http://anadoxin.org/blog

C++: Shooting yourself in the foot #6

Wed, 14 December 2022 :: #cpp :: #rant

Programmers say that C++ gives them a lot of power because it allows them to shoot themselves in the foot in more ways than one. With its complicated syntax, multiple paradigms, and many confusing features, C++ enables programmers to create convoluted, hard-to-maintain code that is still somehow able to run at high speeds. This "power" also comes with a steep learning curve and a high potential for bugs and security vulnerabilities, but hey, at least it's not Java, right?

One of the factors surely contributing to C and C++'s complexity is the number of architectures that can be programmed for. Today it's very easy to forget that Intel x86 is not the only architecture out there. Apple did create a little vortex in this concreted world, and I see it as a good thing, but it's apparently not enough to recall that we still live in the world in which Little Endian is not the only encoding. And I'm not talking about Little Hiawatha.

What would be the most probable way for the C or C++ programmer to deserialize a structure from a file? I think it would be something like this:

// pseudocode

#include <stdio.h>

struct Header {
    uint32_t magic = 0;
    uint32_t size = 0;
    uint32_t value = 0;
};

int main() {
    Header header;
    
    FILE* fp = fopen("file.bin", "rb");
    fread((void *)&header, sizeof(header), 1, fp);
    fclose(fp);
    
    // Use the fields in `header`.
    if (header.size < 100) { 
        ...
    }
}

That would work, except cases when it wouldn't.

By using this pattern, we're completely ignoring the byte order of the data inside the file, as well as the byte order of the machine we're running the program on.

In order to fix the issue, first we need to define two things:

  1. What is the byte order of the data stored inside the file?
  2. (optional) What is the byte order of the machine which is used to run the program?

Defining the first is mostly enough. We can always refer to the byte order of the current machine as the "host" encoding, and good libraries should resolve that to the proper encoding during compilation step. So, a Linux-style fixed version of the code above should look like this:

// pseudocode

#include <stdio.h>
#include <endian.h>

struct Header {
    uint32_t magic = 0;
    uint32_t size = 0;
    uint32_t value = 0;
};

int main() {
    Header header;
    
    FILE* fp = fopen("file.bin", "rb");
    fread(&header.magic, sizeof(magic), 1, fp);
    fread(&header.size, sizeof(size), 1, fp);
    fread(&header.value, sizeof(value), 1, fp);
    fclose(fp);
    
    // Convert from little-endian encoding to host encoding
    header.magic = le32toh(header.magic);
    header.size = le32toh(header.size);
    header.value = le32toh(header.value);
    
    // Use the fields in `header`.
    if (header.size < 100) { 
        ...
    }
}

Of course, if we're running this program on a little-endian machine, then the le32toh() function does absolutely nothing. This is one of the complexities of C++; there are a lot of hidden places where a solution is needed, but this solution will be used only on some edge cases.

There are some situations when this will never matter. For example, if you're programming on Windows, you will never run on anything that's not Little Endian (maybe some old Xboxes ran on big endian; but I'm not really sure). The main reason is that Windows doesn't run on anything that is not Little Endian. Thus, your code is relatively safe to assume that all numbers in all files ever interfaced with will be little endian. But if one day you'll want to port your software to a different platform, or make it "portable" (whatever that means), then you'll probably end up in a quicksand of problems related to different byte ordering.

Different languages cope with the problem using language-specific mechanisms. For example, Java has its own Java Runtime Environment, which precisely defines that all numbers in memory are stored using Big Endian encoding, despite the actual hardware platform that the JRE runs on. So, Java programmers can safely assume that all numbers are Big Endian, and if there's a need to read a little endian number from a file, the byte order needs to be always swapped.

Python or Ruby use their own structure packing and unpacking functions that allow conversion between numbers and binary data. Those structure packing/unpacking functions take special arguments that define what byte order the caller would like to use.

Rust has the byteorder crate, which behaves similarly to how Python/Ruby approach works; i.e. each read and each write is tagged with the "little-endian" or "big-endian" flag, so that the number conversion functions know if the numeric values should be swapped or not.

But C++? In C++, you have the power. You can shoot yourself in the foot as many times as you desire.

Bonus track

Let's see, how many bugs can you spot in the first code example?

  1. The default structure packing can be set to 8, meaning that the bytes from the file won't even be stored in proper fields during the fread() call. Correct solution would be to use #pragma pack directives around the struct Header.
  2. If you've used #pragma pack(1) without #pragma pack(push) and #pragma pack(pop), then you would generate a bug; you could overwrite whatever packing directive has been defined by your runtime earlier (possibly in some header file).
  3. fopen() isn't checked for success.
  4. fread() calls aren't checked for success.
  5. If you'd return an error after verifying that fread() has failed to read the proper number of bytes from the file, you would produce a resource leak bug. The proper approach would be to use RAII functionality of C++ to register automatic call of fclose() on scope exit, using a scope guard.
  6. main() returns an undefined value.
  7. Header header holds random memory, paired with skipping the return values of fopen() and fread(), it can produce undefined behaviors in the program.