Alignment and Packing
When we’re programming our structures and classes in c++ we usually associate the size of the structure as the sum of each member’s size.
struct Foo {
U32 uv; // 4 bytes
F32 fv; // 4 bytes
I32 iv; // 4 bytes
}
For the struct Foo, just sum 4 + 4 + 4 and we get the size of 12 bytes. Very simple.
struct Bar {
U32 uv; // 4 bytes
F32 fv; // 4 bytes
bool bv; // 1 byte
}
But now for the struct Bar, what is it size? 9 bytes? Actually, the typical behavior of the compiler is to leave “holes” in the struct, making Bar to be 12 bytes wide. Due to alignment requirements, Bar has a padding of 3 bytes, so 4 + 4 + 1 (of each member size) + 3 bytes of padding and we get its size of 12 bytes. The exact same size of Foo.
This can seem a bit confusing if you never had any contact with alignment, so here is a proof that using default settings of clang Bar is actually 12 bytes wide.
What is memory alignment?
The natural alignment of some data type refers to the best way to place this data in memory, which is usually in addresses multiples of its size. For example, an object of 4 bytes wide can be aligned in addresses multiple of 4, that is any address ending with the bytes 0x0, 0x4, 0x8 or 0xC. This logic can be applied to data with other sizes:
Size (bytes) | Aligned in addresses ending with |
---|---|
1 | any |
2 | 0x0, 0x2, 0x4, 0x6, 0x8, 0xC or 0xE |
4 | 0x0, 0x4, 0x8 or 0xC |
8 | 0x0 or 0x8 |
16 | 0x0 |
Why is alignment necessary?
In default settings of C++, the alignment is always respected to increase the efficiency when the CPU is reading and writing to memory. This occurs because modern processors can only read and write aligned blocks of data.
For example, if we request the processor to read a 4-bytes float from an address ending with 0xD47C, the memory controller will easily load it into only one register. But, if we try to read the same 4-bytes float from an address ending with 0xD473, the processor will need to read two blocks of 4 bytes each, one at 0xD470 and the other at 0xD474. Then it will apply a mask to each register, shift then to properly set up the bits and finally OR them into the destination register.
As Jason Gregory states in Game Engine Architecture, some processors don’t go this far with unaligned blocks, they just read garbage or even crash. One example of this is the Playstation 2 processor.
Researching about recent game consoles I’ve found that Playstation 3’s processor throw exceptions when you try to read unaligned memory blocks using 8 bytes addresses. Playstation 4’s processor also require more bus cycles to access misaligned memory and we always should avoid spend bus cycles unnecessarily.
How to correctly calculate our data size?
Now that you understand the importance of always access aligned memory, to calculate the correct size of any data structure we can assume that each member will be aligned and the structure as a whole will use the alignment of its largest member. That’s why the boolean inside Bar has a padding of 3 bytes, because even with the boolean being properly aligned in 1 byte, the structure should use an alignment of 4 bytes for maintain its alignment in array context.
The easiest way to calculate the data size is drawing boxes with the width of the largest member in the structure. You draw a first box and fit inside it as many members as you can. If the next member is wider than the remaining space in the box, you fill the box with padding bytes and draw a new box. You keep doing this process until fit all members. If the last box isn’t completely filled, you also fill it with padding bytes.
Doing this process for Foo and Bar should give these results:
Best way to pack our data
Consider this struct being compiled with default settings for a x86 target:
struct InefficientPacking {
U32 ua; // 4 bytes
F32 fb; // 4 bytes
U8 uc; // 1 byte
I32 id; // 4 bytes
bool be; // 1 byte
char* cf; // 4 bytes on x86
};
The only way to pack each member following their declaration order is:
Resulting in 24 bytes wide. By just rearranging the members we can achieve a more efficient packing:
struct EfficientPacking {
U32 ua; // 4 bytes
F32 fb; // 4 bytes
I32 id; // 4 bytes
char* cf; // 4 bytes on x86
bool be; // 1 byte
U8 uc; // 1 byte
};
Give us 20 bytes wide for the same structure. We saved 4 bytes by just reordering our members.
If you’re thinking that we don’t need to care about this because the compiler will automatically rearrange the members, you’re wrong, the standard prohibits the compiler of doing this. So if you’re looking to get the most efficient packing, an easy method that will work in the major part of cases is to arrange members from largest to smallest.
Another way of declaring the same structure is making the padding bytes explicit. This can be a good practice to quickly visualize the structure’s width without having to account for memory alignment.
struct BestPacking {
U32 ua; // 4 bytes
F32 fb; // 4 bytes
I32 id; // 4 bytes
char* cf; // 4 bytes on x86
bool be; // 1 byte
U8 uc; // 1 byte
U8 pad[2]; // 2 bytes
};
The full source code for this post can be found here.
How to disable automatic memory alignment in C++?
The directive #pragma pack (1)
disable it. This should always be avoided because, as I explained, working with unaligned memory lead to wasted cycles on read/writes, chance of multiple cache misses, garbage and exceptions depending on the processor and countless other pessimizations. But, just for the sake of curiosity: