Packing for the Holidays

Every time I visit my family for the holidays, as the date approaches, I find myself filled with dread. It’s nothing sinister, my family’s great, and the season is nice. The reason is simple:

I hate packing.

In fact, I hate both kinds of packing: trip packing, and bit packing. Let me tell you a story about bit packing.

Bit packing

If you’ve ever browsed around the available type attributes in GCC, you may have noticed the entry for “packed”. It seems straightforward enough, and if you’re trying to cram a lot of data in a system (like we do), it can be pretty attractive.

There are plenty of warnings about using __attribute__((__packed__)). Most of them boil down to needing to do two separate loads plus some shifting and arithmetic where a normal integer load would have worked without the packing. That might not seem too bad, if you’re concerned about space.

But it gets much worse.

Concurrency

We’ve been working hard on concurrency for a while here at Tokutek. Probably the strangest thing I saw happen recently was this iiBench graph:

This is a pretty old graph, but the important point is right there in the middle. There’s a hole in the graph! In fact, there are a couple smaller ones in the beginning, too, if you look closely.

The distribution of these dips through the benchmark led us to guess that it had to do with the height of the tree. After some more digging, we found that the problems occurred when the root of the tree had 3 or 4 children.

Cache misses

The perf(1) tool is pretty amazing. It can count, for example, cache misses, and with perf annotate you can see which lines of code are the most expensive, and you can attach to a process or thread on-demand (so you can hunt down problems that don’t appear immediately). Unfortunately, we didn’t discover perf until after this particular issue, but we checked later and confirmed that it will find these types of problems, if you track the event called cache-misses.

When we looked at our most expensive code, it was a call to __sync_fetch_and_add() that we had added when we relaxed our locking rules to allow for more concurrency. We added some instrumentation, and we could see its cost fluctuate by a couple orders of magnitude, depending on the number times the root had split.

It didn’t make much sense in context, but we had a hunch, so we wrote a micro-benchmark to check the behavior of unaligned memory operations. Our machines have 64-byte cache lines, which is pretty typical, so we wrote the benchmark to update memory locations 64 bytes apart within a little chunk of memory. We wanted to see how fast we could do it, depending on the offsets of the locations we updated.

What we saw was impressive, and didn’t even need multiple threads to demonstrate.

Benchmarking

We set up the benchmark with something like this:

Setup

void *mem;
posix_memalign(&mem, 1024, 1024);
const size_t stride = 64, length = 1024;

The main part of the benchmark was the distinction between these functions (we did use another thread, but just for timing):

With __sync_fetch_and_add()

int n_ops = 0;
while (running) {
    for (size_t idx = OFFSET; idx + 8 < length; idx += stride, ++n_ops) {
        uint64_t * const p = (uint64_t *) (mem + idx);
        (void) __sync_fetch_and_add(p, 42);
    }
}
return n_ops;

Without __sync_fetch_and_add()

int n_ops = 0;
while (running) {
    for (size_t idx = OFFSET; idx + 8 < length; idx += stride, ++n_ops) {
        uint64_t volatile * const p = (uint64_t *) (mem + idx);
        *p += 42;
    }
}
return n_ops;

What we’re doing here is updating several integers located at various offsets in memory. We use posix_memalign(3) to guarantee alignment of our buffer, and then, given OFFSET, we update integers located at OFFSET, OFFSET + 64, OFFSET + 128, etc. It’s worth noting that there’s only one thread accessing this memory at all.

We expected to see things slow down for offsets 57 through 63 (the offsets that will place the integer across a cache line boundary), but what came out was much worse than expected.

offset time per add (ns) time per __sync_fetch_and_add() (ns)
24 1.41 8.73
28 1.39 8.73
56 1.42 8.71
57 1.78 1797.05
60 1.77 1838.19
Time to add to an integer (crosses boundary)

Analysis

On regular integers, even if they’re not aligned but as long as they’re within a single cache line, a __sync_fetch_and_add() makes an update cost 6-7x as much. If you start to cross a cache line though, the same change makes it cost over 1000x what it would with a normal add.

It’s useful to note that this isn’t a compiler issue, the compiler doesn’t know the offset of our integer until runtime (offset is a command line argument). If we look at the assembly, that __sync_fetch_and_add() just compiles down to

lock addq $0x2a,(%rax)

This is actually happening in hardware. The LOCK prefix asks the processor to handle the locking for you. If your value’s aligned, it can usually use the normal cache coherency protocol to accomplish the locking for you (called "cache locking"), but if the operation hits multiple cache lines, it has to be much more careful and things get amazingly more expensive (it sends locking signals on the memory bus).

Section 8.1 of The Intel 64 and IA-32 Architectures Software Developer’s Manual - Volume 3A: System Programming Guide, Part 1 has more details on hardware locks.

I should note that on my desktop, a __sync_fetch_and_add() makes a cache line-crossing update "only" 400x slower. The tabulated results above were calculated on a two-socket machine, which tends to amplify performance issues with cache coherency.

Back to the Fractal Tree

Somewhere in the node, we had a struct containing a value we used sync builtins to update, and the struct was packed:

struct __attribute__((__packed__)) foobar_info {
    uint64_t important;
    int x;
    char y;
};

Being packed alone wasn’t enough to cause us such troubles, but when we allocated an array of these structs, it ended up that the 3rd and 4th structs in the array were lined up so that the important value crossed a cache line, and that’s when we saw the dip.

Needless to say, we unpacked the offending struct and allowed the compiler to align our values normally. The dip disappeared, and we kept the concurrency improvements. Those are pretty exciting, by the way, I can’t wait to talk more about what we did there.

Tags: , , , , , .

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>