I read more about the slice-by-N algorithms because they sounded really interest...

cmurphycode · on Dec 3, 2015

This is a great point and a huge mental problem I have when looking at (and writing) benchmarks. If anyone knows more about this (i.e. if they can show why it isn't true) please speak up.

robn_fastmail · on Dec 3, 2015

I think that mostly, your benchmarks have to match your workloads. Most of the CRC32 benchmarks I've seen are looking at larger buffers. The xxhash function mentioned elsewhere in this thread was claimed to be "an order of magnitude" faster, but again, large buffers - the gain over CRC32 on the same tests were rather more modest (though not at all insignificant).

In this case, I think (but am curious, will investigate further at some point) our Cyrus servers are doing enough checksumming work to keep any tables hot in the cache. So the tests are hopefully a useful indicator of where improvements can be made.

minimax · on Dec 3, 2015

According to the Stephan Brumme website you linked to, the slice-by-8 lookup table is 8K and the slice-by-16 table is 16K, so your combo version of crc32 needs 24K of L1 cache to run at full speed. Modern server class CPUs typically have 32K of L1 dcache so that doesn't leave much room for the rest of your work. Maybe that's reasonable (I don't really know what Cyrus does), but I thought it was worth thinking about.

brongondwana · on Dec 4, 2015

Most of the time we're iterating through a cyrus.index, where there's 104 bytes per record, and we're doing a CRC32 over 100 of them, or we're reading through a twoskip database where we're CRCing the header (24+bytes, average 32) and then doing a couple of mmap lookup and memcmp operations before jumping to the next header, which is normally only within a hundred bytes forwards on a bit and mostly sorted database. The mmap lookahead will also have been in that close-by range.

Also, our oldest CPU on production servers seems to be the E5520 right now, which has 128kb of L1 data cache.

vardump · on Dec 4, 2015

I'm fairly sure E5520 has 32 kB L1 data cache, not 128 kB. L1 caches are core local, not shared like L3.

vidarh · on Dec 4, 2015

This is what the datasheet says [1]:

    - Instruction Cache = 32kB, per core
    - Data Cache = 32kB, per core
    - 256kB Mid-Level cache, per core
    - 8MB shared among cores (up to 4)

So I guess the confusion is that Intel moved the L2 cache onto each core (from Nehalem onwards, I think?) and used that opportunity to substantially lower latency for it.

[1] http://www.intel.com/content/www/us/en/processors/xeon/xeon-...