I read more about the slice-by-N algorithms because they sounded really interesting. The way they work is by using a set of lookup tables that are 4k to 16k in size (larger lookup table for larger N). The reason they are fast is because the lookup tables fit within the L1 cache on modern CPUs. So when you do 100M rounds of CRC32 it is super fast because the table is always cache hot, but I don't think this result is informative if you just want to occasionally do a CRC in between doing other types of work (especially for small buffer sizes). You will have to wait as the lookup tables are brought up through the cache heirarchy _and_ you are potentially evicting other useful data from the cache at the same time. Presumably PCLMULQDQ does not have this drawback.
This is a great point and a huge mental problem I have when looking at (and writing) benchmarks. If anyone knows more about this (i.e. if they can show why it isn't true) please speak up.
I think that mostly, your benchmarks have to match your workloads. Most of the CRC32 benchmarks I've seen are looking at larger buffers. The xxhash function mentioned elsewhere in this thread was claimed to be "an order of magnitude" faster, but again, large buffers - the gain over CRC32 on the same tests were rather more modest (though not at all insignificant).
In this case, I think (but am curious, will investigate further at some point) our Cyrus servers are doing enough checksumming work to keep any tables hot in the cache. So the tests are hopefully a useful indicator of where improvements can be made.
According to the Stephan Brumme website you linked to, the slice-by-8 lookup table is 8K and the slice-by-16 table is 16K, so your combo version of crc32 needs 24K of L1 cache to run at full speed. Modern server class CPUs typically have 32K of L1 dcache so that doesn't leave much room for the rest of your work. Maybe that's reasonable (I don't really know what Cyrus does), but I thought it was worth thinking about.
Most of the time we're iterating through a cyrus.index, where there's 104 bytes per record, and we're doing a CRC32 over 100 of them, or we're reading through a twoskip database where we're CRCing the header (24+bytes, average 32) and then doing a couple of mmap lookup and memcmp operations before jumping to the next header, which is normally only within a hundred bytes forwards on a bit and mostly sorted database. The mmap lookahead will also have been in that close-by range.
Also, our oldest CPU on production servers seems to be the E5520 right now, which has 128kb of L1 data cache.
- Instruction Cache = 32kB, per core
- Data Cache = 32kB, per core
- 256kB Mid-Level cache, per core
- 8MB shared among cores (up to 4)
So I guess the confusion is that Intel moved the L2 cache onto each core (from Nehalem onwards, I think?) and used that opportunity to substantially lower latency for it.
> cloudflare is amazing until the input buffer gets under 80 bytes. That's the point where it stops using the optimised implementation and falls back to the regular zlib implementation (slice-by-4). I'm not sure why (no explanatory comments I could find), but it's a showstopper for our uses.
If they have CRC accounting for 10% of CPU, they must be using these checksums a lot. At some point I'd imagine the false error rate simply due to bit flips and other random errors on the path from database through CRC function will outlast whatever value you are getting from the constant rechecks.
Also, literature suggests a throughput of ∼2.67 bytes per cycle for the CRC32 instruction, a three fold improvement over best in class non-HW routines. I'm quite sure it would be worth it to reconvert previous checksums if you can do so in a way that minimizes downtime (think TrueCrypt doing a transparent initial encryption; not encrypting when theres IO load).
Is the CRC32 instruction much faster than an optimized implementation using PCLMULQDQ? It's available on a wider range of CPUs, but I thought I remembered that PCLMULQDQ worked very quickly.
I think when I looked single instr crc32 was ~5 times faster than the pclmulqdq version.
As an aside, if they have a server in production that doesn't support CLMUL, they should junk that machine - it appeared in Westmere and Bulldozer - everything earlier is EOLed. Crc32 was in sse4.2, so Nehalem.
The crc32 instruction is hard to beat, has been around for years, and I would guess the different poly could be phased in on their systems like you would for changing a password algorithm, or just try the new poly since it is quick, and if it fails use the old poly.
A parallel PCLMULQDQ version is faster than a serial CRC32. Because CRC32 has a latency of 3 cycles.
A parallel CRC32 can be slightly faster with more data segment used, and more code size.
Also PCLMULQDQ can use an polynomial I think. So they could make a compatible version.
The lookup table version they used has to have used lots more CPU cache. Also, I would suggest they look at reducing the amount of times they call the function. Since there are two ways to speed up functions that are called a lot. One is to stop calling it so many times.
Or since IO is limiting, they could consider compressing/decompressing the data as well. Using something fast like LZ4 compression. This will give them faster IO and a checksum at the same time. Something like Blosc can be faster than memcpy. Especially since the data is replicated over the network as well, and they would save space on their storage.
Modern CPU performance optimization is very often about memory IO. If the data is in L2/L3 then you can do a LOT of computation on it, almost for free, compared to the time it takes to get it into L2/L3.
During that time waiting for memory/disk/network IO, they might consider other things. Like indexing, better checksums (like SHA1) or even encryption. Especially if the data is already in L2/L3.
Modern Intel CPUs have an instruction specifically to compute CRC. This instruction is easily consumed through a C++ intrinsic, literally in one line of code. You can't do any better than that, no matter what you use.
[Intel's CRC32 CPU instructions] uses different inputs to the CRC32 algorithm (known as the polynomial) which is apparently more robust, and is used in networks, filesystems, that sort of thing. It gives different results to the "standard" polynomial, typically used in compression.
They would have to go back and recompute all their existing stored checksums.
That's why you have to tag the value with an identifier of the algorithm, so you can change your mind later. The problems they currently face are a direct consequence of an architectural mistake a few years ago.
Indeed. As I note in the post, this wasn't something we _needed_ to do at that time, we were just curious.
Replacing a function with a faster implementation is trivial; its just another deployment and we do several of those each week. Changing data format adds operational complexity - two codepaths need to be run and maintained and we increase system load and lower response times for the duration. (The data format is versioned, so that's easy - no architectural problem there).
Its totally worth doing if you need to, and we're not afraid of that, but you don't do it on a whim - it takes proper planning and testing and needs multiple people involved.
> Next, we have to get our optimisation settings right. We compile Cyrus with no optimisations and full debugging because it makes it really easy to work with crash dumps.
Wait, so are these results without letting the compiler optimize?
The results in the tests are all with optimisations (-O3 -march=sandybridge -mtune=intel), as mentioned in the post.
The exception is the Debian packaged version of zlib, because we don't control that. That's the reason I include stock zlib in the tests - if it had been wildly different from the system zlib, I would have looked into recompiling the Debian package with more optimisations. There were no major differences and indeed, inspecting the package further shows that it is compiled with optimisations.
On the Cyrus side, we used to link to the Debian zlib, so we get those optimisations. For the new code bundled in Cyrus itself, we have to enable optimisations ourselves.
Fnv is fast because the code is tiny and hashing a few bytes is fast. xxhash uses multiple parallel streams and has very high throughput. They are fast hashes, each in their end of the spectrum, Fnv for 1-16 byte keys and xxhash for long data. It's better to call xxhash a checksum, since that's its purpose.
There are hash functions that are as fast or faster than FNV and stastically stronger (and faster) than xxhash. There are multiple hash function families that should be used before either of the above in modern applications unless you need backward compatibility (like the CRC case in the article). This is an active research area and both of the above, while adequate for many casual use cases, should not be used for checksums for the same reason CRC32 should not be used for checksums in large systems.
As an example from one of my hash function research projects, MetroHash64 will outperform xxhash both for speed and stastical robustness (which is quasi-cryptographic). If you only need 32 bits, truncate larger hashes; if they have very strong stastical properties, that works well.
With all due respect, xxHash is a decent hash function but you are comparing it against some relatively weak or slow hash functions. As was pointed out last time this came up, the Metro64 hash example I offered is simultaneously faster and higher quality than xxHash. There are hash functions that are faster than xxHash and stronger than some of the cryptographic hashes, and xxHash definitely has much more bias than even MD5.
Getting a perfect score with SMHasher is table stakes. The default settings on SMHasher are much too loose for current research; I know it well, I have hacked it extensively. A state-of-the-art non-cryptographic hash function today generally has the following properties:
- statistical quality greater than or equal to MD5, a cryptographic hash. xxHash is not close to MD5 in quality.
- faster than xxHash on large keys, and for really good hashes, memory bandwidth bound. (Metro64 is, again, faster than xxHash, though not memory bandwidth bound.)
- faster than xxHash on small keys (Metro64, to use that example, is almost 2x faster)
There are a lot of good hash functions being developed by researchers. But empirically, xxHash is nowhere near the state-of-the-art any way you slice it. It is a decent hash function but there are many functions produced by many researchers that are both faster and higher quality. It isn't personal, people can measure it for themselves.
If your spam signatures are small (around 32 bytes) and fixed size, you might also want to look into a pure XOR tabulation hash. It's excellent for generating fixed-size keys for hash tables:
I use it all the time when I need a hash table index. It's simple to understand, safe and fast. Even though it can generate more assembly than other hashes, it's actually often faster because it has no multiply operation, only XOR. If the tables are small enough to fit into cache (i.e. small keys), it's brilliant. It has properties suitable for triangular probing, where other hashes sometimes don't have enough independence. And if you need a rolling hash, i.e. for Rabin Karp, it's also perfect.
If you were going to rewrite your data, why wouldn't you go with something that's been standardized by every hardware implementers as the next generation CRC algorithm, CRC32c (the one the article explicitly throws out at the beginning, because the author did not want to rewrite the data)? Even ARM chips and network- and storage-focused microcontrollers have hardware acceleration for it these days...
"Because it produces different results, it would render all our stored checksums invalid, so we'd have to recalculate them all. That's a big job, so we'd need to do a lot of testing first to make sure that it's so much better that it's actually worth the effort."
My quick tests here (using the same methods outlined in the post) suggests its around 30% faster on 64-byte buffers. If we're ever shopping for a new hash function entirely, I'll make sure its considered. Right now its not worth it because as noted in the post, we're not actually under any particular stress and changing hash functions is a big job.
If they were going to use a different hash function then they would more likely uses the built-in CRC32 instruction, which computes CRC-32C. That would certainly be faster.
No, if you have a HW supported crc32 builtin nothing will be faster than this.
You can safely skip the PCLMULQDQ optimization, which only works on newer CPU's, but crc32 should be almost everywhere nowadays.