I am surprised to see any speedup in this! I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.
Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?
I wonder if there's a difference when measured in wall clock time. It's still somewhat beneficial to have the CPU work efficiently to give an opportunity for hyperthreading to take place when waiting for memory.
If you really wanted to make something like this faster, you should focus on cache utilization and make use of prefetching instructions. x86 has pretty bad prefetching instructions and pretty good speculative fetching, so don't expect massive speedups but on ARM or Aarch64, you have a finer grained control over cache prefetching (L1 and L2 separately) and you could see much bigger differences.
As for benchmarking this kind of problems: you obviously want to measure real world performance, so you need wall clock time as well as time stamp counter, but I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.
The figure you should be staring at is the total throughput of the algorithm, measured in gigabytes per second. You should be getting figures close to the memory bandwidth available (25-50 GB/s depending on CPU and memory).
edit: I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference, here's my results:
Thanks for looking into this! I know that Daniel appreciates feedback, but rarely reads HN, so an email to him or blog comment with your results might be helpful.
Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?
On modern Intel processors, the "time stamp counter" is monotonically increasing at a constant rate, so it does take memory latency into account. On many Linux systems (including the one used here) clock_gettime() uses the same underlying clock source, so there should be no difference in accuracy for long measurements. The CPUID-RDTSC/RDSTCP-CPUID pattern used here has the advantage of a somewhat lower and significantly more predictable overhead, which helps when measuring shorter events.
I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.
If I remember the numbers right, the Skylake processor this is running on can read about 64B per cycle if the source is L1, about 24B per cycle if the source is L3, and about 6B per cycle from main memory. Using the existing RDTSC framework, I get equal speeds at L3 size, and still get sub .4B/cycle coming from main memory.
I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference
I agree that testing on larger buffers would be informative (as would testing different ratios of whitespace) but I don't think your approach is capturing what you think it is. The macro being used for time measurement uses a different random input for each iteration (look at 'pre'), and the unoptimized time of initializing this dominates the clock time. So while I think your test is worthwhile, I think you need a better way to perform the measurement. I think you'll see that RDTSC maps exactly to wall time in this case, but surprises are definitely possible.
I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.
Yes, although as with wall time, one needs to be sure to measure only on the section of code that one is optimizing. Perf makes this difficult, so something like "likwid" with an API that allows profiling fragments of code would be required.
I haven't done that yet for this, but at the faster speeds (sse4_despace_branchless) I don't think there are any cache misses or branch mispredictions in the code of interest.
surprisingly, Clang is about 10% slower than GCC in my experiments
My surprise was the opposite. Clang was 25% faster than GCC and ICC on the L1- and L3-sized vectorized branchless (.25 cycles/byte versus .33 cycles/byte, although the code I'm running is not quite what's on Github). Most of this benefit seems to be because clang unrolls 2x to reduce loop overhead.
I dropped a comment on his blog and a link to this discussion...
And how sloppy of me! I didn't notice the `pre;` in the code consuming most of the time (to be fair it's just 4 characters and I didn't put too much time to it). When I move that outside of the loop, before the timer, I get results that show improvement with the optimized version.
And indeed it looks like rdtsc is giving similar figures to clock_gettime now. I falsely presumed that it counts retired instructions.
I'm still surprised to see a speedup, and how badly the original version is performing.
My guess is that the conditional stores are poison to the CPU pipelines. The 64 bit version gets most of the performance out of it already, I presume it's because of the more efficient memory usage pattern.
I can't edit my earlier post any more, I'd correct it if I could.
Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?
I wonder if there's a difference when measured in wall clock time. It's still somewhat beneficial to have the CPU work efficiently to give an opportunity for hyperthreading to take place when waiting for memory.
If you really wanted to make something like this faster, you should focus on cache utilization and make use of prefetching instructions. x86 has pretty bad prefetching instructions and pretty good speculative fetching, so don't expect massive speedups but on ARM or Aarch64, you have a finer grained control over cache prefetching (L1 and L2 separately) and you could see much bigger differences.
As for benchmarking this kind of problems: you obviously want to measure real world performance, so you need wall clock time as well as time stamp counter, but I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.
The figure you should be staring at is the total throughput of the algorithm, measured in gigabytes per second. You should be getting figures close to the memory bandwidth available (25-50 GB/s depending on CPU and memory).
edit: I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference, here's my results:
Here's the diff to the original: http://pasteall.org/208511edit2: surprisingly, Clang is about 10% slower than GCC in my experiments.