I am surprised to see any speedup in this! I'd expect something trivial like thi...

nkurz · on Jan 21, 2017

Thanks for looking into this! I know that Daniel appreciates feedback, but rarely reads HN, so an email to him or blog comment with your results might be helpful.

Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?

On modern Intel processors, the "time stamp counter" is monotonically increasing at a constant rate, so it does take memory latency into account. On many Linux systems (including the one used here) clock_gettime() uses the same underlying clock source, so there should be no difference in accuracy for long measurements. The CPUID-RDTSC/RDSTCP-CPUID pattern used here has the advantage of a somewhat lower and significantly more predictable overhead, which helps when measuring shorter events.

I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.

If I remember the numbers right, the Skylake processor this is running on can read about 64B per cycle if the source is L1, about 24B per cycle if the source is L3, and about 6B per cycle from main memory. Using the existing RDTSC framework, I get equal speeds at L3 size, and still get sub .4B/cycle coming from main memory.

I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference

I agree that testing on larger buffers would be informative (as would testing different ratios of whitespace) but I don't think your approach is capturing what you think it is. The macro being used for time measurement uses a different random input for each iteration (look at 'pre'), and the unoptimized time of initializing this dominates the clock time. So while I think your test is worthwhile, I think you need a better way to perform the measurement. I think you'll see that RDTSC maps exactly to wall time in this case, but surprises are definitely possible.

I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.

Yes, although as with wall time, one needs to be sure to measure only on the section of code that one is optimizing. Perf makes this difficult, so something like "likwid" with an API that allows profiling fragments of code would be required. I haven't done that yet for this, but at the faster speeds (sse4_despace_branchless) I don't think there are any cache misses or branch mispredictions in the code of interest.

surprisingly, Clang is about 10% slower than GCC in my experiments

My surprise was the opposite. Clang was 25% faster than GCC and ICC on the L1- and L3-sized vectorized branchless (.25 cycles/byte versus .33 cycles/byte, although the code I'm running is not quite what's on Github). Most of this benefit seems to be because clang unrolls 2x to reduce loop overhead.

exDM69 · on Jan 21, 2017

I dropped a comment on his blog and a link to this discussion...

And how sloppy of me! I didn't notice the `pre;` in the code consuming most of the time (to be fair it's just 4 characters and I didn't put too much time to it). When I move that outside of the loop, before the timer, I get results that show improvement with the optimized version.

And indeed it looks like rdtsc is giving similar figures to clock_gettime now. I falsely presumed that it counts retired instructions.

I'm still surprised to see a speedup, and how badly the original version is performing.

My guess is that the conditional stores are poison to the CPU pipelines. The 64 bit version gets most of the performance out of it already, I presume it's because of the more efficient memory usage pattern.

I can't edit my earlier post any more, I'd correct it if I could.

Clang:

    memcpy(tmpbuffer,buffer,N):  0.000000 cycles / ops 65232 nsec (0.000065 sec) 
    countspaces(buffer, N):  0.000000 cycles / ops 65176 nsec (0.000065 sec) 
    despace(buffer, N):  4.547286 cycles / ops 7654310198 nsec (7.654310 sec) 
    faster_despace(buffer, N):  1.582721 cycles / ops 2651677500 nsec (2.651678 sec) 
    despace64(buffer, N):  0.583952 cycles / ops 1025215835 nsec (1.025216 sec) 
    despace_to(buffer, N, tmpbuffer):  0.000000 cycles / ops 63847 nsec (0.000064 sec) 
    avx2_countspaces(buffer, N):  0.000000 cycles / ops 63697 nsec (0.000064 sec) 
    avx2_despace(buffer, N):  0.307253 cycles / ops 602528061 nsec (0.602528 sec) 
    sse4_despace(buffer, N):  0.310504 cycles / ops 534542967 nsec (0.534543 sec) 
    sse4_despace_branchless(buffer, N):  0.353851 cycles / ops 594652080 nsec (0.594652 sec) 
    sse4_despace_trail(buffer, N):  0.314221 cycles / ops 562439990 nsec (0.562440 sec) 
    sse42_despace_branchless(buffer, N):  0.608811 cycles / ops 1020576734 nsec (1.020577 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.608494 cycles / ops 1020062750 nsec (1.020063 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.779908 cycles / ops 2983955982 nsec (2.983956 sec)

GCC:

    memcpy(tmpbuffer,buffer,N):  0.285702 cycles / ops 489599835 nsec (0.489600 sec) 
    countspaces(buffer, N):  0.000000 cycles / ops 63995 nsec (0.000064 sec) 
    despace(buffer, N):  4.751018 cycles / ops 8014809122 nsec (8.014809 sec) 
    faster_despace(buffer, N):  1.718575 cycles / ops 2898082416 nsec (2.898082 sec) 
    despace64(buffer, N):  0.883421 cycles / ops 1560479651 nsec (1.560480 sec) 
    despace_to(buffer, N, tmpbuffer):  6.424313 cycles / ops 10892716258 nsec (10.892716 sec) 
    avx2_countspaces(buffer, N):  0.031227 cycles / ops 52369789 nsec (0.052370 sec) 
    avx2_despace(buffer, N):  0.315633 cycles / ops 627793484 nsec (0.627793 sec) 
    sse4_despace(buffer, N):  0.319739 cycles / ops 554173689 nsec (0.554174 sec) 
    sse4_despace_branchless(buffer, N):  0.366240 cycles / ops 615638316 nsec (0.615638 sec) 
    sse4_despace_trail(buffer, N):  0.318973 cycles / ops 572262343 nsec (0.572262 sec) 
    sse42_despace_branchless(buffer, N):  0.638114 cycles / ops 1070772787 nsec (1.070773 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.637988 cycles / ops 1069642163 nsec (1.069642 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.768081 cycles / ops 2963488990 nsec (2.963489 sec)