Hacker News new | past | comments | ask | show | jobs | submit login

I am surprised to see any speedup in this! I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.

Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?

I wonder if there's a difference when measured in wall clock time. It's still somewhat beneficial to have the CPU work efficiently to give an opportunity for hyperthreading to take place when waiting for memory.

If you really wanted to make something like this faster, you should focus on cache utilization and make use of prefetching instructions. x86 has pretty bad prefetching instructions and pretty good speculative fetching, so don't expect massive speedups but on ARM or Aarch64, you have a finer grained control over cache prefetching (L1 and L2 separately) and you could see much bigger differences.

As for benchmarking this kind of problems: you obviously want to measure real world performance, so you need wall clock time as well as time stamp counter, but I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.

The figure you should be staring at is the total throughput of the algorithm, measured in gigabytes per second. You should be getting figures close to the memory bandwidth available (25-50 GB/s depending on CPU and memory).

edit: I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference, here's my results:

    memcpy(tmpbuffer,buffer,N):  0.122945 cycles / ops 1495907352 nsec (1.495907 sec) 
    countspaces(buffer, N):  3.657322 cycles / ops 1544915395 nsec (1.544915 sec) 
    despace(buffer, N):  6.521193 cycles / ops 1621204460 nsec (1.621204 sec) 
    faster_despace(buffer, N):  1.721657 cycles / ops 1500507217 nsec (1.500507 sec) 
    despace64(buffer, N):  3.595031 cycles / ops 1544993649 nsec (1.544994 sec) 
    despace_to(buffer, N, tmpbuffer):  6.307885 cycles / ops 1615101563 nsec (1.615102 sec) 
    avx2_countspaces(buffer, N):  0.190992 cycles / ops 1460961459 nsec (1.460961 sec) 
    avx2_despace(buffer, N):  5.750583 cycles / ops 1615971010 nsec (1.615971 sec) 
    sse4_despace(buffer, N):  0.985002 cycles / ops 1482901389 nsec (1.482901 sec) 
    sse4_despace_branchless(buffer, N):  0.338737 cycles / ops 1460874704 nsec (1.460875 sec) 
    sse4_despace_trail(buffer, N):  1.950657 cycles / ops 1502268447 nsec (1.502268 sec) 
    sse42_despace_branchless(buffer, N):  0.562246 cycles / ops 1468638389 nsec (1.468638 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.624913 cycles / ops 1472445127 nsec (1.472445 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.747046 cycles / ops 1507705780 nsec (1.507706 sec)
Here's the diff to the original: http://pasteall.org/208511

edit2: surprisingly, Clang is about 10% slower than GCC in my experiments.




Thanks for looking into this! I know that Daniel appreciates feedback, but rarely reads HN, so an email to him or blog comment with your results might be helpful.

Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?

On modern Intel processors, the "time stamp counter" is monotonically increasing at a constant rate, so it does take memory latency into account. On many Linux systems (including the one used here) clock_gettime() uses the same underlying clock source, so there should be no difference in accuracy for long measurements. The CPUID-RDTSC/RDSTCP-CPUID pattern used here has the advantage of a somewhat lower and significantly more predictable overhead, which helps when measuring shorter events.

I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.

If I remember the numbers right, the Skylake processor this is running on can read about 64B per cycle if the source is L1, about 24B per cycle if the source is L3, and about 6B per cycle from main memory. Using the existing RDTSC framework, I get equal speeds at L3 size, and still get sub .4B/cycle coming from main memory.

I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference

I agree that testing on larger buffers would be informative (as would testing different ratios of whitespace) but I don't think your approach is capturing what you think it is. The macro being used for time measurement uses a different random input for each iteration (look at 'pre'), and the unoptimized time of initializing this dominates the clock time. So while I think your test is worthwhile, I think you need a better way to perform the measurement. I think you'll see that RDTSC maps exactly to wall time in this case, but surprises are definitely possible.

I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.

Yes, although as with wall time, one needs to be sure to measure only on the section of code that one is optimizing. Perf makes this difficult, so something like "likwid" with an API that allows profiling fragments of code would be required. I haven't done that yet for this, but at the faster speeds (sse4_despace_branchless) I don't think there are any cache misses or branch mispredictions in the code of interest.

surprisingly, Clang is about 10% slower than GCC in my experiments

My surprise was the opposite. Clang was 25% faster than GCC and ICC on the L1- and L3-sized vectorized branchless (.25 cycles/byte versus .33 cycles/byte, although the code I'm running is not quite what's on Github). Most of this benefit seems to be because clang unrolls 2x to reduce loop overhead.


I dropped a comment on his blog and a link to this discussion...

And how sloppy of me! I didn't notice the `pre;` in the code consuming most of the time (to be fair it's just 4 characters and I didn't put too much time to it). When I move that outside of the loop, before the timer, I get results that show improvement with the optimized version.

And indeed it looks like rdtsc is giving similar figures to clock_gettime now. I falsely presumed that it counts retired instructions.

I'm still surprised to see a speedup, and how badly the original version is performing.

My guess is that the conditional stores are poison to the CPU pipelines. The 64 bit version gets most of the performance out of it already, I presume it's because of the more efficient memory usage pattern.

I can't edit my earlier post any more, I'd correct it if I could.

Clang:

    memcpy(tmpbuffer,buffer,N):  0.000000 cycles / ops 65232 nsec (0.000065 sec) 
    countspaces(buffer, N):  0.000000 cycles / ops 65176 nsec (0.000065 sec) 
    despace(buffer, N):  4.547286 cycles / ops 7654310198 nsec (7.654310 sec) 
    faster_despace(buffer, N):  1.582721 cycles / ops 2651677500 nsec (2.651678 sec) 
    despace64(buffer, N):  0.583952 cycles / ops 1025215835 nsec (1.025216 sec) 
    despace_to(buffer, N, tmpbuffer):  0.000000 cycles / ops 63847 nsec (0.000064 sec) 
    avx2_countspaces(buffer, N):  0.000000 cycles / ops 63697 nsec (0.000064 sec) 
    avx2_despace(buffer, N):  0.307253 cycles / ops 602528061 nsec (0.602528 sec) 
    sse4_despace(buffer, N):  0.310504 cycles / ops 534542967 nsec (0.534543 sec) 
    sse4_despace_branchless(buffer, N):  0.353851 cycles / ops 594652080 nsec (0.594652 sec) 
    sse4_despace_trail(buffer, N):  0.314221 cycles / ops 562439990 nsec (0.562440 sec) 
    sse42_despace_branchless(buffer, N):  0.608811 cycles / ops 1020576734 nsec (1.020577 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.608494 cycles / ops 1020062750 nsec (1.020063 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.779908 cycles / ops 2983955982 nsec (2.983956 sec) 
GCC:

    memcpy(tmpbuffer,buffer,N):  0.285702 cycles / ops 489599835 nsec (0.489600 sec) 
    countspaces(buffer, N):  0.000000 cycles / ops 63995 nsec (0.000064 sec) 
    despace(buffer, N):  4.751018 cycles / ops 8014809122 nsec (8.014809 sec) 
    faster_despace(buffer, N):  1.718575 cycles / ops 2898082416 nsec (2.898082 sec) 
    despace64(buffer, N):  0.883421 cycles / ops 1560479651 nsec (1.560480 sec) 
    despace_to(buffer, N, tmpbuffer):  6.424313 cycles / ops 10892716258 nsec (10.892716 sec) 
    avx2_countspaces(buffer, N):  0.031227 cycles / ops 52369789 nsec (0.052370 sec) 
    avx2_despace(buffer, N):  0.315633 cycles / ops 627793484 nsec (0.627793 sec) 
    sse4_despace(buffer, N):  0.319739 cycles / ops 554173689 nsec (0.554174 sec) 
    sse4_despace_branchless(buffer, N):  0.366240 cycles / ops 615638316 nsec (0.615638 sec) 
    sse4_despace_trail(buffer, N):  0.318973 cycles / ops 572262343 nsec (0.572262 sec) 
    sse42_despace_branchless(buffer, N):  0.638114 cycles / ops 1070772787 nsec (1.070773 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.637988 cycles / ops 1069642163 nsec (1.069642 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.768081 cycles / ops 2963488990 nsec (2.963489 sec)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: