Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An ASCII SIMD wc would likely be much more than 4× faster. The 320 MB/s is speed TFA quotes is good for a byte-at-a-time solution, but pitiful as far as the capabilities of modern machines are concerned. A decent SSD is an order of magnitude faster than that. Around 1 GB/s (i.e. ~100 cycles/fetch) is table stakes for a task of this complexity.

(TFA almost nerd-sniped me into writing a SIMD implementation already, and you finished the job. End result: on an aging 2.5 GHz Broadwell, I’m seeing 300 MB/s for TFA’s C implementation and 3200 MB/s for my first attempt at an x86-64-v3 one. So more like a 10× speedup, and its highly likely one can do better. Admittedly I did spend around an hour on these 50 lines, and would need to spend more in reality to check for CPU features at runtime and so on.)

However, TFA calls going ASCII-only “cheating”, and I can see its point. I don’t know how difficult a Unicode SIMD wc would be or if it’s even possible to do well. (Bonus points: do you want to define a word boundary as a space/nonspace boundary the way[1] POSIX tells you to, or the more complex way[2] Unicode tells you to?)

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/w...

[2] https://www.unicode.org/reports/tr29/#Word_Boundaries



The (cheating, ASCII-only) SIMD implementation for reference:

  #if 0 /* shell polyglot */
  exec ${CC:-cc} $CPPFLAGS -g -O2 -march=x86-64-v3 $CFLAGS -o "${0%.c}" "$0" || exit $?
  #endif
  /* SPDX-License-Identifier: CC0-1.0 */
  
  #include <immintrin.h>
  #include <stdint.h>
  #include <stdio.h>
  #include <unistd.h>
  
  int main(int argc, char **argv) {
      const char *const progname = argc ? argv[0] : "<unknown>";
      unsigned long long bytes = 0, words = 0, lines = 0;
      uint32_t prev = -1;
      for (;;) {
          static __m256i buf[1024 * 1024];
          ssize_t k, n = 0;
          do {
              k = read(STDIN_FILENO, (char *)buf + n, sizeof buf - n);
          } while (k > 0 && (n += k) < sizeof buf);
          if (k < 0) { perror(progname); return 1; }
          if (!n) break;
          bytes += n;
  
          while (n % sizeof *buf) ((char *)buf)[n++] = ' ';
          n /= sizeof *buf;
  
          for (size_t i = 0; i < n; i++) {
              __m256i x = _mm256_load_si256(&buf[i]);
  
              __m256i nl = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('\n'));
              lines += _mm_popcnt_u32(_mm256_movemask_epi8(nl));
  
              const __m256i table = _mm256_broadcastsi128_si256(_mm_setr_epi8(
                  /* 0 */ ' ', 0,    0,    0,    0,    0,    0, 0,
                  /* 8 */ 0,   '\t', '\n', '\v', '\f', '\r', 0, 0));
              __m256i ws = _mm256_cmpeq_epi8(x, _mm256_shuffle_epi8(table,
                  _mm256_and_si256(x, _mm256_set1_epi8(0x0F))));
              uint64_t mask = _mm256_movemask_epi8(ws);
              words += _mm_popcnt_u32(mask ^ (mask << 32 | prev) >> 31);
              prev = mask;
          }
      }
      words += !(prev >> 31);
      printf("%llu %llu %llu\n", lines, words / 2, bytes);
      return 0;
  }


Remember exploring the simd fizz-buzz solution[0] and forgot just how fast computers can be

[0] https://codegolf.stackexchange.com/questions/215216/high-thr...


There are degrees to this kind of thing, and this is far far away from that one (exercise: find one instruction in the code that can be deleted without changing anything else). On my (Ryzen 7x40) laptop it does run around two times faster than its SSD can supply data (11 GB/s vs 6 GB/s), and—to my great surprise—gets four times(!) slower if you feed it via (GNU) cat(1) instead of shell redirection (slowdown from piping from pv is below 1.5× though).

Yet it’s nowhere near being bottlenecked on memory bandwidth (theoretical at 80 GB/s, sequential-read actual at 55 GB/s, memset at 45 GB/s, or—probably most realistically given we’re not using zero-copy I/O—memcpy at 25 GB/s). As Daniel Lemire put it, “Data engineering at the speed of your disk”[1].

Unfortunately, to get that speed out of your computer, you end up needing to program any transformations you may need in strange, target-dependent, and most importantly nearly noncomposable ways. Compiler engineers have been working on the problem for more than two decades, but I don’t think we’re putting away our intrinsic references and latency tables any time soon. One good way[2] to explain the problem is that a single core of the CPU mentioned above (for example) has about as much memory it can access essentially instantly as the original IBM PC without addons: about 64 KB (then it was called “RAM”, now we call it “physical registers” and “L1 cache”).

[1] https://www.youtube.com/watch?v=p6X8BGSrR9w

[2] https://retrocomputing.stackexchange.com/a/26457




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: