An ASCII SIMD wc would likely be much more than 4× faster. The 320 MB/s is speed...

mananaysiempre · on June 20, 2024

The (cheating, ASCII-only) SIMD implementation for reference:

  #if 0 /* shell polyglot */
  exec ${CC:-cc} $CPPFLAGS -g -O2 -march=x86-64-v3 $CFLAGS -o "${0%.c}" "$0" || exit $?
  #endif
  /* SPDX-License-Identifier: CC0-1.0 */
  
  #include <immintrin.h>
  #include <stdint.h>
  #include <stdio.h>
  #include <unistd.h>
  
  int main(int argc, char **argv) {
      const char *const progname = argc ? argv[0] : "<unknown>";
      unsigned long long bytes = 0, words = 0, lines = 0;
      uint32_t prev = -1;
      for (;;) {
          static __m256i buf[1024 * 1024];
          ssize_t k, n = 0;
          do {
              k = read(STDIN_FILENO, (char *)buf + n, sizeof buf - n);
          } while (k > 0 && (n += k) < sizeof buf);
          if (k < 0) { perror(progname); return 1; }
          if (!n) break;
          bytes += n;
  
          while (n % sizeof *buf) ((char *)buf)[n++] = ' ';
          n /= sizeof *buf;
  
          for (size_t i = 0; i < n; i++) {
              __m256i x = _mm256_load_si256(&buf[i]);
  
              __m256i nl = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('\n'));
              lines += _mm_popcnt_u32(_mm256_movemask_epi8(nl));
  
              const __m256i table = _mm256_broadcastsi128_si256(_mm_setr_epi8(
                  /* 0 */ ' ', 0,    0,    0,    0,    0,    0, 0,
                  /* 8 */ 0,   '\t', '\n', '\v', '\f', '\r', 0, 0));
              __m256i ws = _mm256_cmpeq_epi8(x, _mm256_shuffle_epi8(table,
                  _mm256_and_si256(x, _mm256_set1_epi8(0x0F))));
              uint64_t mask = _mm256_movemask_epi8(ws);
              words += _mm_popcnt_u32(mask ^ (mask << 32 | prev) >> 31);
              prev = mask;
          }
      }
      words += !(prev >> 31);
      printf("%llu %llu %llu\n", lines, words / 2, bytes);
      return 0;
  }

bear8642 · on June 21, 2024

Remember exploring the simd fizz-buzz solution[0] and forgot just how fast computers can be

[0] https://codegolf.stackexchange.com/questions/215216/high-thr...

mananaysiempre · on June 22, 2024

There are degrees to this kind of thing, and this is far far away from that one (exercise: find one instruction in the code that can be deleted without changing anything else). On my (Ryzen 7x40) laptop it does run around two times faster than its SSD can supply data (11 GB/s vs 6 GB/s), and—to my great surprise—gets four times(!) slower if you feed it via (GNU) cat(1) instead of shell redirection (slowdown from piping from pv is below 1.5× though).

Yet it’s nowhere near being bottlenecked on memory bandwidth (theoretical at 80 GB/s, sequential-read actual at 55 GB/s, memset at 45 GB/s, or—probably most realistically given we’re not using zero-copy I/O—memcpy at 25 GB/s). As Daniel Lemire put it, “Data engineering at the speed of your disk”[1].

Unfortunately, to get that speed out of your computer, you end up needing to program any transformations you may need in strange, target-dependent, and most importantly nearly noncomposable ways. Compiler engineers have been working on the problem for more than two decades, but I don’t think we’re putting away our intrinsic references and latency tables any time soon. One good way[2] to explain the problem is that a single core of the CPU mentioned above (for example) has about as much memory it can access essentially instantly as the original IBM PC without addons: about 64 KB (then it was called “RAM”, now we call it “physical registers” and “L1 cache”).

[1] https://www.youtube.com/watch?v=p6X8BGSrR9w

[2] https://retrocomputing.stackexchange.com/a/26457