There are degrees to this kind of thing, and this is far far away from that one ...

There are degrees to this kind of thing, and this is far far away from that one (exercise: find one instruction in the code that can be deleted without changing anything else). On my (Ryzen 7x40) laptop it does run around two times faster than its SSD can supply data (11 GB/s vs 6 GB/s), and—to my great surprise—gets four times(!) slower if you feed it via (GNU) cat(1) instead of shell redirection (slowdown from piping from pv is below 1.5× though).

Yet it’s nowhere near being bottlenecked on memory bandwidth (theoretical at 80 GB/s, sequential-read actual at 55 GB/s, memset at 45 GB/s, or—probably most realistically given we’re not using zero-copy I/O—memcpy at 25 GB/s). As Daniel Lemire put it, “Data engineering at the speed of your disk”[1].

Unfortunately, to get that speed out of your computer, you end up needing to program any transformations you may need in strange, target-dependent, and most importantly nearly noncomposable ways. Compiler engineers have been working on the problem for more than two decades, but I don’t think we’re putting away our intrinsic references and latency tables any time soon. One good way[2] to explain the problem is that a single core of the CPU mentioned above (for example) has about as much memory it can access essentially instantly as the original IBM PC without addons: about 64 KB (then it was called “RAM”, now we call it “physical registers” and “L1 cache”).

[1] https://www.youtube.com/watch?v=p6X8BGSrR9w

[2] https://retrocomputing.stackexchange.com/a/26457