I wonder how much these tricks matter nowadays. Even if we ignore the time to read from disk, for simple algorithms that scan data linearly, it is often the case that the CPU is spending most of its time waiting for the data to be read from RAM.
Also, in modern x86 processors, an entire cache line (64 bytes) is read whenever a single byte is needed. So for strings smaller than 64, even algorithms that don't check every byte will read every byte from disk to RAM and from RAM to the processor cache.
Even if we ignore the time to read from disk, for simple algorithms that scan data linearly, it is often the case that the CPU is spending most of its time waiting for the data to be read from RAM.
Nope. Even DDR-3 can perform sequential transfers at speeds well exceeding one byte per CPU clock cycle.
And yes, you can ignore the time to read from disk. Think a SAN over a 10-gig link. That gets you close to byte-per-cycle territory as well. It takes on the order of a dozen cycles (more or less depending on architecture and algorithm) to perform one "step" of a search. So yes, these algorithms very much matter.
Also, in modern x86 processors, an entire cache line (64 bytes) is read whenever a single byte is needed. So for strings smaller than 64, even algorithms that don't check every byte will read every byte from disk to RAM and from RAM to the processor cache.
Yep. But this is not necessarily the bottleneck; see my above comments about the CPU.
The CPU can also execute instructions exceeding one per clock cycle. For example instructions such as MOV, CMP and even conditional jumps run at a typical throughput of 3 instructions per clock cycle on the intel Core line of CPUs. (source: http://www.agner.org/optimize/instruction_tables.pdf)
This means that the simplest handcrafted loop that loads a byte, compares its value to the first byte of the searched string and then does a conditional jump should (if unrolled) take about 1 clock cycle per byte examined!
This means that if our DDR3 memory can read 6GB of data per second and our CPU core is clocked at 3Gh, this completely naive algorithm will run at half of the theoretical maximum speed.
Using XMM instructions (that work on 16 bytes at a time), should probably get us to the limit of the RAM speed.
Regarding SAN-over-10-gig link. I don't know about you, but the computer I'm typing this on has an SSD that can read only a few hundred megabytes per second.
Also, in modern x86 processors, an entire cache line (64 bytes) is read whenever a single byte is needed. So for strings smaller than 64, even algorithms that don't check every byte will read every byte from disk to RAM and from RAM to the processor cache.