Not serving, but handling 2100k requests. Your skepticism is rightly placed, as ...

Not serving, but handling 2100k requests. Your skepticism is rightly placed, as the HTTP protocol is yet an example of an inefficient protocol that nonetheless is used as the primary protocol on the internet. Some webservers[1] can serve millions of requests pr. second, but I'd never use HTTP in code where efficiency is key.

No, I'm talking about handling requests. In this particular case, requests (32 to 64 bytes) were flowing through several services (on the same computer). I replaced the processing chain with a single application to remove the overhead of serialization between processes. Requests were filtered early in the pipeline, which made a ~55% reduction in the work needed.

Requests were then batched into succinct data structures and processed via SIMD. Output used to be JSON, but I instead wrote a custom memory allocator and just memcpy the entire blob on to the wire.

Before: No pre-filtering, off-the-shelf databases (PSQL), queue system for I/O, named pipes and TCP/IP for local data transfer. Lots of concurrency issues, thread starvation and I/O bound work.

After: Agnessive pre-filtering, succinct data structures for cache coherence, no serialization overhead, SIMD processing. Can saturate a 32 core CPU with almost no overhead.

[1] https://www.techempower.com/benchmarks/#section=data-r13&hw=...