Indeed. I think you have different saturation points the wider the use cases you hit. One example w/ a single-core (which btw, agreed whole heartedly for io) is checksumming + decoding.
For kafka, we have multiple indexes - a time index and an offset index which are simple metadata. the trouble becomes on how you handle decompression+checksumming+compression for supporting compacted topics. ( https://github.com/vectorizedio/redpanda/blob/dev/src/v/stor... )
So single core starts to get saturated while doing both fore-ground and background requests.
.....
Now assume that you handle that with correct priorities for IO and CPU scheduling.... the next bottleneck will be keeping up w/ background tasks.
So then you start to add more threads. but as you mentioned and what I tried to highligiht in that article was that the cost of implicit or simple synchronization is very expensive (as noted by you intuition)
The thread-per-core buffer management with defer destructors is really handy at doing 3 things explicitly
1. your cross core communication is explicit - that is you give it shares as part of a quota so that you understand how your system priorities are working across the system for any kind of workload. This is helpful to prioritize foreground and background work.
2. there is effectively a const memory addresses once you parse it - so you treat it is largely immutable and you can add hooks (say crash if modified on a remote core)
3. makes memory accounting fast. i.e.: instead of pushing a global barrier for the allocator you simply send a message back to the originating core for allocator accounting. This becomes hugely important as you start to increase the number of cores.
>>> the trouble becomes on how you handle decompression+checksumming+compression
gzip will cap 1 MB/s with the strongest compression setting and 50 MB/s with the fastest setting, which is really slow.
The first step to improve kafka is for kafka to adopt zstd compression.
Another thing that really hurts is SSL. Desktop CPU with AES instructions can push 1 GB/s so it's not too bad, but that may not the the CPU you have or the default algorithm used by the software.
lz4 is a good option for really high-performance compression as well. (Zstd is my general recommmendation, and both beat the pants off of gzip, but for very high throughput applications lz4 still beats zstd. Both are designs from Yann Collet.)
indeed. though, the recent zstd changes w/ different levels of compression sort of close the gap in perf that lz4 had over zstd. (if interested in this kind of detail for a new streaming storage engiene, i gave a talk last week at the facebook performance summit - https://twitter.com/perfsummit1/status/1337603028677902336)
For kafka, we have multiple indexes - a time index and an offset index which are simple metadata. the trouble becomes on how you handle decompression+checksumming+compression for supporting compacted topics. ( https://github.com/vectorizedio/redpanda/blob/dev/src/v/stor... )
So single core starts to get saturated while doing both fore-ground and background requests.
.....
Now assume that you handle that with correct priorities for IO and CPU scheduling.... the next bottleneck will be keeping up w/ background tasks.
So then you start to add more threads. but as you mentioned and what I tried to highligiht in that article was that the cost of implicit or simple synchronization is very expensive (as noted by you intuition)
The thread-per-core buffer management with defer destructors is really handy at doing 3 things explicitly
1. your cross core communication is explicit - that is you give it shares as part of a quota so that you understand how your system priorities are working across the system for any kind of workload. This is helpful to prioritize foreground and background work.
2. there is effectively a const memory addresses once you parse it - so you treat it is largely immutable and you can add hooks (say crash if modified on a remote core)
3. makes memory accounting fast. i.e.: instead of pushing a global barrier for the allocator you simply send a message back to the originating core for allocator accounting. This becomes hugely important as you start to increase the number of cores.