It can be many things, my guess is likely memory bandwidth. There's so much MB/s...

hinkley · on June 18, 2023

Higher parallelism is usually about timeliness for interactive workloads, not throughput. Unless it’s just the classic fallacy that parallelism = speed.

For throughput tasks it’s often the case that you go with less parallelism to reduce Amdahl’s law a few percent, and instead investing in keeping the pipeline saturated, so that the variance in concurrent tasks is lower. Work stealing being one of the more notable tricks.