Came to a similar conclusion back in the days when writing a raytracer, and it stopped scaling past 8 or so cores.
Ended up with a system where each thread accumulated results in small buffers, appended pointers to those buffers to a shared "buffer list" which was very fast due to low contention using typical spinlock+mutex combo.
The thread that overflowed the buffer list would then become the single writer by taking on the responsibility to accumulate the results to the shared output image. It would start by swapping in a fresh list, so the other threads could carry on.
The system would self-tune by regulating the size of the shared buffer list so that the other threads could keep working while the one "writer thread" accumulated.
Probably had room for improvement, but after this change it scaled almost linearly to a least 32 cores, which was the largest system available for testing at the time.
The reason for not simply allocating a full output image per thread and accumulate post-render was mainly due to the memory requirements for large output images.
Ended up with a system where each thread accumulated results in small buffers, appended pointers to those buffers to a shared "buffer list" which was very fast due to low contention using typical spinlock+mutex combo.
The thread that overflowed the buffer list would then become the single writer by taking on the responsibility to accumulate the results to the shared output image. It would start by swapping in a fresh list, so the other threads could carry on.
The system would self-tune by regulating the size of the shared buffer list so that the other threads could keep working while the one "writer thread" accumulated.
Probably had room for improvement, but after this change it scaled almost linearly to a least 32 cores, which was the largest system available for testing at the time.
The reason for not simply allocating a full output image per thread and accumulate post-render was mainly due to the memory requirements for large output images.