> it's absolutely not the case that all lanes of a hardware thread (warp / wavef...

obl · on March 9, 2023

It would be interesting to see how you were testing for that, because at least on AMD it's fairly certain that a single thread can be shading multiple primitives.

For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :

> The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that this quad begins a new primitive, a zero indicates it uses the same primitive as the previous quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and so it is not included in the mask

The mask is used by the interp instructions to load the correct interpolants from local memory.

In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).

That being said, of course I expect this process to be "lazy" : you would not want to buffer execution of a partially filled thread forever, so depending on the workload you might measure different things.

[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...

[2] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_In...

moonchild · on March 9, 2023

I think I drew that conclusion from the following: I rendered a full-screen quad using two triangles, and make each fragment display a hash of its threadgroup id. Most threadgroups were arranged in nice, aligned 4x8 rectangles, but near the boundary, they became morphed and distorted so they could stay within the same triangle. That said, it occurs to me now that this could be an opportunistic thing; I am going to try to repeat that experiment, but with many triangles which are all smaller than a single threadgroup.