Hacker News new | past | comments | ask | show | jobs | submit login

> it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient

I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.

I would expect it simplifies the fragment processing pipeline to have all the interpolants come from the same triangle. Another factor that comes to mind is that, due to the 2x2 quad-padding, you would end up with multiple shader executions corresponding to the same pixel location, coming from different triangles; that would probably involve complicated bookkeeping. Especially given MSAA.




It would be interesting to see how you were testing for that, because at least on AMD it's fairly certain that a single thread can be shading multiple primitives.

For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :

> The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that this quad begins a new primitive, a zero indicates it uses the same primitive as the previous quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and so it is not included in the mask

The mask is used by the interp instructions to load the correct interpolants from local memory.

In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).

That being said, of course I expect this process to be "lazy" : you would not want to buffer execution of a partially filled thread forever, so depending on the workload you might measure different things.

[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...

[2] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_In...


I think I drew that conclusion from the following: I rendered a full-screen quad using two triangles, and make each fragment display a hash of its threadgroup id. Most threadgroups were arranged in nice, aligned 4x8 rectangles, but near the boundary, they became morphed and distorted so they could stay within the same triangle. That said, it occurs to me now that this could be an opportunistic thing; I am going to try to repeat that experiment, but with many triangles which are all smaller than a single threadgroup.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: