> Sharing of data structures between CPU and GPU is nice too
How they did it? Hard to do because GPU hardware can convert data types on the fly, e.g. you can store bytes in VRAM, and convert them to 32-bit floats in [ 0 .. +1 ] in the shader. GPUs can do that for both inputs (loaded texture texels, loaded buffer elements, vertex attributes) and outputs (rendered pixels, stored UAV elements).
If you are using plain buffers the GPU and the CPU access data pretty much exactly the same way. With scalar block layout all the alignments are pretty much the same too.
To get the format conversion stuff you talk about, you need to use images, vertex input or texel buffers and configure the format conversion explicitly.
It's a good question how much of these conversions are actually done by GPU hardware and how much of it is just software (which you could write yourself in a shader and get same perf). I have not seen an apples to apples benchmark about these format conversions.
> If you are using plain buffers the GPU and the CPU access data pretty much exactly the same way
Yeah, that will work fine for byte address buffers, to lesser extent constant buffers (they don’t convert data types but the access sematic and alignment are a bit tricky), but not much else. Vertex buffers, textures, and texel buffers / typed buffers in D3D are all widely used in real-time graphics.
> which you could write yourself in a shader and get same perf
Pretty sure it’s hardware. Emulating anisotropic texture sampler with HLSL codes would need hundreds of instructions, prohibitively expensive. Even simpler trilinear sampling is surprisingly tricky to emulate due to these screen-space partial derivatives on input.
> I have not seen an apples to apples benchmark about these format conversions.
> Yeah, that will work fine for byte address buffers, to lesser extent constant buffers (they don’t convert data types but the access sematic and alignment are a bit tricky), but not much else.
This is where sharing the CPU and GPU side struct declaration is helpful. With scalar block layout (VK_EXT_scalar_block_layout in Vulkan, not sure how about d3d land) you don't even need to worry about alignment rules because they're the same for GPU and CPU (just make sure your binding base address/offset is aligned).
> Vertex buffers, textures, and texel buffers / typed buffers in D3D are all widely used in real-time graphics.
Of course. You don't get to share "structs" here between CPU and GPU here transparently because you need to program the GPU hardware (vertex input, texture samplers) to match.
There are some reflection based trickery that can help here but rust-gpu afaik doesn't do that. I've seen some projects use proc macros to generate vertex input layout config for GL/Vulkan from Rust structs with some custom #[attribute] annotations.
> Pretty sure it’s hardware.
Now this is just guessing.
> Emulating anisotropic texture sampler with HLSL codes would need hundreds of instructions...
Texture sampling / interpolation is certainly hardware.
But the conversion from rgba8_unorm to rgba32f, for example? Or r10g10b10a2?
I've not seen any conclusive benchmark results that suggest whether it's faster to just grab these from a storage buffer in a shader and do the few arithmetic instructions or whether it's faster to use an texel buffer. Images are a different beast entirely due to tiling formats (you can't really memory map them so the point of sharing struct declarations is irrelevant).
> Here’s a benchmark for vertex buffers
I am familiar with this benchmark from 8 years ago, which is highly specific to vertex buffers (and post transform cache etc).
It's a nicely done benchmark but it has two small flaws in it: the hw tested is quite old by now and it doesn't take into account the benefit of improved batching / reduced draw calls that can only be done with custom vertex fetch (so you don't need BindVertex/IndexBuffer calls). It would be great if this benchmark could be re-run with some newer hw.
But this benchmark doesn't answer the question whether the typed buffer format conversions are faster than doing it in a shader (outside of vertex input).
> however on nVidia Maxwell vertex buffers were 2-4 times faster.
The relevant hardware got revamped in Turing series to facilitate mesh shaders, so can't extrapolate the results to present day hardware.
Fwiw. I've been using custom vertex fetch with buffer device address in my projects for a few years now and I haven't noticed adverse performance implications on any hw I've used (Intel, NV and AMD). But I haven't done rigorous benchmarking that would compare to using vertex input stage.
I'm not using rust-gpu for shaders at the moment, but if I was, it would be helpful to just use the same struct declarations. All my vertex data, instance data, constant buffers and compute buffers are an 1:1 translation from Rust to GLSL struct declarations which is just redundant work.
> This is where sharing the CPU and GPU side struct declaration is helpful
Indeed, but sharing code between CPU and shaders is not the only way to solve the problem. I wrote a simple design-time tool which loads compiled shaders with shader reflection API, and generates a source file with C++ header (or for other projects C# structures) with these constant buffers. At least with D3D11, compiled shaders have sufficient type and memory layout info to generate these structures, matching memory layout by generating padding fields when necessary.
> not sure how about d3d land
Pretty sure D3D11 doesn’t have an equivalent of that Vulkan extension. Not sure about D3D12 though, only used the 12 briefly.
> I've been using custom vertex fetch with buffer device address in my projects
In my projects, I sometimes use a lot of non-trivial input layout features. Sometimes I need multiple vertex buffers, e.g. to generate normals on GPU with a compute shader. Sometimes I need instancing. Often I need FP16 or SNORM/UNORM vertex attributes, like RG16_UNORM for octahedron-encoded normals.
How they did it? Hard to do because GPU hardware can convert data types on the fly, e.g. you can store bytes in VRAM, and convert them to 32-bit floats in [ 0 .. +1 ] in the shader. GPUs can do that for both inputs (loaded texture texels, loaded buffer elements, vertex attributes) and outputs (rendered pixels, stored UAV elements).