I haven't benchmarked, so these opinions might be worthless, but here's how they...

Animats · 2025-03-07T02:50:14 1741315814

The glam guy wanted to go all aligned so that SIMD would work, but that would break so much code that he was talked out of it.

Hint for language designers: when you design a new language, put this stuff, and multidimensional arrays, in the standard library. Multiple incompatible versions of such types is as bad for number-crunching as would be multiple incompatible string types for string manipulation. You want your standard numeric libraries to work on the standard types.

This is part of why Matlab is so successful. You don't have to worry about this stuff.

zevets · 2025-03-07T15:03:45 1741359825

It's honestly surprising so many programming languages ignore the needs of "floating point" users. Rust has ints that aren't 0, but no std type for floats that aren't NaN? In some sense, ieee754 floats are better than ints, as the float error modes have NaNs are just HW supported error tagged enum types.

I think its from a CS education which treats the "naturals" as fundamental, vs an engineering background where the "reals" are fundamental, and matrix math _essential_ and people live on one side of this fence.

Animats · 2025-03-07T18:24:05 1741371845

That was true in the past, for a few reasons.

- Floating point operations used to be slow. On early PCs, you didn't even have a floating point unit. AutoCAD on DOS required an FPU, and this was controversial at the time.

- Using the FPU inside system code was a no-no for a long time. Floating point usage inside the Linux kernel is still strongly discouraged.[1] System programmers tended not to think in terms of floating point.

- Attempts to put multidimensional arrays in modern languages tend to result in bikeshedding. If a language has array slices, some people want multidimensional slices. That requires "stride" fields on slices, which slows down slice indexing. Now there are two factions arguing. Rust and Go both churned on this in the early days, and neither came out with a good language-level solution. It's embarrassing that FORTRAN has better multidimensional arrays.

Now that the AI world, the GPU world, and the graphics world all run on floating point arrays, it's time to get past that.

[1] https://www.kernel.org/doc/html/next/core-api/floating-point...

vlovich123 · 2025-03-07T16:54:56 1741366496

> This enables some memory layout optimization. For example, Option<NonZero<u32>> is the same size as u32

NaN doesn’t have this optimization because the optimization isn’t generic across all possible representations. Trying to make it generic gets quite complex and floats might have many such representations (eg you want NaN to be optimized, someone else needs NaN and thinks infinity works better etc). In other words:

Nonzero is primarily for size optimization of Option<number>. If you want sentinels, then write your own wrapper, it’s not hard.

grandempire · 2025-03-07T08:00:30 1741334430

The code example is absolutely the way to do simd. A simd type is not a geometric vector, it’s a magic float that happens to do 4 float operations at a time.

If your vector is generic (using cpp syntax here): vec<3 float> then you can just put in vec<3, float4> and then solve 4 vector math problems at a time.

It helps tremendously if your interfaces already take N inputs at a time, so then instead of iterating one at a time you do 4 at a time.

creata · 2025-03-07T08:30:08 1741336208

Right. Glam (maybe because it's stuck with its data layout, maybe to present a cleaner interface) instead uses a SIMD type for a single Vec4, which tends to be a much less efficient way of using SIMD types.

> If your vector is generic (using cpp syntax here): vec<3 float> then you can just put in vec<3, float4> and then solve 4 vector math problems at a time.

Yeah, that's the idea, but for anyone reading, the main complication is when you need to branch. There are usually multiple ways to handle branching (e.g., sometimes it's worth adding a "fast path" for when all the branches are true, and sometimes it isn't; sometimes you should turn a branch into branchless code and sometimes you shouldn't) and AVX-512 adds even more ways to do it.

the__alchemist · 2025-03-08T02:51:26 1741402286

I did some digging; I added rudimentary 256-bit (AVX) Structure-of-Array layout Vec3 (f32) support. Seems to work. Have constructor/unpack methods to convert between these types and [Vec3; 8].

camel-cdr · 2025-03-07T15:56:15 1741362975

I'm not a fan of such vector libraries, AFAIK they all just inhibit auto vectorization. At most, you can take advantage of 128-bit SIMD with a bunch of shuffles and extracts, whenever you are also working with scalar variables.

I did a small experiment comparing 6 possible implementations of the n-body [0] update loop: https://godbolt.org/z/sfehEfPGT

The implementations are:

* AOS: a simple scalar implementation with coordinates stored in an array of structs * SOA: a simple scalar implementation with coordinates stored as a struct of arrays * float3: uses a struct of three floats as a vector type * float4: uses a struct of four floats as a vector type, ignores the last element * vec4: like float4, but using a generic SIMD abstraction (so basically what glam does) * floats3: attempts to do SOA with nice syntax. floats3 type has three arrays of floats and there are operations to extract and store a float3 type from a given index.

Since these abstractions are often used in games I'll start of looking at what the compiler produces when targeting Zen5 with -O3 -ffast-math:

* Zen5 O3 ffast-math:

    AOS:     gcc: 11119 ~SSE    clang:  3688 AVX512, but quite messy
    SOA:     gcc:  1283 AVX512  clang:  1202 AVX512
    float3:       11050 ~SSE    clang: 10894 ~SSE
    float4:  gcc:  8646 ~SSE    clang: 10815 ~SSE
    vec4:    gcc:  7913 ~SSE    clang:  8196 ~SSE
    floats3: gcc:  1284 AVX512  clang: 13351 ~SSE

The numbers next to the compilers are the cycle estimates from the llvm-mca model of Zen5 for processing 1024 elements. AVX512 indicates whether the compiler was able to vectorize the loop with AVX512, and ~SSE means it could be partial vectorization with SSE.

Now let's also look at a different ISA, this time the RISC-V Vector extension:

* P670 2xVLEN O3 ffast-math:

    AOS:     gcc: 17445         clang:  3357 RVV
    SOA:     gcc:  3355 RVV     clang:  3334 RVV
    float3:  gcc: 17445         clang: 17449
    float4:  gcc: 25668 RVV128  clang: 17470 RVV128
    vec4:    gcc: 45091 RVV128  clang: 23111 RVV128
    floats3: gcc:  3333 RVV     clang: 17446

This time the llvm-mca model for the SiFive-P670 was used, but I pretended it has 256-bit vectors instead of 128-bit ones, as the vector length is transparent to the codegen and this amplifies the effect I'd like to show. RVV means it could be fully vectorized, while RVV128 is similar to ~SSE and means it could only partially take advantage of the lower 128-bit of the vector registers.

So if you are using such vector types to do computations in loops you are likely to end up preventing your compiler from optimizing it for modern hardware. In general writing simple SOA scalar code seems to vectorize best, as long as you make sure the compiler isn't confused by aliasing. But even the plain old AOS scalar code can be vectorized by modern clang, but not by gcc, and sadly also not the float3/float4 implementations, which should be very similar. Modern ISAs like NEON/SVE/RVV have more complex vector load/stores that allow you to retrieve data more efficiently even from a traditionally bad data layout like AOS. You can dress up the SOA code to make it a bit nicer, unfortunately my attempt with floats3 currently only works properly with gcc.

Below are the results when compiling without -ffast-math:

* Zen5 O3:

    AOS:     gcc: 11819 ~SSE    clang: 10788 ~SSE
    SOA:     gcc:  4146 AVX512  clang: 13734 AVX512
    float3:       11826 ~SSE    clang: 11499 ~SSE
    float4:  gcc:  8662 ~SSE    clang: 11810 ~SSE
    vec4:    gcc:  8575 ~SSE    clang:  7451 ~SSE
    floats3: gcc:  4148 AVX512  clang: 14367 ~SSE

* P670 2xVLEN O3:

    AOS:     gcc: 17464 RVV64   clang:  6122 RVV
    SOA:     gcc:  7140 RVV     clang:  6118 RVV
    float3:  gcc: 17445         clang: 17464 RVV64
    float4:  gcc: 25665 RVV128  clang: 19184 RVV128
    vec4:    gcc: 17463 RVV128  clang: 56868 RVV128
    floats3: gcc:  7140 RVV     clang: 17444

Weirdly clang seems to be struggling with the SOA here, and overall vec4 looks like the best performance tradeoff for X86. Still with proper SOA, and I bet you could coax clang into generating it as well, you can still get a 2x performance improvement. Additionally, vec4 performs horribly with current compilers for VLA SIMD ISAs.

I'll try to experiment with some real world code, if I can find some that is bottle-necked by such types.

[0] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

creata · 2025-03-08T12:39:11 1741437551

> I'm not a fan of such vector libraries, AFAIK they all just inhibit auto vectorization.

Yes, I don't think anyone using them is depending on autovectorization.