GFNI only does 8-bits and sticks you with a single modulus. But for some reason ...

dragontamer · 2024-08-28T04:04:15 1724817855

The matrix multiplication needed to correct grows at n^2 despite the code only growing at n. There are asymptotic faster matrix multiplications in theory than O(n^2) but in practice every algorithm is O(n^2).

As such, large ReedSolomon codes are impractical. If you need a larger code than what GF(2^8) can offer, you grow with 2-dimension codes, slicing or other features.

In practice, this sacrifices Minimum Distance property, meaning you should use a Turbo Code (or other XOR code) which are O(n) but imperfect.

---------

CRC32 can be implemented in GFNI. And AES is also GF(2^8).

----------

I don't think there are many algorithms where GF(2^16) or bigger are needed.

And if they did, it's possible to turn 8x8 into 16x16 or 32x32 anyway.

nullc · 2024-08-28T08:20:03 1724833203

RS codes can be done in N log N time for both encoding and erasure decoding, and sometimes achieving MDS is useful for attack resistance... e.g. that you can always decode with N blocks, vs with non-minimum distance codes you can have adversarial subsets that will fail to decode even when you have significantly more blocks than needed.. Larger word sizes for RS codes is also useful for list decoding even when the number of data blocks isn't that great.

And sure, list decoding is slow. But I think there are two distinct groups of applications for good instruction sets: one is where you're trying to make a fast thing like an 8-bit RS code or something "free" by having it run at near memory, wire, or disk speeds (or make it consume little power). but the other is where you're doing something legitimately slow, including things that are O(N^2) (or worse). In those cases sometimes a small constant factor makes a big difference between usable and very much not usable.

dragontamer · 2024-08-29T14:01:56 1724940116

In those cases, I know that ARM has PMULL / PMULL2 and x86 has PCLMULQDQ, which go up to 64x64 == 128-bit multiplication.

There is even an AVX512 version of PCLMULQDQ.

klauspost · 2024-08-28T09:17:55 1724836675

In my experience, the hybrid approach is practically better due to cache locality anyway, so I don't think even native GF16 or GF32 would be of much help.

FWIW, I've ported rs-leopard to Go and found it very effective, except in the "recover one" scenario, where it is only at 50% speed of the "recover all" scenario, since it has to do several reconstructions to get one output. But even so, I am not too sure it would be much better with plain GF16, since you would still need to touch most shards to get one shard out.