Hacker News new | past | comments | ask | show | jobs | submit login

GFNI only does 8-bits and sticks you with a single modulus. But for some reason I'd forgotten it completely, you're totally right to give me a mystified response.

(FWIW, it's possible to project elements from one field to another isomorphic field, though it takes enough operations that for fast code like RS decoding the conversion is probably performance limiting).

For hybrid codes GFNI should be sufficient, though for things like using RS at 16/32 bit sizes it's not.




The matrix multiplication needed to correct grows at n^2 despite the code only growing at n. There are asymptotic faster matrix multiplications in theory than O(n^2) but in practice every algorithm is O(n^2).

As such, large ReedSolomon codes are impractical. If you need a larger code than what GF(2^8) can offer, you grow with 2-dimension codes, slicing or other features.

In practice, this sacrifices Minimum Distance property, meaning you should use a Turbo Code (or other XOR code) which are O(n) but imperfect.

---------

CRC32 can be implemented in GFNI. And AES is also GF(2^8).

----------

I don't think there are many algorithms where GF(2^16) or bigger are needed.

And if they did, it's possible to turn 8x8 into 16x16 or 32x32 anyway.


RS codes can be done in N log N time for both encoding and erasure decoding, and sometimes achieving MDS is useful for attack resistance... e.g. that you can always decode with N blocks, vs with non-minimum distance codes you can have adversarial subsets that will fail to decode even when you have significantly more blocks than needed.. Larger word sizes for RS codes is also useful for list decoding even when the number of data blocks isn't that great.

And sure, list decoding is slow. But I think there are two distinct groups of applications for good instruction sets: one is where you're trying to make a fast thing like an 8-bit RS code or something "free" by having it run at near memory, wire, or disk speeds (or make it consume little power). but the other is where you're doing something legitimately slow, including things that are O(N^2) (or worse). In those cases sometimes a small constant factor makes a big difference between usable and very much not usable.


In those cases, I know that ARM has PMULL / PMULL2 and x86 has PCLMULQDQ, which go up to 64x64 == 128-bit multiplication.

There is even an AVX512 version of PCLMULQDQ.


In my experience, the hybrid approach is practically better due to cache locality anyway, so I don't think even native GF16 or GF32 would be of much help.

FWIW, I've ported rs-leopard to Go and found it very effective, except in the "recover one" scenario, where it is only at 50% speed of the "recover all" scenario, since it has to do several reconstructions to get one output. But even so, I am not too sure it would be much better with plain GF16, since you would still need to touch most shards to get one shard out.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: