Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

VByte is a weird baseline to compare to: it is a byte-aligned encoding scheme, so it trades off some space efficiency for speed of decoding. The proposed method is bit-aligned, so it should be compared against bit-aligned encoding schemes.

For large integers, it is hard to beat Elias Delta coding [1], which asymptotically uses log(n) + o(log(n)) bits, and in practice it breaks even with most other encoding schemes quite early on.

More importantly, Elias Gamma and Delta are complete, meaning that there is no redundancy (another way to look at it is that any sequence over the alphabet is decodable). VByte is complete over the byte alphabet. Any complete code is optimal for some integer distribution (see "implied probability" in the Wikipedia page).

So if your integer distribution is heavier on the large integers, there are already plenty of complete encodings to pick from, and almost certainly one that fits well the distribution.

The scheme proposed here, instead, is not complete, as mentioned in the README ("this method does introduce some redundancy, particularly when sequences must be padded to prevent unintentional "101" patterns"), so it cannot be optimal for any distribution.

The Wikipedia page on the Kraft–McMillan inequality [2] explains this in more detail.

[1] https://en.wikipedia.org/wiki/Elias_delta_coding [2] https://en.wikipedia.org/wiki/Kraft%E2%80%93McMillan_inequal...



Yeah I think the paper could be better, too. Thank you for your suggestions — and for the information, it’s very interesting!

Although, the padding requirement for integers with bits ending in 10 can actually be dismissed: you join on only 101, then to decode you just split on 10101 first, re-attach the removed 10 to the left, then split the resulting parts on 101, removing the padding.

I guess you can consider that a spec bug in the draft? Hahahaha ! :)

I don’t know what complete means, and I don’t know if this becomes complete, but anyway it’s really interesting.

Sounds like it would be a good idea to add these codings to the bench mark!

There’s another potential criticism I was thinking about: what if we encode the lengths with VByte then concat the bits, just like we do with irradix to make the L1 variant? It’s not really fair to compare L1 with VByte when they’re operating in different modes like that. It’s possible that any advantage with our current scheme disappears if you do that, I don’t know.

I picked VByte tho because it’s so simple and very common, so just a question for you: why aren’t the codings you mentioned used as frequently as VByte? Are they more complex or slower?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: