Show HN: Outperforming VByte for Large Integers Using Phi-Encoding

ot · 2024-08-27T18:06:31 1724781991

VByte is a weird baseline to compare to: it is a byte-aligned encoding scheme, so it trades off some space efficiency for speed of decoding. The proposed method is bit-aligned, so it should be compared against bit-aligned encoding schemes.

For large integers, it is hard to beat Elias Delta coding [1], which asymptotically uses log(n) + o(log(n)) bits, and in practice it breaks even with most other encoding schemes quite early on.

More importantly, Elias Gamma and Delta are complete, meaning that there is no redundancy (another way to look at it is that any sequence over the alphabet is decodable). VByte is complete over the byte alphabet. Any complete code is optimal for some integer distribution (see "implied probability" in the Wikipedia page).

So if your integer distribution is heavier on the large integers, there are already plenty of complete encodings to pick from, and almost certainly one that fits well the distribution.

The scheme proposed here, instead, is not complete, as mentioned in the README ("this method does introduce some redundancy, particularly when sequences must be padded to prevent unintentional "101" patterns"), so it cannot be optimal for any distribution.

The Wikipedia page on the Kraft–McMillan inequality [2] explains this in more detail.

[1] https://en.wikipedia.org/wiki/Elias_delta_coding [2] https://en.wikipedia.org/wiki/Kraft%E2%80%93McMillan_inequal...

keepamovin · 2024-08-28T01:54:40 1724810080

Yeah I think the paper could be better, too. Thank you for your suggestions — and for the information, it’s very interesting!

Although, the padding requirement for integers with bits ending in 10 can actually be dismissed: you join on only 101, then to decode you just split on 10101 first, re-attach the removed 10 to the left, then split the resulting parts on 101, removing the padding.

I guess you can consider that a spec bug in the draft? Hahahaha ! :)

I don’t know what complete means, and I don’t know if this becomes complete, but anyway it’s really interesting.

Sounds like it would be a good idea to add these codings to the bench mark!

There’s another potential criticism I was thinking about: what if we encode the lengths with VByte then concat the bits, just like we do with irradix to make the L1 variant? It’s not really fair to compare L1 with VByte when they’re operating in different modes like that. It’s possible that any advantage with our current scheme disappears if you do that, I don’t know.

I picked VByte tho because it’s so simple and very common, so just a question for you: why aren’t the codings you mentioned used as frequently as VByte? Are they more complex or slower?

gloria_mundi · 2024-08-27T17:21:54 1724779314

I'm not sure I understand the prime density thing. Of the numbers up to 8258, about 12.5% are prime. Accounting for the fact that about a quarter of these primes ends in 101, i.e. cannot occur, I would expect about 10.7% = 12.5% * (3/4) / (7/8), which is fairly close to the observed 9.4%.

The 2.1% in the README seems to be the density of primes < 1000 among numbers up to 8258. That's not what was counted.

ForOldHack · 2024-08-27T17:32:12 1724779932

After you read about Ramanujan...

https://web.williams.edu/Mathematics/sjmiller/public_html/ma...

To be honest, I have a degree in math, and struggle to understand the extreme difficulty in assessing the density of primes.

gloria_mundi · 2024-08-27T17:43:53 1724780633

I haven't read all of that, but the problem at hand seems significantly less complicated.

We're mapping the numbers from 1 to 1000 to distinct numbers up to 8258, and the README claims that we should expect 2.1% of the resulting numbers to be prime. I see no reason for this claim, and as I understand it, the 2.1% comes from pi(1000) / 8258, which seems like nonsense to me.

keepamovin · 2024-08-28T02:01:43 1724810503

I don’t remember the 2.1% thing, it could be an error, I don’t know.

I just remember the density of primes was higher: but your explanation accounts for that — well done — because it filters out.

keepamovin · 2024-08-28T01:56:39 1724810199

That’s a good explanation! I didn’t think of that :) Thank you, makes more sense than it has some special bias.

Wait, what 2.1% are you referring to? That looks interesting.