VByte is a weird baseline to compare to: it is a byte-aligned encoding scheme, so it trades off some space efficiency for speed of decoding. The proposed method is bit-aligned, so it should be compared against bit-aligned encoding schemes.
For large integers, it is hard to beat Elias Delta coding [1], which asymptotically uses log(n) + o(log(n)) bits, and in practice it breaks even with most other encoding schemes quite early on.
More importantly, Elias Gamma and Delta are complete, meaning that there is no redundancy (another way to look at it is that any sequence over the alphabet is decodable). VByte is complete over the byte alphabet.
Any complete code is optimal for some integer distribution (see "implied probability" in the Wikipedia page).
So if your integer distribution is heavier on the large integers, there are already plenty of complete encodings to pick from, and almost certainly one that fits well the distribution.
The scheme proposed here, instead, is not complete, as mentioned in the README ("this method does introduce some redundancy, particularly when sequences must be padded to prevent unintentional "101" patterns"), so it cannot be optimal for any distribution.
The Wikipedia page on the Kraft–McMillan inequality [2] explains this in more detail.
Yeah I think the paper could be better, too. Thank you for your suggestions — and for the information, it’s very interesting!
Although, the padding requirement for integers with bits ending in 10 can actually be dismissed: you join on only 101, then to decode you just split on 10101 first, re-attach the removed 10 to the left, then split the resulting parts on 101, removing the padding.
I guess you can consider that a spec bug in the draft? Hahahaha ! :)
I don’t know what complete means, and I don’t know if this becomes complete, but anyway it’s really interesting.
Sounds like it would be a good idea to add these codings to the bench mark!
There’s another potential criticism I was thinking about: what if we encode the lengths with VByte then concat the bits, just like we do with irradix to make the L1 variant? It’s not really fair to compare L1 with VByte when they’re operating in different modes like that. It’s possible that any advantage with our current scheme disappears if you do that, I don’t know.
I picked VByte tho because it’s so simple and very common, so just a question for you: why aren’t the codings you mentioned used as frequently as VByte? Are they more complex or slower?
I'm not sure I understand the prime density thing. Of the numbers up to 8258, about 12.5% are prime. Accounting for the fact that about a quarter of these primes ends in 101, i.e. cannot occur, I would expect about 10.7% = 12.5% * (3/4) / (7/8), which is fairly close to the observed 9.4%.
The 2.1% in the README seems to be the density of primes < 1000 among numbers up to 8258. That's not what was counted.
I haven't read all of that, but the problem at hand seems significantly less complicated.
We're mapping the numbers from 1 to 1000 to distinct numbers up to 8258, and the README claims that we should expect 2.1% of the resulting numbers to be prime. I see no reason for this claim, and as I understand it, the 2.1% comes from pi(1000) / 8258, which seems like nonsense to me.
For large integers, it is hard to beat Elias Delta coding [1], which asymptotically uses log(n) + o(log(n)) bits, and in practice it breaks even with most other encoding schemes quite early on.
More importantly, Elias Gamma and Delta are complete, meaning that there is no redundancy (another way to look at it is that any sequence over the alphabet is decodable). VByte is complete over the byte alphabet. Any complete code is optimal for some integer distribution (see "implied probability" in the Wikipedia page).
So if your integer distribution is heavier on the large integers, there are already plenty of complete encodings to pick from, and almost certainly one that fits well the distribution.
The scheme proposed here, instead, is not complete, as mentioned in the README ("this method does introduce some redundancy, particularly when sequences must be padded to prevent unintentional "101" patterns"), so it cannot be optimal for any distribution.
The Wikipedia page on the Kraft–McMillan inequality [2] explains this in more detail.
[1] https://en.wikipedia.org/wiki/Elias_delta_coding [2] https://en.wikipedia.org/wiki/Kraft%E2%80%93McMillan_inequal...