Say that experiment is correct. Wouldn't that imply that the information context...

wat10000 · on Dec 18, 2024

I don’t follow. Of course the probabilities change depending on context. 1 bit per letter is an average, not an exact measure for each individual letter. There are cases where the next letter is virtually guaranteed, and the information content of that letter is much less than one bit. There are cases where it could easily be many different possibilities and that’s more than one bit. On average it’s about one bit.

> Also the fact that there are more than two letters also indicate more than one bit

This seems to deny the possibility of data compression, which I hope you’d reconsider, given that this message has probably been compressed and decompressed several times before it gets to you.

Anyway, it should be easy to see that the number of bits per symbol isn’t tied to the number of symbols when there’s knowledge about the structure of the data. Start with the case where there are 256 symbols. That implies eight bits per symbol. Now take this comment, encode it as ASCII, and run it through gzip. The result is less than 8 bits per symbol.

For a contrived example, consider a case where a language has three symbols, A, B, and C. In this language, A appears with a frequency of 999,999,998 per billion. B and C each appear with a frequency of one in a billion. Now, take some text from this language and apply a basic run-length encoding to it. You’ll end up with something like 32 bits per billion letters on average (around 30 bits to encode a typical run length of approximately 1 billion, and 2 bits to encode which letter is in the run), which is way less than one bit per letter.

taffer · on Dec 18, 2024

> I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

Yes the entropy of the next letter always depends on the context. One bit per letter is just an average for all kinds of contexts.

> Also the fact that there are more than two letters also indicate more than one bit

Our alphabet is simply not the most efficient way of encoding information. It takes about 5 bits to encode 26 letters, space, comma and period. Even simple algorithms like Huffman or LZ77 only require just 3 bits per letter. Current state-of-the-art algorithms compress the English Wikipedia using a mere 0.8 bits per character: https://www.mattmahoney.net/dc/text.html