Got it, I was thinking of the locations as being i.i.d. in that you do not have any information on their location. You and others are saying that adjusting the symbol counts as you go would give the same performance to other algorithms. Looking into that, thanks.
Whenever you're compressing a file on your computer the frequencies are known beforehand. Agree this does not apply to noisy-channel coding, only the Shannon source coding theorem in lossless compression. Good point though I'm not sure if this would be considered an entropy encoder, as previous values do have some impact. Any more info I should look at for that?
I was using what I seemed to be the standard with other encoders, storing the ratios is roughly the same cost compared to other compressors, a few extra bits for exact counts. But I do see how this isn't i.i.d. so I'm fixing that.
I don't think you understand the basic concepts and you are wasting your time trying to "fix" it. Let me try to explain a bit differently: The Shannon limit depends on assuming a model of the source (the thing producing symbols). For example imagine the output of something that you think is "true" noise like a fair coin flip that produces 0 (for heads) and 1 (for tails). That has entropy of 1 bit and you cannot compress it. Now you learn that it is actually a pseudorandom number generator that produces 0 or 1 in a 1:1 ratio. If you know the program (and seed if required) the source now has 0 entropy (you can run the program to know every bit that will be produced). Same symbol outputs, different models of how they are produced, different theoretical compression limits. As another example if you are compressing text one model would be to assume that words are produced independently of one another. We know that that isn't true (certain words are more likely to follow other words like your next word prediction system on your phone works) but you might still choose to implement a compression system based on that naive model because it's less complex, uses less memory or compression/decompression time etc. If you implemented compression using a less naive model (e.g. what is the probability of the next symbol in all human text given all the symbols I have seen before this one) you could get better compression rates but you wouldn't be breaking the Shannon limit that you calculated using the naive model because the "true" limit is based on the "true" model of whatever the entropy of human produced text is, which is not really calculable (people produce new sentences every day) but you might approximate using something like all human text you can find.
As I alluded to above note that Shannon entropy doesn't take into consideration practical complexities (memory, program size, processing time) that are real considerations when implementing a compression scheme. Most of the time you are going to trade higher entropy for simpler, faster, less resource intensive compression.
To sum up: You cannot beat the Shannon limit of the "true" model of the source but you can beat the Shannon limit of a naive approximation of the source. Those naive models are used because they are more practical to implement.
I appreciate the input. I did not mean to imply I could encode a random stream of symbols/characters, that is absolutely valid. I was approaching this as a compression technique for something like a text file, where the symbol counts are known at compression time.
First issue is that if you do comparisons you have to do them apples-to-apples, you can't assume that method A has that information available for free but method B does not or that method A assumes a given source model but method B assumes a different source model.
Second, I really don't understand how you intend to use a table of symbol counts: If you do it over the entire file the table might be a reasonable size but the number of permutations becomes infeasible. Conversely if you do it in small windows (like 8 or so in your examples) you have to store a separate symbol count table for each window which would explode the symbol count table. I really doubt you are gaining anything from doing this. You are going to create an enormous per-file symbol frequency table and then not count it against the compressed size, that isn't compression it's just misdirection.
When I get time may compress the table as well then, in looking at example arithmetic encoders they seem to have some flat table implementations as well (https://github.com/nayuki/Reference-arithmetic-coding/blob/m...). I don't see how that's misdirection, I left the table uncompressed specifically for transparency reasons. Was just trying to keep that part simple.
Yeah this was a big fear in posting. No I'm not experienced in academia or this would be a paper, I code things for fun. Have you tried looking at the math for multinomials vs Shannon? Or running the code? Please do point out my mistakes in the math.
It is not our responsibility to do that. You should explain how your method differs from well-established methods such as arithmetic coding (see papers by Rissanen). While doing that, you should quickly realize that you’ve got nothing at all.
A basic data compression course would have sufficed too, but as you say, you don’t have any knowledge of the basics of information theory or data compression. You sound like an over-eager yet ignorant grad student who insists he has found flaws in the professor’s lecture material.
>As an example, given the symbol frequencies of 1, 2, and 3 for a total of 6 symbols. Shannon limit [-(1/6)log2(1/6)-(2/6)log2(2/6)-(3/6)log2(3/6)]6 = 8.75 => 9 bits
This isn't how you calculate Shannon limit fwiw. If all symbols are of equal frequency simply take the total number of different symbols and simply log2(different_symbol_count). That's it.
So say you had 60 different permutations. Each of those is a symbol in entropy encoding. Each individual symbol takes log2(60) bits to store using arithmetic encoding. Which is exactly the statement and correct calculation you had for calculating the Shannon limit in the very next line :). As in the Shannon limit for 60 different symbols is absolutely 5.9bits not 9bits
Instead you have some weird calculation here that seems to take individual probabilities and add them back up again to give 9bits. That's very far off and incorrect.
I think you are confused about probabilities and realized frequencies. A probability is a before-the-fact model. A realized frequency is an after the fact observation. If you flip a fair coin the probability of heads or tails is 50% but any given single flip will have a frequency of 1 for either heads or tails and 0 for the other. (What is true is that as you flip the coin a very large number of times the actual ratio you will see will approach 50/50 but that's definitely not true for say 8 tosses. This is called the asymptotic equipartition principle but it isn't really relevant here).
As far as I can tell the symbol counts are not included when measuring the size of the output, I did say that but there is a lot of text :) Huffman, ANS, arithmetic coding, etc do not include the frequency counts as part of the size afaik, though of course that needs to be stored to decode, if not please link me.
> Huffman, ANS, arithmetic coding, etc do not include the frequency counts as part of the size afaik, though of course that needs to be stored to decode, if not please link me.
They may or they may not. There are schemes doing both. Either way, you cannot compare arithmetic coding given no information to your scheme that is given the exact frequencies of every symbol.
Sorry, I (and others) have explained several times what an equivalent arithmetic coding scheme would have to do to make the same assumptions (and have the same limitations) as your scheme.
It would have to assume there's _exactly_ N ones and N zeros in the bit string. Start with those counts:
remaining ones = N;
remaining zeros = N;
It would encode bit for bit. After every zero, it decrements the remaining number of zeros. After every one, it decrements the remaining number of ones. The probability of the next bit being zero or one is derived from the known remaining number of each, P(one) = (remaining ones / (remaining ones + remaining zeros)). When one of the counts reaches zero, the probability of the other becomes 100% and no more information need to be encoded.
Does the linked code do that? No.
I don't have a ready-made example because this is not an assumption about the input that is useful in the majority of cases. The linked example can encode _any sequence_, not just ones that have an equal number of ones and zeros.
If you spent time implementing this scheme you would learn a lot about this field.
Thanks I'll work on that. One thing though
> The linked example can encode _any sequence_, not just ones that have an equal number of ones and zeros.
The compression I made can also encode any sequence of bytes, see the testinput folder (caveats in the FAQ are because of the implementation specifics not the algorithm), it just also has to store the freq table which I'm working on compressing too. It has no requirement for equal number of 1s and 0s, that was just an example in the description that I guess did more bad than good.
I mean without storing the freq table, which the last example you linked doesn't. Having to store a frequency table limits the encoder to do no better than the 0-th order entropy. Arithmetic coding can do arbitrarily complex prediction per symbol, see e.g. http://mattmahoney.net/dc/text.html#1072 which reaches the world record by training a neural network to predict the next symbol while it's encoding a file.
Not to say a 0-th order entropy coder could not be useful for certain applications if it was e.g. faster than, say, a range coder while maintaining similar ratios. But I doubt that is possible with this method.
I've added basic bit packing to the frequencies table, so long as the dictionary doesn't dominate like in a pangram it's slightly smaller than two arithmetic (static and adaptive) examples I could find online. I understand this isn't the best that arithmetic can do, next I'll look into that.
Never was trying to compare to NNs or PPM (or even simple LZ), since all this algorithm does is look at the frequency counts.
Agree it's not very practical and very expensive calculations, just fun to work on.
I was initially surprised to find that the size of multinomials is less than the entropy predicted by the Shannon source coding theorem, which is generally accepted as a lower bound. Not sure if this encoder is new, but hopefully others find it interesting. Step by step math here https://github.com/Peter-Ebert/Valli-Encoding/blob/main/the-...
I don't see why arithmetic coding wouldn't match this if the probabilities are suitably updated after every symbol. If you encode 8 bits, 4:4 ratio, after the first bit is encoded, either 1 or 0 has a remaining count of 3, etc. And when the count of either 0 or 1 reaches 0, the rest of the bits are known.
Here's a simpler example, 2 symbols 1:1 ratio, Shannon would say the entropy is 1 bit per symbol, so this needs 2 bits: 01 or 10 encode both permutations.
However I can also just store 1 or 0 to indicate what's stored in the first position, using only a single bit, and the next value is inferred.
With arithmetic coding this requires 1 bit to store: the value of the first bit. The other is known with 100% probability. This is very simple to implement, especially with bit-wise arithmetic coding.
The problem is that the Shannon limit doesn't apply to your example. A fixed 1:1 ratio bit string is not I.I.D.
This. Optimal use of an arithmetic encoder is with P(symbol | all previous symbols), not just P(symbol). Obviously there are practical problems with storing all possible contingency tables, so for something like HEVC there is quite a lot of effort in identifying where this matters.
You are saying that conditional on some additional information (or structure) you can get better compression. But that is saying something tautological: conditional on the entropy being lower the entropy is lower. In your case you are assuming you already know the symbol frequencies and you don’t count that as information. However you are comparing against a model where that is not assumed.
If i had a text that was 100% 'aaaa' or 'bbbb' or 'cccc' with equal probability I'd feed 1/3rd probability aaaa, 1/3rd probability bbbb, 1/3rd probability cccc into an encoder. In this case since the probabilities are not binary numbers i'd use an arithmetic encoder to optimally compress this down to ~1.58 bits per symbol. So 'aaaa' would take 1.58bits to store as would 'bbbb' and 'cccc'
Wouldn’t two symbols with a 1:1 ratio have four possible bit patterns for two bits? (00,01,10,11) With the ratio only happening on average over a large number of bits?
Id there really could only be 01 or 10, then those are the two symbols in the alphabet, and you only need one bit to pick the next symbol (two bits of output).