Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've studied the Burrows-Wheeler Transform, I understand the transformation, I've re-implemented it countless times for kicks, I see how it improves compressability, but for the life of me the intuition of _why_ it works has never really clicked.

It's a fantastic bit of algorithmic magic that will always impress me to see it.



The Burroughs-Wheeler transform has been described as a unique algorithm idea in that there are no non-trivial variations or related algorithms, unlike more conventional compression algorithms, which can be tweaked and improved in so many ways. There is no general compression theory in which BWT could be described as a special case.

It looks to me that the above still holds: Bzip2 and Bzip3 are simply combining more conventional compression algorithms with the BWT, which itself is still the same old transform. Bzip2 does Huffman coding after BWT, and Bzip3 does arithmetic coding.


> There is no general compression theory in which BWT could be described as a special case.

I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].

[0] N.J. Larsson, "The Context Trees of Block Sorting Compression," 1998. https://ieeexplore.ieee.org/abstract/document/672147


Thanks for the reference, looks interesting.


It definitely is related to prediction by partial match (PPM).

BWT sorts rotated data and what is achieved is that same suffixes group together:

  ...
  "Bzip2 and Bzip3 are simply combining more"
  "Bzip3 are simply combining moreBzip2 and "
The preceding (to suffix) character goes to end and then gets outputted. This is much like PPM going backward. There is a PPM* algorithm (unbounded context length) where authors considered reconstruction of contexts from data, utilizing something like LZSS seach. Same idea is in BWT - context is reconstructed from data.

BWT also breaks near dependencies in data, this is why move-to-front with Huffman or arithmetic encoding works well there.


Wow, thanks. As always, the best way to learn more on the internet is to be confidently and totally wrong!


Can BWT be combined with zstd, which uses asymmetric numeral systems?


Yes, it would actually be interesting to just have a bwt pass which does no compression, so we can then try lots of post compression options.


I've been thinking about adding support for this kind of stacking to DACT [0].

[0] http://dact.rkeene.org/


> BWT be combined with zstd

BWT can be combined with anything which does RLE and get a benefit.

What does it does is give RLE more to work with.


I think you would just need ANS, not the rest of zstd.


I don’t know enough to evaluate that, but it sounds plausible. Apparently modifying or integrating zstd into custom solutions was a common path in submissions to, at the very least, the GDCC 2021 T2 contest. This is all well outside of my learned competence so I’m just here to learn and ask naive questions.


Zstd is basically LZ77 + FSE. There are some other pieces to it, but they’re minor. FSE is a form of entropy coding that you can swap out with other entropy coding techniques, like Huffman or arithmetic coding. For most objectives, FSE is the clear winner of those three.

As people mentioned elsewhere in the thread, Burrows-Wheeler isn’t really composable with other systems. Nobody has figured out a reasonable way to combine LZ77 and Burrows-Wheeler. That’s why zstd + bzip2 does not work. But you do need an entropy coding system to go with Burrows-Wheeler… and FSE fits the bill.


Thank you for the reference. I learned something new today. That algorithm is wild. If you had shown me the transform and asked if it had an inverse, I would have said of course it doesn't, it's too weird.


I always understood it as working because of the predictability of a symbol/letter/token given the previous one.

Sorting all the shifts of a string puts all the characters in order, then looking at the last column shows you all the _preceding_ characters. If there's any predictability there (which there often is), it's now easier to compress. It's sorta like an entropy coder in that way.

I've never thought of it as being that deep, and understood them since I was a kid -- building an intuition for "why" the FFT works is much harder -- but that being said, I clicked quickly to reply thinking "that's easy! I can explain this!" then struggled for a while trying to get the picture in my mind into text. :)


But shouldn't Huffman coding already detect that same predictability and compress it the same?

What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.


> What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.

Ahhhh. Now we're on the same page. :) Seeing how it helps when combined is somewhat subtle/non-obvious. I believe it relates to BWT and Huffman both being approximations of something more optimal. The two transforms could also have different window sizes -- one rarely does BWT on a whole 1GB file -- which introduce inefficiencies. Huffman coding is also only optimal in the very large alphabet and very long data limits. As your data length and alphabet size decrease, it gets less optimal.

Put differently, "I think that's a wonderfully phrased question, this _is_ my specialization/subfield, and I'm gonna need to chew on it for a while."


Thanks. Yeah, I can see how that would make more sense if BWT was redundant under a theoretically perfect Huffman compression, but it happens to pick up some things that real-world Huffman encoders don't, with their practical limits on CPU and memory.


Nearly any real-world Huffman encoder is trivially optimal, i.e., given a set of symbols with probabilities, it is easy to construct the optimal set of output bits for each symbol. (There are some exceptions in that if you have a huge amount of symbols, or some that are extremely rare, you can bound the upper length and get a non-optimal code. This matters very little in practice, i.e., less than 0.1% in a typical setting. And of course, if you allow fractional bits or probabilities that change based on context, then you can do better, but then it's no longer Huffman.)

BWT is orthogonal to Huffman; like LZ, it exploits that some symbol _sequences_ are more common than others, while Huffman is about the optimality of coding each symbol on its own.


One way to look at data compression is splitting it into a model and an encoder. The model describes the data, while the encoder encodes (or equivalently predicts) the data according to the model. The compressed output consists of the serialized model and the encoded data. BWT is a model, while Huffman is an encoder.

Huffman takes a probability distribution and a symbol and encodes the symbol according to the distribution. If you encode all symbols independently according to the same distribution, you probably don't get very good compression.

You get a bit better results with a model that has a separate distribution for each context. If the previous symbols were X, Y, and Z, you encode the next symbol according to the distribution for context XYZ. This approach doesn't really scale, because the size of the model grows rapidly (exponentially in a naive implementation) with context length. You get better compression with an adaptive model. You start with a uniform distribution and update the available contexts and distributions after encoding each symbol. On the one hand, you don't have to store the model explicitly. But on the other hand, updating the model is very slow.

Burrows-Wheeler transform is an implicit model. It sorts the symbols according to the context that follows them, and it does that simultaneously for each context length. Because you don't have to store the model explicitly, you can effectively use longer context lengths than with a fixed explicit model. And because you don't have to update an explicit model after encoding each symbol, using the BWT is much faster than using an adaptive model.


Hi, tool author here.

Huffman coding is a static minimum-redundancy code. What this means is that it finds an optimal assignment of bit sequences to letters in the input alphabet (commonly US-ASCII or extensions). This however means that Huffman coding can not exploit redundancies that stem from the concrete sequence of characters. For example, you could easily predict that an `e` comes after `Th`, but Huffman coding can not know that.

Hence after applying the Burrows-Wheeler transform you need to have some sort of a higher-order transform (i.e. a transform that considers more than just individual bytes) which somehow reaps from the changed distribution of the result of the algorithm. But we will get to that in a second.

The joke here is that the Burrows-Wheeler transform is closely related to suffix trees and suffix arrays, which are often used in bioinformatics and HPC for full-text search. If you wanted to find a pattern of length `p` in a text of length `n`, if you already have a suffix tree of the original text, the search is linear in the length /of the pattern/ - i.e. O(p). The suffix tree stores all suffixes of a string in a compressed manner (i.e. it has a linear space overhead, approximately O(20n) as given by Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press), so you can search for a word in it by simply traversing from the root node to an internal or leaf node by following a sequence of bytes that comprise the word.

As such, a suffix tree (and equivalently suffix array and the BWT, which is trivially computed from a suffix array) form something which can be thought of as a static PPM model. Notably real world implementations of PPM use suffix trees as a part of their main storage data structure (e.g. PPMd). What this all means is that given a suffix tree, we can very cheaply give the probability distribution for the next byte that follows a given fixed-order sequence of bytes. This is nice, because then e.g. an order-2 predictor would be able to tell that `Th` is followed by `e` once enough data has been gathered.

As you can probably guess, the more preceding bytes you know, the better will be your estimate for what is the most likely next byte. But the larger your context, the more expensive the searches and computations become due to pointer chasing in the suffix tree.

So how do we remedy this? We notice that the Burrows-Wheeler transform essentially clusters similar contexts together, meaning that a low order predictor (= faster, simpler) on BWT compresses as well as a high order predictor (= slow, complicated) on the original data, at the cost of an extra transformation. This is viable, because the Burrows-Wheeler transform can be quickly computed and there have been recent advancements in running it on the GPU. So what this means is that bzip3 uses BWT + a low order predictor with an arithmetic coder to encode the bytes, meaning that it can make use of high order statistics for compression and performs comparably at a faster speed.


You are spot on.

Btw, the Burrows-Wheeler transform is often explained as taking the last column.

I find it easier to understand why it works if you think of BWT as sorting all rotations of your string by from their _second_ character onwards, and then writing down all the first characters.


> But shouldn't Huffman coding already detect that same predictability and compress it the same?

Huffman coding only works on individual letters. It doesn't know anything about relationships between adjacent letters.


Understanding why increasing predictability helps with compression is not the hard part though. What's hard to grasp is why the transform is reversible.


a word can be factored into the set and frequency of letters + the specific permutation. compressable patterns in either channel seem likely when the underlying words are language like.


And in general that set of descriptors provides no compressibility. BWT is much richer that that set in that the way it works performs well for data we care about.

Describing a multiset takes as much information as the multiset contained to begin with, on average. BWT somehow works better on things of use.


I remember the lecturer commenting on what sort of sick and twisted mind could come up with such a ridiculous convoluted notion when I was taught it at university.


Wheeler was also one of the inventors of the "closed subroutine" AKA function, which had to be implemented via a hack as machines of the time did not include ISA support for "return":

https://en.m.wikipedia.org/wiki/Wheeler_Jump


Yes. He also happened to be in residence just down the hall from the lecture theatre at the time.


I really liked the computerphile video about it https://www.youtube.com/watch?v=GYbCttCF25A


I feel exactly the same, and have also implemented it backwards and forwards. I've thought about it in my sleep, trying to recall how it REALLY works. Happens every few years ;-) I always thought it was probably obvious to everyone else what the "magic" is.


Isn't it basically run length encoding but on patterns? Sorting lexicographical on all rotations means blocks of patterns get grouped together, which means you can do compression more easily, right?


> the intuition of _why_ it works has never really clicked.

In terms of why it aids compression, or why it's reversible?

As far as the first goes: it transforms n-grams into repeated characters.


Haha I also studied as part of a student project and I remember my computer science teacher saying "it's dark magic" ^^


Yeah. BWT and zero knowledge proofs are my goto examples of CS things that seem like pure magic.


Public-key cryptography is magic. Zero-knowledge proofs are a consequence that's difficult to find on your own but easy to understand once you've seen it.

I remember seeing someone (probably Avi Wigderson) demonstrating a zero-knowledge proof for graph coloring on an overhead projector when I was starting my second year studying CS. He had a slide with a graph, multiple slides with different permutations of the same valid coloring of the vertices, and a piece of paper with "doors" over the vertices to cover everything. The audience could ask him to open the doors over the vertices connected by any edge, and he would show that the coloring is valid, as far as those two vertices are concerned. And then he would replace the coloring with another permutation for the next iteration. The idea felt clever but kind of obvious in retrospect.


Magical things are usually built up from the trivial examples.

Graph colouring is usually used as a teaching example for zkp, because of how easy it is to understand. Its still amazing how you can go from that example to "Here is a file that anyone can verify (non-interactively) which shows i have a passport and that passport is not from a country sanctioned by the usa but otherwise does not reveal anything about me" or "I will not reveal my ip or any other identifying/unique info but i will prove i have not been previously blocked from this website including during the other times i anonoymously accessed this website"


Things that look magical often stop being magical when you have the right perspective and the right abstractions. The step from a toy example to proving any verifiable statement is just NP-completeness. Which is simple enough that undergraduates are often expected to understand it.

Interactive zero-knowledge proofs are also technically non-interactive. They are defined in terms of the verifier evaluating a transcript of the protocol. If the verifier accepts some causal assumptions about the provenance of the transcript, they will accept the proof. If they disagree with the assumptions, the proof is indistinguishable from random noise they could generate themself. An interactive commitment – challenge – response protocol is one possible causal assumption. A source of randomness could replace the challenges, making the protocol non-interactive. Or there could be a pre-committed secret, making a single-round protocol effectively non-interactive.

These are things a sufficiently interested CS undergraduate can prove and understand. Public-key cryptography, on the other hand, remains magical. There are many things people assume to be true. Which need to be true for public-key cryptography to function. Empirically these things seem to be true, but nobody has been able to prove them. And I don't think anyone really understands them.


Agreed - I've worked with PKI in many years, and know why the various systems work...in terms of "why you can decrypt again", not in terms of why it is secure (no other way to decrypt) which no one really knows. But if we assume for a moment the systems are secure, it is truly fascinating when thinking about it in abstract terms: Who would have thought, it is possible to give someone an exact recipe to follow that scramble a message to be sent... we can consider e.g. an encryption routine where the public key is inline so it is really just a standalone scrambling problem. Even though it is completely public, it can only be used to scramble and not unscramble. The sender who literally went through the steps to scramble the message, cannot undo what they just did. (the sender could have saved the original before scrambling!). And it is not because data is thrown away it can't be unscrambled - all the information is there there, fully recoverable, but only by the person who made the scrambling-recipe and there's no practical way to deduce this unscrambling recipe from the scrambling recipe.


> Public-key cryptography is magic.

Public key cryptography is kind of magic, but the basic piece that makes everything work isn't too hard. It's been a while since I took discrete math, so this is a bit rusty, but it should get you in the right direction. It's modular multiplication in a field. If you take a small field and make a times table, you can see the magic. All of the factors that are relatively prime to the modulus will have a pattern that you F times X is only equal to F times Y if X and Y are equal. There will also be a multiplicative inverse so that F times X time I equals X for all X; because of commutation I times X times F also equals X. When you've determined an F and an I, you give your correspondent F and keep I --- then if you multiply your message by F, they multiply it by I and get back the message. If they want to send you a message, they multiply it by I and send it, then you multiply it by F.

There's some other stuff, but that's the magic part, IMHO. You just have to use a bigger field and some other stuff I forgot. ;)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: