Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Taking a Look at Compression Algorithms (cefboud.github.io)
214 points by thunderbong 6 months ago | hide | past | favorite | 38 comments


Amazing writeup - I needed this a few months ago :)

My impression after my own, shallower dive is that trainable dictionaries are an underappreciated part of the (or at least my) system design toolkit.

For example, say you're serving Wikipedia - a bunch of pages that are kind of static. In order to minimize disk space, you'll be tempted to compress the content. Compressing the whole corpus gets a good compression ratio, but it means that, to read an arbitrary item, you need to decompress everything (or 50% of everything, on average, I guess).

So to get random access, you compress each page. That's ok, but you get a worse compression ratio because every compressor starts from scratch.

But with Zstandard and a trainable dictionary, you could train a dictionary on a couple pages, then use that dictionary to compress and decompress arbitrary items.

As far as I can tell, that's probably the best of both worlds - close to the compression ratio of compressing the whole corpus with gzip, but the random access of compressing each item individually.

This seems really generalizable - e.g. maybe Facebook has to store a zillion photos that are very rarely accessed, but 10% of them are selfies. If we use a vector search to find clusters of similar items, we can compress those items with a single dictionary.

In fact, taking another step back, it seems like databases ought to offer this out of the box. Just like the concept of an index, it's not always a win and there are a lot of knobs that you might want to tune, but the benefits seem clear.

Maybe all of this already exists, or there's something I'm missing, but I really appreciate article's like OP's that break things down so clearly.


What you are describing is sometimes called a shared dictionary and it's a great trick for task-specific compression, where you know what data you're going to be compressing ahead of time.

The Brotli algorithm is typical LZ plus a shared dictionary aimed at common web documents and markup. It does work well and fast for HTML. A common criticism is that it's basically targeted at compressing Wikipedia and the dictionary is loaded with a bunch of junk and now every browser needs a copy of that 120 kB of junk some of which will very rarely be used unless you're compressing Wikipedia. (Both "II, Holy Roman" and "Holy Roman Emperor" are tokens in the Brotli dictionary, for example. Whole dictionary here for the curious: https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37... )


In fact there is a new feature Chrome is championing (and just shipped) called "Compression dictionary transport" - https://datatracker.ietf.org/doc/draft-ietf-httpbis-compress... / https://chromestatus.com/feature/5124977788977152 that allows any HTTP resource to specify the dictionary it wants to use (including the "use me as the dictionary for future reuqests") which allows a website to use a dictionary that specialized to _its_ contents instead of the contents of something completely different.


Another state machine in the browser what can go wrong


Heh and if dictionaries can be shared between sites, another potential security leak.


If anyone is interested in an example of how ZSTD's dictionary compression performs against standard gzip, a number of years ago I put together an example using some Common Crawl data.

"I was able to achive a random WARC file compression size of 793,764,785 bytes vs Gzip's compressed size of 959,016,011" [0]

In hindsight, I could have written that up and tested it better, but it's at least something.

[0] https://github.com/benwills/proposal-warc-to-zstandard?tab=r...


RocksDB has support for Ztd and preset dictionaries and it makes sense since it has the same kind of level-spans chunking being a fork of LevelDB.

Entries are stored in-memory/logged (instead of put into a b-tree like classic DB's) and then periodically placed in span-files that are "linear" for faster search, however as these span files are built in bulk it makes more sense to compress blocks of them since much data is handled at once (so even if it's linear it's still blocks and reading just produces more variable size blocks by decompression upon read).


I remember tools that worked with the Wikipedia dumps, in bzip2, and built indexes to allow decent random access. Once you know where the compressed blocks are, and which Wikipedia entries they contain, you could start from a given block, something like 900k, rather than start at the beginning of the file. Compressing roughly a megabyte at a time, rather than a page, is a pretty solid win for compressibility.


Good thoughts. I'm going to keep this in mind. I've been working on a custom udp netcode for a while. I experimented with LZMAing / RLEing my binary snapshot diffs I send down, and neither felt great, but RLEing beat LZMA for what I was doing so far 100% of the time. Some kind of trained dictionary does sound better.


In general, it's often worth doing transforms like RLE combined with general purpose compression. General compression algorithms don't know about the details of your data and typically have a max window size period, so if RLE compresses your data a lot, it makes LZMA (or most other compressors) will be seeing a giant block of zeros most of the time and won't be able to see nearly as much of the actual data. Running compression after RLE will mean that te giant chunks of zeros will be squashed down so the regular compressor can fit non-trivially compressable data within the window size and more usefully look for improvements.


The conclusion: "One interesting idea is that, at their core, AI models are nothing more than compression models that take the corpus of data humanity has and boil it down to a set of weights" is also good, there is a interesting paper about compressing weather date with neuronal networks: https://arxiv.org/abs/2210.12538

What's missing a bit is that the comparison is more for general purpose data, there are some very interesting and super fast compressing algorithms for e.g. numbers (Turbopforc, gorilla, etc...) Daniel Lemires blog is super interesting about the different algorithms and how to make them faster.


Information Theory, Inference, and Learning Algorithms (2005) by David J.C. MacKay (sadly deceased) was one of my favorite books covering some of the Maths in this area. I need to look at it again.

Free link to online version http://www.inference.org.uk/itprnn/book.pdf


I believe LZ + Huffman hit a sweet spot in ratio/speed/complexity, and that's why it has remained very popular since the late 80s. It's only more recently that faster hardware made arithmetic coding with higher-order models fast enough to be practically usable.


More importantly, there's been a huge shift in the cost of tables vs multiplications.

Back in the 80s and early 90s multiplication was expensive (multiple cycles) while tables were more or less "free" in comparison, today cache-misses are super-expensive (100s of cycles) while multiplications can be run in parallel (MMX,SSE,etc). Sure a huffman table will probably mostly be in-cache but it'll still be at the cost of cache space.

In addition to that various arithmetic encoding methods were patented and thus avoided.


I did some benchmarking of compression and decompression last year. Raspberry Pi 4, Debian, and my corpus was a filesystem with a billion files on it, as a sparse tar of a drive image, which I acknowledge is an odd choice, but that's where my focus was. I made graphs, because the tables were overwhelming. (Which also applies to this post, I think.) There's a blog post, but I think, more quickly useful:

* the graphs - https://gitlab.com/pronoiac/billion-file-fs/-/tree/main/grap...

* the Pareto frontier for compression vs time: zstd -1 and -9, plzip -0, xz -1 and -9. lzop -1 was a bit faster, and plzip -9 a bit smaller, but they had heavy penalties on the other axis.

I wasn't aware of Snappy.



I really like the conclusion. “Gzip is all you need” has lost it’s momentum for some time, but there is definitely more gold to dig in that area.


Does anyone have a favorite arithmetic encoding reference implementation? It's something I've never implemented myself and it's a gap in my tactile understanding of algorithms that I've always meant to fill in


"Arithmetic coding for data compression" by Witten Neal and Cleary. 1987. https://dl.acm.org/doi/10.1145/214762.214771 It explains that all that's left to work on for better lossless compression factors is better (higher likelihood) models of sources. It provides C code and a good explanation of arithmetic coding.


I really enjoy https://andrewiggins.github.io/gz-heatmap/ to help visualize gzip/deflate at a per-byte level.


If any of you interested in more details on compression, and in use, take a look at "Bit-mapped graphics" by Steve Rimmer and "The data compression book" by Mark Nelson.


I’d be interested in hearing more about how to “abuse” compression algorithms to detect outliers in datasets or to filter out repetitions. For example, LLMs or Whisper sometimes get stuck in a loop, repeating a word, a group of words, or whole sentences multiple times. I imagine a compression algorithm would be an interesting tool to filter out such noise. And there’s maybe even more shenanigans you can do with them besides actual compression…


Why not use a dedicated algorithm?

Rolling hashes and bloomfilters would get you a lot more than trying to glean something out of a compressed stream. It's not because compression algorithms use dictionaries that it's the only way to build a dictionary...


I was promised a description of how FSE works but all I got was an implementation, by a person muttering about how hard ANS was to understand.

So... any links on how it actually works?


"The Data Compression Book" by Mark Nelson has a good explanations of some basic techniques - and examples of C implementations of them.


Available on Open Library (though, I don't believe downloading it is an option).

https://openlibrary.org/books/OL803861M/The_data_compression....


Tanks for the easy explantations. Very Clear. Doing Compression of Data using different strategies are always usefull to understand : After all, many of us software developers are dealing with saving Data with size issues, speed loading of junk of data. Interesting...


Wish he took a look at middle-out too. Kind of a missed opportunity tbh.


Compression is always an interesting topic.

Is there room for creativity in this field, or has the last juice been squeezed by proving that it can't be done better?


Outside of lossy compression, we've been fairly close to theoretical limits for quite some time, but what changes is a little bit what we want to compress, but also how we leverage resources to do the compression.

A good deal of the time we care about decompression speeds, or the tradeoff between speed and bandwidth. The algorithms that reigned in the 90's and still kind of reign today are unwieldy. So new techniques that get within a few percent of optimal much faster or using less memory are an easy sell.

And once in a while we get something like Burrows Wheeler which isn't a compression, it's a transform (hence BWT) that can unearth some broader patterns in a file and make them more conducive to being compressed without a large memory structure that grows faster than the data under inspection.


I'd say there are some possibilities in compression formats that are transparent to certain operations, that is compressed data you can process as is (without decompressing).

For example, look at Algorithmics on SLP-compressed strings: A survey (https://www.degruyter.com/document/doi/10.1515/gcc-2012-0016...).


The article is excellent!

Btw, does anyone know of an article or book about GPU texture compression? Would love a good in detail reference.


Very good article for what it covers. I'm just a little disappointed about zstd. I now more or less understand arithmetic coding, which is quite nice, but some other aspects of zstd are only a name and a (probably good) URL for further research.


one observation is that performance differ tremendously base on nature of data, and dev needs to run data specific benchmnark himself.

For example, on my data I have 2x compression rate for lz4 and 7x for zstd somehow.


I don't know if it was done already but it should be possible to make a compression format that also aids in searching the archive in a bloomfilter-ish kind of way.

Then get 4 goals: compression ratio, compression speed, decompression speed and search (which could be split further)


As pointed out: it's done, look for algorithmics over grammar-based compression. Querying and search is one of the operation that is doable on compressed data.

See Algorithmics on SLP-compressed strings: A survey (Markus Lohrey) (https://www.degruyter.com/document/doi/10.1515/gcc-2012-0016...)

Implementation-wise, you probably loose on the first goal, gain on the second and third (simpler, faster implementations), if you make the fourth one easy to implement.


Note that this is about general purpose compression. For example, most machine learning (including LLMs) results in a compression algorithm, as the objective function is minimizing entropy (see nn.CrossEntropyLoss).


I didn't find the middle-out compression




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: