I feel like compression will do a better job than deciding on an encoding scheme...

steveklabnik · on Feb 13, 2023

I thought I'd read something about it, but when I googled, what I did find was this old HN comment: https://news.ycombinator.com/item?id=8514519

> UTF-8 + gzip is 32% smaller than UTF-32 + gzip using the HN frontpage as corpus.

KMag · on Feb 14, 2023

In theory, yes. In practice, no. At my previous job, I wrote a short Python script that took /usr/dict/words and gzipped it and also converted to UTF-16LE (inserting a null byte before every character) and gzipping that. The information content is the same, but the compressed UTF-16LE ends up a bit bigger. IIRC, the difference was more than 1%, but less than 20%.

My use case was to show the flaw in the logic a colleague was using to assert that gzipped JSON should be the same size as gzipped MessagePack for the same data, because the information content was the same. It was a quick 5-minute script without having to deal with coming up with a suitable JSON corpus to convert to MessagePack.

Among other things, the zlib compression window only holds half as many characters if your characters are twice as big.

KMag · on Feb 15, 2023

For anyone still reading, out of curiosity, I reran the experiment on my Debian box:

  $ </usr/share/dict/words  gzip --best | wc -c
  261255
  $ </usr/share/dict/words iconv -f utf-8 -t utf-16le | gzip --best | wc -c
  303404

A bit over a 16% size increase form converting the "wamerican" dictionary to UTF-16LE and then compressing.

teddyh · on Feb 15, 2023

For the large dictionary it’s a tiny bit worse, but still rounds to 16%:

  $ < /usr/share/dict/american-english-insane gzip --best | wc --bytes
  1778330
  $ < /usr/share/dict/american-english-insane iconv -f utf-8 -t utf-16le | gzip --best | wc --bytes
  2061457

JoshTriplett · on Feb 13, 2023

You still have to decompress it on the other end, to actually parse and use it. At which point you have four times the memory usage, unless you turn it into some smaller in-memory encoding...such as UTF-8.

vlovich123 · on Feb 14, 2023

Most encoders let you define filters that restructure the data using out of band knowledge to achieve better rates. For example, if you have an array of floating point numbers, rearranging the exponent and mantissa can yield significant savings if you can arrange for a consecutive run of each separately because the generic compressor doesn’t know anything about the structure of the data. Compressors are great but out of band structural compression/reorganization + compressor will always outperform compressor alone.