Hacker News new | past | comments | ask | show | jobs | submit login

I feel like compression will do a better job than deciding on an encoding scheme ahead of time, no? Once gziped I wouldn't expect a difference



I thought I'd read something about it, but when I googled, what I did find was this old HN comment: https://news.ycombinator.com/item?id=8514519

> UTF-8 + gzip is 32% smaller than UTF-32 + gzip using the HN frontpage as corpus.


In theory, yes. In practice, no. At my previous job, I wrote a short Python script that took /usr/dict/words and gzipped it and also converted to UTF-16LE (inserting a null byte before every character) and gzipping that. The information content is the same, but the compressed UTF-16LE ends up a bit bigger. IIRC, the difference was more than 1%, but less than 20%.

My use case was to show the flaw in the logic a colleague was using to assert that gzipped JSON should be the same size as gzipped MessagePack for the same data, because the information content was the same. It was a quick 5-minute script without having to deal with coming up with a suitable JSON corpus to convert to MessagePack.

Among other things, the zlib compression window only holds half as many characters if your characters are twice as big.


For anyone still reading, out of curiosity, I reran the experiment on my Debian box:

  $ </usr/share/dict/words  gzip --best | wc -c
  261255
  $ </usr/share/dict/words iconv -f utf-8 -t utf-16le | gzip --best | wc -c
  303404
A bit over a 16% size increase form converting the "wamerican" dictionary to UTF-16LE and then compressing.


For the large dictionary it’s a tiny bit worse, but still rounds to 16%:

  $ < /usr/share/dict/american-english-insane gzip --best | wc --bytes
  1778330
  $ < /usr/share/dict/american-english-insane iconv -f utf-8 -t utf-16le | gzip --best | wc --bytes
  2061457


You still have to decompress it on the other end, to actually parse and use it. At which point you have four times the memory usage, unless you turn it into some smaller in-memory encoding...such as UTF-8.


Most encoders let you define filters that restructure the data using out of band knowledge to achieve better rates. For example, if you have an array of floating point numbers, rearranging the exponent and mantissa can yield significant savings if you can arrange for a consecutive run of each separately because the generic compressor doesn’t know anything about the structure of the data. Compressors are great but out of band structural compression/reorganization + compressor will always outperform compressor alone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: