Hacker News new | past | comments | ask | show | jobs | submit login

Also note that the dictionary is strongly biased towards English. Sure, there is some Russian, Chinese, Arabic (and probably some other scripts in there which I don't recognize), but there seems to be more English words in there than all those others combined. If you're compressing small documents in any other language than English, it might not be worth it to use Brotli.

Edit: They wrote this about it in http://www.gstatic.com/b/brotlidocs/brotli-2015-09-22.pdf :

> Unlike other algorithms compared here, brotli includes a static dictionary. It contains 13’504 words or syllables of English, Spanish, Chinese, Hindi, Russian and Arabic, as well as common phrases used in machine readable languages, particularly HTML and JavaScript. The total size of the static dictionary is 122’784 bytes. The static dictionary is extended by a mechanism of transforms that slightly change the words in the dictionary. A total of 1’633’984 sequences, although not all of them unique, can be constructed by using the 121 transforms. To reduce the amount of bias the static dictionary gives to the results, we used a multilingual web corpus of 93 different languages where only 122 of the 1285 documents (9.5 %) are in languages supported by our static dictionary.




Here's the list of words, by the way. I couldn't find it anywhere in non-hexadecimal form:

https://gist.github.com/xnyhps/677f7c1b444f346bef99

(I cleaned it up a bit to remove newlines and tabs, and a couple that are entirely of unprintable characters.)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: