Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The dictionary quality was definitely highly sensitive to some of the tricks that the original authors implemented in their C++ code, many were documented in the paper but a few were not:

1. Always promoting single-bytes by boosting their scores by a factor of 8 in candidate search

2. Boosting the calculated gains of single-byte candidates by a factor of 8 to prevent them from falling off in later generations

3. Having an adaptive threshold for which symbols are included as the rounds go on

I didn't document these in the blog post to keep the content accessible, but it's definitely something you find once you start digging into compression ratios! Perhaps they will end up in a part 2 at some point.

[1]: https://github.com/spiraldb/fsst/blob/develop/src/builder.rs...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: