The dictionary quality was definitely highly sensitive to some of the tricks tha...

The dictionary quality was definitely highly sensitive to some of the tricks that the original authors implemented in their C++ code, many were documented in the paper but a few were not:

1. Always promoting single-bytes by boosting their scores by a factor of 8 in candidate search

2. Boosting the calculated gains of single-byte candidates by a factor of 8 to prevent them from falling off in later generations

3. Having an adaptive threshold for which symbols are included as the rounds go on

I didn't document these in the blog post to keep the content accessible, but it's definitely something you find once you start digging into compression ratios! Perhaps they will end up in a part 2 at some point.

[1]: https://github.com/spiraldb/fsst/blob/develop/src/builder.rs...