The dictionary quality was definitely highly sensitive to some of the tricks that the original authors implemented in their C++ code, many were documented in the paper but a few were not:
1. Always promoting single-bytes by boosting their scores by a factor of 8 in candidate search
2. Boosting the calculated gains of single-byte candidates by a factor of 8 to prevent them from falling off in later generations
3. Having an adaptive threshold for which symbols are included as the rounds go on
I didn't document these in the blog post to keep the content accessible, but it's definitely something you find once you start digging into compression ratios! Perhaps they will end up in a part 2 at some point.
1. Always promoting single-bytes by boosting their scores by a factor of 8 in candidate search
2. Boosting the calculated gains of single-byte candidates by a factor of 8 to prevent them from falling off in later generations
3. Having an adaptive threshold for which symbols are included as the rounds go on
I didn't document these in the blog post to keep the content accessible, but it's definitely something you find once you start digging into compression ratios! Perhaps they will end up in a part 2 at some point.
[1]: https://github.com/spiraldb/fsst/blob/develop/src/builder.rs...