I don’t know what I’m talking about (pure fantasy), but what if you train a mode...

WorldMaker · 2025-10-28T17:08:34 1761671314

The tokenizer is already a form of (somewhat lossy) compression of a string of plaintext to a stream of token identifiers. You can reason about Tokenizers/"embedding spaces" as a sort of massive "Dictionary Table/Dictionary Function" like you might use in a zip/gzip stream.

Starting with already compressed data doesn't necessarily mean fewer tokens, you can probably assume similar entropy (or probably worse entropy) in expanding "Dictionary words" in a compressed stream versus "tokens" from a plaintext stream.

Loranubi · 2025-10-28T08:50:54 1761641454

Since all input is run through a tokenizer, I would expect the tokenizer space doesn't change a lot between one trained on uncompressed vs one trained on compressed data.