It makes sense - publications write about the things they added, changed or eval...

_t89y · on Feb 21, 2024

Thanks for your reply.

That's my first point. In 10 years we have word2vec, GloVe, GPT-2 and... tiktoken. lol. It's as if directional, numeric magnitudes in an embedding space of arbitrary dimensionality have magically captured or will magically capture the nuances and expressivity of language. Optimization techniques and new strategies for domain adaption are what matters, particularly for mobile devices, on-device ASR and short-form videos.

I don't think robust is a good characterization of clusters of semantic attributes in space or a distributional semantics of language. I'd say crude and without understanding are more accurate descriptions. Capturing semantic properties sometimes is not the same thing as having a semantics.

By targeted improvements you must be referring to domain adaptation and by the default option you must be referring to attention over BPE tokens? You can move directional quantities around in directional quantity space all day. If it results in expected behavior for your application that you weren't getting before that's great. If that's all you want to get out of these models then indeed there's nothing to do here. I'm not after improvements so much as I'm after something that works.

PeterisP · on Feb 22, 2024

What I mean by robust and targeted improvements is not about the concept as such, but about any choices specifically with respect to how you build the tokenization layer - if you're making a particular system, making better targeted choices for tokenization, character filter/preprocessing or vocabulary can give you some improvements in efficiency, but they rarely are a dealbraker and tokenization never is a key enabler. Like, if some tokenization or filtering destroys data your specific task happens to need, that's a problem, but you don't need advanced future research to fix it, going back to simpler tokenization and removing features is sufficient for that, at the extreme you could always use a naive character-level tokenizer, it's trivial but simply is less computationally efficient.

If you don't care about tokenization and use any of the reasonable default options without caring about them, and if you're doing a proper pre-training on non-tiny quantities of data, then the next few layers of whatever neural architecture you have on top of these tokens will generally be able to learn to compensate for any drawbacks in your tokenization, perhaps at some computation overhead - e.g. perhaps you could have had one less layer or smaller layers if you had the best tokenization possible, and edging out that computation cost improvement is pretty much the only thing you can hope to get out of having a better tokenizer.

_t89y · on Feb 23, 2024

Thanks for this perspective on the tradeoff between accuracy and efficiency and the insight that an adequately pre-trained model should be in a position to recover lost information from bad tokens.

Tokenization, the gateway to word embeddings, is a means to an end. I'm not suggesting that better tokens are needed or that BPE tokens should be replaced with something else. I'm suggesting that aiming for a distributional semantics is setting the bar pretty low and that there are better places to end up than These Things Are Over Here And Those Things Are Over There Let's Combine Them And See What Happens. I'm expressing disbelief that these representations have been taken at face value and that there has been practically no discussion of applying alternative formalisms which may be more expressive.

Modeling language in a latent space only makes sense for certain aspects of language and certain kinds of analyses. Crucially, you have to have meaningful primitives to begin with. This line of thinking that an understanding of language and an understanding of the world is somehow going to emerge from mapping character spans onto a latent space and combining them with dot product attention is pretty half baked. These systems remain in Firth Mode™.