Because the way LLMs work is more-or-less "for every token, read the entire matr...

emidoots · 2024-10-25T00:33:52 1729816432

Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?

e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?

Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?

pizza · 2024-10-25T07:09:38 1729840178

I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.

visarga · 2024-10-25T03:34:03 1729827243

yes, look up Byte Pair Encoding

https://huggingface.co/learn/nlp-course/chapter6/5

dragonwriter · 2024-10-25T01:09:52 1729818592

Much larger tokens require a much larger token vocabulary.