Hacker News new | past | comments | ask | show | jobs | submit login

Because the way LLMs work is more-or-less "for every token, read the entire matrix from memory and do math on it". Math is fast, so if you manage to use only half the bits to store each item in the matrix, you only have to do half as much work. Of course, sometimes those least-significant-bits were relied-upon in the original training.



Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?

e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?

Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?


I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.



Much larger tokens require a much larger token vocabulary.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: