By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data...

hansvm · on Feb 25, 2023

Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.

Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).

That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.

Your question feels like it has a motive though. What are you really asking?

akomtu · on Feb 26, 2023

LLMs need a baseline to compare with. I suspect that when they get compared with a fuzzy hash table of a similar size (that returns a range of probabilities), their performance will become unimpressive.

hansvm · on Feb 26, 2023

You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.

LLMs can track mid-range dependencies though. Consider the following input

> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.

Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.

Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.

None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.

akomtu · on Feb 26, 2023

The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.

The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.

hansvm · on Feb 27, 2023

Ah, so less a hash table and more vanilla KNN?

Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).

akomtu · on Feb 27, 2023

k-nearest neighbors? Sort of, but I'd rather describe it as a geospatial map in many dimensions.

hansvm · on Feb 27, 2023

There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.

For some intuition, imagine the following tasks:

> Repeat the following phrase exactly twice: "sdflhasdflhasdf"

> Repeat the following phrase exactly twice: "sdflhasdflhasdg"

Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.

The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.

akomtu · on Feb 28, 2023

Indeed, that's good counterexample.