Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

True - I think the key is "empirical probability", so just keying on the combinations of all the tokens in your context window.

(Also I think the "last n tokens" term is a bit misleading: ChatGPT seems to have an n of "approximately 4000 tokens or 3000 words" [1, 2] which would amount to ~6 pages of text [3].

I've seen very few conversations even approaching that length - and in the ones that did, there were reports of it breaking down, e.g. having continuity errors in long RP sessions, etc. So I think for practical purposes we can say its "probability of the next token given all the previous tokens".)

Building a naive markov chain with such a large context is infeasible even before compression, you couldn't even gather enough training data: If you have a vocabulary of 200 words, that would give you 200^4000 [4] conditional probabilities to train. You'd have to gather enough data to get a useful empirical probability for each of them. (Even if shortened the length to the ones of realistic prompts, like 50 words or so, 200^50 is still a number too big to have a name)

Which is why, from what I've got, the big innovation in transformer networks was that they don't look at each token in their context window but have a number of "meta models" which select which tokens to look at - the "attention head" mechanism.

And I think there, the intuition of markov chains break down a bit. Those meta models make the selection based on some internal representation of the tokens. But at least I haven't really understood yet what that internal representation contains.

[1] https://www.reddit.com/r/deeplearning/comments/zk5esp/chatgp...

[2] https://help.openai.com/en/articles/6787051-does-chatgpt-rem...

[3] https://capitalizemytitle.com/page-count/1500-words/

[4] this number: https://www.calculator.net/big-number-calculator.html?cx=200...



Oh yeah, good points. And citations! Thanks so much, really appreciated.

I'm doing work currently on this and hopefully it will yield some fruit -- at least, the work relating on the internals of what's happening inside of Transformers. Just from slowly absorbing the research over the years I have a few gut hypotheses about what is happening. Hopefully any of the work I do will yield some fruit, I think a good chunk of what is happening is surprisingly standard, just hidden due to the complexity of millions of parameters slinging information hither and yon.

Thanks again for putting all of the thought and effort into your post, I really appreciate it. This is something I love about being here in this particular place! :D


That sounds extremely interesting! I'm still trying to understand the basic transformer architecture so far, but I think those kinds of insights are exactly what we'll need if we don't want the whole field to degrade to alchemy with no one understanding what is going on.

Do you have a blog?


I do not, (unfortunately? Maybe I need one?), though if you want you can follow me on GitHub at https://github.com/tysam-code. I try to update/post to my projects regularly, as I'm able to. It alternates between different ones, though LLMs are the main focus right now, if I can get something to an (appropriately and healthfully) publishable state. :3 :)

I try to be skeptical about certain possibilities within the field, but I do feel bullish about us being able to at least tease out some of the structure of what's happening due to how some properties of transformers work. At least, I think it'll be easier than figuring out how certain brain informational structures work (which has happened to some tiny degree, and will be I think even cooler in the future)! :) XD :DDDD :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: