you mean this http://karpathy.github.io/2012/10/22/state-of-computer-visio...?
Very funny to revisit. How primitive our tools were in comparison to now is astounding. It feels like the first flight of the Wright Brothers vs a jetliner. Imagenet was the new frontier. Simpler times...
I think the interesting thing here is the very, very surprising result that LLMs would be capable of abstracting the things in the second to last paragraph from the described experiences of amalgamated written human data.
It's the thing most people even in this thread don't seem to realize has emerged in research in the past year.
Give a Markov chain a lot of text about fishing and it will tell you about fish. Give GPT a lot of text about fishing and it turns out that it will probably learn how to fish.
World model representations are occuring in GPT. And people really need to start realizing there's already published research demonstrating that, as it goes a long way to explaining why the multimodal parts work.