What does reasoning have to do with geometry? Is this like the idea that different concepts have inherent geometrical forms? A Platonic or noetic take on the geometries of reason? (I struggled to understand much of this paper…)
A follow-up comment after having studied the paper a bit more, since you asked about where the geometry comes into play.
One of the references the paper provide is to this[1] paper, which shows how the non-linear layers in modern deep neural networks partitions the input into regions and applies region-dependent affine mappings[2] to generate the output. It also mentions how that connects to vector quantization and k-means clustering.
So, the geometric perspective isn't referring to your typical high-school geometry, but more abstract concepts like vector spaces[3] and combinatiorial computational geometry[4].
The submitted paper shows that this partitioning is directly linked to the approximation power of the neural network. They then show how increasing the approximation power results in better answers to math word problems, and hence that the approximation power correlated to the reasoning ability of LLMs.
Modern neural networks make heavy use of linear algebra, in particular the transformer[1] architecture that powers modern LLMs.
Since linear algebra is closely related to geometry[2], it seems quite reasonable that there are some geometric aspects that define their capabilities and performance.
Specifically, in this paper they're considering the intrinsic dimension[3] of the attention layers, and seeing how it correlates with the performance of LLMs.
> it seems quite reasonable that there are some geometric aspects that define their capabilities and performance.
Sure but this doesn't mean terribly much when you can relate either concept to virtually any other concept. "Reasonable" would imply one specific term implies another specific term and you haven't filled in those blanks yet.
"different concepts have inherent geometrical forms"
Absolutely, in fact you can build the foundation of mathematics on this concept. You can build proofs and reasoning (for some value of "reasoning").
That's how dependent type systems work, search for HoTT and modal homotophy theory. That's how lean4, coq, and theorem proofs work.
If you remember at the foundation of lambda calculus or boolean algebra, they proceed through a series of transformation of mathematical objects that are organized lattices or semi-lattices, partially ordered sets (e.g. in boolean algebra, where the partial order is provided by the implication).
It would be interesting to understand if the density of attention mechanisms follow a similar progression as dependent type systems, and we can find a link between the dependent types involved in a proof and the corresponding spaces in a LLM via some continuous relaxation analogous to a proximal operator + some transformation (from high-level concepts into output tokens).
We have found in embeddings that geometry has a meaning. Specific simple concepts correspond to vector directions. I wouldn't be surprised at all that we find that reasoning on dependent concepts correspond to complex subspaces in the paths that a LLM takes, and that with enough training this connections becomes closer and closer to the logical structure of corresponding proofs (for self-consistent corpus of input and, like math proofs, and given enough training data).
The paper doesn't make this point at all, but one thing you could do here is an AlphaGeometry-style[1] synthetic benchmark, where you have a geometry engine crank out a hundred million word problems, and have an LLM try to solve them.
Geometry problems have the nice property that they're easy to generate and solve mechanically, but there's no particular reason why a vanilla Transformer LLM would be any good at them, and you can have absolutely huge scale. (Unlike, say, the HumanEval benchmark, which only has 164 problems, which resulted in lots of accusations that LLMs can simply memorize the answers)
You'd have the second problem of trying to figure out how to relay geometry as a sequence of tokens when surely how you would encode this would affect what things you might reasonably expect an LLM to draw from it.
Only if your purpose is to create the best geometry solver. If you're trying to improve the general intelligence of a frontier LLM, you're probably better off feeding in the synthetic data as some combination of raw images and text (as part of its existing tokenisation).
I think they are talking about the word embeddings, where context is embedded into high geometric dimensions (one dimension might capture how 'feminine' a word is, or how 'blue' it is).
My extremely naive understanding is that the more useful ones, which also tend to be structures of language like gender or color, get their own dimensions, and other embedding are represented with combinations.
A weak illustration of this is this site[1], from an HN post a few months ago[2].
If the curvature metric wasn’t steep to begin with AdamW wouldn’t work. If the regions of interest weren’t roughly Euclidean control vectors wouldn’t work.
I think the connection is that the authors could convincingly write a paper on this connection, thus inflating the AI publication bubble, furthering their academic acumen and improving their chances of getting research grants or selective jobs in the field. Some other interests of the authors seem to be detecting exoplanets using AI and detecting birds through audio analysis.
Since nobody can really say what a good AI department does, companies seem to be driven by credentiallism, load up on machine learning PhDs and masters so they can show their board and investors that they are ready for the AI revolution. This creates economic pressure to write such papers, the vast majority of which will amount to nothing.
I think a lot of the time you would be correct. But this is published to arxiv so it’s not peer reviewed and doesn’t boost the authors credentials. It could be designed to attract attention to the company they work at. Or it could just be a cool idea the author wanted to share.