First, this is awesome and we need more of this kind of thing.
Second, disclaimer: I am not now and might never be a serious algebraic and/or differential geometer. Just a fan at the moment.
I've been calling the useful transformations in LLM latent manifolds things like "substantially affine", and I think that's probably true enough of the current crop.
I don't think this about `{V, I}-JEPA` (about which there's a lot of information and I plan to look into it a lot more) or Sora (about which there is less information but is still impressive AF). One imagines that `V-JEPA` and Sora have some deep parallels/symmetries.
Either way, I'll wager that serious Riemannian geometry is rapidly on it's way to table stakes. We have extreme high-dimension spaces that result from backprop and gradient descent, some combination of smooth/continuous/differentiable/compact seem pretty likely to fall out? Along with interesting curvature tensors and parallel transport for moving around in them? And TDA for figuring it out numerically/computationally?
I'd love if an expert chimed in, I'm trying to describe an intuition with a fluency that involves pointing and gesturing.
To my knowledge as a math-turned-ML guy, there are currently no useful geometric characterizations of deep net latent spaces that are both "deep" (in the sense of using advanced mathematics) and "useful" (in the sense of revealing properties of networks or their latent spaces that aren't understood otherwise). Of course if anyone knows better I'd love to hear about it.
Continuous geometric concepts don't play super well with the way we like to decompose model outputs into discrete entities (classes, words, visual properties). We can, e.g. find variables in celebrity face GAN latent spaces that seem related to face orientation, or hair color, sort of, over some variable range and under some input conditions, but that doesn't really translate cleanly into any typical mathematical characterizations, geometric or otherwise, where you'd be looking for some property to hold everywhere or at least have an atlas of connected local approximations to simple characterizations.
Instead, we get high-dimensional messes of spaces, and network gradients during training don't exhibit clean or easy to understand dynamics except in the simplest toy cases.
To paraphrase a more serious "math for ML" prof I've chatted with at times -- "doing math" classically involves being able to find a description with only a few free parameters for a complex phenomenon that may superficially appear to have many/infinite free parameters. It's possible that for large ML models trained on natural data, such a reduction just doesn't exist, you can't break the contributions of millions or billions of parameters down into a low-dimensional approximation. He was/is skeptical of us attaining deep mathematical insight into their operation, but he could always be wrong. I'd certainly love to see cool novel insights come out of mathematics that give clarity to what's been going on these past 15 years.
Thank you very much for the thoughtful and insightful reply!
This is obviously speculation/intuition, but it's not terribly surprising to me at least that operating in e.g. pixel-space or a straightforward lifted latent manifold (modern diffusers basically) wouldn't have apparent structure under the fancy t-SNE type things that seem to be the heaviest artillery brought to the party (at least in the open). In pixel space, you get 6-17 fingers on 1-3 hands.
The `france - paris + uk === london` thing is real, and it's not surprising because there aren't typically much in the way of nonlinearities in `word2vec`/`fasttext`/`glove` type stuff. But this substantially survives all the leaky relus or whatever in LLMs. They're pretty clearly interpolating in a way that you could get close to with a composition of affine transforms.
JEPA (and maybe Sora if..., fuck it) seems a dramatic shift in forcing joint loss into a much higher-level space/manifold with (to me at least) shockingly semantic properties. I mean look at the I-JEPA reconstructions from pre-trained lifted space with some dinky diffuser/VAE-thing eating the hyperplane:
That's not pixel space, and you've got a lot of freedom to make it smoother, I suspect no one says "L1 regularization" anymore, but there's some modern version of that, we know how to do this.
AFAIU (and again, I welcome expert correction) TDA at least and really a lot of modern geometry is about "scruffy intrinsic / smooth embedded" or vice versa, and "scruffy at this scale but smooth if you set the focus right".
preserving topology during dimension reduction might affect this? something something, fractal dimensionality, erm tropical geometry and amoebas and the, here it is, https://proceedings.mlr.press/v80/zhang18i.html
Edit, obviously I don't know my zonotope from my tropical hypersurface, I do however like the pretty pictures ;)
"By computing geometric descriptors of DNNs and performing large-scale model comparisons, we discovered a geometric phenomenon that has been overlooked in previous work: DNN models of high-level visual cortex benefit from high-dimensional latent representations. This finding runs counter to the view that both DNNs and neural systems benefit by compressing representations down to low-dimensional subspaces [20–39, 78]. "
Maybe! I'm lost in the maths, "Our results suggest that learned optimizers can benefit from considering the (symmetry) structure of the weight space they optimize. " this from 7th Feb came out of DeepMind; https://arxiv.org/abs/2402.05232
I haven't dug into this much at all, but I am aware that some teams have been looking for methods to either discover or train in symmetries and invariances into latent spaces. For instance, in an image recognition model one might expect shift and rotation invariance properties.
I don't know if this line of research is going anywhere, but I thought it was interesting in terms of actual geometry being applied to ANNs.
The math is often behind me, I just try to visualize what I can. So I don't know how interesting this would be to you, but these folks are pretty good at math and trying to figure out what's going on in the latent spaces:
https://transformer-circuits.pub/
If you find this stuff useful, I recommend also checking out our pygraphistry umap() for a few reasons here:
- by visually linking similar entities, vs just a scatter plot of values floating around or binning, you see a lot of the substructure and non-linkages these tools otherwise mislead you on
- you get a full interactive suite for on-the-fly visual encodings, drilling, reclustering, etc for wheb you start investigating it
- automatic support for heterogenous data. Ex: for user analysis, combine profile + name embeddings with numeric, date, Boolean, etc features, so you can work with the real dataset, not just a couple columns
For video/images, there are other tools I'd recommend because they add another layer of nuance. But for this kind of text, tabular, etc data, we've come to be opinionated and built in a lot here over the last few years, including with cool work by our collaborators at Nvidia (rapids)
Author of the project here. Definitely appreciating the supportive comments. I'd be happy to answer questions folks have and am very interested in what kind of data folks end up visualizing with it!
This is great! Used to hack something like that together whenever working with embeddings, clustering, semantic search etc., using umap and plotly. This looks a lot more polished!
I wrote a little tool last year for myself that I called "hyperspace" haha- it allows me to do a similar "inspection" of model activity and output across a series of visualizations.
Honestly, looks more useful. Tensorflow embedding projector is pretty limited except for quick nifty visualizations, but it doesn't really inform you much about different clusters of points or why different clusters or hierarchies emerge. From a quick glance, it looks like this library lets you do that.
How to "create an embedding" depends a lot on what kind of data you have.
Usually, you train a neural network to solve some kind of task with your data. The most common task is probably classification, for example, "Is the animal shown in this image a dog or a cat?" or "Does this text sound happy or sad?".
Once your network is trained, you discard its last layer, which was responsible for classification, and use the output of the second-to-last layer as your embedding vector.
This works because the first few layers of the network have already transformed the data into a generally useful representation, which gets turned into specific classes by the last layer, or can be used as an embedding vector instead.
Atlas from Nomic AI is popular and Weights & Bias have some tools but generally high dimensional data is hard to visualize whatever you do with it. This is a solid roll it yourself at home implementation though and well documented so nice work and thanks to the author this would be a good Show HN post.
Second, disclaimer: I am not now and might never be a serious algebraic and/or differential geometer. Just a fan at the moment.
I've been calling the useful transformations in LLM latent manifolds things like "substantially affine", and I think that's probably true enough of the current crop.
I don't think this about `{V, I}-JEPA` (about which there's a lot of information and I plan to look into it a lot more) or Sora (about which there is less information but is still impressive AF). One imagines that `V-JEPA` and Sora have some deep parallels/symmetries.
Either way, I'll wager that serious Riemannian geometry is rapidly on it's way to table stakes. We have extreme high-dimension spaces that result from backprop and gradient descent, some combination of smooth/continuous/differentiable/compact seem pretty likely to fall out? Along with interesting curvature tensors and parallel transport for moving around in them? And TDA for figuring it out numerically/computationally?
I'd love if an expert chimed in, I'm trying to describe an intuition with a fluency that involves pointing and gesturing.