Visualize Latent Spaces

benreesman · on Feb 17, 2024

First, this is awesome and we need more of this kind of thing.

Second, disclaimer: I am not now and might never be a serious algebraic and/or differential geometer. Just a fan at the moment.

I've been calling the useful transformations in LLM latent manifolds things like "substantially affine", and I think that's probably true enough of the current crop.

I don't think this about `{V, I}-JEPA` (about which there's a lot of information and I plan to look into it a lot more) or Sora (about which there is less information but is still impressive AF). One imagines that `V-JEPA` and Sora have some deep parallels/symmetries.

Either way, I'll wager that serious Riemannian geometry is rapidly on it's way to table stakes. We have extreme high-dimension spaces that result from backprop and gradient descent, some combination of smooth/continuous/differentiable/compact seem pretty likely to fall out? Along with interesting curvature tensors and parallel transport for moving around in them? And TDA for figuring it out numerically/computationally?

I'd love if an expert chimed in, I'm trying to describe an intuition with a fluency that involves pointing and gesturing.

heyitsguay · on Feb 17, 2024

To my knowledge as a math-turned-ML guy, there are currently no useful geometric characterizations of deep net latent spaces that are both "deep" (in the sense of using advanced mathematics) and "useful" (in the sense of revealing properties of networks or their latent spaces that aren't understood otherwise). Of course if anyone knows better I'd love to hear about it.

Continuous geometric concepts don't play super well with the way we like to decompose model outputs into discrete entities (classes, words, visual properties). We can, e.g. find variables in celebrity face GAN latent spaces that seem related to face orientation, or hair color, sort of, over some variable range and under some input conditions, but that doesn't really translate cleanly into any typical mathematical characterizations, geometric or otherwise, where you'd be looking for some property to hold everywhere or at least have an atlas of connected local approximations to simple characterizations.

Instead, we get high-dimensional messes of spaces, and network gradients during training don't exhibit clean or easy to understand dynamics except in the simplest toy cases.

To paraphrase a more serious "math for ML" prof I've chatted with at times -- "doing math" classically involves being able to find a description with only a few free parameters for a complex phenomenon that may superficially appear to have many/infinite free parameters. It's possible that for large ML models trained on natural data, such a reduction just doesn't exist, you can't break the contributions of millions or billions of parameters down into a low-dimensional approximation. He was/is skeptical of us attaining deep mathematical insight into their operation, but he could always be wrong. I'd certainly love to see cool novel insights come out of mathematics that give clarity to what's been going on these past 15 years.

benreesman · on Feb 17, 2024

Thank you very much for the thoughtful and insightful reply!

This is obviously speculation/intuition, but it's not terribly surprising to me at least that operating in e.g. pixel-space or a straightforward lifted latent manifold (modern diffusers basically) wouldn't have apparent structure under the fancy t-SNE type things that seem to be the heaviest artillery brought to the party (at least in the open). In pixel space, you get 6-17 fingers on 1-3 hands.

The `france - paris + uk === london` thing is real, and it's not surprising because there aren't typically much in the way of nonlinearities in `word2vec`/`fasttext`/`glove` type stuff. But this substantially survives all the leaky relus or whatever in LLMs. They're pretty clearly interpolating in a way that you could get close to with a composition of affine transforms.

JEPA (and maybe Sora if..., fuck it) seems a dramatic shift in forcing joint loss into a much higher-level space/manifold with (to me at least) shockingly semantic properties. I mean look at the I-JEPA reconstructions from pre-trained lifted space with some dinky diffuser/VAE-thing eating the hyperplane:

https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/

That's not pixel space, and you've got a lot of freedom to make it smoother, I suspect no one says "L1 regularization" anymore, but there's some modern version of that, we know how to do this.

AFAIU (and again, I welcome expert correction) TDA at least and really a lot of modern geometry is about "scruffy intrinsic / smooth embedded" or vice versa, and "scruffy at this scale but smooth if you set the focus right".

tudorw · on Feb 17, 2024

preserving topology during dimension reduction might affect this? something something, fractal dimensionality, erm tropical geometry and amoebas and the, here it is, https://proceedings.mlr.press/v80/zhang18i.html

Edit, obviously I don't know my zonotope from my tropical hypersurface, I do however like the pretty pictures ;)

Here's another good piece of how topology might help work out why some models do better than others; https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

"By computing geometric descriptors of DNNs and performing large-scale model comparisons, we discovered a geometric phenomenon that has been overlooked in previous work: DNN models of high-level visual cortex benefit from high-dimensional latent representations. This finding runs counter to the view that both DNNs and neural systems benefit by compressing representations down to low-dimensional subspaces [20–39, 78]. "

benreesman · on Feb 18, 2024

https://en.wikipedia.org/wiki/Parallel_transport ?

tudorw · on Feb 18, 2024

Maybe! I'm lost in the maths, "Our results suggest that learned optimizers can benefit from considering the (symmetry) structure of the weight space they optimize. " this from 7th Feb came out of DeepMind; https://arxiv.org/abs/2402.05232

When it comes to the math underneath an LLM, https://medium.com/autonomous-agents/part-8-mathematical-exp... is about the most accessible explanation I have found so far.

tel · on Feb 18, 2024

I haven't dug into this much at all, but I am aware that some teams have been looking for methods to either discover or train in symmetries and invariances into latent spaces. For instance, in an image recognition model one might expect shift and rotation invariance properties.

I don't know if this line of research is going anywhere, but I thought it was interesting in terms of actual geometry being applied to ANNs.

enjalot · on Feb 17, 2024

The math is often behind me, I just try to visualize what I can. So I don't know how interesting this would be to you, but these folks are pretty good at math and trying to figure out what's going on in the latent spaces: https://transformer-circuits.pub/

lmeyerov · on Feb 18, 2024

If you find this stuff useful, I recommend also checking out our pygraphistry umap() for a few reasons here:

- by visually linking similar entities, vs just a scatter plot of values floating around or binning, you see a lot of the substructure and non-linkages these tools otherwise mislead you on

- you get a full interactive suite for on-the-fly visual encodings, drilling, reclustering, etc for wheb you start investigating it

- automatic support for heterogenous data. Ex: for user analysis, combine profile + name embeddings with numeric, date, Boolean, etc features, so you can work with the real dataset, not just a couple columns

For video/images, there are other tools I'd recommend because they add another layer of nuance. But for this kind of text, tabular, etc data, we've come to be opinionated and built in a lot here over the last few years, including with cool work by our collaborators at Nvidia (rapids)

enjalot · on Feb 17, 2024

Author of the project here. Definitely appreciating the supportive comments. I'd be happy to answer questions folks have and am very interested in what kind of data folks end up visualizing with it!

tudorw · on Feb 17, 2024

Did you look at using PHATE for dimension reduction?

enjalot · on Feb 17, 2024

I hadn't seen it, looking at the API it seems like it could be pretty straightforward to drop it in and see how the projections look.

tudorw · on Feb 17, 2024

I'd be really interested to see if that works out. There's some interesting comparisons here; https://www.nature.com/articles/s42003-022-03628-x

leobg · on Feb 17, 2024

This is great! Used to hack something like that together whenever working with embeddings, clustering, semantic search etc., using umap and plotly. This looks a lot more polished!

spacecadet · on Feb 17, 2024

Great project!

I wrote a little tool last year for myself that I called "hyperspace" haha- it allows me to do a similar "inspection" of model activity and output across a series of visualizations.

jellyfish24 · on Feb 17, 2024

Wondering how this compares to the Tensorflow embedding projector? https://projector.tensorflow.org/

johnsutor · on Feb 17, 2024

Honestly, looks more useful. Tensorflow embedding projector is pretty limited except for quick nifty visualizations, but it doesn't really inform you much about different clusters of points or why different clusters or hierarchies emerge. From a quick glance, it looks like this library lets you do that.

dist-epoch · on Feb 17, 2024

How does one create a new embedding?

If I have a new kind of data, not text, not image, so there is no existing embedding, how do I create one?

Any good articles/talks on this?

johndough · on Feb 17, 2024

How to "create an embedding" depends a lot on what kind of data you have.

Usually, you train a neural network to solve some kind of task with your data. The most common task is probably classification, for example, "Is the animal shown in this image a dog or a cat?" or "Does this text sound happy or sad?".

Once your network is trained, you discard its last layer, which was responsible for classification, and use the output of the second-to-last layer as your embedding vector.

This works because the first few layers of the network have already transformed the data into a generally useful representation, which gets turned into specific classes by the last layer, or can be used as an embedding vector instead.

Terretta · on Feb 17, 2024

If you use phind.com or similar tools, they can get you started using the references on the right or refining your question:

https://www.phind.com/search?cache=gvivp6ubidmlzrtntlrt72i8

jszymborski · on Feb 17, 2024

Lots of different ways to go about this. "Representation Learning" is what you're going to want to look up.

jimmySixDOF · on Feb 17, 2024

Atlas from Nomic AI is popular and Weights & Bias have some tools but generally high dimensional data is hard to visualize whatever you do with it. This is a solid roll it yourself at home implementation though and well documented so nice work and thanks to the author this would be a good Show HN post.

ametrau · on Feb 17, 2024

Looks very cool. Looking forward to trying it on my embeddings