TopoNets: High performing vision and language models with brain-like topography

gzer0 · 2025-01-31T08:30:12 1738312212

I spent time working with Andrej and the rest of the FSD team back in 2020/2021, and we had plenty of conversations on how human visual processing maps onto our neural network architectures. Our approach—transformer-based attention blocks, multi-scale feature extraction, and temporal fusion—mirrors elements of the biological visual cortex (retina → LGN → V1 → V2 → V4 → IT) which break down raw inputs and integrate them over time. It’s amazing how closely this synthetic perceptual pipeline parallels the way our own brains interpret the world.

The key insight we discovered was that explicitly enforcing brain-like topographic organization (as some academic work attempts - such as this one here) isn't necessary - what matters is having the right functional components that parallel biological visual processing. Our experience showed that the key elements of biological visual processing - like hierarchical feature extraction and temporal integration - emerge naturally when you build architectures that have to solve real visual tasks.

The brain's organization serves its function, not the other way around. This was validated by the real-world performance of our synthetic visual cortex in the Tesla FSD stack.

Link to the 2021 Tesla AI day talk: https://www.youtube.com/live/j0z4FweCy4M?t=3010s

lukan · 2025-01-31T09:12:16 1738314736

"It’s amazing how closely this synthetic perceptual pipeline parallels the way our own brains interpret the world."

It is amazing, that the synthetic pipeline, that was build to mimick the brain, seems to mimick the brain?

That sounds a bit tautological and otherwise I doubt we have really understood how our brain exactly interprets the world.

In general this is definitely interesting research, but worded like this, it smells a bit hyped to me.

Shorel · 2025-01-31T12:54:36 1738328076

I interpreted it the other way around.

We can think of a solution space, with potentially many good solutions to the vision problem, and we can, in science fiction-like speculation, that the other solutions will be very different and surprise us.

Then this experiment shows its solution is the same we already knew, and that's it.

Then there aren't many good potential solutions, there is only one, and the ocean of possibilities becomes the pond of this solution.

trhway · 2025-01-31T09:43:41 1738316621

The convolutional kernels in the first levels do converge to Gabors like the ones in V1 (and there were math works in the 90-ies, in neuro research, about optimality of such kernels) so it wouldn't be surprising if higher levels would converge to something that is similar to the higher levels of visual cortex (like hierarchical feature aggregation that is nicely illustrated by deep dreaming and also feels like it can be optimal under reasonable conditions and thus would be expected to emerge).

perching_aix · 2025-01-31T11:38:43 1738323523

Did you read the part where he explicitly mentioned that they discovered how enforcing that architecture was not necessary, as it would emerge on its own?

lukan · 2025-01-31T12:17:17 1738325837

I did, but it was not clear to me, how it was meant. I assume the basic design was done before (with the brain in mind).

iandanforth · 2025-01-31T11:37:02 1738323422

Unlike neural networks the brain contains massive numbers of lateral connections. This, combined with topographical organization, allows it to do within layer temporal predictions as activations travel across the visual field, create active competition between similarly tuned neurons in a layer (forming natural sub networks), and quite a bit more. So, yeah, the brain's organisation serves it's function, and it does so very very well.

dmarchand90 · 2025-01-31T09:35:32 1738316132

I've found how CNN map to visual cortex to be very clear. But I've always been a bit confused about how llms map to the brain. Is that even the case?

nickpsecurity · 2025-01-31T15:17:21 1738336641

They probably don’t. They’re very different. LLM’s seem to be based on pragmatic, mathematical techniques developed over time to produce patterns from data.

There’s at least three fields in this:

1. Machine learning using non-neurological techniques (most stuff). These use a combination of statistical algorithms stitched together with hyperparameter tweaking. Also, usually global optimization by heavy methods like backpropagation.

2. “Brain-inspired” or “biologically accurate”algorithms that try to imitate the brain. They sometimes include evidence their behavior matches experimental observations of brain behavior. Many of these use complex neurons, spiking nets, and/or local learning (Hebbian).

(Note: There is some work on hybrids such as integrating hippocampus-like memory or doing limited backpropagation on Hebbian-like architectures.)

3. Computational neuroscience which aims to make biologically-accurate models at various levels of granularity. Their goal is to understand brain function. A common reason is diagnosing and treating neurological disorders.

Making an LLM like the brain would require use of brain-inspired components, multiple systems specialized for certain tasks, memory integrated into all of them, and a brain-like model for reinforcement. Imitating God’s complex design is simply much more difficult than combining proven algorithms that work well enough. ;)

That said, I keep collecting work on both efficient ML and brain-inspired ML. I think some combination of the techniques might have high impact later. I think the lower, training costs of some brain-inspired methods, especially Hebbian learning, justify more experimentation by small teams with small, GPU budgets. Might find something cost-effective in that research. We need more of it on common platforms, too, like HughingFace libraries and cheap VM’s.

trhway · 2025-01-31T10:39:19 1738319959

> how llms map to the brain

For the lower level - word embedings (word2vec, "King – Man + Woman = Queen") - one can see a similarity

https://www.nature.com/articles/d41586-019-00069-1 and https://gallantlab.org/viewer-huth-2016/

"The map reveals how language is spread throughout the cortex and across both hemispheres, showing groups of words clustered together by meaning."

nyrikki · 2025-01-31T15:45:01 1738338301

That is the latent space.

Very different from a feed forward network with perceptrons, auttograd, etc...

Inner product spaces are fixed points, mapping between models is less surprising because the general case is a merger set IIRC.

energy123 · 2025-01-31T03:39:46 1738294786

The main reason topography emerges in physical brains is because spatially distant connections are physically difficult and expensive in biological systems. Artificial neural nets have no such trade-off. So what's the motivation here? I can understand this might be a very good regularizer, so it could help with generalization error on small-data tasks. But hard to see why this should be on the critical path to AGI. As compute and data grows, you want less inductive bias. For example, CNN will beat ViT on small data tasks, but that flips with enough scale because ViT imposes less inductive bias. Or at least any inductive bias should be chosen because it models the structure of the data well, such as with causal transformers and language.

AYBABTME · 2025-01-31T05:53:46 1738302826

Locality of data and computation is very important in neural nets. It's the number one reason why training and inference are as slow as they are. It's why GPUs need super expensive HBM memory, why NVLink is a thing, why Infiniband is a thing.

If the problem of training and inference on neural networks can be optimized so that a topology can be used to keep closely related data together, we will see huge advancements in training and inference speed, and probably in model size as a result.

And speed isn't just speed. Speed makes impossible (not enough time in our lifetime) things possible.

A huge factor in Deepseek being able to train on H800 (half HBM bandwith as H100) is that they used GPU cores to compress/decompress the data moved around between the GPU memory and the compute units. This reduces latency in accessing data and made up for the slower memory bandwith (which translates in higher latency when fetching data). Anything that reduces the latency of memory accesses is a huge accelerator for neural nets. The number one way to achieve this is to keep related data next to each other, so that it fits in the closest caches possible.

mirekrusin · 2025-01-31T06:24:40 1738304680

It's true, but isn't OP also correct? Ie. it's about speed, which implies locality, which implies approaches like MoE which does exactly that and it's unlike physical brain topology?

Having said that it would be fun to see things like rearrangement data moves based on temerature of silicon parts after training cycle.

nickpsecurity · 2025-01-31T15:20:07 1738336807

Well, locality and the global nature of pre-training methods. The brain mostly uses local learning (Hebbian learning) which requires less, data movement. AI firms putting as much money into making that scale as they did on backpropagation might drop costs a lot.

vlovich123 · 2025-01-31T03:48:51 1738295331

Unless GPUs work markedly differently somehow or there’s been some fundamental shift in computer architecture I’m not aware of, spatial locality is still a factor in computers.

Aside from HW acceleration today, designs like Cebras would benefit heavily by reducing the amount of random access from accessing the weights (and thus freeing up cross-chip memory bandwidth for other things).

whynotminot · 2025-01-31T03:55:03 1738295703

This makes me remember game developers back when games could still be played directly from the physical disc. They would often duplicate data to different parts of the disc, knowing that certain data would often be streamed from disc together, so that seek times were minimized.

But those game devs knew where everything was spatially on the disc, and how the data would generally be used during gameplay. It was consistent.

Do engineers have a lot of insight into how models get loaded spatially onto a given GPU at run time? Is this constant? Is it variable on a per GPU basis? I would think it would have to be.

Hard to optimize for this.

jaek · 2025-01-31T04:26:46 1738297606

This brings to mind The Story of Mel from programming folklore.

http://beza1e1.tuxen.de/lore/story_of_mel.html

abrookewood · 2025-01-31T05:28:31 1738301311

Such a good read - some people really are on another level in their chosen field.

vlovich123 · 2025-01-31T15:55:58 1738338958

Right now models have no structure so that access is random but you definitely know where the data is located in memory since you put it there. It doesn’t matter about the physical location - it’s all through a TLB but if you ask the GPU for a contiguos memory allocation it gives it to you. This is probable the absolute easiest thing to optimize for if your data access pattern is amenable to it.

harles · 2025-01-31T04:21:16 1738297276

That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.

vlovich123 · 2025-01-31T15:57:22 1738339042

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.

mayukhdeb · 2025-02-01T13:57:57 1738418277

In this paper, we don't zero out the weights. We remove them.

vlovich123 · 2025-02-02T20:40:08 1738528808

Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?

cma · 2025-01-31T03:56:11 1738295771

> The main reason topography emerges in physical brains is because spatially distant connections are physically difficult and expensive in biological systems.

The brain itself seems to have bottlenecks that aren't distance related, like hemispheres and the corpus callosum that are preserved over all placental mammals and other mammalian groups have something similar and still hemispheres. Maybe it's just an artifact of bilateral symmetry that is stuck in there from path dependence, or forcing a redundancy to make damage more recoverable, but maybe it has a big regularizing or alternatively specializing effect (regularization like dropout tends to force more distributed representations which seems kind of opposite to this work and other work like "Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability," https://arxiv.org/abs/2305.08746 ).

jlpom · 2025-01-31T19:53:44 1738353224

It increases modularity and small-worldness, which are in my book critical for AGI (surprised by the way that this publication doesn't cite https://www.nature.com/articles/s42256-023-00748-9).

mayukhdeb · 2025-01-31T20:19:47 1738354787

Thank you for sharing this! We'll read through this and update the camera-ready version accordingly for ICLR 2025.

exe34 · 2025-01-31T17:16:50 1738343810

> CNN will beat ViT on small data tasks, but that flips with enough scale because ViT imposes less inductive bias

any idea why this is the case? CNN have the bias that neighbouring pixels are somehow relevant - they are neighbours. ViTs have to re-learn this from scratch. So why do they end up doing better than CNN?

TZubiri · 2025-01-31T04:21:01 1738297261

Maybe this would be relevant for datacenters with significant distance between machines, or multidatacenter systems.

xpl · 2025-01-31T03:44:46 1738295086

> So what's the motivation here?

Better interpretability, I suppose. Could give insights into how cognition works.

mayukhdeb · 2025-01-31T04:24:48 1738297488

The motivation was to induce structure in the weights of neural nets and see if the functional organization that emerges aligns with that of the brain or not. Turns out, it does -- both for vision and language.

The gains in parameter efficiency was a surprise even to us when we first tried it out.

energy123 · 2025-01-31T03:59:19 1738295959

That's true, and interpretability is helpful for AI safety.

mayukhdeb · 2025-01-31T04:26:59 1738297619

Indeed. What's cool is that we were able to localize literal "regions" in the GPTs which encoded toxic concepts related to racism, politics, etc. A similar video can be found here: https://toponets.github.io

More work is being done on this as we speak.

fakeparmesean · 2025-01-31T05:16:08 1738300568

My understanding coming from mechanistic interpretability is that models are typically (or always) in superposition, meaning that most or all neurons are forced to encode semantically unrelated concepts because there are more concepts than neurons in a typical LM. We train SAEs (where we apply L1 reg and a sparsity penalty to “encourage” the encoder output latents to yield sparse representations of the originating raw activations), to hopefully disentangle these features, or make them more monosemantic.This allows us to use the SAE as a sort of microscope to see what’s going on in the LM, and apply techniques like activation patching to localize features of interest, which sounds similar to what you’ve described. I’m curious what this work means for mech interp. Is this a novel alternative to mitigating polysemanticity? Or perhaps neurons are still encoding multiple features, but the features tend to have greater semantical overlap? Fascinating stuff!

mayukhdeb · 2025-01-31T14:58:56 1738335536

> the features tend to have greater semantical overlap?

This is true. The features closer together now have much stronger semantic overlap. You can watch how the weights self-organize in a GPT here: https://toponets.github.io/webpage_assets/banner_video.mp4

We're already studying the effects of topographic structure on polysemanticity.

cwillu · 2025-01-31T04:35:42 1738298142

Was it toxicity though as understood by the model, or just a cluster of concepts that you've chosen to label as toxic?

I.e., is this something that could (and therefore, will) be turned towards identifying toxic concepts as understood by the chinese or us government, or to identify (say) pro-union concepts so they can be down-weighted in a released model, etc?

mayukhdeb · 2025-01-31T04:43:51 1738298631

We localized "toxic" neurons by contrasting the activations of each neuron for toxic v/s normal texts. It's a method inspired by old-school neuroscience.

immibis · 2025-01-31T09:41:49 1738316509

Defining all politics as toxic is concerning, if it's not just a proof of concept. That's something dictatorships do so that people won't speak up.

jv22222 · 2025-01-31T04:42:59 1738298579

I had this idea the other day. Not sure if it relates but maybe?

https://twitter.com/justinvincent/status/1884357300703400274

mercer · 2025-01-31T04:03:36 1738296216

I imagine it could be easier to make sense of the 'biological' patterns that way? like, having bottlenecks or spatially-related challenges might have to be simulated too, to make sense of the ingested 'biological' information.

ziofill · 2025-01-31T04:23:44 1738297424

Perhaps they are more easily compressible? Once a bunch of nearby weights have similar roles one may not need all of them.

mayukhdeb · 2025-01-31T04:40:20 1738298420

Yep. That is exactly the idea here. Our compression method is super duper naive. We literally keep every n-th weight column and discard the rest. Turns out that even after getting rid of 80% of the weight columns in this way, we were able to retain the same performance in a 125M GPT.

w-m · 2025-01-31T10:48:25 1738320505

If you have things organized neatly together, you can also use pre-existing compression algorithms, like JPEG, to compress your data. That's what we're doing in Self-Organizing Gaussians [0]. There we take an unorganised (noisy) set of primitives that have 59 attributes and sort them into 59 2D grids which are locally smooth. Then we use off-the-shelf image formats to store the attributes. It's an incredibly effective compression scheme, and quite simple.

[0]: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/

FrereKhan · 2025-01-31T05:11:13 1738300273

This paper imports an arbitrarily-chosen aspect of cortical architecture — topological maps of function — and ignores every other aspect of biological neural tissue. The resulting models show lower performance for the same number of parameters — not surprising, since they are more constrained compared with baseline. They may be slightly more robust against pruning — not surprising, since they are more regularised.

The figures show individual seeds, presumably, with no statistical analysis in the performance or pruning comparisons, so the null hypothesis is there is no difference between toponets and baseline. I would never let this paper be submitted by my team.

We haven't learned anything about the brain, or about ANNs.

brrrrrm · 2025-01-31T13:38:12 1738330692

this paper plays into some popular fantasy about the aesthetic of ANNs. it’s not scientifically useful

mayukhdeb · 2025-01-31T15:15:06 1738336506

If by popular fantasy you mean replicating the functional profiles of the visual and language cortex of the brain, then yes. These ideas in neuroscience are popular, but not fantasy. I encourage you to read up on functional organization in the brain, it's very fascinating.

> it’s not scientifically useful

Having structured weights in GPTs enables us to localize and control various concepts and study stuff like polysemanticity, superposition, etc. Other scientific directions include sparse inference (already proven to work) and better model editing. Turns out, topographic structure also helps these models better predict neural data, which is yet another direction we're exploring in computational neuroscience.

photon_lines · 2025-01-31T16:22:25 1738340545

I love the paper - don't read into the negative comments. I find that a lot of online feedback (more so on Reddit and much less so on HN (usually)) tends to be opinionated and misinformed by quite a bit these days. Fantastic work and fantastic read.

mayukhdeb · 2025-01-31T19:05:14 1738350314

Thank you for your kind words!

brrrrrm · 2025-01-31T22:21:52 1738362112

I probably came in too hot on that (dealing with some personal stuff). Although I disagree with purported the impact of the paper, I don’t think this is fundamentally incorrect or bad science and I wish you the best on future research.

slama · 2025-01-31T03:38:27 1738294707

The title here doesn't seem to match. The paper is called "TopoNets: High Performing Vision and Language Models with Brain-Like Topography"

Even with their new method, models with topography seem to perform worse than models without.

dang · 2025-01-31T05:01:58 1738299718

Submitted title was "Inducing brain-like structure in GPT's weights makes them parameter efficient". We've reverted it now in keeping with the site guidelines (https://news.ycombinator.com/newsguidelines.html).

Since the submitter appears to be one of the authors, maybe they can explain the connection between the two titles? (Or maybe they already have! I haven't read the entire thread)

mayukhdeb · 2025-01-31T15:21:02 1738336862

Thanks for clarifying your reason for renaming the title.

The explanation for the original title is this plot from our publication in ICLR 2025: https://toponets.github.io/webpage_assets/FigureEfficiencyNa...

You can find more details on the website: https://toponets.github.io (see section: "Toponets deliver sparse, parameter-efficient language models")

We find out that inducing topographic structure in the weights of GPTs made them compressible (during inference) without losing out on performance.

I encourage you to revert the name if you find it justified after looking into the evidence I've shown here. Thanks.

vessenes · 2025-01-31T04:03:52 1738296232

I hate to dog on research papers. They’re work to write. That said, I think this paper is not likely to be of interest to AI researchers — instead it may be of interest to Neuroscience folks or other brain research types.

The lede — adding topography worsens networks at similar weights — is not only buried, it’s obscured with statements claiming that topo networks show less upheaval when scaled down, e.g. they are more efficient than similar weight networks.

It’s hard for me to see how both these things can be true — the graphs show the more topography is added, the worse the networks perform at the trained model sizes.

To have the second statement “They compress better and are therefore more efficient” also be true, I think you’d need to show a pretty remarkable claim, which is that while a model trained at the same scale as a llama architecture is worse, when you scale them both down, this model becomes not only better than the scaled down llama, but also better than a natively trained model at the new smaller scale.

There is no proof of this in the paper, and good reason to be skeptical of this idea based on the data presented.

That said, like a lot of ideas in AI, this .. works! You can train a model successfully imposing these outside structures on it, and that model doesn’t even suck very much. Which is a cool statement about complexity theory and the resilience of these architectures, in my opinion. But I don’t think it says much else about either the brain or underlying AI ‘truths’.

igleria · 2025-01-31T08:01:58 1738310518

This is excellent. Since reading https://books.google.de/books/about/Models_of_the_Mind.html?... I've been expecting someone to start looking back into biology to try to move forward. I guess the poster is one of the authors. Kudos!

mayukhdeb · 2025-01-31T15:27:58 1738337278

Thank you for your kind words!

Indeed. The problem with most AI research today is they simply do trial and error with large amounts of compute. No room for taking inspiration from nature, which is requires more thought and less FLOPS.

LZ_Khan · 2025-01-31T03:29:20 1738294160

Shouldn't there be a comparison in performance on common benchmarks to other models?

Like a 7B toponet model vs a 7B Llama model?

As a layperson I don't understand why topology is a thing to optimize for.

TOMDM · 2025-01-31T03:45:08 1738295108

The only potential benefit shown in the paper is the topologically local models seem to be more resilient after pruning.

So you may be able to prune a 7B model down to 6B while maintaining most of the capability.

mayukhdeb · 2025-01-31T04:20:52 1738297252

> The only potential benefit

Other benefits:

1. Significantly lower dimensionality of internal representations 2. More interpretable (see: https://toponets.github.io)

> 7B model down to 6B

We remove ~80% of the parameters in topographic layers and retain the same performance in the model. The drop in parameter count is not significant because we did not experiment with applying TopoLoss in all of the layers of the model (did not align with the goal of the paper)

We are currently performing those strong sparsity experiments internally, and the results look very promising!

michalsustr · 2025-01-31T04:54:28 1738299268

The blurring in the sheets and the topo loss reminded me of https://arxiv.org/abs/2408.05446

light_hue_1 · 2025-01-31T03:53:46 1738295626

They bury the part where inducing brain like structure hurts performance!

This is a method to just hurt your network in exchange for nothing useful at all aside from some sketchy story that this is "brain like".

mayukhdeb · 2025-01-31T04:15:13 1738296913

Our goal was never to optimize for performance. There's a long standing hypothesis that topographic structure in the human brain leads to metabolic efficiency. Thanks to topography in ANNs, we were able to test out this hypothesis in a computational setting.

> sketchy story this is "brain like".

we reproduce the hallmarks of functional organization seen in the visual and language cortex of the brain. I encourage you to read the paper before making such comments

light_hue_1 · 2025-02-05T15:36:14 1738769774

I did read the paper. I really hope I don't get assigned to be a reviewer for it.

You don't reproduce anything about the functional organization of the visual or language cortex. You make a pretty picture with blobs in it. And one that's trivial to get from current methods. If you think "the functional organization of the visual or language system" means random blobs of activation/connectivity, well, then it's time for a class on neuroscience. I cannot imagine what neuroscientist would let this fly reviewing the paper.

The whole "we don't optimize for performance" is nonsense. Take any modern method that prunes weights and beats your approach with ease. Then smooth its output a bit to make nice blobs. The performance loss from smoothing will still beat your method and look "brain-like" by your definition. There you go. Your experiments don't show anything at all aside from the fact that a bad method performs poorly.

You didn't think through controls or alternative hypotheses. You didn't take into account a decade of research on methods to prune networks. You don't take seriously what we know about functional organization in the brain.

All sorts of bad papers make it through reviewing these days. But.. you can definitely do better. Good luck!

devmor · 2025-01-31T06:10:21 1738303821

Is this "brain-like" in any functional way, or "brain-like" in the same way that a tall rectangle is "door-like" even if it doesn't share any functions with a door?

I know quite a bit about machine learning, but very little to nothing about neuroscience and human cognition, so I am curious how an expert (that didn't work on the paper) would describe it.

(Forgive me for the pre-emptive negativity but I am so utterly exhausted by dishonest comparisons to sapient thought in the field of artificial intelligence that it has nearly drained me of the incredible amount of enthusiasm I used to carry for it.)

mayukhdeb · 2025-01-31T14:43:19 1738334599

It is indeed brain-like in a functional way. Topographic structure is what enables the brain to have low dimensionality and metabolic efficiency. We find that inducing such structure in neural nets made them have significantly lower dimensionality and also more parameter efficient (After training, we could take advantage of the structure to remove ~80% of the weights in topographic layers without sacrificing performance)

devmor · 2025-02-01T07:16:21 1738394181

>After training, we could take advantage of the structure to remove ~80% of the weights in topographic layers without sacrificing performance

This is really interesting to me. Is it that the structure clustered the neurons in such a way that they didn't need to be weighted because their function were grouped by similar black box properties?

mayukhdeb · 2025-02-01T13:54:50 1738418090

> Is it that the structure clustered the neurons in such a way that they didn't need to be weighted

Yep. Because of the structure, we did not have to compute the output of each weight column and simply copied the outputs of nearby weight columns whose outputs were computed.

devmor · 2025-02-03T16:20:10 1738599610

That is really cool and deserves the descriptor "brain-like", thank you for answering my questions!

mayukhdeb · 2025-02-06T17:28:36 1738862916

Thanks for the kind words! Happy to know that there are people out there who find this stuff just as interesting as I do.