Interesting. All developers I know who tinkered around with embeddings and vecto...

_jayhack_ · 2025-10-12T00:09:24 1760227764

Vector embedding is not an invention of the last decade. Featurization in ML goes back to the 60s - even deep learning-based featurization is decades old at a minimum. Like everything else in ML this became much more useful with data and compute scale

senderista · 2025-10-12T00:28:47 1760228927

Yup, when I was at MSFT 20 years ago they were already productizing vector embedding of documents and queries (LSI).

jongjong · 2025-10-12T07:52:51 1760255571

Interesting. Makes one think.

senderista · 2025-10-12T19:16:29 1760296589

To be clear, LSA[1] is simply applied linear algebra, not ML. I'm sure learned embeddings outperform the simple SVD[2] used in LSA.

[1] https://en.wikipedia.org/wiki/Latent_semantic_analysis

[2] https://en.wikipedia.org/wiki/Singular_value_decomposition

liampulles · 2025-10-12T00:44:51 1760229891

If you take the embedding for king, subtract the embedding for male, add the embedding for female, and lookup the closest embedding you get queen.

The fact that dot product addition can encode the concept of royalty and gender (among all other sorts) is kind of magic to me.

puttycat · 2025-10-12T02:45:09 1760237109

This was actually shown to not really work in practice.

intelkishan · 2025-10-12T04:06:38 1760241998

I have seen this particular work example to work. You don't get the exact match but the closest one is indeed Queen.

godelski · 2025-10-12T07:12:15 1760253135

Yes but it doesn't generalize very well. Even on simple features like gender. If you go look at embeddings you'll find that man and woman are neighbors, just as king and queen are[0]. This is a better explanation for the result as you're just taking very small steps in the latent space.

Here, play around[1]

  mother - parent + man = woman
  father - parent + woman = man
  father - parent + man = woman
  mother - parent + woman = man
  woman - human + man = girl

Or some that should be trivial

  woman - man + man = girl
  man - man + man = woman
  woman - woman + woman = man

Working in very high dimensions is funky stuff. Embedding high dimensions into low dimensions results in even funkier stuff

[0] https://projector.tensorflow.org/

[1] https://www.cs.cmu.edu/~dst/WordEmbeddingDemo/

liampulles · 2025-10-13T11:31:34 1760355094

Thank you for the comment!

This led me to do a bit more research, and I see indeed the queen result is in itself infact "cheating" a bit: https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935...

#TheMoreYouKnow

yellowcake0 · 2025-10-12T09:37:28 1760261848

so addition is not associative?

godelski · 2025-10-12T17:53:33 1760291613

I think you're missing the point

yellowcake0 · 2025-10-12T22:38:10 1760308690

It's a pretty exotic type of addition that would lead to the second set of examples, just trying to get an idea of its nature.

godelski · 2025-10-13T02:18:17 1760321897

Calling it addition is hairy here. Do you just mean an operator? If so, I'm with you. But normally people are expecting addition to have the full abelian group properties, which this certainly doesn't. It's not a ring because it doesn't have the multiplication structure. But it also isn't even a monoid[0] since, as we just discussed, it doesn't have associativity nor unitality.

There is far less structure here than you are assuming, and that's the underlying problem. There is local structure and so the addition operation will work as expected when operating on close neighbors, but this does greatly limit the utility.

And if you aren't aware of the terms I'm using here I think you should be extra careful. It highlights that you are making assumptions that you weren't aware were even assumptions (an unknown unknown just became a known unknown). I understand that this is an easy mistake to make since most people are not familiar with these concepts (including many in the ML world), but this is also why you need to be careful. Because even those that do are probably not going to drop these terms when discussing with anyone except other experts as there's no expectation that others will understand them.

[0] https://ncatlab.org/nlab/show/monoid

yellowcake0 · 2025-10-13T03:06:02 1760324762

I think you misinterpreted the tone of my original comment as some sort of gotcha. Presumably you're overloading the addition symbol with some other operational meaning in the context of vector embeddings. I'm just calling it addition because you're using a plus sign and I don't know what else to call it, I wasn't referring to addition as it's commonly understood which is clearly associative.

danielmarkbruce · 2025-10-13T04:54:27 1760331267

You guys are debating this as though embedding models and/or layers work the same way. They don't.

Vector addition is absolutely associative. The question is more "does it magically line up with what sounds correct in a semantic sense?".

yellowcake0 · 2025-10-13T06:41:50 1760337710

I'm just trying to get an idea of what the operation is such that man - man + man = woman, but it's like pulling teeth.

danielmarkbruce · 2025-10-13T18:24:13 1760379853

It's just plain old addition. There is nothing fancy about the operation. The fancy part is training a model such that it would produce vector representations of words which had this property of conceptually making sense.

If someone says: "conceptually, what is king - man + woman". One might reasonably say "queen". This isn't some well defined math thing, just sort of a common sense thing.

Now, imagine you have a function (lets call it an "embedding model") which turns words into vectors. The function turns king into [3,2], man into [1,1], woman into [1.5, 1.5] and queen into [3.5, 2.5].

Now for king - man + woman you get [3,2] - [1,1] + [1.5,1.5] = [3.5, 2.5] and hey presto, that's the same as queen [3.5, 2.5].

Now you have to ask - how do you get a function to produce those numbers? If you look at the word2vec paper, you'll come to see they use a couple of methods to train a model and if you think about those methods and the data, you'll realize it's not entirely surprising (in retrospect) that you could end up with a function that produced vectors which had such properties. And, if at the same time you are sort of mind blown, welcome to the club. It blew Jeff Dean's big brain too.

godelski · 2025-10-13T20:39:02 1760387942

  > It's just plain old addition

I'm sorry, but I think you are overestimating your knowledge.

Have you gone through abstract algebra? Are you familiar with monoids, groups, rings, fields, algebras, and so on?

Because it seems you aren't aware that these structures exist and area critical part of mathematics. It's probably why you're not understanding the conversation. @yellocake seems to understand that "addition" doesn't mean 'addition' (sorry, I assumed you meant how normal people use the word lol). You may not realize it, but you're already showing that addition doesn't have a single meaning. 1+1 = 2 but [1,0] + [0, 1] = [1,1] and 1+0i + 0+i = 1 + i. The operator symbol is the same but the operation actually isn't.

  > Now for king - man + woman you get [3,2] - [1,1] + [1.5,1.5] = [3.5, 2.5] and hey presto, that's the same as queen [3.5, 2.5].

The same as? Or is queen the closest?

If it were just "plain old addition" then @yellowcake (or me![0]) wouldn't have any confusion. Because

     man - man  + man 
  = (man - man) + man 
  =      0      + man 
  = man != woman

We literally just proved that it isn't "plain old addition". So stop being overly confident and look at the facts.

  >>> Vector addition is absolutely associative

This is commonly true, but not necessarily. Floating point arithmetic is not associative.

  > you'll realize it's not entirely surprising that you could end up with a function that produced vectors which had such properties

Except it doesn't work as well as you think, and that's the issue. There are many examples of it working and this is indeed surprising, but the effect does not generalize. If you go back to Jeff's papers you'll find some reasonable assumptions that are also limiting. Go look at "Distributed Representations of Words and Phrases and their Compositionality"[1] and look at Figure 2. See anything interesting? Notice that the capitals aren't always the closest? You might notice Ankara is closer to Japan than Tokyo. You'll also notice that the lines don't all point in the same direction. So if we made the assumption that the space was well defined then clearly we aren't following the geodesic. But you probably didn't realize a second issue, PCA only works on linear representations. Yet the model is not linear. Now there aren't many details on what they did for the PCA, but it is easy to add information implicitly and there's a good chance that happened here. The model definitely still is facing the challenges of metrics in high dimensional spaces, where notions such as distance become ill-defined.

I've met Jeff and even talked with him at length. He's a brilliant dude and I have no doubt about that. But I don't believe he thinks this works in general. I'm aware he isn't a mathematician, but anyone who plays around with vector embeddings will experience the results I'm talking about. He certainly seems to understand that there are major limits to these models but also that just because something has limits doesn't mean it isn't useful. The paper says just as much and references several works that go into that even further. If you've misinterpreted me as saying embeddings are not useful then you're sorely mistaken. But neither should we talk about tools as if they are infallible and work perfectly. All that does is makes us bad tool users.

[0] I also have no idea what mathematical structure vector embeddings follow. I'm actually not sure anyone does. This is definitely an under researched domain despite it being very important. This issue applies to even modern LLMs! But good luck getting funding for that kind of research. You're going to have a hard time getting it at a big lab (despite having high value) and you don't have the time in academia unless you're tenured, but then you got students to prioritize.

[1] https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec...

danielmarkbruce · 2025-10-13T22:47:06 1760395626

Maybe spend more time reading a response than writing. Yellowcake doesn't know what you are talking about either (note the "pulling teeth" comment).

The examples you gave are a result of the embedding model in question not producing vectors which would map to most peoples conceptual view of the world. Go through that website you quote from and see for yourself - it's just element wise addition.

The examples I gave are entirely made up, 2 dimensional vectors to explain what plain old addition means (ie, plain old "add the vectors element wise") in the context of embedding models. And yes, it's the same as, because I defined it that way. Your website uses 300 dimensions, not 2.

As I mentioned, not all embedding models work the same way (or, as you've said, "this doesn't generalize"). They get trained differently, on different data. The word "similar" is used very loosely.

You even directly quote me and don't seem to be able to read the quote. The word "could" is there. You could end up with a model which had these nice properties.

The entire point of my post was to highlight that yellowcake's confusion arises because he assumes it's an esoteric definition of addition that results in your examples, when it's not that.

godelski · 2025-10-14T02:04:35 1760407475

  > Maybe spend more time reading a response than writing.

Quite ironic considering

  > Yellowcake doesn't know what you are talking about either

I actually said

  >> @yellocake seems to understand that "addition" doesn't mean 'addition'

Which is entirely based off of

  >>>>>> Presumably you're overloading the addition symbol

I didn't assume their knowledge, they straight up told me and I updated my understanding based on that. That's how conversations work. And the fact that they understand operator overloading doesn't mean they understand more either. Do they understand monoids, fields, groups, and rings? Who knows? We'll have to let yellowcake tell us.

Regardless, what you claim I assumed about yellowcake's knowledge is quite different than what I actually said. So maybe take your own advice.

I write a lot because, unlike you, I understand these things are complex. Were it simpler, I would not need as many words.

danielmarkbruce · 2025-10-14T02:54:18 1760410458

Yeah except addition does mean addition in this case - ask anyone what plain old addition means for a vector, and they'll tell you element wise addition. The website you quoted is for a simple example using element wise addition and you made it sound as complex as possible because you are desperate to sound smart.

godelski · 2025-10-14T06:36:48 1760423808

  > you are desperate to sound smart.

Because I don't think 1-1+1=7?

Whatever you say man

danielmarkbruce · 2025-10-14T14:43:49 1760453029

You really don't understand that the illogical sounding results from that website are due to the vectors themselves huh. It has zero to do with the definition of +.

godelski · 2025-10-14T19:01:50 1760468510

Please, tell me more. I was naively under the impression that normal addition had Abelian group properties[0]. Maybe you can inform me as what the inverse element is. That will get me to change my mind

[0] https://en.wikipedia.org/wiki/Abelian_group

danielmarkbruce · 2025-10-14T19:36:33 1760470593

You’re lost in abstractions. ‘King’ and ‘queen’ and 'man' etc etc aren’t algebraic symbols, they’re mapped to vectors of real numbers. The model learns those mappings, then we just add and subtract numbers element wise. That’s it. You’re giving a group theory lecture about an operation that’s literally just a[i] + b[i]. The semantics come from training, not from some deep mathematical revelation you think everyone missed.

godelski · 2025-10-15T09:06:02 1760519162

  > they’re mapped to vectors of real numbers

Yes, I'm in agreement here. But you need to tell me how

  a - a + a = b

Use what ever the fuck you want for a. A vector (e.g. [1,2,3]), a number (e.g. 1), an embedding (e.g. [[1,2,3],[4,5,6]]), words (e.g. "man"), I really don't give a damn. You have to tell me why b is a reasonable answer to that equation. You have to tell me how a==b while also a!=b.

Because I expect the usual addition to be

  a - a + a = a

This is the last time I'm going to say this to you.

You're telling me I'm lost in abstraction and I'm telling you is not usual addition because a != b. That's it! That's the whole fucking argument. You literally cannot see the contradiction right in front of you. The only why it is usual addition is if you tell me "man == woman" because that is literally the example from several comments ago. Stop being so smart and just read the damn comment

mirekrusin · 2025-10-12T05:01:16 1760245276

Shouldn't this itself be a part of training?

Having set of "king - male + female = queen" like relations, including more complex phrases to align embeddings.

It seems like terse, lightweight, information dense way to address essence of knowldge.

ekidd · 2025-10-12T00:33:01 1760229181

Vector embeddings are slightly interesting because they come pre-trained with large amounts of data.

But similar ways to reduce huge numbers of dimensions to a much smaller set of "interesting" dimensions have been known for a long time.

Examples include principal component analysis/single value decomposition, which was the first big breakthrough in face recognition (in the early 90s), and also used in latent semantic indexing, the Netflix prize, and a large pile of other things. And the underlying technique was invented in 1901.

Dimensionality reduction is cool, and vector embedding is definitely an interesting way to do it (at significant computational cost).

CuriouslyC · 2025-10-12T04:52:47 1760244767

Vector embeddings are so overhyped. They're decent as a secondary signal, but they're expensive to compute and fragile. BM25 based solutions are more robust and WAY lower latency, at the cost of some accuracy loss vs hybrid solutions. You can get the majority of the lift from hybrid solutions with ingest time semantic expansion/reverse hyde type input annotation with a sparse embedding BM25 at a fraction of the computational cost.

jongjong · 2025-10-12T07:50:26 1760255426

But it's much cheaper to compute than inference, and also you only have to compute once for any content and reuse multiple times.

calf · 2025-10-12T05:26:15 1760246775

The idea of reducing language to mere bits, in general, sounds like it would violate the Godel/Turing theorems about computability.