Interesting. All developers I know who tinkered around with embeddings and vector similarity scoring were instantly hooked. The efficiency of computing the embeddings once and then reusing as many times as needed, comparing the vectors with a cheap <30-line function is extremely appealing. Not to mention the indexing capabilities to make it work at scale.
IMO vector embedding is the most important innovation in computing of the last decade. There's something magical about it. These people deserve some kind of prize. The idea that you can reduce almost any intricate concept including whole paragraphs to a fixed-size vector which encapsulates its meaning and proximity to other concepts across a large number of dimensions is pure genius.
Vector embedding is not an invention of the last decade. Featurization in ML goes back to the 60s - even deep learning-based featurization is decades old at a minimum. Like everything else in ML this became much more useful with data and compute scale
Yes but it doesn't generalize very well. Even on simple features like gender. If you go look at embeddings you'll find that man and woman are neighbors, just as king and queen are[0]. This is a better explanation for the result as you're just taking very small steps in the latent space.
Here, play around[1]
mother - parent + man = woman
father - parent + woman = man
father - parent + man = woman
mother - parent + woman = man
woman - human + man = girl
Or some that should be trivial
woman - man + man = girl
man - man + man = woman
woman - woman + woman = man
Working in very high dimensions is funky stuff. Embedding high dimensions into low dimensions results in even funkier stuff
Calling it addition is hairy here. Do you just mean an operator? If so, I'm with you. But normally people are expecting addition to have the full abelian group properties, which this certainly doesn't. It's not a ring because it doesn't have the multiplication structure. But it also isn't even a monoid[0] since, as we just discussed, it doesn't have associativity nor unitality.
There is far less structure here than you are assuming, and that's the underlying problem. There is local structure and so the addition operation will work as expected when operating on close neighbors, but this does greatly limit the utility.
And if you aren't aware of the terms I'm using here I think you should be extra careful. It highlights that you are making assumptions that you weren't aware were even assumptions (an unknown unknown just became a known unknown). I understand that this is an easy mistake to make since most people are not familiar with these concepts (including many in the ML world), but this is also why you need to be careful. Because even those that do are probably not going to drop these terms when discussing with anyone except other experts as there's no expectation that others will understand them.
I think you misinterpreted the tone of my original comment as some sort of gotcha. Presumably you're overloading the addition symbol with some other operational meaning in the context of vector embeddings. I'm just calling it addition because you're using a plus sign and I don't know what else to call it, I wasn't referring to addition as it's commonly understood which is clearly associative.
It's just plain old addition. There is nothing fancy about the operation. The fancy part is training a model such that it would produce vector representations of words which had this property of conceptually making sense.
If someone says: "conceptually, what is king - man + woman". One might reasonably say "queen". This isn't some well defined math thing, just sort of a common sense thing.
Now, imagine you have a function (lets call it an "embedding model") which turns words into vectors. The function turns king into [3,2], man into [1,1], woman into [1.5, 1.5] and queen into [3.5, 2.5].
Now for king - man + woman you get [3,2] - [1,1] + [1.5,1.5] = [3.5, 2.5] and hey presto, that's the same as queen [3.5, 2.5].
Now you have to ask - how do you get a function to produce those numbers? If you look at the word2vec paper, you'll come to see they use a couple of methods to train a model and if you think about those methods and the data, you'll realize it's not entirely surprising (in retrospect) that you could end up with a function that produced vectors which had such properties. And, if at the same time you are sort of mind blown, welcome to the club. It blew Jeff Dean's big brain too.
I'm sorry, but I think you are overestimating your knowledge.
Have you gone through abstract algebra? Are you familiar with monoids, groups, rings, fields, algebras, and so on?
Because it seems you aren't aware that these structures exist and area critical part of mathematics. It's probably why you're not understanding the conversation. @yellocake seems to understand that "addition" doesn't mean 'addition' (sorry, I assumed you meant how normal people use the word lol). You may not realize it, but you're already showing that addition doesn't have a single meaning. 1+1 = 2 but [1,0] + [0, 1] = [1,1] and 1+0i + 0+i = 1 + i. The operator symbol is the same but the operation actually isn't.
> Now for king - man + woman you get [3,2] - [1,1] + [1.5,1.5] = [3.5, 2.5] and hey presto, that's the same as queen [3.5, 2.5].
The same as? Or is queen the closest?
If it were just "plain old addition" then @yellowcake (or me![0]) wouldn't have any confusion. Because
man - man + man
= (man - man) + man
= 0 + man
= man != woman
We literally just proved that it isn't "plain old addition". So stop being overly confident and look at the facts.
>>> Vector addition is absolutely associative
This is commonly true, but not necessarily. Floating point arithmetic is not associative.
> you'll realize it's not entirely surprising that you could end up with a function that produced vectors which had such properties
Except it doesn't work as well as you think, and that's the issue. There are many examples of it working and this is indeed surprising, but the effect does not generalize. If you go back to Jeff's papers you'll find some reasonable assumptions that are also limiting. Go look at "Distributed Representations of Words and Phrases and their Compositionality"[1] and look at Figure 2. See anything interesting? Notice that the capitals aren't always the closest? You might notice Ankara is closer to Japan than Tokyo. You'll also notice that the lines don't all point in the same direction. So if we made the assumption that the space was well defined then clearly we aren't following the geodesic. But you probably didn't realize a second issue, PCA only works on linear representations. Yet the model is not linear. Now there aren't many details on what they did for the PCA, but it is easy to add information implicitly and there's a good chance that happened here. The model definitely still is facing the challenges of metrics in high dimensional spaces, where notions such as distance become ill-defined.
I've met Jeff and even talked with him at length. He's a brilliant dude and I have no doubt about that. But I don't believe he thinks this works in general. I'm aware he isn't a mathematician, but anyone who plays around with vector embeddings will experience the results I'm talking about. He certainly seems to understand that there are major limits to these models but also that just because something has limits doesn't mean it isn't useful. The paper says just as much and references several works that go into that even further. If you've misinterpreted me as saying embeddings are not useful then you're sorely mistaken. But neither should we talk about tools as if they are infallible and work perfectly. All that does is makes us bad tool users.
[0] I also have no idea what mathematical structure vector embeddings follow. I'm actually not sure anyone does. This is definitely an under researched domain despite it being very important. This issue applies to even modern LLMs! But good luck getting funding for that kind of research. You're going to have a hard time getting it at a big lab (despite having high value) and you don't have the time in academia unless you're tenured, but then you got students to prioritize.
Maybe spend more time reading a response than writing. Yellowcake doesn't know what you are talking about either (note the "pulling teeth" comment).
The examples you gave are a result of the embedding model in question not producing vectors which would map to most peoples conceptual view of the world. Go through that website you quote from and see for yourself - it's just element wise addition.
The examples I gave are entirely made up, 2 dimensional vectors to explain what plain old addition means (ie, plain old "add the vectors element wise") in the context of embedding models. And yes, it's the same as, because I defined it that way. Your website uses 300 dimensions, not 2.
As I mentioned, not all embedding models work the same way (or, as you've said, "this doesn't generalize"). They get trained differently, on different data. The word "similar" is used very loosely.
You even directly quote me and don't seem to be able to read the quote. The word "could" is there. You could end up with a model which had these nice properties.
The entire point of my post was to highlight that yellowcake's confusion arises because he assumes it's an esoteric definition of addition that results in your examples, when it's not that.
> Maybe spend more time reading a response than writing.
Quite ironic considering
> Yellowcake doesn't know what you are talking about either
I actually said
>> @yellocake seems to understand that "addition" doesn't mean 'addition'
Which is entirely based off of
>>>>>> Presumably you're overloading the addition symbol
I didn't assume their knowledge, they straight up told me and I updated my understanding based on that. That's how conversations work. And the fact that they understand operator overloading doesn't mean they understand more either. Do they understand monoids, fields, groups, and rings? Who knows? We'll have to let yellowcake tell us.
Regardless, what you claim I assumed about yellowcake's knowledge is quite different than what I actually said. So maybe take your own advice.
I write a lot because, unlike you, I understand these things are complex. Were it simpler, I would not need as many words.
Yeah except addition does mean addition in this case - ask anyone what plain old addition means for a vector, and they'll tell you element wise addition. The website you quoted is for a simple example using element wise addition and you made it sound as complex as possible because you are desperate to sound smart.
You really don't understand that the illogical sounding results from that website are due to the vectors themselves huh. It has zero to do with the definition of +.
Please, tell me more. I was naively under the impression that normal addition had Abelian group properties[0]. Maybe you can inform me as what the inverse element is. That will get me to change my mind
You’re lost in abstractions. ‘King’ and ‘queen’ and 'man' etc etc aren’t algebraic symbols, they’re mapped to vectors of real numbers. The model learns those mappings, then we just add and subtract numbers element wise. That’s it. You’re giving a group theory lecture about an operation that’s literally just a[i] + b[i]. The semantics come from training, not from some deep mathematical revelation you think everyone missed.
Yes, I'm in agreement here. But you need to tell me how
a - a + a = b
Use what ever the fuck you want for a. A vector (e.g. [1,2,3]), a number (e.g. 1), an embedding (e.g. [[1,2,3],[4,5,6]]), words (e.g. "man"), I really don't give a damn. You have to tell me why b is a reasonable answer to that equation. You have to tell me how a==b while also a!=b.
Because I expect the usual addition to be
a - a + a = a
This is the last time I'm going to say this to you.
You're telling me I'm lost in abstraction and I'm telling you is not usual addition because a != b. That's it! That's the whole fucking argument. You literally cannot see the contradiction right in front of you. The only why it is usual addition is if you tell me "man == woman" because that is literally the example from several comments ago. Stop being so smart and just read the damn comment
Vector embeddings are slightly interesting because they come pre-trained with large amounts of data.
But similar ways to reduce huge numbers of dimensions to a much smaller set of "interesting" dimensions have been known for a long time.
Examples include principal component analysis/single value decomposition, which was the first big breakthrough in face recognition (in the early 90s), and also used in latent semantic indexing, the Netflix prize, and a large pile of other things. And the underlying technique was invented in 1901.
Dimensionality reduction is cool, and vector embedding is definitely an interesting way to do it (at significant computational cost).
Vector embeddings are so overhyped. They're decent as a secondary signal, but they're expensive to compute and fragile. BM25 based solutions are more robust and WAY lower latency, at the cost of some accuracy loss vs hybrid solutions. You can get the majority of the lift from hybrid solutions with ingest time semantic expansion/reverse hyde type input annotation with a sparse embedding BM25 at a fraction of the computational cost.
IMO vector embedding is the most important innovation in computing of the last decade. There's something magical about it. These people deserve some kind of prize. The idea that you can reduce almost any intricate concept including whole paragraphs to a fixed-size vector which encapsulates its meaning and proximity to other concepts across a large number of dimensions is pure genius.