Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".
> I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.
In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.
You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.
But that doesn't seem to be what you are describing in terms of using incrementing indices and adding occurrence counts.
If you want to create a bag of words text embedding then you set the number of embedding dimensions to the vocabulary size and the value of each dimension to the global count of the corresponding word.
Heh -- my explanation isn't the clearest I realize, but yes, it is BoW.
Eg fix your vocab of 50k words (or whatever) and enumerate it.
Then to make an embedding for some piece of text
1. initialize an all zero vector of size 50k
2. for each word in the text, add one to the index of the corresponding word (per our enumeration). If the word isn't in the 50k words in your vocabulary, then discard it
3. (optionally), normalize the embedding to 1 (though you don't really need this and can leave it off for the toy example).
initialize an embedding (for a single text) as an all zero vector of size 50k
Are you talking about sentence/text chunk embeddings, or just embeddings in general?
If you need high quality text embeddings (e. g to use with a vector DB for text chunk retrieval), they they are going to come from the output of a language model, either a local one or using an embeddings API.
Other embeddings are normally going to be learnt in end-to-end fashion.
I disagree. In most subjects, recapitulating the historical development of a thing helps motivate modern developments. Eg
1. Start with bag of words. Represent words as all zero except one index that is not zero. Then a document is the sum (or average) of all the words in the document. We now have a method of embedding a variable length piece of text into a fixed size vector and we start to see how "similar" is approximately "close", though clearly there are some issues. We're somewhere at the start of nlp now.
2. One big issue is that there are a lot of common noisy words (like "good", "think", "said", etc.) that can make the embedding more similar than we feel they should be. So now we develop strategies for reducing the impact of those words one our vector. Remember how we just summed up the individual word vectors in 1? Now we'll scale each word vector by its frequency so that the more frequent the word in our corpus, the smaller we'll make the corresponding word vector. That brings us to tf-idf embeddings.
3. Another big issue is that our representation of words don't capture word similarity at all. The sentences "I hate when it rains" and "I dislike when it rains" should be more similar than "I hate when it rains" and "I like when it rains", but in our embeddings from (2) the similarity of the two pairs is going to be similar. So now we revisit our method of constructing word vectors and start to explore ways to "smear" words out. This is where things like word2vec and glove pop up as methods of creating distributed representation of words. Now we can represent documents by summing/averaging/tf-idfing our word vectors the same as we did in 2.
4. Now we notice there is an issue where words can have multiple meanings depending on their surrounding context. Think of things like irony, metaphor, humor, etc. Consider "She rolled her eyes and said, 'Don't you love it here?"'" and "She rolled the dough and said, 'Don't you love it here?'". Odds our, the similarity per (3) is going to be pretty similar, despite the fact that its clear these are wildly different meanings. The issue is that our model in (3) just uses a static operation for combining our words, and because of that we aren't capturing the fact that "Don't you love it here" shouldn't mean the same thing in the first and second sentences. So now we start to consider ways in which we can combine our word vectors differently and let the context affect the way in which we combine them.
5. And that brings us to now where we have a lot more compute than we did before and access to way bigger corpora so we can do some really interesting things, but its all still the basic steps of breaking down text into its constituent parts, representing those numerically, and then defining a method to combine various parts to produce a final representation for a document. The above steps greatly help by showing the motivation for each change and understanding why we do the things we do today.
This explanation is better, because it puts things into perspective, but you don't seem to realize that your 1 and 2 are almost trivial compared to 3 and 4.
At the heart of it are "methods of creating distributed representation of words", that's where the magic happens. So I'd focus on helping people understand those methods. Should probably also mention subword embedding methods like BPE, since that's what everyone uses today.
I noticed that many educators make this mistake: spend a lot of time on explaining very basic trivial things, then rush over difficult to grasp concepts or details.
> We now have a method of embedding a variable length piece of text into a fixed size vector
Question: Is it a rule that the embedding vector must be higher dimensional than the source text? Ideally 1 token -> a 1000+ length vector? The reason I ask is because it seems like it would lose value as a mechanism if I sent in a 1000 character long string and only got say a 4-length vector embedding for it. Because only 4 metrics/features can't possibly describe such a complex statement, I thought it was necessary that the dimensionality of the embedding be higher than the source?
OK, sounds counter-intuitive, but I'll take your word for it!
It seems odd since the basis of word similarity captured in this type of way is that word meanings are associated with local context, which doesn't seem related to these global occurrence counts.
Perhaps it works because two words with similar occurrence counts are more likely to often appear close to each other than two words where one has a high count, and another a small count? But this wouldn't seem to work for small counts, and anyways the counts are just being added to the base index rather than making similar-count words closer in the embedding space.
Do you have any explanation for why this captures any similarity in meaning?
> rather than making similar-count words closer in the embedding space.
Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.
The problem of multiple meanings isn't solved by this approach at all, at least not directly.
Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.
Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.
1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.
2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.
3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.
These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.
It’s not pure chance that the above calculus shakes out, but it doesn’t have to be that way. If you are embedding on a word by word level then it can happen, if it’s a little smaller or larger than word by word it’s not immediately clear what the calculation is doing.
But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".