Word2Vec Explained. Explaining the Intuition of Word2Vec

it_does_follow · on March 27, 2022

Am I alone in really disliking Towards Data Science?

While their articles always look nice, their content is all written quickly by data scientists wanting to polish their resume with the ultimate aim of rapidly generating content for TDS that will match every conceivable data science related search. This post clearly exists solely so that TDS can get the top spot for "Word2vec explained" (which they have). As evidence of this tactic you can see that there already is a TDS post "Word2vec made easy" [0], offering nothing substantially different than this one.

The problem is that content is almost never useful, it just looks nice at first skim through. The authors, at no real fault of their own, are just eager novices that rarely have new perspective to add to a topic. It's not uncommon to find huge conceptual errors (or at least gaps) in the content there.

I personally encourage everyone at every level to write about what they can, but the issue is that TDS has manipulated this population of eager data scientists in order to dominant search results on nearly every single topic they can cover related to DS, which has made searching for anything tedious.

Compare this post to the fantastic work of Jay Alammar [1]. Jay's post is truly excellent, covering a lot of interesting details about word2vec and providing excellent visuals as well.

I'm assuming TDS will fold as soon as DS stops being a "hot" topic (which I think we'll be in the relatively near future), and will personally be glad to see the web rid of their low signal blog spam.

0. https://towardsdatascience.com/word2vec-made-easy-139a31a4b8... 1. https://jalammar.github.io/illustrated-word2vec/

minimaxir · on March 27, 2022

TDS is a banned domain on HN: https://news.ycombinator.com/from?site=towardsdatascience.co...

It's unusual that this article got vouched.

ColinWright · on March 27, 2022

I thought this particular article gave a balanced, high-level overview, along with enough detail and references to provide a good starting point.

Yes, perhaps it's a bit light-weight, but as an introduction I thought it did a good job.

minimaxir · on March 27, 2022

Unusual in the sense that getting vouched as an outlier, not necessarily as an indication of the article quality.

yesenadam · on March 28, 2022

What makes you say it's banned? From that link it doesn't seem to be.

minimaxir · on March 28, 2022

Turn on showdead. (but even if you don't, the fact that the last submission before this one was 3 months ago is a hint)

heyhihello · on March 27, 2022

Agree on poor TDS quality.

You should check out Amazon’s MLU’s interactives - they’re like mini nyt articles on different algorithms:

https://mlu-explain.github.io/

bllguo · on March 27, 2022

As another recommendation, Distill is higher level, and has less topic coverage, but their article quality is fantastic:

https://distill.pub/

ehsankia · on March 27, 2022

Great fun use of word2vec

https://semantle.novalis.org/

https://semantle.pimanrul.es/

svcrunch · on March 27, 2022

Also take a look at Semantris:

https://research.google.com/semantris

gojomo · on March 27, 2022

Also:

https://transorthogonal-linguistics.herokuapp.com/TOL/boy/ma...

(Which can reproduce an old XKCD about the 'purity' of other scientific fields compared to math: https://twitter.com/RadimRehurek/status/638531775333949440)

Asafp · on March 27, 2022

If you think Word2Vec is cool, you can try to google X2Vec with whatever X comes up to your mind. you can almost make anything into a vector using the right neural network. At my previous job I created a LinkedPage2Vec and used it to find Look Alike people for B2B marketing and sales. I also played a bit with Code2Vec.

kqr · on March 27, 2022

I have been tempted to try word2vec-like techniques on e-commerce shopping carts as a way to find a particular type of recommendation. I suspect data will be too sparse, though.

Has anyone approached similar techniques on non-text corpuses?

cgearhart · on March 27, 2022

It works…for some definition of “works”. It’s been applied to all kinds of problems—including graphs (Node2Vec) and many other cases where the input isn’t “words”—to the point that I’d consider it a weak baseline for any embedding task. In my experience it is unreasonably effective for simple problems (make a binary classifier for tweets), but the effectiveness drops quickly as the problem gets more complicated.

In your proposed use case I would bet that you will “see” the kind of similarity you’re looking for based on vector similarity, but I also expect it to largely be an illusion due to confirmation bias. It will be much harder to make that similarity actionable to solve the actual business use case. (Like 30% of the time it’ll work like magic; 60% of the time it’ll be “meh”; 10% of the time it’ll be hilariously wrong.)

rdedev · on March 27, 2022

Ive been looking at a way to use transformer based models on tabular data. The hope is that these models have a much better contextual understanding of words. So embeddings from these models should be of better quality than just word2vec ones

kevin948 · on March 27, 2022

Same here. Find any good resources? I've been leaning on auto-encoders to encode better than word-2-vec and its ilk.

VHRanger · on March 27, 2022

Network node embeddings are the best for tabular data. I maintain a library on it here, but there's plenty of good alternatives:

https://github.com/VHRanger/nodevectors

rdedev · on March 28, 2022

My idea is to use make a table row into a textual description and feed it into a transformer and get a get effectively a sentence embedding. This is effectively a query embedding. Then make a couple of value embeddings for the target you are trying to predict and use cosine similarity to predict the right value embedding and feed that to the ml model as part of the feature set. It works if the categorical values in your table are entities that the model might have learned.

I tried this approach and it did improve the overall performance. The next step would be fine tuning the transformer model. I want to see if I could do it without disturbing the existing weights too much. Here's the library I used to get get the embeddings

https://www.sbert.net/

VHRanger · on March 27, 2022

For sparser data, you should just do normal network node embeddings.

Look into node2vec libraries for instance

OccamsRazr · on March 27, 2022

You may find this airbnb paper relevant. They use skip-grams to generate feature vectors for their listings.

https://www.kdd.org/kdd2018/accepted-papers/view/real-time-p...

random314 · on March 27, 2022

I have applied it to ecommerce shopping carts and it works quite well :). The itemids(words) viewed in sequence in a session can be thought of as a sentence.

laughy · on March 27, 2022

I have applied it to the names in a population database. It learnt interesting, and expected structure. Visualized with UMAP it clustered by gender first, and then something that probably could be described as cultural origin of name.

samuel · on March 27, 2022

For me the key point to understand what's going on(assumming I got it), is that the hidden layer "has" to produce similar representations for words that appear in the same contexts so the output layer can predict them.

The intuition behind doc2vec it's a bit harder to grasp. I understand the role of the "paragraph word": it provides context to the prediction so in "the ball hit the ---" in a basketball text the classifier would predict "rim" and in a football one "goalpost"(simplifying). But I still don't get why similar texts get similar latent representations.

gibsonf1 · on March 27, 2022

The problem is, words are not the issue, concepts are. And for understanding meaning, both causal and conceptual understanding of a spacetime model of the world is needed. That’s why the word2vec approach to nlp is truly a dead end, although some associations can be gleaned.

mgaunard · on March 27, 2022

word2vec is 9 years old, hardly a "recent breakthrough".

ColinWright · on March 27, 2022

I'm a little surprised that that's the most useful and constructive thing you can say about the article.

visarga · on March 27, 2022

Since it's 2022, use Sentence Transformers to embed short phrases. They are leaps above w2v. Or just use any model from Hugging Face. It's just 10 lines of code, really easy to start with.

https://sbert.net

https://huggingface.co

minimaxir · on March 27, 2022

Although Hacker News can reach Stack Overflow reductiveness of "just use X lol"...using Transformers for NLP is indeed the best answer for all in terms of performance, speed, and ease of implementation.

A weakness of TDS monopolizing data science SEO is that it's hard for better techniques to surface.