Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That’s why in NLP we use term frequency over inverse document frequency. It gives you a measure of common uncommon things are.

Wonder how you’d implement that in a heat map. Just call each pixel a document and see where it takes you?



People have been critiquing the collaborative filtering aspect of this work vs content analysis ("[why use stars instead of code similarity]") but there's something elegant about the simplicity of using less priors here.

A tf*idf matrix could be applied to the star-feature matrix too. Document = github repo. Term = name of user who starred it.

THUS, users who overstar are simply less important for computing similarities.

This would mitigate the phenomenon of massively popular github repos being clustered together because of folks who blithely star the most well known stuff.


Winsorize the data points to remove outliers and then divide it by the population count for the case of the heatmap?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: