Hacker News new | past | comments | ask | show | jobs | submit login

Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.

There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: