Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/J...

Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.

There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim