Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Building a library to deduplicate data at scale in Apache Spark, where there is no unique record identifier (i.e. fuzzy/probabilistic matching).

https://github.com/moj-analytical-services/sparklink

It's currently in alpha testing, but the goal is for it to:

- Work at much greater scale than current open source implementations (the ambition is to work at 100 million records plus)

- Get results much faster than current open source implementations - ideally runtimes of less than an hour.

- Have a highly transparent methodology, so the match scores can be easily explained both graphically and in words (it isn't a 'black box')

- Have accuracy similar to the best products on the marketplace.



I'm currently working on a solution involving larger data sets to match a record with a binary score (0/1). I'm using Redis with the Bloom Filter module. This works in that the query results are sub-second, but the data ingestion/filter population part is quite slow comparatively (<100 MB/s). Another block for me is if having to use multiple filters to query across multiple sets which just multiplies all the resources needed. Does Spark have any advantages or specialized filters for this use case? (I have nil experience with Spark, but am ready to dig up if it would really help.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: