Building a library to deduplicate data at scale in Apache Spark, where there is ...

manishjhawar · on Jan 30, 2020

I'm currently working on a solution involving larger data sets to match a record with a binary score (0/1). I'm using Redis with the Bloom Filter module. This works in that the query results are sub-second, but the data ingestion/filter population part is quite slow comparatively (<100 MB/s). Another block for me is if having to use multiple filters to query across multiple sets which just multiplies all the resources needed. Does Spark have any advantages or specialized filters for this use case? (I have nil experience with Spark, but am ready to dig up if it would really help.)