Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are other reasons for duplicates in event streams - not just the dupes introduced by at-least once processing in Kinesis or Kafka workers. We've done a lot of thinking about this (all open-source) at Snowplow, this is a good starting point:

http://snowplowanalytics.com/blog/2015/08/19/dealing-with-du...

Our last release started to tackle dupes caused by bots, spiders and dodgy UUID algos:

http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-pe...



Hi, Jin here from Amplitude. You are absolutely right that there are other sources of duplicates. Our real-time data store sits behind an event processor (not covered in this blog) that handles all major event duplication scenarios. This is why the real-time store focuses on duplications introduced by the message bus replays, something that systems such as Druid do not address.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: