Ask HN: Need help with distributed application infrastucture design

vlahmot · on Jan 8, 2018

I don't have any specific experience with emails but any time I need to move a lot of data around I go with Apache Kafka and Apache Flume.

Write all of your emails into a kafka topic from your webapp. Read from the topic to do processing. Use flume to sync results back to your webapp db.

1) For this I would probably use something like Chef/Ansible but I don't know the first thing about configuring email servers. You could have something that wakes up, reads the latest config off a topic, and then applies that config via a config management tool.

2) You can throw Apache Spark on to the kafka stream to calculate these aggregations.

3) Flume can read the emails and then save them back to wherever you need (this is typically s3/postgres for me). Flume can scale out over the kafka topic naturally using the same consumer group id.

I like this approach because you can scale it cheaply and easily by starting with kinesis streams instead of kafka if you don't have the ops resources to run kafka and running spark in stand alone mode until you need a cluster.

With spark you can do your statistics in there (streaming over a time window or batch) and then sink them over to your stats db.

With the flume/kafka combo you can treat kafka as the "channel" and you get some nice transaction functionality out of flume that makes handling failures a breeze.

It does take some tooling/monitoring to run confidently and the whole apache "big data" ecosystem is daunting at first but its well worth it in my opinion.

herbst · on Jan 9, 2018

This sounds super interesting. I never worked with any of these, but this sounds more or less like what i need. Kudos

dozzie · on Jan 8, 2018

So basically, 1) is like keeping account related data in LDAP with per-MTA replicas, 2) is like collecting statistics, and 3) is like forwarding e-mail messages to another system to handle there? Did I get that correctly? And what is 4) "A LOT of data"?

herbst · on Jan 8, 2018

I was hoping it is clear enough :)

1) Is actually a Web application, but yeah #1 is kind of clear

3) Is only logging as well. So not time critical, emails are handled by the "nodes" themself. Reason i mention them outside of statistics is because statistics are just a few numbers, this is actual data and may need to be threated differently.

4) It should handle up to a few thousand emails per second.