Hey HN,
I hope some smart brains here can lead me to the right direction. I am looking for a simple yet robust data replication and transfering mechanism for a web application i am writing.
The main things i need:
1) Replicate login & configuration data to X nodes (email servers)
2) Count emails by status (dropped, forwarded, ...) [statistics]
3) Save (some) emails back to the Web application [logging]
4) 2 & 3 are A LOT of data. It must be able to handle that without losing any
The guys from the postgres IRC made me realize that multimaster is not only overkill (and many people are afraid of running them) but also simply not the right solution.
1) is easy using Postgres master slave replication
2 & 3 however are not so easy in my head. And 4 is kind of out my knowledge scope.
I've thought about the following 3 implementations:
a) Doing a Master-Master replication. Having the secondary master doing the hard work. Like replicating to others, plus receiving statistics (directly over network). I am not sure how smart it is to use the same database, even more in such a approach.
b) Doing statistics (and most likely also the email logs) in a seperate database, that provides a thin API my application can query & cache. Statistics are everywhere in my interface, so i would likely still replicate/cache the relevant data back to my main application. But the heavy writing does not directly impact my web application.
c) Maybe using something like logstash to handle the information load and drip the relevant info out into the web applications database.
I realize this topic is just hard, however i feel like i am missing something obvious.
Write all of your emails into a kafka topic from your webapp. Read from the topic to do processing. Use flume to sync results back to your webapp db.
1) For this I would probably use something like Chef/Ansible but I don't know the first thing about configuring email servers. You could have something that wakes up, reads the latest config off a topic, and then applies that config via a config management tool.
2) You can throw Apache Spark on to the kafka stream to calculate these aggregations.
3) Flume can read the emails and then save them back to wherever you need (this is typically s3/postgres for me). Flume can scale out over the kafka topic naturally using the same consumer group id.
I like this approach because you can scale it cheaply and easily by starting with kinesis streams instead of kafka if you don't have the ops resources to run kafka and running spark in stand alone mode until you need a cluster.
With spark you can do your statistics in there (streaming over a time window or batch) and then sink them over to your stats db.
With the flume/kafka combo you can treat kafka as the "channel" and you get some nice transaction functionality out of flume that makes handling failures a breeze.
It does take some tooling/monitoring to run confidently and the whole apache "big data" ecosystem is daunting at first but its well worth it in my opinion.