A good writeup, but quite shocking that this managed to happen in the first place. I'd have expected that an email service provider would have very good monitoring on deliverability and failure reasons on both sending and receiving, and that something like a cloud migration would be done very incrementally to ensure no loss of service.
For this particular issue I would have expected some or all internal email at HEY! to be moved before any customers so that the new system could be tested.
Email is notoriously finicky when it comes to networks, IPs, the cryptography involved, and all sorts of details that are in flux during a cloud migration, and it's also notorious for being difficult to recover from if you accidentally get your email listed in denylists.
I'm glad that they posted a "miss" - but this reads over and over like a sales pitch:
- I created a card in <X> Basecamp
- Someone posted a message in Campfire
- We have our own encryption
- Another message posted in a different Campfire
- Oh, this one uses custom categories!
- Todo's in Basecamp project
I get it, 37signals dogfoods their system. What we don't normally see from other posts is that person/company X posted in slack and made a ticket in jira and then created a todo on their trello board.
I'm admittedly a bit of a 37signals fan boy, so take this with a grain of salt, but I also love nerding out on seeing what other companies processes/tools are.
When folks post here with various monitoring/log exploration tools as a part of a postmortem, I always find myself at least doing a quick google on each of the tools to learn more about how they can be used.
I had the same thought that this post was a low-key product show case, but it also showed how we might incorporate some of their workflow into our process. So, even if an advertisement, it was value add. Something that is increasingly rare on company blogs as more and more content just smells like SEO bait.
I'm a little surprised this was published. It is hard to sound charitable when writing something like this but it was such a trivial, obvious fault (moving an email system and then SPF starts failing) that normally things like this are embarrassingly swept under the rug. Generally that is probably the best path.
While I appreciate the transparency and it's a great write-up, at the same time somehow I leave the post with a worse opinion of 37signals.
> Senior SRE Paul Shuvashish first noticed that these emails weren’t failing DKIM but SPF. [...] This pointed out a flaw in our application-level analysis system: we were assimilating DMARC errors – which can be either because of SPF or DKIM – to DKIM errors. So while the app was doing the right thing nevertheless – marking the email as spam – the insight it was collecting internally was misleading.
I don't agree with 'the app was doing the right thing' here: for DMARC alignment (a DMARC pass) you need SPF or DKIM alignment. One of the two is enough.
So an email from a domain with DMARC enabled that passes DKIM, but fails SPF should pass. The application should not have rejected the email based on SPF, when it was actually DKIM aligned.
As someone that works in a team with minimal collaboration software overhead—is there a ton of bloat in their process (Basecamp this, Campfire that, etc.) or is that just the reality of modern software development?
I don’t think there is one example of “modern” software development. Some orgs will have a very minimal process, some won’t. Some teams will have more involved processes than their surrounding org, some more. I’ve seen worse than what is described in the article, I’ve seen better. Strikes me about middling, especially for a company that dogfoods its own tools.
For this particular issue I would have expected some or all internal email at HEY! to be moved before any customers so that the new system could be tested.
Email is notoriously finicky when it comes to networks, IPs, the cryptography involved, and all sorts of details that are in flux during a cloud migration, and it's also notorious for being difficult to recover from if you accidentally get your email listed in denylists.