> Website, API, Git (ssh and https), Pages, Registry, CI/CD, Background Processi...

m712 · on Nov 28, 2019

The status page has updated to indicate that they misconfigured their firewall. Apparently their entire set of services go through a single firewall (or at least, multiple firewalls with the same config). It's worrying that they don't have a staging setup for these kinds of things.

(NOTE: I am speculating here, if they do have a staging system and this wasn't reproduced there then the last sentence doesn't apply.)

ownagefool · on Nov 28, 2019

Likely just means they have a Single Point of Failure.

Some guesses would be:

Automation/orchestration - They've migrated to k8s (I don't believe they've actually done this yet), but it could be their orchestration / automation tool automated a broken thing everywhere.

Database/Auth - Pretty much everything in gitlab will touch the database as far as I'm aware. Otherwise, how do you check whether users are auth'd to take action something. You wouldn't expect this to break the static website, i.e. the sales landing pages, but these could be based off an internal CMS, or could be checking for "guest" role session.

DNS/Service Discovery - As a sibling posted, "it's always DNS". It's good practice to use names for services instead of IP addresses, but this means your DNS needs to generally work, or everything will go down. Service Discovery could rely on DNS, but it could also be an API call that finds out DNS addresses or IP addresses directly.

CDN - You wouldn't typically put this in front of auth'd usage, and typically a CDN might not be helpful in front of something like SSH, but a quick look at fastly suggests they might support this. The main downside is sharing all the user data / auth tokens.

Security Product / CA - All you need is a requirement to encrypt internal traffic and rotate secrets, and you end up with a secret store that sits in the middle of everything.

Storage Layer - I believe they were big on Ceph for a while. If everything is backed by Ceph, everything will go down if you fail with Ceph.

Obviously, whatever it is, you'd expect them to split up their fail over plan a bit more in the future if it is something like that, but usually there's a single point of failure somewhere.

ownagefool · on Nov 28, 2019

Replying to myself, because it's now on their status page that a firewall change took down the database.

This points to there being:

- a lack of process and testing on key networking changes. Aren't they doing CI/CD, automated testing and peer review for this?

- A SPOF in the database; why couldn't things connect to a secondary for a read-only mode?

Quite a lot of the time, things break for stupid reasons. The main difference is when a normal company does something stupid, they can hide it, lie about it, or make it sound more complex.

The fact Gitlab publishes their fuck ups, is supposed to force them to do a better job and actually look at root causes and apply proper fixes that we can all judge. I wouldn't hold any particular fuck-up against them.

wbl · on Nov 28, 2019

Network devices are generally hostile to advanced automation, and if they had both primary and secondary as the same class of machine then the changes would apply to both.

ownagefool · on Nov 28, 2019

I believe they're hosting in-cloud, which means it's probably not a device and can be automated. Obviously, public IP addresses will be specific to environments, but that's what PRs should double check.

ktsmith · on Nov 28, 2019

You can look through what happened here: https://gitlab.com/gitlab-com/gl-infra/production/issues/142...

beaconstudios · on Nov 28, 2019

presumably a single point of failure - my guess would be something at the network level.

manojlds · on Nov 28, 2019

It's always DNS!

yyx · on Nov 28, 2019

This time it might be Consul

dordoka · on Nov 28, 2019

Spot on. The status page confirms it's a bad firewall configuration

beaconstudios · on Nov 28, 2019

It's always the network.

RossM · on Nov 28, 2019

At the bottom of the page they list availability of third party services used - Fastly has a warning symbol, and I imagine they put that CDN in front of everything.

bArray · on Nov 28, 2019

> Fastly has a warning symbol, and I imagine they put that

> CDN in front of everything.

Check their status page, it's just a simple reroute since the 13th [1].

[1] https://status.fastly.com/

k_ · on Nov 28, 2019

Latest tweet by @gitlabstatus said "We've identified an issue with database connectivity", which could explain why so many services are impacted.

tyingq · on Nov 28, 2019

Bad firewall change...they just updated the page.