I think a audited algorithm where each type is strictly defined like int32 added to that really help with what exactly should be inputted to it so it remains correct.
On the other hand when you have a reasonably complex deployment it's easy to get swamped with dashboards showing CPU, Memory, I/O, application-metrics, signups, active users/sessions, etc.
Instead it's nice to think about how you can express the state of a complete system as a single number. It might be you divide active user sessions by database-connections, and then scale by memory capacity.
But as a single digit you can then get used to normal ranges, and have it always visible somewhere obvious. A single number won't show details, but when it changes you can go look at the specific metrics. It's a cute shorthand, and it can work well as a basic "are we normal" check.
splitting the status page like they do, to the point where it is only a bit of humourous exaggeration to say that they track broken `git push` and `git pull` separately, is a sleight of hand / accounting / SLA-fudging that we should not excuse
there is a subset of the site that pretty much everyone uses — git, issues, pull requests, actions — and if any part of that is broken then the site is broken and the status page should indicate how often this happens
> splitting the status page like they do, to the point where it is only a bit of humourous exaggeration to say that they track broken `git push` and `git pull` separately, is a sleight of hand / accounting / SLA-fudging that we should not excuse
This is a pretty ungenerous take. You could look at it the other way: if I don't use actions then it's useful for me to know that only actions are broken, and I can continue in my normal usage. If you bundle everything up then the status page is reporting an unhelpful false positive for me.
you can do both: report a number that shows how often your service as a whole is degraded, with a breakdown for individual components
example (not sponsored, i barely use codex and today's the first time i've ever had to look at this page; i don't know how much they're fudging the individual numbers or not reporting minor incidents):
It’s obviously a meme website, the meme is more funny when the number isn’t high. Anyone looking for actual accurate info would go to the real status page.
Ironically I’ve never found official status pages to be all that accurate either since companies love to exclude all kinds of outages from counting towards uptime. Anthropic is hilariously egregious about that as a recent example I can think of, but I assume GitHub does the same since it’s so common in the industry.
If S3, EC2, EKS and RDB alone had a similar uptime as all of Github right now, we'd all know.
No one cares that much if repo wikis, commit stats or gist had these issues. It's the combination of inter-dependent services that are used in combination, like PRs, actions, discussions, etc.
If one were to build a single percentage for each of these components of both systems, github would still lose. Maybe it's a few days without outages more but this isn't a comparison.
From a user perspective this makes sense. But if you’re MSFT or GitHub this number is pretty embarrassing.
They would love if everyone on the platform used all of the features and had massive lock-in right? So if some part of that is always broken, it’s not a confidence booster for users to adopt more of the feature set.
Sure the more things you use the more likely it is that one has an issue but clearly stability isn’t a goal for these type of companies anymore.
I think the correct middle ground is a site that lets you select the parts of the platform you rely on and ignore the others. For example, GitHub is "down" for me when I can't push, process PRs, or release packages, but I don't care about Actions or AI features.
You’re kind of an outlier - nobody wants AI but Actions are core for tons of workflows and deployment pipelines. Everyone bought into the “only robots can deploy” mantra (correctly IMO, it’s a huge time and friction saver) only to be bit in the ass by the platform being so u reliable they can be stuck for days without deploys.
They can do this with certificate transparency other wise CA can sign whatever date they want. But if they collude with CT that can issue rouge certificates for targeted attacks.
Yes, that's all right, there's already a requirement that they submit to one Google CT log and one non-Google CT log. They thought about it already. The playbook I mentioned they've been rehearsing contains specific threat against backdating certs, they say they'll distrust immediately if they detect, and they have means of detecting backdating on significant scale (esp. for LE, where they submit 100% issued certs, not just the subset that is intended for consumption with Chrome).