This looks very useful for our database heavy teams.
Getting this information is certainly already possible, but there is a bit of a barrier in front of it. You need to realize the query is slow, then you need to re-run it with the right EXPLAIN and/or ANALYZE incantation with 8-9 parameters for a query visualizer, paste it into a query visualizer and then you get some nice, easily digested overview of what is going on.
Teams either don't know how to do that, or don't do that, due to permissions or because it's a hassle. Having a slow "calculateFooReport()" trace go straight into a bunch of slow SequentialScan- and NestedLoop-Nodes would remove one excuse from that equation.
Kinda bummed that we're updating out of the supported versions starting next month.
I find it important to include system information in here as well, so just copy-pasting an invocation from system A to system B does not run.
For example, our database restore script has a parameter `--yes-delete-all-data-in` and it needs to be parametrized with the PostgreSQL cluster name. So a command with `--yes-delete-all-data-in=pg-accounting` works on exactly one system and not on other systems.
When we designed the (by now largely self-hosted) stack for our production enviroment, we had that discussion. And honestly, on the persistence side, most people agreed that PostgreSQL, S3 and a file system for some special services is plenty. Maybe add some async queueing as well. Add some container scheduling, the usual TLS/Edge loadbalancing, some monitoring and you have a fairly narrow stack that can run a lot of applications with different purposes and customers..
We (10 people) run this + CI on just a VM + storage provider, mostly VSphere from our sister team of 6 (and yes it hurts, and we have no time to move it), Hetzner and some legacy things on AWS.
Though that's currently the problem -- there is a somewhat steep minimal invest of time into this. But that's good, because this means there could be value for European cloud providers to build up this narrow stack managed and get paid for it. We will see.
> I would wager the main reason for this is the same reason it’s also hard to teach these skills to people: there’s not a lot of high quality training for distributed debugging of complex production issues. Competence comes from years of experience fighting fires.
The search space for a cause beyong a certain size can also be big. Very big.
Like, at work we're at the beginning of where the powerlaw starts going nuts. Somewhere around 700 - 1000 services in production, across several datacenters, with a few dozen infrastructure clusters behind it. For each bug, if you looked into it, there'd probably by 20 - 30 changes, 10 - 20 anomalies, and 5 weird things someone noticed in the 30 minutes around it.
People already struggle at triaging relevance of everything in this context. That's something I can see AI start helping and there were some talks about Meta doing just that - ranking changes and anomalies in order of relevance to a bug ticket so people don't run after other things.
That's however just the reactive part of OPS and SRE work. The proactive part is much harder and oftentimes not technical. What if most negatively rated support cases run into a dark hole in a certain service, but the responsible team never allocates time to improve monitoring, because sales is on their butt for features? LLMs can identify this maybe, or help them implement the tracing faster, but those 10 minutes could also be spent on features for money.
And what AI model told you to collect the metrics about support cases and resolution to even have that question?
Good alert deduplication and dependency rules are worth so much. "Dear alerting, don't start throwing a fit about those 600 systems over there if you can't even reach the firewall all traffic to those systems goes through". Suddenly you don't get throttled by your SMS provider for the volume of alerts it tries to send, and instead just get one very spicy message.
Snark aside, this also impacts resolution time, because done well, this instantly points out the most critical problem, instead of all the consequences of one big breaking. "Dear operator, don't worry about the hundreds of apps, the database cluster is down".
They used to, but I wouldn't want to go back to that. Believe me, compilers that continue and try their best are a massive improvement in many cases, allowing you to fix more issues between compilation attempts.
Perhaps, I don't really program much c/c++. but in my experience most of the subsequent errors are due to the first error. So even where there might be several places I could fix the code my standard operating practice is to find the first error, fix that and see what it cleans up.
But like I said, I am not much of a C programmer. The compiler authors feel strongly about pushing past all possible errors and keep doing it so perhaps there is merit to this practice. but it bugs the heck out of me.
98% of the time those lengthy messages are useless, but the other 2% of the time they're critical to tracking down the problem.
A year or two ago Visual Studio added a pop-up that parses such lengthy compiler messages into a clickable tree list. I found it annoying at first, until I discovered I could dock it to the side, ignore it 98% of the time, but still go look at the details when relevant. This is an idea other compilers should copy.
Maybe ships should copy this approach too: issue fewer warnings, but provide a list of warning details for review when necessary.
Also, an EV is as green as the grid. Hamburgs public transportation is heavily investing into electrical busses, because a bus is expected to function for 10 - 15 years. Meaning, a diesel bus built today will be as polluting in 2035 as it is today, though they are also looking at alternatives there. But an electrical bus will become cleaner and cleaner over time.
> For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.
Is a well-stocked Dell Server going for ~50 - 60K capex without storage before the RAM prices exploded. I"m wondering a bit about the CPU in there, but the Storage + RAM is fairly normal and nothing crazy. I'm pretty sure you could have that in a rack for 100k hardware pricing.
> When they tell their base managers to crack the whip and force them to give the whole “you are not working hard enough, tighten up. Shorter lunches, clock in 5 minutes early, etc” speech to the base employees, they will absolutely feel resentment and do LESS work, not more.
The most influential question from team lead trainings over the years has been: Do you trust your employee to want to complete the task and purpose they have, or do you need to control them? There are a few names for this, Theory X and Theory Y mainly.
And don't be snide and just say that the current economy forces you to work due to wages. A lot of people I know would just create their own creative work if they had all the money in the world. So yeah, I think if you frame a persons job and purpose in the company right, you can trust them to work. This may not work in all industries, but in tech it seems to hold.
An example where this is in my experience a good guidance: Someone starts slipping their metrics, whatever those are. Comes in late, is hard to reach remotely. Naturally they should get slapped with the book right? Nah. If you assume they want to work well, the first question should be: Why, what is going on?
In a lot of cases, there will be something going on in their private life they are struggling with. If you help them with that, or at least help them navigate work around this, you will end up with a great team member.
Like one guy on the team recently had some trouble during the last legs of building a house and needed more flexible time. We could've been strict and told him to punch it and take their entire annual vacation to manage that, even if he just needs to be able to jump away for an hour or two here and there. Instead we made sure to schedule simple work for him and have him work with a higher focus on educating his sidekicks, tracked the total time away and then booked it as 3-4 days at the end. Now it's a fun story in the teams lore they are fond of having navigated that, instead of one guy sulking about having lost all of his vacation in that nonsense.
> In a lot of cases, there will be something going on in their private life they are struggling with. If you help them with that, or at least help them navigate work around this, you will end up with a great team member.
Note that there already has to be a pretty high level of trust between that employee and their manager for this to work; if I don't feel like I can trust my manager, I will absolutely keep my lips zipped about anything not directly work related.
Oh absolutely, and it would be my responsibility to build this. In fact, I don't even need details. I just prefer to know about a team members situation and have a plan around it, before clients, internal customers, our boss and HR start coming knocking with hard questions or worse.
I'm now somewhat interested in the study to see how they accounted for possible hidden factors.
If a team lead or manager spent the time to track birthdays and took time out of their day to have a 10 minute chat with someone on their birthday, they probably exhibit a number of other behaviors that could be summarized as "treating their employees as humans". That's the boss people tend to like to work with and possibly go another mile for them.
If tolerating your boss during a normal day takes 9 of your 12 spoons of energy for the day, it takes very little further push to be spiteful. At worst, they may force you to find another workplace with a better boss.
This is a study from an elite institution published in a respectable journal in the social sciences. Certainly they took the time to perform a controlled experiment and assigned managers at random to deliver the birthday cards late or on time. That would be cheap to do and minimally invasive for the human subjects.
[Reads abstract]
They didn't? It's a pure observational study that one measure of sloppiness in the organisation correlates with another? What do we pay these guys for?
Per abstract it's a "a dynamic difference-in-differences" analysis, which means likely that they see whether the employee behavior changes after the event. But establishing causation with it still requires quite a few assumptions.
PNAS is kinda known for headline grabbing research with at times a bit less rigorous methodology.
> Certainly they took the time to perform a controlled experiment and assigned managers at random to deliver the birthday cards late or on time. That would be cheap to do and minimally invasive for the human subjects.
If the results are true, it would be actually quite expensive because of the drop in productivity. It could also be a bit of a nightmare to push through ethical review.
They could start by observing the rate at which birthday cards are delivered on time, and not vary too much from that.
I suppose the impact on productivity isn't known in advance, and it might be that failing to receive a birthday card from a normally diligent manager costs the company more in productivity than it gains from a sloppy manager unexpectedly giving one on time.
However, if at some point somehow it shines through, that this is just another checklist being ticked off, without actual sincerity behind it, this all goes down the drain, and the time would be better spent on actual work environment improvements, rather than wet handshakes and pseudo "we are a family".
This has been my understanding for e.g. European chips as well:
First you subsidize and support the creation of currently not commercially viable chip fabs on-shore. Literally handing companies money under some obligation into the future.
Eventually the on-shore chips are produced, but they have higher total cost of ownership for the users: Logistics may be cheaper due to less distance, or more expensive because they are not well-trodden paths yet. Production costs like labor, water, energy could be higher. And the chips could just have higher failure rate, because problems in the new processes need to be kinked out.
But to get local consumers to switch to these switches, one applies tariffs to other sources of chips so the on-shore chips become more competitive artificially, until they become actually cheaper and competitive.
The way it is threatened here isn't in the use case of tariffs at all from my limited understanding.
Getting this information is certainly already possible, but there is a bit of a barrier in front of it. You need to realize the query is slow, then you need to re-run it with the right EXPLAIN and/or ANALYZE incantation with 8-9 parameters for a query visualizer, paste it into a query visualizer and then you get some nice, easily digested overview of what is going on.
Teams either don't know how to do that, or don't do that, due to permissions or because it's a hassle. Having a slow "calculateFooReport()" trace go straight into a bunch of slow SequentialScan- and NestedLoop-Nodes would remove one excuse from that equation.
Kinda bummed that we're updating out of the supported versions starting next month.
reply