That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.
In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.
I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.
For the mean time to failure, I based it on a section out of Mastering Statistical Process Control, By Tim Stapenhurst. Specifically, The section on using SPC to measure earthquakes, etc. The system worked pretty well, ran for years, and using R, I built a free system to monitor all the job schedule information for our HPC systems. I’d present the most egregious information in the form of a daily Pareto chart. I’d attempt to shame the code owners when they would appear at the top of the Pareto chart. But, mostly, I just did not want people having their go-to excuse of blaming the system administrators, when it was really their recent code update.
There were other SPC charts, which one could drill down and look at job run times, or which nodes the jobs ran on, etc. But working the culture to get people to be responsible for their applications was a little out of my wheelhouse, and always a challenge. For those few people who really embraced their application ownership and wanted to make sure things ran well, it that was always nice. It was always nice to say something like, “your job used to crash 3 times a year and now it seems to be crashing 6 times a year.” At least, we would have a good point to discuss potential causes.
I know some of the developers got sucked into tools like Splunk, but to me, that was always cost prohibitive for our budget and our volume of data.
Answering your question about “early signals in metrics before job failures increased” the mean time to failure SPC chart would show a job failure signature and if there were problem nodes, or problems with a software update, that would become apparent to allow further investigation. The other SPC charts, like job run time would show things like increased job run time, etc. But, that was pretty basic stuff (and lots of tools can do that stuff), such as a user was generating a daily tar-file, which was growing over time and eventually filling up a file system, etc. But getting people to take action always seemed so hard.
I’m building EventSentinel.ai, a predictive AI platform that monitors hardware and network infrastructure to detect early signals of failures and connectivity issues before they cause downtime.
I’m looking for a few early-stage design partners (SRE / DevOps / IT / Network teams) who:
Manage on‑prem or hybrid infrastructure with critical uptime requirements
Are currently using tools like Datadog, PRTG, Zabbix, or similar, but still deal with “surprise” incidents?
Are open to trying an MVP and giving candid feedback in short feedback sessions?
What you’d get:
-Early access to our predictive failure and anomaly detection features
-Direct influence on the roadmap based on your needs
-Free usage during the MVP phase (and preferential terms later)
If this sounds relevant, drop a comment “interested” and I’ll follow up with details or email at gabriele@eventsentinel.ai
That’s true, but AI is interesting because consumption-based pricing introduces a lot more variance than typical SaaS infrastructure. One user action can trigger dozens of model calls in an agent workflow. That’s partly why we started experimenting with models like https://oxlo.ai where the pricing flips back to a fixed subscription and we absorb the usage spikes.
Yeah, it feels like an unsolved problem still. I've also seen many teams spend hours on human review in eval pipelines (and this accumulates with each new model that gets released).
I’m building EventSentinel.ai, a predictive AI platform that monitors hardware and network infrastructure to detect early signals of failures and connectivity issues before they cause downtime.
I’m looking for a few early-stage design partners (SRE / DevOps / IT / Network teams) who:
Manage on‑prem or hybrid infrastructure with critical uptime requirements
Are currently using tools like Datadog, PRTG, Zabbix, or similar, but still deal with “surprise” incidents?
Are open to trying an MVP and giving candid feedback in short feedback sessions?
What you’d get:
-Early access to our predictive failure and anomaly detection features
-Direct influence on the roadmap based on your needs
-Free usage during the MVP phase (and preferential terms later)
If this sounds relevant, drop a comment “interested” and I’ll follow up with details or email at gabriele@eventsentinel.ai
In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.
I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.