More

zippyman55 · 2026-03-24T05:25:22 1774329922

You know the story about how the frogs, thrown into a hot pot will jump out. But, if you turn up the heat slowly, they just eventually die? Well, the other day, at work, we were called into a room to watch a mandatory video of frogs in this environment. I actually noticed that management had turned the thermostat up really high. I hopped out of that meeting very quickly.

zippyman55 · 2026-03-21T02:33:57 1774060437

I have always wondered when I used the pair of chopsticks to push food on my fork, if there was a name for my type.

zippyman55 · 2026-03-20T04:40:51 1773981651

Ranking, Celsius, Centigrade have the degrees. Kelvin is a base unit, absolute and no degree!

zippyman55 · 2026-03-19T22:14:32 1773958472

Are you suggesting that melting them will increase the value?

zippyman55 · 2026-03-18T12:24:29 1773836669

Morning Fred! Morning Sam.

zippyman55 · 2026-03-17T20:59:15 1773781155

Pimp is a bad word and its use should be avoided. Dont glamorize the word.

zippyman55 · 2026-03-15T05:02:38 1773550958

Yep! Maybe some money could be put in a s&p500 equal weight index?

This is not my first rodeo.

zippyman55 · 2026-03-11T13:37:22 1773236242

My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs. What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code. It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership.

gabdiax · 2026-03-11T13:50:34 1773237034

That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

zippyman55 · 2026-03-12T05:17:25 1773292645

For the mean time to failure, I based it on a section out of Mastering Statistical Process Control, By Tim Stapenhurst. Specifically, The section on using SPC to measure earthquakes, etc. The system worked pretty well, ran for years, and using R, I built a free system to monitor all the job schedule information for our HPC systems. I’d present the most egregious information in the form of a daily Pareto chart. I’d attempt to shame the code owners when they would appear at the top of the Pareto chart. But, mostly, I just did not want people having their go-to excuse of blaming the system administrators, when it was really their recent code update. There were other SPC charts, which one could drill down and look at job run times, or which nodes the jobs ran on, etc. But working the culture to get people to be responsible for their applications was a little out of my wheelhouse, and always a challenge. For those few people who really embraced their application ownership and wanted to make sure things ran well, it that was always nice. It was always nice to say something like, “your job used to crash 3 times a year and now it seems to be crashing 6 times a year.” At least, we would have a good point to discuss potential causes. I know some of the developers got sucked into tools like Splunk, but to me, that was always cost prohibitive for our budget and our volume of data. Answering your question about “early signals in metrics before job failures increased” the mean time to failure SPC chart would show a job failure signature and if there were problem nodes, or problems with a software update, that would become apparent to allow further investigation. The other SPC charts, like job run time would show things like increased job run time, etc. But, that was pretty basic stuff (and lots of tools can do that stuff), such as a user was generating a daily tar-file, which was growing over time and eventually filling up a file system, etc. But getting people to take action always seemed so hard.

gabdiax · 2026-03-15T22:22:48 1773613368

I’m building EventSentinel.ai, a predictive AI platform that monitors hardware and network infrastructure to detect early signals of failures and connectivity issues before they cause downtime.

I’m looking for a few early-stage design partners (SRE / DevOps / IT / Network teams) who:

Manage on‑prem or hybrid infrastructure with critical uptime requirements

Are currently using tools like Datadog, PRTG, Zabbix, or similar, but still deal with “surprise” incidents?

Are open to trying an MVP and giving candid feedback in short feedback sessions?

What you’d get:

-Early access to our predictive failure and anomaly detection features

-Direct influence on the roadmap based on your needs

-Free usage during the MVP phase (and preferential terms later)

If this sounds relevant, drop a comment “interested” and I’ll follow up with details or email at gabriele@eventsentinel.ai

zippyman55 · 2026-03-10T02:06:32 1773108392

This product used to be a favorite of mine when I was a kid, but today, it tastes so artificial and inferior. Terrible chocolate and peanut butter. Nope.

zippyman55 · 2026-03-09T06:55:12 1773039312

In college Chemistry, I had discovered that when you were given temperatures in Fahrenheit, you did not need to convert to Kelvin and then take the natural log. A far easier method was to add 459.67 to the Fahrenheit values converting them to degrees Rankine, then just take the natural log. So much easier.