The older I get and the more systems I encounter, the more I'm convinced everything is in a state of failing. The question is only how long before it becomes noticable.
One of the consequences of this is when something goes very wrong, and the people involved say it really shouldn't have because something like 4 separate things have to fail at the same time for that to have happened.
What they don't realise is that, thanks to the FFT, 2 of those 4 things actually failed over a year ago and the system has been continuing to work ever since without anyone ever noticing. One of the other things actually fails intermittently a couple of times per month, and while a couple of people have noticed that and reported it, either it's not been investigated at all, or when it was investigated everything looked fine and the report was put down as an unexplained glitch. All that was actually needed was for the 4th thing to fail, to cause this "can't happen" failure mode to suddenly ruin everyone's day.
All of this is further exacerbated by the modern business take on capitalism of "optimize away work that doesn't positively produce".
Just because you're paying someone to go check and they don't find anything 99% of the time, you don't get rid of that. If you do, there goes the pnly sensor capable of propagating that signal to the rest of the control network. Applies equally well in electronics/circuits/or human domains.
If only C-levels and Boards viewed such jobs as a good excuse to provide someone with a living as opposed to poor worker utilization hurting their bottom line
Experienced sysadmins see the occasional chatter of problems in logs and emails as the pulse of the system, the EKG hooked up to the patient letting you know how things are going.
The right level of logging is usually enough to be annoying but not so annoying you need to reduce it. That means you're hopefully thorough enough in the logging that you're reporting what you care about and the things you see are the the things you can manually fix or ignore, and fit a frequency level you can live with.
Until the stuff you're managing becomes too large for that to make sense anymore, because that only scales so far (but surprisingly far for diligent admins). Then you need to move to something more complex, which is a big job and requires an entirely different way of reviewing (like Prometheus). You're probably better doing that from the beginning, but it's time and effort but everyone can take, or may not have existed at that time.
> because that only scales so far (but surprisingly far for diligent admins)
Surprisingly far. We log terabytes of data a day, most of it is unnecessary. I've been trying to tell people that it is not a good idea but so far I've had limited success. It's easy for people to just "printf" everything (we are lucky if logging libraries are used with proper levels).
Right now, we have the most expensive 'add' operation in history. K8s pod logs some data. That is eventually noticed by fluentd, which is watching the filesystem. Fluentd sends to our logging servers over the network. That goes to kafka, then is shipped to the system that actually does the indexing. And then, there's a query running, which will then count the number of times that string appeared and updates a counter.
All of which could have been avoided if the apps just added +1 to a counter, and exposed that as /metrics so Prometheus could scrape.
I don’t know if we worked for the same company but your logging pipeline sounds almost identical to the logging pipeline at my last company. That said, logs are definitely abused often, usually all it takes is an engineer to say to themselves “well, I know this gets logged 1000x a second per host, but, someday during an outage I could potentially use this!”, without realizing that one log call can cost thousands or more a month. Tools like ElastAlert unfortunately dont help in that it makes people comfortable using logs as a fail signal. It seems like the best way to get teams to limit their logging is to give them the exact dollar amount their log call costs; ie, “this one log line accounts for 5% of all log traffic, well that costs $X a month, and it’s only been queried for N times”
Time to market. Nothing else matters, use whatever stack/library works and whatever I can copy/paste of stackoverflow to get it to work, we can fix the problems in software as we go and we only have to support it for 2 years then we can make another.