Experienced sysadmins see the occasional chatter of problems in logs and emails as the pulse of the system, the EKG hooked up to the patient letting you know how things are going.
The right level of logging is usually enough to be annoying but not so annoying you need to reduce it. That means you're hopefully thorough enough in the logging that you're reporting what you care about and the things you see are the the things you can manually fix or ignore, and fit a frequency level you can live with.
Until the stuff you're managing becomes too large for that to make sense anymore, because that only scales so far (but surprisingly far for diligent admins). Then you need to move to something more complex, which is a big job and requires an entirely different way of reviewing (like Prometheus). You're probably better doing that from the beginning, but it's time and effort but everyone can take, or may not have existed at that time.
> because that only scales so far (but surprisingly far for diligent admins)
Surprisingly far. We log terabytes of data a day, most of it is unnecessary. I've been trying to tell people that it is not a good idea but so far I've had limited success. It's easy for people to just "printf" everything (we are lucky if logging libraries are used with proper levels).
Right now, we have the most expensive 'add' operation in history. K8s pod logs some data. That is eventually noticed by fluentd, which is watching the filesystem. Fluentd sends to our logging servers over the network. That goes to kafka, then is shipped to the system that actually does the indexing. And then, there's a query running, which will then count the number of times that string appeared and updates a counter.
All of which could have been avoided if the apps just added +1 to a counter, and exposed that as /metrics so Prometheus could scrape.
I don’t know if we worked for the same company but your logging pipeline sounds almost identical to the logging pipeline at my last company. That said, logs are definitely abused often, usually all it takes is an engineer to say to themselves “well, I know this gets logged 1000x a second per host, but, someday during an outage I could potentially use this!”, without realizing that one log call can cost thousands or more a month. Tools like ElastAlert unfortunately dont help in that it makes people comfortable using logs as a fail signal. It seems like the best way to get teams to limit their logging is to give them the exact dollar amount their log call costs; ie, “this one log line accounts for 5% of all log traffic, well that costs $X a month, and it’s only been queried for N times”
The right level of logging is usually enough to be annoying but not so annoying you need to reduce it. That means you're hopefully thorough enough in the logging that you're reporting what you care about and the things you see are the the things you can manually fix or ignore, and fit a frequency level you can live with.
Until the stuff you're managing becomes too large for that to make sense anymore, because that only scales so far (but surprisingly far for diligent admins). Then you need to move to something more complex, which is a big job and requires an entirely different way of reviewing (like Prometheus). You're probably better doing that from the beginning, but it's time and effort but everyone can take, or may not have existed at that time.