Experienced sysadmins see the occasional chatter of problems in logs and emails ...

outworlder · on Oct 17, 2022

> because that only scales so far (but surprisingly far for diligent admins)

Surprisingly far. We log terabytes of data a day, most of it is unnecessary. I've been trying to tell people that it is not a good idea but so far I've had limited success. It's easy for people to just "printf" everything (we are lucky if logging libraries are used with proper levels).

Right now, we have the most expensive 'add' operation in history. K8s pod logs some data. That is eventually noticed by fluentd, which is watching the filesystem. Fluentd sends to our logging servers over the network. That goes to kafka, then is shipped to the system that actually does the indexing. And then, there's a query running, which will then count the number of times that string appeared and updates a counter.

All of which could have been avoided if the apps just added +1 to a counter, and exposed that as /metrics so Prometheus could scrape.

renlo · on Oct 17, 2022

I don’t know if we worked for the same company but your logging pipeline sounds almost identical to the logging pipeline at my last company. That said, logs are definitely abused often, usually all it takes is an engineer to say to themselves “well, I know this gets logged 1000x a second per host, but, someday during an outage I could potentially use this!”, without realizing that one log call can cost thousands or more a month. Tools like ElastAlert unfortunately dont help in that it makes people comfortable using logs as a fail signal. It seems like the best way to get teams to limit their logging is to give them the exact dollar amount their log call costs; ie, “this one log line accounts for 5% of all log traffic, well that costs $X a month, and it’s only been queried for N times”

retzkek · on Oct 17, 2022

mtail is handy for such cases, when you really only care to extract a few metrics from the logs. https://google.github.io/mtail/

Generally though I've found once you start aggregating logs you find many more uses for them.