Logs are for humans, metrics are for machines. Metrics tell you there's a proble...

ludovicianul · on Oct 6, 2022

To be honest these feel like excuses to avoid structure. Keeping things consistent is something that require dedication and explicit effort. Just saying "everyone just do logging in their preferred format" seems the easiest way. Structured logging gives you lots of benefits. You monitoring will just work out of the box if you keep to the standard. Additional logs won't arbitrarily come up within your system with no anticipation. I'm not saying it will never happen, but you have control over it.

0xbadcafebee · on Oct 6, 2022

Do you have control over the format that a router uses to send you logs? Or a cloud vendor's services? Probably not. Inconsistency is therefore a certainty. The time you spend on consistency for a small subset of your logs can instead be spent working on a telemetry management system which decomposes and analyses all your logs regardless of format.

ludovicianul · on Oct 7, 2022

The fact that you will have inconsistencies doesn't mean you don't need to thrive for a standard where you have control. It will still significantly simplify things.

rrwo · on Oct 5, 2022

Yes.

I used to work at a company that invested too much time and effort into an ELK stack that handled specialist log formats sent over a custom UDP service. It was brittle and fell over in all sorts of strange ways. This was all for monitoring a handful of servers.

The only good that came of it was when we realized that the support staff could search the logs and diagnose 95% of customer complaints without bothering developers. ("Your transaction failed because you entered a bad card number.")

But that was because the text logs were more useful than the metrics.

wyiske · on Oct 6, 2022

Agreed. I was actually hoping the article would talk about how to log (buffer, async, files or output stream, etc). I’ve seen at least 2 cases of excessive logging causing outages: One was using log4j zip rollover which blocked all threads in the app causing timeouts, the other was using json in an older android vm which couldn’t cope with all the garbage, causing OOM due to fragmentation (before compacting garbage collector was introduced)