Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Logs are for humans, metrics are for machines.

Metrics tell you there's a problem. They're an application's way of reporting metadata or state. They are designed to be small and frequent and machine readable/writeable, with small bits of useful information that are of themselves quite useless, but are very useful in aggregate.

Logs describe the problem in detail to a human. They're a developer's way of diagnosing problems in an application. They can have machine-readable components to them, but being machine-readable is not the point. The point is for a human to quickly diagnose and fix an arbitrary problem, however you decide to do that. It's extremely annoying if you can't read a log because it was machine-formatted.

From the article:

  Invest time in designing your log structure
Don't. Eventually you will have to deal with logs with a random structure. Invest in telemetry management and distributed tracing.

  Log as much as possible.
Don't. It's about quality, not quantity. Too many logs with too little information leaves you buried and unable to diagnose quickly. And too much information is bad for security; always mask sensitive information. I can't tell you how many giant companies have been hacked because of this. And cost is a factor: I have seen people stop logging because it was costing them too much money, when they should have been logging smarter (fewer messages with more useful info) and utilizing metrics.

  Keeping consistency is everone’s priority
It is guaranteed that you will have to deal with some metrics and logs in arbitrary formats. Do not worry about making them perfect. Invest in telemetry management and distributed tracing.


To be honest these feel like excuses to avoid structure. Keeping things consistent is something that require dedication and explicit effort. Just saying "everyone just do logging in their preferred format" seems the easiest way. Structured logging gives you lots of benefits. You monitoring will just work out of the box if you keep to the standard. Additional logs won't arbitrarily come up within your system with no anticipation. I'm not saying it will never happen, but you have control over it.


Do you have control over the format that a router uses to send you logs? Or a cloud vendor's services? Probably not. Inconsistency is therefore a certainty. The time you spend on consistency for a small subset of your logs can instead be spent working on a telemetry management system which decomposes and analyses all your logs regardless of format.


The fact that you will have inconsistencies doesn't mean you don't need to thrive for a standard where you have control. It will still significantly simplify things.


Yes.

I used to work at a company that invested too much time and effort into an ELK stack that handled specialist log formats sent over a custom UDP service. It was brittle and fell over in all sorts of strange ways. This was all for monitoring a handful of servers.

The only good that came of it was when we realized that the support staff could search the logs and diagnose 95% of customer complaints without bothering developers. ("Your transaction failed because you entered a bad card number.")

But that was because the text logs were more useful than the metrics.


Agreed. I was actually hoping the article would talk about how to log (buffer, async, files or output stream, etc). I’ve seen at least 2 cases of excessive logging causing outages: One was using log4j zip rollover which blocked all threads in the app causing timeouts, the other was using json in an older android vm which couldn’t cope with all the garbage, causing OOM due to fragmentation (before compacting garbage collector was introduced)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: