Do we need to store all that telemetry?

h1fra · on April 15, 2024

I understand the point but I also advocate for the opposite, it's not cool for the planet for sure but having all the data points for at least a couple of months is very useful on any large system and +15months for metrics so you can compare with the year before.

I can't count the number of times users (or myself) discovered bug after many weeks because something gradually failed over time. Also it saves a lot of time to be able to pin point the exact day a behavior as changed so you can check the deploy of that day and quickly find the bug. Sometimes a trend is not obvious after a deploy but is clearly visible on the graph after a long period of time.

And for business intelligence, it's always when you badly need a metric that you realize you never tracked it.

jcgrillo · on April 15, 2024

Yeah I've definitely been saved a bunch of times by long retention, and the BI questions that might arise are impossible to predict. So some sort of retention is definitely necessary, IME.

But let's take the case of metrics as an example--do we need full sample granularity for "old" data? Do we need full tag cardinality? Sample granularity reduction could be done with a transform to rollups at a coarser time granularity. That's a 60x reduction going from Hz to 1/min. You might lose a bunch of frequency information this way, but maybe that's ok?

Numbers are really nice in ways that text is not.

jcgrillo · on April 15, 2024

Another facet of this is how do we store telemetry data? Fully indexed instantaneously searchable seems to be the "default" these days but who actually needs that?

I keep harping on this, but compressed utf-8 text (or even worse, compressed json) is a horribly wasteful way to do it. See [1]. Putting a small amount of thought into storing telemetry data seems like it could yield incredible savings at scale.

[1] https://lists.w3.org/Archives/Public/www-logging/1996May/000...

ChrisCooney · on April 15, 2024

I was gonna make a post but this took the words out of my mouth. I have a whole talk about this exact topic, but the summary is that the paradigm of hot storage and then 2 weeks later, compressed archive, is the most wasteful way we could possibly organize this data. I discuss this at length in the talk below:

https://www.youtube.com/watch?v=XXgBJmqv0ok

jcgrillo · on April 15, 2024

Nice talk. The first (and best!) logs search solution I experienced in my career was simply a gigantic tree of compressed logs on a hadoop cluster. As someone who spent a bunch of time analyzing logs, the "query interface" being "anything you can sling at the hadoop cluster" was phenomenally awesome. The basic computering tools are programming languages, and eventually you encounter problems where you need a real (Turing-complete) one.

One great side effect of this was service developers weren't afraid to write logs. We logged excessively, and it didn't cost too much. If we'd been indexing everything in ES it would have bankrupted us.

These days with S3 and the cloud, hadoop (or the EMR suite) per se probably isn't the way to go, but I'd sure like to see observability solutions giving me a first-class programming model that I as a user can interact with--not some bespoke "query DSL", and for them to accept that instantaneous indexed retrieval isn't important.

This paper is really interesting: https://www.usenix.org/system/files/osdi21-rodrigues.pdf

Stuff like this gives me hope we can have it both ways. With highly tuned compression and programmatic access the user is empowered and the cost is minimized.

simonw · on April 15, 2024

I thought compressed JSON was pretty efficient. How much would you expect to save over that with a custom binary format?

rixed · on April 16, 2024

Storing date in compressed json consist of:

- converting every number into its sequence of digits in decimal notation,

- writing those one character at a time,

- also write the string representation of the label of each value repeatedly for every record,

- compress all this with a structure-unaware generic text compression algorithm based on longest match search.

Each time you want to read that data, undo all of the above in reverse order.

You can optimize to some degree, but that's basically it.

I expect that not doing any of this saves the time spent doing it. I also expect data type aware compression to be much more efficient than text compressing the text expansion.

In numbers, I expect 2 to 3 orders of magnitude difference in time and also in space (for non random data).

tredre3 · on April 15, 2024

The network difference between compressed JSON or a compressed format is likely negligible.

But jcgrillo was talking about storage (at least his link was). And when parsing for analysis or for storing millions of points daily, there's no doubt that a binary format is simply a lot more CPU and disk efficient.

winrid · on April 15, 2024

Usually the JSON gets transformed into a binary format (example: BSON).

jcgrillo · on April 15, 2024

The thing about telemetry data is it's extremely repetitive. Take for example a CLF[1] log line:

  127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

As written this is 99 bytes (792 bits), but how much information is actually in it? We have an IP address which is taking up 9 bytes but only needs at most 4 (fewer in cases like this where two of the bytes are zero if we employ varint encoding). Across log lines the ident and user will likely be very repetitive, so storing each unique occurrence more than once is really wasteful. The timestamp takes up 28 bytes but only needs 13 bytes--far fewer if that field is delta encoded between log lines. The HTTP method is taking up 5+ bytes, it's only worth 1 byte. The URLs are also super redundant--no need to store a copy in each line. The HTTP version is 1 byte but it's taking up 8. The status code is taking up 3 bytes but it's only worth 1--there are only 63 "real" HTTP status codes. The content length is taking up 4 bytes when it needs only 2. So I guess this log line only really has ~33 bytes of information in it (assuming a 32 bit pointer for each string--ident, user, URL). Much less if amortized across many lines. So maybe by naively parsing this log line and throwing a bunch of them in columnar, packed protobuf fields (where we get varint encoding for free), and delta-encoding the timestamps, and maintaining a dictionary for all the strings, we might achieve something like a ~5x compression ratio.

Playing around with gzip -9 on some test data[2] (not exactly CLF, but maybe similar entropy) I'm getting like ~1.9x compression.

Obviously if I parse this log line into a JSON blob, that blob will compress with a much higher ratio due to the repetitive nature of JSON, but it'll still be larger than the equivalent compressed CLF.

I'm working on a demo for my "protobof + fst[3]" idea, so I'm not sure if my "maybe ~5x" claim is totally off the mark or not. But I'm confident we can do way better than JSON.

[1] https://en.wikipedia.org/wiki/Common_Log_Format [2] https://www.sec.gov/about/data/edgar-log-file-data-sets [3] https://crates.io/crates/fst

EDIT: I guess maybe another way to state my conjecture is "telemetry compression is not general purpose text compression". These data have a schema, and by ignoring that fact and treating them always as schemaless data (employing general purpose text compression methods) we're leaving something on the table.

simonw · on April 15, 2024

My hunch is that JSON using a custom compression dictionary with zlib (see zdict argument to https://docs.python.org/3/library/zlib.html#zlib.compressobj) or zstandard would get you most of the benefit while still letting you interact with existing JSON tools. I've not put the work in to prove that to myself though!

rixed · on April 16, 2024

Labels or other predefined constants being useless, compressing them better is not going to win the argument.

Have a look at the description and performance of a non-toy time series database published 10 years ago:

https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Convenience of text and json is an argument, but performance??

jcgrillo · on April 15, 2024

Yeah that would be an interesting experiment too.

This blog post has some interesting ideas as well: https://www.uber.com/blog/reducing-logging-cost-by-two-order...

392 · on April 18, 2024

This is all quite true, but a possibly faster way to prototype would be to use DRAIN algorithm (there are rust and python impls that are easy to use) to determine the "log template". Then push the log template when its first seen and nothing but values after that, into a programmatically generated table in a common columnar format like Parquet or Iceberg. Then you can point the myriad of data analysis tools like DuckDb, DataFusion, or the latest InfluxDb at it, you've got your SQL on logs implemented. It can feel a bit Rube-Goldbergish, and it's a bit tricky to navigate the space uninformed because it's early, but it can also handle all other data your company uses in one platform, no need to special case "applications" from the "data analysis"/historical/log side. One place to handle permissions. Then there are tools like Dagster for managing this humongous single database in a straightforward way rather than writing a web of applications that push and pull to it but without a complete picture being possible, or needing devs to remember their place in the system. Search up Uber CLP for prior art, or more generally the "modern data stack" (PRQL will be perfect for querying logs). But by piggybacking on big systems like this, you can take advantage of future advancements in state of the art, like BtrBlocks https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844.... Of course, if the savings/earnings are high enough, I guess you start on implementing this now.

jcgrillo · on April 18, 2024

Thanks I wasn't aware of either DRAIN or BtrBlocks. CLP is very cool. Honestly I'm not sure what a good query experience looks like. I really enjoy the flexibility of mapreduce because there are no "unsolvable" problem--if a high level DSL like Hive or Pig gets in the way you just drop down a level to Spark or streaming Python mapreduce or whatever. So ultimately rather than a "DSL for logs" I'd rather have more like a "programming model for logs". I don't know what this looks like in 2024, hopefully not still actually hadoop/EMR.

mountainriver · on April 15, 2024

Great post, the observability folks have gone off the rails in the last 5 years. I’ve seen it do more harm than good in terms of dev speed and ironically often make things less observable for the common path.

klabb3 · on April 15, 2024

Isn’t the issue more that off-the-shelf solutions optimize for features and not cost? For instance, if I sell you an observability product, I want to show off all the cool realtime debugging features and such. And since there’s a cost to having all these features available (retention, indexing, sampling), we end up paying for features we don’t need. In a world of usage-based XaaS, there’s very little incentive to be cost-effective. Arguably even a perverse incentive to waste resources.

I bet you a full dollar that both in-house and open source solutions, on average, are way more stingy with resources. As they should be.

ChrisCooney · on April 15, 2024

Hey! Just wanna wave a flag for Coralogix (I work there, disclaimer). We've built a ton of cost optimization because we know that the industry is just ridiculous right now. Your assessment is absolutely correct, and there are more than a few multi-billion dollar companies whose bottom line is predicated on their customers wasting money and being inefficient. We're not one of them!

https://coralogix.com/

blobcode · on April 15, 2024

> we can also add local storage of telemetry data in an efficient circular buffer. Typically, local storage is cheap and underutilized, allowing for “free” storage of a finite amount of historical data, that wraps automatically. Local storage provides the ability to “time travel” when a particular event is hit.

I think that this is a good idea when storage is concern for high-volume logs / production. Persisting the buffer when high error rates / unusual system behavior is observed would be a cool idea.

taneq · on April 15, 2024

This is a good approach and is pretty common in the embedded world. You use a ring buffer to store a relatively short but detailed log, and then if you encounter an error (or whatever other relevant trigger criteria you use) you snapshot the contents of that ring buffer. Then later you can retrieve the snapshots to figure out what happened.

jauntywundrkind · on April 15, 2024

We've turned off logging & tracing on a bunch of our high volume routes. Ideally I'd prefer we still sample them, at like 0.1% or what not, to give us some indicator, some chance of seeing anomalies. It just seems easier to gather & use this information than it is to go develop a suite of metrics that can register all issues.

OpenTelemetry recently ish gained Open Agent Management Protocol (OpAMP), which allows some runtime control over things generating telemetry. The ability to stay fairly low but then scale up as needed sounds tempting, but gee it also sends shivers down my spine thinking of having such a elastic demands on one's telemetry infrastructure, as engineers turn telemetry up as problems are occuring. https://opentelemetry.io/docs/specs/opamp/

The idea of having a local circular buffer sounds excellent to me. Being able to run local queries & aggregate would be sweet. Are there any open otel issues discussing these ideas?

wrs · on April 15, 2024

We continue to recreate features of single-computer OSes in distributed systems. This seems like the dtrace/bpftrace of microservices world.

gghffguhvc · on April 15, 2024

“A lot of telemetry doesn’t need to be stored for very long” is the attitude I take. Keeps costs down but gives good visibility.

zug_zug · on April 16, 2024

I think most places don't collect enough telemetry in the right formats.

It's also possible they collect too much in the wrong formats.

But the ability to vet a hypothesis (I bet our users are confused about feature X, which we can test by looking at how many times they go to page X, then Y, then X again in 30 second window) in an hour versus 2 sprints is vastly underappreciated/underutilized.

I feel like this article paints with too broad a brush.

m3047 · on April 15, 2024

Agree with the article enough that I did something about it which I call "Poor Fred's SIEM". The heart of it is a DNS proxy for Redis (https://github.com/m3047/rkvdns). However it's not targeted at environments where everything is in a "bubble" such that there are no ingress / egress costs. (Lookin' at you, Cloud.) Furthermore "control plane" is an important concept, and it's well understood in the industrial control world as the Purdue Model.

From a systems standpoint do you need to have all resources stored centrally in order to do centralized reporting? No, of course not. Admittedly it's handy if bandwidth and storage are free. The alternative is distributed storage, with or without summarization at the edge (and aggregating from distributed storage for reporting).

Having it distributed does raise access issues: access needs to be controlled, and management of access needs to be managed. Philosophically the Cloud solutions sell centralized management, but federation is a perfectly viable option. The choice is largely dictated by organizational structure not technology.

There is also a difference between diagnostic and evaluative indicators. Trying to evaluate from diagnostics causes fatigue because humans aren't built that way; evaluatives can and should be built from diagnostics. Diagnostics can't be built from evaluatives.

The logging/telemetry stack that I propose is:

1) Ephemeral logging at the limits of whatever observability you can build. E.g.: systemd journal with a small backing store, similar to a ring buffer.

2) Your compliance framework may require shipping some classes of events off of the local host, but I don't think any of them require shipping it to the cloud.

3) Build evaluatives locally in Redis.

4) Use DNS to query those evaluatives from elsewhere for ad hoc as well as historical purposes. This could be a centralized location or it could be true federation where each site accesses all other site's evaluatives.

I wouldn't put Redis on the internet, but I don't worry too much about DNS; and there are well-understood ways of securing DNS from tampering, unauthorized access, and even observation. By the way, DNS will handle hundreds or thousands of queries per second you just have to build for it.

m3047 · on April 15, 2024

So I went off and set up an actual "live fire" demo because it's that easy:

  curl http://athena.m3047.net/grafitti.html
  dig @athena.m3047.net grafitti\;*.keys.redis.athena.m3047 txt

yetanotherdood · on April 16, 2024

> For 30 years how telemetry is produced has not changed: we define all of the data points that we need ahead of time and ship them out of the origin process, typically at large expense. If we apply the control plane / data plane split to observability telemetry production we can fundamentally change the status quo for the first time in three decades

Has Matt read any prior art in this field? https://research.google/pubs/monarch-googles-planet-scale-in...

jedberg · on April 15, 2024

NO! You don't!

I couldn't agree with the author more. Keeping historical records of business metrics makes a ton of sense. But history telemetry (CPU, Memory, Network, error logs) makes little sense.

If an issue occurs, then turn on telemetry around that issue until you track it down. If an issue occurs once and never again, did it really matter? This obviously does not apply to security, I'm just speaking of operational issues.

Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

Etheryte · on April 15, 2024

I feel like this swings the pendulum a little too far to the other side. There's very little harm in having telemetry on at all times, but log rotate once a week/month/whatever works for you. If you have telemetry off to begin with, you might not even notice you have an issue while your users do.

jedberg · on April 15, 2024

You should have a ton of telemetry on business metrics. You would absolutely notice an issue before your users if you have those. For example at Netflix we monitored stream starts per second -- how often you hit play and it worked. That metric was the most important, and the one that triggered most investigations.

If your CPU and memory aren't affecting the business metrics, then it's not super relevant.

Izkata · on April 15, 2024

As someone who's been on a maintenance team for years, keeping monitoring (cpu, memory, disk, etc) for at least two weeks is critical, and I'd prefer 6 months to easily identify larger trends and prevent issues before they happen.

tjoff · on April 15, 2024

Very little harm... If the telemetry is from your users I'd like you to value them more than that.

Also consider the potential risks of handling personal data and leaks.

Etheryte · on April 15, 2024

This only holds if you assume telemetry means personal data, but that is a very big if. Meta, Google and other giants generally deal in telemetry that includes personal data, however for most run of the mill software that's not the case. Outside of advertising, I would argue that for most applications you're already pretty close to being clear of personal data as long as you exclude the user's email and other identifiers from the logs. Sure, there are examples where this is not the case, but it isn't even remotely as big of a problem as you claim it to be.

tjoff · on April 15, 2024

A lot of telemetry can become personal data. Filenames etc. are the easy parts.

Telemetry needs to be motivated for it to not be considered spyware. You need to really consider what you are logging and why, and then, is it worth the downsides.

It is not something to take lightly, hardly "no harm".

dimitar · on April 15, 2024

Very few things are worth keeping after two weeks, I like short retention policies

dlisboa · on April 15, 2024

> Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

A day is a pretty small window, I'd say a week or a bit more is good enough for most orgs. That way you can compare specific endpoints/code between deploys, answering questions like "was this endpoint this slow last week too or did I break it?". Some issues take a few days to brew and having historical data is important in debugging. Many orgs don't do load testing at all or have any real performance analysis done before things crash.

Log retention is also directly tied to how fast and easily can you detect and recover from issues.

jedberg · on April 15, 2024

> Log retention is also directly tied to how fast and easily can you detect and recover from issues.

I disagree. Every issue I've ever debugged, I did a tail -f on the logs. I can't recall ever searching the old logs.

Even if it takes a few days for an issue to brew, usually the logs right now will show the issue. Or if they don't, then you can turn on the logs and have them in a few days time. It's so rare that it's almost never worth keeping the logs around just for that one case where an old log might lead to resolution, and rarely does one have time during an active incident to look at old logs anyway.

rezonant · on April 15, 2024

> I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

User writes into support 3 days after the problem occurred, and support goes back and forth covering level 1 possibilities for an additional 2 days before escalating. It's common for 1 support complaint to represent some larger factor of users who never complain, so it would be useful to understand how common the issue is once it has been identified in the observability data. Having one day isn't sufficient in this scenario.

jedberg · on April 15, 2024

I think you missed my key point -- I'm talking about operational metrics not business metrics. With business metrics you can get historical context, but I don't see how CPU/Memory/Storage/App logs will help you.

dog321 · on April 16, 2024

Here is a good piece on gaining value from long term operational metrics. https://danluu.com/metrics-analytics/

rezonant · on April 16, 2024

Yep, I missed that. I definitely have hit cases where having a larger window of infrastructure metrics has been very useful. Being able to correlate it against other observability factors can help to understand what caused a problem. But I agree that you don't have to keep it forever. I think a few weeks is fine, assuming the scale of the system doesn't mean that a few weeks is an unwieldy amount of data

aPoCoMiLogin · on April 15, 2024

if you don't have metrics for cpu/memory/storage how do you know when to scale the app, or when you are at limit of the storage? i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

collecting user-identified telemetry is debatable (depends on the case), but not collecting anything at all is just plain stupid.

jedberg · on April 15, 2024

> if you don't have metrics for cpu/memory/storage how do you know when to scale the app

When the business metrics start to fail. You don't need constant metics on storage, you can poll it every so often. If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.

sofixa · on April 16, 2024

> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

After having annoyed how many users and lost how much revenue? Having metrics to identify brewing problems before issues start to arise (be they on arriving CPU, memory, disk, network constraints or increasing network latency which will soon but not yet show up in the business metrics) is valuable.

> I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.

I have a hard time believing at either of those it was acceptable to have a problem ongoing for days without any idea what's happening because logs and metrics weren't enabled in the first place.

IneffablePigeon · on April 16, 2024

> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

…but why incur that round trip on my feedback loop? Having those metrics on doesn’t cost me much.

This feels potentially like the perspective of a large organisation with both mature monitoring systems and quite steady state user base activity (through scale). When I have a customer who had an issue yesterday because they had an unusual workload that won’t be repeated often, I can’t afford not to have had the basic metrics turned on, in case they point us in the right direction.

aPoCoMiLogin · on April 15, 2024

where you worked doesn't matter to me very much, when what are you saying contradicts what you probably did ("experience in large scale systems"), also it sounds like argument from authority.

not having cpu/mem/hdd metrics is just plain bogus and sounds like fantasy world, where everything works like we expect it to work, and there is no bugs at all. ridiculous

boundlessdreamz · on April 16, 2024

You question his competence.

> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

He was answering that.

If instead of dismissing someone outright and question their competence, you had raised specific concerns, this would have been a more productive conversation

aPoCoMiLogin · on April 16, 2024

> You question his competence. > He was answering that. > If instead of dismissing someone outright and question their competence, you had raised specific concerns, this would have been a more productive conversation

he first said that we don't need to monitor anything, just enable debugging when "business metrics" are failing, and then he changed his stance to "polling from time to time". that's just shows that his first take wasn't thoughtful, so I assumed that he never worked in "the field" or worked on smaller projects, as nobody that worked in bigger projects would say that "we don't need CPU/mem/hdd metrics". it's not like hes proposing something novel, that just ridiculous take that needs to be called out

rezonant · on April 16, 2024

> i feel like you have never touched servers/backend in anything more than simple projects (or at all)

I feel like if you are going to go out on a limb and call someone's expertise into question...

> I ran all of ops for reddit for four years and headed up SRE at Netflix

And they provide excellent credentials which you failed to check...

> where you worked doesn't matter to me very much

You can't just weasel out of it by pretending like you didn't start the interaction by calling someone's expertise into question.

aPoCoMiLogin · on April 16, 2024

> And they provide excellent credentials which you failed to check...

that's logical fallacy, you can work in any place on earth and still be wrong in the subject.

> You can't just weasel out of it by pretending like you didn't start the interaction by calling someone's expertise into question.

why? if his take is bad, then his job or experience doesn't change the outcome. i'm not an expert by any means, but things that hes saying just contradict everything that is standard practice and my own experience. based on that i'm able to say that he doesn't know what he's saying/proposing, and using his "excellent credentials" just make things worse, as it shows that he doesn't have an argument, just wishful thinking

rezonant · on April 16, 2024

At the scale of Netflix or Reddit it very well may make sense to only keep very limited CPU/memory stats on such a massive fleet. Look, I have a different opinion as well, but the difference between you and me is I'm not resorting to personal attacks and instead discussing it on the merits.

aPoCoMiLogin · on April 16, 2024

>At the scale of Netflix or Reddit it very well may make sense to only keep very limited CPU/memory stats on such a massive fleet.

read again what is his argument, that we don't need to store __any__ cpu/mem/storage metrics, other than "business metrics" (or later he crawled back to polling from time to time).

> Look, I have a different opinion as well, but the difference between you and me is I'm not resorting to personal attacks and instead discussing it on the merits.

maybe that's due to difference in culture/region, but i'm unaware where i've attacked him personally. i've just pointed out that what he's saying is to be expected by someone without experience/knowledge "in the field".

rezonant · on April 17, 2024

I read again his argument.

> Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

That doesn't say don't store any, that says you can get by storing a 24 hour period. And his broader point is that it should be time bound, that storing these metrics indefinitely isn't useful and can be very expensive.

I'm of the mind that a week or two of fast online access is the right amount myself (with offline "cold" storage of logs for a longer period), but the overall premise that storing logs and infrastructure metrics forever is unnecessary and wasteful.

TylerE · on April 15, 2024

Strongly disagree. Having stored telemetry has helped me debug so many things.

Forever is probably too much, but keeping a month or so is totally sane.

jedberg · on April 15, 2024

Why kind of things did you debug with CPU/Memory/Storage telemetry that you couldn't have debugged by only turning those things on after you knew there was a problem?

Izkata · on April 15, 2024

Identifying patterns where problems coincide with other processes or times, eventually tracking it down to a release done by another team.

It's happened to me a few times.

tveita · on April 16, 2024

So your business metrics suddenly dropped, but what has changed?

This service is using 80% CPU, that seems a bit high... but is it always this high? Looks like it spiked within the last hour. But wait, it does that every Monday at 9 am, so probably a red herring.

This cache has a hit ratio of 60%... is that good? A bit low? Actually it's suspiciously high compared to last week - looks like a lot of people aren't getting a personalised feed.

Metrics are incredibly cheap to keep around for the value you get from a good operational dashboard, despite what Datadog/Amazon/Grafana Cloud tells you. It's just the most egregiously overpriced data you can buy since 20 cent text messages.

A good start is to set up VictoriaMetrics with some collectors and set retention to 14 days.

aPoCoMiLogin · on April 15, 2024

when storage is full, and you don't know about that, you can't release anything to enable the logs in first place.

jedberg · on April 15, 2024

You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Also, as your storage hits 97%+, you'll probably start seeing effects in your business metrics, and then you can look into it.

aPoCoMiLogin · on April 15, 2024

I think that you are confusing real-time metrics, streamed with very high precision (below 1s) and metrics that are simply polled every N time (most use-cases).

real-time, high precision metrics aren't necessary. when you say that you don't need metrics and then say that you can poll metrics periodically, you are contradicting yourself.

jedberg · on April 15, 2024

I'm not contradicting myself. I'm saying you just poll for storage, you don't store the results. My entire thesis is that those metrics aren't worth storing.

aPoCoMiLogin · on April 16, 2024

crossing fingers that the process that is polling the storage doesn't crash in the future, so you won't be left in the dark, as there is no metric stored, so you will never know when things will go down the drain.

sofixa · on April 16, 2024

> You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Unless you want to be able to have trends over time, either for capacity planning (needing to order more storage in case of bare metal, or planning costs ahead) or to correlate with other things (storage consumption is growing twice as fast since deployment X, did we change something there?).

You don't need to have 1s granularity metrics on storage consumption, but having none is just stupid levels of fake "optimisation" that will cost you more in the long run.

binary132 · on April 15, 2024

In general I think many programmers have internalized the idea that it’s best to waste as many computing resources as we can possibly afford as long as it’s not the bottleneck. Then, in the future, if and when it becomes the bottleneck, we’ll have plenty of headroom to optimize and look like heroes for saving the millions of dollars we never had to spend in the first place. It’s really insane (at best) or genuinely a type of grift at worst.

thisislife2 · on April 15, 2024

"Data is the new oil" - if you don't collect your customer data, and treat it as an asset, you are guilty of mismanagement . /s

ryandrake · on April 16, 2024

Data is more like uranium than oil.[1] Valuable for its limited purpose, but dangerous to just collect and hold on to forever.

1: https://www.forbes.com/sites/forbestechcouncil/2022/10/03/th...

hooverd · on April 15, 2024

Does that mean Google is the new Saudi Arabia?

akira2501 · on April 16, 2024

More like Iraq or Daniel Plainsview. Drilling diagonally to tap neighboring fields.

dragonwriter · on April 16, 2024

That sounds like a popular misunderstanding of the mutual accusations before the 1990 invasion between Iraq and Kuwait of overpumping from the oil field that crosses their border (which did not involve “slant drilling”).

ozymandias1337 · on April 15, 2024

Rock the Kasbah

murat124 · on April 15, 2024

YES you do. BUT with varying retention periods for each a) environment b) region c) function d) criticality e) metric namespace/name f) team etc.

Nobody needs to retain metrics like CPU, Memory for weeks but I may want to see their numbers during an incident, or not long after it is over.