if you don't have metrics for cpu/memory/storage how do you know when to scale the app, or when you are at limit of the storage? i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.
collecting user-identified telemetry is debatable (depends on the case), but not collecting anything at all is just plain stupid.
> if you don't have metrics for cpu/memory/storage how do you know when to scale the app
When the business metrics start to fail. You don't need constant metics on storage, you can poll it every so often. If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.
> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.
I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.
> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.
After having annoyed how many users and lost how much revenue? Having metrics to identify brewing problems before issues start to arise (be they on arriving CPU, memory, disk, network constraints or increasing network latency which will soon but not yet show up in the business metrics) is valuable.
> I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.
I have a hard time believing at either of those it was acceptable to have a problem ongoing for days without any idea what's happening because logs and metrics weren't enabled in the first place.
> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.
…but why incur that round trip on my feedback loop? Having those metrics on doesn’t cost me much.
This feels potentially like the perspective of a large organisation with both mature monitoring systems and quite steady state user base activity (through scale). When I have a customer who had an issue yesterday because they had an unusual workload that won’t be repeated often, I can’t afford not to have had the basic metrics turned on, in case they point us in the right direction.
where you worked doesn't matter to me very much, when what are you saying contradicts what you probably did ("experience in large scale systems"), also it sounds like argument from authority.
not having cpu/mem/hdd metrics is just plain bogus and sounds like fantasy world, where everything works like we expect it to work, and there is no bugs at all. ridiculous
> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.
He was answering that.
If instead of dismissing someone outright and question their competence, you had
raised specific concerns, this would have been a more productive conversation
> You question his competence.
> He was answering that.
> If instead of dismissing someone outright and question their competence, you had raised specific concerns, this would have been a more productive conversation
he first said that we don't need to monitor anything, just enable debugging when "business metrics" are failing, and then he changed his stance to "polling from time to time". that's just shows that his first take wasn't thoughtful, so I assumed that he never worked in "the field" or worked on smaller projects, as nobody that worked in bigger projects would say that "we don't need CPU/mem/hdd metrics". it's not like hes proposing something novel, that just ridiculous take that needs to be called out
> And they provide excellent credentials which you failed to check...
that's logical fallacy, you can work in any place on earth and still be wrong in the subject.
> You can't just weasel out of it by pretending like you didn't start the interaction by calling someone's expertise into question.
why? if his take is bad, then his job or experience doesn't change the outcome. i'm not an expert by any means, but things that hes saying just contradict everything that is standard practice and my own experience. based on that i'm able to say that he doesn't know what he's saying/proposing, and using his "excellent credentials" just make things worse, as it shows that he doesn't have an argument, just wishful thinking
At the scale of Netflix or Reddit it very well may make sense to only keep very limited CPU/memory stats on such a massive fleet. Look, I have a different opinion as well, but the difference between you and me is I'm not resorting to personal attacks and instead discussing it on the merits.
>At the scale of Netflix or Reddit it very well may make sense to only keep very limited CPU/memory stats on such a massive fleet.
read again what is his argument, that we don't need to store __any__ cpu/mem/storage metrics, other than "business metrics" (or later he crawled back to polling from time to time).
> Look, I have a different opinion as well, but the difference between you and me is I'm not resorting to personal attacks and instead discussing it on the merits.
maybe that's due to difference in culture/region, but i'm unaware where i've attacked him personally. i've just pointed out that what he's saying is to be expected by someone without experience/knowledge "in the field".
> Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.
That doesn't say don't store any, that says you can get by storing a 24 hour period. And his broader point is that it should be time bound, that storing these metrics indefinitely isn't useful and can be very expensive.
I'm of the mind that a week or two of fast online access is the right amount myself (with offline "cold" storage of logs for a longer period), but the overall premise that storing logs and infrastructure metrics forever is unnecessary and wasteful.
collecting user-identified telemetry is debatable (depends on the case), but not collecting anything at all is just plain stupid.