ElasticSearch was never really meant for log storage anyway. It’s a full text engine, and just happened to work reasonably well for that purpose at lower volume. ELK ran with it in an attempt to go after Splunk, but it is phenomenally difficult to scale an indexing pipeline like ELK to high volume. There are far better ways to handle log analysis, particularly when your primary query is counting things that happen over T instead of finding log entries that match a query (which it always is) — streaming analysis is a much better fit than indexing, just lesser known.
As someone who knows of several places using it for multi-petabyte log retention and log analysis, what is Elastic used for, if it's not good at log indexing and text search? If I need to find logs based on one snippet of data, then pivot on different fields to analyze events, what should I be using?
> ElasticSearch was never really meant for log storage anyway.
Indeed, it wasn't designed for log storage, though it happened to match this
use scenario well (now less so with every release).
> There are far better ways to handle log analysis, particularly when your primary query is counting things that happen over T instead of finding log entries that match a query (which it always is)
Oh? This is the first time I hear that my use case (storing logs from syslog
for diagnostics at a later time) counts things over time. Good to know. I may
ask you later for more insight about my environment.
> streaming analysis is a much better fit than indexing, just lesser known.
Well, I do this, too, and not only from logs. I still need arbitrary term
search for logs.
The snark is totally unnecessary, since the vast majority of people deploy ELK to do reporting. Full term search is achievable with grep; what does ES give you for your non-reporting use case, since troubleshooting is an extremely low frequency event? Are you primarily leaning on its clustering and replication? Genuinely curious.
The overhead per log record, building multiple indexes at log line rate, there’s just so many reasons not to do your use case in ES that I don’t even think about it. I think it’s a poorer fit than reporting, to be honest.
ELK > grep for searching. As the other poster said, per-field filtering and rapid pivoting is MUCH more effective workflow than greping for string fragments and hoping it matches on the proper field in a syslog message.
And you keep talking about how much you know and how ELK is literally worse than grep for searching off fields in logs for troubleshooting, but offer no alternative setups or use cases. You're hand-waving.
I've seen some of the performance issues of ELK at scale, and I'd be interested in what's out there, because its not my expertise. But you are just yelling "dataflow" and "streaming analytics".
> The snark is totally unnecessary, since the vast majority of people deploy ELK to do reporting.
You shouldn't have used authoritatively universal quantifier. There are plenty
of sysadmins who use ES for this case, you apparently just happened to only be
exposed to using it with websites.
Then, what ES+Kibana give me over grep? Search over specific field (my logs
are parsed to a proper data structure), which includes type of event
(obviously, different types for different daemons), a query language, and
a frontend with histograms.
Mind you, troubleshooting around a specific event is but one of the things
sysadmins do with logs. There are also other uses, all landing in the realm of
post-hoc analysis.
Kibana and histograms are reporting. Now the snark is even more confusing, since you’re doing exactly what I say is a poor fit, but claiming it’s not your use case. I spend what time I can trying to show those very same sysadmins you’re talking about why ES is a poor architecture for log work, particularly at scale.
As an SRE, I’ve built high volume log processing at every employer in multiple verticals, including web. I know what sysadmins do. Not a fan of the condescension and assumptions you’re making. I have an opinion. We differ. That’s fine. Let it be fine.
> Kibana and histograms are reporting. [...] you’re doing exactly what I say is
> a poor fit, but claiming it’s not your use case.
You must be from the species that can predict each and every report before
it's needed. Good for you.
Also, I didn't claim that I don't use reports known in advance; I do use them.
But there are cases when preparing such a report for just seeing one trend is
an overkill, and there's still troubleshooting that is helped by the query
language. Your defined-in-advance reports don't help with that.
> I spend what time I can trying to show those very same sysadmins you’re talking about why ES is a poor architecture for log work, particularly at scale.
OK. What works "particularly at scale", then?
Also, do you realize that "particularly at scale" is a quite rare setting, and
"a dozen or less of gigabytes a day" scale is much, much more common, and ES
works (worked) reasonably well for that?
You should read the Dremel and Dataflow papers as examples of alternative approaches and dial down your sarcastic attitude by about four clicks. You don’t need to define reporting ahead of time when architected well; it’s quite possible to do ad-hoc and post-hoc without indexing per record. At small scale, your questions are quite infrequent and the corpus small, meaning waiting on a full scan isn’t the end of the world.
A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
This was an opportunity to learn from someone with a different perspective, and I could learn something from yours, but instead, you’ve made me regret even saying anything. I’m sorry, I just can’t engage with you further.
(Edit: I’m genuinely mystified that discussing alternative architectures is somehow arrogant “pissing on” people. Why personalize this so much?)
So, basically, you have/had an access to closed software designed
specifically for working with system logs and based on that you piss on
everybody who uses what they have at hand on a smaller scale. Or at least this
is how I see your comments here.
I may need to tone down my sarcasm, but likewise, you need to tone down your
arrogance about working at Google or compatible.
But still, thank you for the search keyword ("dremel"). I certainly will read
the paper (though I don't expect too many very specific ideas from
a publication ten pages long), since I dislike the current landscape of only
having ES, flat files, and paid solutions for storing logs at a rate of few GB
per day.
> A dozen or less gigabytes a day means: use grep. This is just like throwing Hadoop at that log volume.
No, not quite. I do also use grep and awk (and App::RecordStream) with that.
I still want to have a query language for working with this data, especially
if it is combined with easily usable histogram plotter.
“Dataflow” and the open source ecosystem in that neighborhood (Flink, Spark, Beam, Kafka, that family of stuff) is a much more powerful way to look at logs in real time, rather than indexing them into storage and then querying. There just isn’t something off the shelf as easy to deploy as ElasticSearch with that architecture, that I’m aware of. (There should be!) When you jump the mental gap of events being identical to logs, then start looking at fun architectures like event sourcing, you start realizing streaming analysis is a pretty powerful way of thinking.
I’ve extracted insight from millions of log records per second on a single node with a similar setup, in real time, with much room to scale. The key to scaling log analysis is to get past textual parsing, which means using something structured, which usually negates the reason you were using ElasticSearch in the first place.
Google’s first Dataflow talk from I/O and the paper should give you an idea of what log analysis can look like when you get past the full text indexing mindset. Note that there’s nothing wrong with ELK, but you will run into scaling difficulty far sooner than you’d expect trying to index every log event as it comes. It’s also bad to discover this when you get slashdotted, and your ES cluster whimpers trying to keep up. One thing streaming often gets you is latency in that situation instead of death, since you’re probably building atop a distributed log (and falling behind is less bad than falling over).
The key here is: are you looking for specific log events or are you counting? You’re almost always counting. Why index individual events, then?
I don't think this is "always" true. The power of having the data in records is that you can trivially pivot and slice and dice the data in different ways. When it's aggregated to a count - I don't have that ability. When trying to debug something that's happening, I find it far easier to have the entire records.
As for scaling, it scales very well (you can read through the elastic blog/use cases to see plenty of examples.) That's not to say there aren't levels of scaling it won't handle. But I would venture to say that for 99% of the people out there, it will solve there problems very well.
Would it be accurate to say that your ability to count something new, starting from a previous point in time (i.e. not just from 'now') is dependent upon how easily you can replay your stream (I'm thinking in Kafka terms); or is there something in your architecture that allows you to consume from the stream from a (relative? absolute?) point in time (again, Kafka thinking leaking into my question)?