Well... we have 3 node MongoDB cluster and are processing up to a million trades... per second. And a trade is way more complex than a chat message. Has tens to hundreds of fields, may require enriching with data from multiple external services and then requires to be stored, be searchable with unknown, arbitrary bitemporal queries and may need multiple downstream systems to be notified depending on a lot of factors when it is modified.
All this happens on the aforementioned MongoDB cluster and just two server nodes. And the two server nodes are really only for redundancy, a single node easily fits the load.
What I want to say is:
-- processing a hundred million simple transactions per day is nothing difficult on modern hardware.
-- modern servers have stupendous potential to process transactions which is 99.99% wasted by "modern" application stacks,
-- if you are willing to spend a little bit of learning effort, it is easily possible to run millions of non trivial transactions per second on a single server,
-- most databases (even as bad as MongoDB is) have a potential to handle much more load than people think they can. You just need to kind of understand how it works and what its strengths are and play into rather than against them.
And if you think we are running Rust on bare metal and some super large servers -- you would be wrong. It is a normal Java reactive application running on OpenJDK on an 8 core server with couple hundred GB of memory. And the last time I needed to look at the profiler was about a year ago.
A million sounds impressive, but this is clearly not serialized throughput based on other comments here. Getting a million of anything to fast NVMe is trivial if there is no contention and you are a little clever with IO.
I have written experimental datastores that can hit in excess of 2 million writes per second on a samsung 980 pro. 1k object size, fully serialized throughput (~2 gigabytes/second, saturates the disk). I still struggle to find problem domains this kind of perf can't deal with.
If you just care about going fast, use 1 computer and batch everything before you try to put it to disk. Doesn't matter what fancy branding is on it. Just need to play by some basic rules.
Primary advantage with 1 computer is that you can much more easily enforce a total global ordering of events (serialization) without resorting to round trip or PTP error bound delays.
Trading is for multiple reasons ideal for this, one is that total global ordering is a key feature (and requirement) of the domain so this "1 fast big server" thing is good. It is also quite widely known that several of the big exchange operate this model, a single sequencer application and then using multicast to transmit the outcomes of what it sees.
The other thing that is helping a lot here compared to Discord: Trading is very neatly organized in trading days and shuts down for hours between each trading day. So you don't have the issue that Discord had where some channels have low message volumes and others have high, leading to having scattered data all over the place. You can naturally partition data by day and you know at query time which data you want to have.
The decentralized Solana network is currently capable of over 50k TPS 24/7 in the live Beta with a target transaction finality speed of 30ms round trip from your browser/device. Their unofficial moto is "DeFi at Nasdaq Speed." Solana is nascent and will likely reach 10M+ TPS and 10ms finality within a couple years time.
Decentralized Acyclic Graph based networks (e.g. Hashigraph, which are not technically blockchains) can reach effectively infinite TPS but suffer in time to finality.
Solana is a blockchain with zero downtown (and a Turing complete smartchain), mind you-- nota centralized exchange.
you can handle the load or not, right? A built in maintenance window is super nice, but servers crash all the time. So, that's a problem, or you've got a system in place. if you can handle failover, you've got free maintenance windows anyway, so it seems not any more difficult?
It is wise if you mean "be ready for servers to crash at any time by thinking they are going to crash at the worst possible moment".
But it is stupid, because people think they need massive parallel deployments just because servers will be constantly crashing and it is just not true. The cost they pay is in having couple of times more nodes than they really need to have if they got their focus right (making the application efficient first, scalable later)
The reality is, servers do not crash. At least not the kind of hardware I am working on.
I have been responsible for keeping communication with a stock exchange for like 3 years in one of my past jobs and during that time we haven't lost a single packet.
And aside from some massive parallel loads which used tens of thousands of nodes and aside from one time my server room boiled over due to failed AC (and no environmental monitoring) I never had a server crash on me for the past 20 years.
So you can reasonably assume that your servers will be functioning properly (if you bought quality) and it kinda helps a lot at design stage.
This is the regime we operate in as well. For our business, a failure, while really bad, is not catastrophic (we still maintain non-repudiation). We look at it like any other risk model in the market.
For many in our industry, the cost of not engineering this way and eating the occasional super rare bag of shit is orders of magnitude higher than otherwise tolerable. One well-managed server forged of the highest binned silicon is usually the cheapest and most effective engineering solution over the long term.
Another super important thing to remember is that main goal of this is to have super simple code and very simple but rock solid guarantees.
The main benefit is writing application code that is simple, easy to understand and simple to prove it works correctly, enabled by reliable infrastructure.
When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.
> When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.
This is 100% our philosophy. I honestly don't understand why all high-stakes software isn't developed in the same way that we build these trading/data systems.
I think this is the boundary between "engineering" and "art". In my experience, there are a lot of developers who feel like what they do is not engineering because they believe it to be so subjective and open to interpretation. Perhaps there is a mentality that it cant ever be perfect or 100% correct, so why even try upholding such a standard as realistic? It is certainly more entertaining to consume new shiny technology than sitting down with business owners in boring meetings for hours every week...
In reality, you can build software like you build nuclear reactors. It is all a game of complexity management and discipline. Surprisingly, it usually costs less when you do it this way, especially after accounting for the total lifecycle of the product/service. If you can actually build a "perfect" piece of software, you can eliminate entire parts of the org chart. How many developer hours are spent every day at your average SV firm fighting bugs and other regressions? What if you could take this to a number approximating zero?
The classical retort I hear from developers when I pose comments like these is "Well the business still isnt sure exactly what the app should do or look like". My response to that is "Then why are you spinning up Kubernetes clusters when you should be drawing wireframes and schema designs for the customer to review?"
Every time I write something like "Yes, you really can write reliable application. No, if it breaks you can't blame everybody and the universe around you. You made a mistake and you need to figure out how this happened and how to prevent it from happening in the future." I just get downvoted to hell.
I suspect in large part it is because when people fail at something they feel a need to find some external explanation of it. And it is all too easy when "business" actually is part of the problem.
The best people I worked with, let's just say I never heard them blaming business for their bugs. They own it, they solve it and they learn from it.
What I am not seeing is people actually have a hard look on what they have done and how they could have avoided the problems.
For example, the single most cause of failed projects I have seen, by far, is unnecessary complication stemming from easily avoidable technical debt.
Easily avoidable technical debt is something that could have reasonably be predicted at the early stage and solved by just making better decisions. Maybe not split your application to 30 services and then run it on Kubernetes? Maybe rather than separate services, pay attention to have proper modules and APIs within your application and your application will just fit couple of servers? Maybe having function calls rather thancascade of internal network hops is cheap way to get good performance rather than (ignore Amdahl's law and) try to incorporate some exotic database that nobody knows and will have to start learning from scratch?
Then people rewrite these projects and rather than understanding what caused the previous version to fail -- just repeat the same process only with new application stack.
We know because that communication happens on UDP and each packet on app layer has sequence number. It is used on receiving side to rebuild sequence of packets (events from the exchange can only be understood correctly when processed in same order as generated and only if you have complete stream -- you can't process a packet until you processed the one preceding it). It is trivial to detect that we haven't missed a packet.
We had a special alert for a missing packet. To my knowledge that has never activated in production except for exchange-mandated tests (the exchange runs tests regularly to ensure every brokerage house can handle faults in communications, maximum loads, etc.)
If a packed was missed, the same data is replicated on another, independent link through another carrier.
And if this wasn't enough, if your system is down (which it shouldn't happen during trading) you can contact exchange's TCP service and request for the missing sequences. But that never happened, either.
As we really liked this pattern, we built a small framework and used it for internal communication as well including data flowing to traders' workstations.
Mind, that neither the carrier link, the networking devices or people who maintain it are cheap.
In regular trading (not crypto, see the other comment about the volume differences) it is common to tune Java for example to run GC outside of trading hours. That works if you don't allocate new heap memory in every transaction/message but instead only use the stack + pre-allocated pools.
Thats not really what this article is about. Their problem wasn't throughput. What's the size of all the data in your MongoDB instance? And what's the latency in your reads?
In the big data world the "complexity" of the data doesn't really mean much. It's just bytes.
> What's the size of all the data in your MongoDB instance?
3x12TB
> In the big data world the "complexity" of the data doesn't really mean much.
Oh how wrong you are.
It is much easier to deal with data when the only thing you need to do is to just move it from A to B. Like "find who should see this message, make sure they see it".
It is much different when you have large, rich domain model that runs tens of thousands of business rules on incoming data and each entity can have very different processing depending on its state and the event that came.
I am writing whole applications just to data-mine our processing flow just to be able to understand a little bit of what is happening there.
At that traffic you can't even log anything for each of the transactions. You have to work indirectly through various metrics, etc.
Nice that's a pretty decent size, curious on the latency still. Thats the primary problem for a real time chat app.
Complexity of data and running business rules on it is not a data store problem though, that's a compute problem. It's highly parallelizable and compute is cheap.
For reference, my team runs transformations on about 1 PB of (uncompressed) data per day with 3 spark clusters, each with 50 nodes. We've got about 70ish PB of (compressed) data queryable. All our challenges come from storage, not compute.
In order to be able to run so much stuff on MongoDB, we almost never run single queries to the database. If I fetch or insert trade data, I probably run a query for 10 thousand trades at the same time.
So what happens is, as data comes from multiple directions it is being batched (for example 1-10 thousand at a time), split into groups that can be processed together in a roughly similar process, and then travels the pipeline as a single batch which is super important as it allows amortizing some of the costs.
Also the processing pipeline has many, many steps in it. A lot of them have buffers inbetween so that steps don't get starved for data.
All this causes latency. I try to keep it subsecond but it is a tradeoff between throughput and latency.
It could have been implemented better, but the implementation would be complex and inflexible. I think having clear, readable, flexible implementation is worth a little bit of tradeoff in latency.
As to storage being source of most woes, I fully agree. In our case it is trying to deal with bloat of data caused by business wanting to add this or that. All this data causes database caches to be less effective, requires more network throughput, more CPU for parsing/serializing, needs to be replicated, etc. So half the effort is constantly trying to figure out why they want to add this or that and is it really necessary or can be avoided somehow.
I thought you are doing millions qps with a 3 nodes mongodb cluster, from the top level comment. That would be impressive.
By batching 1-10 thousands records at a time, your use case is very different from discord, which needs to deliver individual messages as fast as possible.
Data doesn't come or leave batched. This is just internal mechanism.
Think in term of Discord, their database probably already queues and batches writes. Or maybe they could decide to fetch details of multiple users with a single query by noticing there are 10k concurrent asks for user details. So why have 10k queries when you could have 10 queries for 1k user objects?
If you complain that my process is different because I refuse to run it inefficiently when I can spot an occasion to optimize then yes, it is different.
Of course, cassandra/mongodb/etc can perform their own batching when writing to the commit log, and can also benefit from write combining by not flushing out the dirty data immediately. That's besides the point.
Your use case allows you to perform batching for writes at the *application layer*, while discord's use case doesn't.
Why couldn't others with lots of traffic use a similar approach? I assume they do. Seems pretty genius idea to batch things like that, especially when qps is very high batching (maybe waiting for a few ms to fill a batch) makes a lot of sense.
I don't see why discord's case can't use same tricks. If they have a lot of stuff happening at the same time and their application is relatively simple (from the point of view of number of different types of operation it performs) at any point in time it is bound to have many cases of the same operation being performed.
Then it is just a case of structuring your application properly.
Most applications are immediately broken, by design, by having a thread dedicated to the request/response pair. It then becomes difficult to have parts of that processing from different threads be selected and processed together to take benefit of amortizing costs.
The alternative I am using is funneling all requests into a single pipeline and having that pipeline split into stages distributed over CPU cores. So it comes in (by way of Kafka or REST call, etc.), it is queued, it goes to CPU core #1, gets some processing there, then moves to CPU core #2, gets some other processing there, gets published to CPU core #3 and so on.
Now, each of these components can work on huge number of tasks at the same time. For example when the step is to enrich the data, it might be necessary to shoot a message to another REST service and wait for response. During that time the component picks up other items to do the same.
As you see, this architecture practically begs to use batching and amortize costs.
What you're describing sounds like vanilla async concurrency. I seriously doubt 'most applications' use the one-thread-per-request model at this point in time, most major frameworks are async now. And it's not a silver bullet either, plenty of articles on how single-thread is sometimes a better fit for extremely high-performance apps.
After reading all of you responses, I still don't see how you think your learnings apply to Discord. They would not be able to fit the indexes in memory on MongoDB. They can't batch reads or writes at the application server level (the latency cost for messaging is not acceptable). Millions of queries happen every second, not one-off analytical workloads. It seems these two systems are far enough apart that really there is no meaningful comparison to be made here.
Well on one hand you've got engineers at a billion dollar company explaining how they've solved a problem. On the other hand you've got some random commentor on HN over-simplifying a complex engineering solution.
I think you're reading into it. They are stating that the solution in the post was overengineered, and describing an alternate solution that doesn't require as much abstraction or resources, but is manageable for data with a much higher dimensional structure
The fact that you read that as "I am very smart" and that that was a reason to downvote the post, tells more about you than it does the person you're supposedly describing.
As an example, there are bitemporal queries like "for the given population of trades specified by following rules, find the set of trades that met the rules at a particular point in time, based on our knowledge at another given point in time". Also trades are versioned (are a stream of business events from trading system), then have amendments (each event may be amended in the future but the older version must be preserved). Our system can also amend the data (for example to add some additional data to the trade later). All this causes trades to be a tree of immutable versions you need to comb through. A trade can have anywhere from 1 to 30k versions.
This takes about 20 seconds. The process opens about 200 connections to the cluster and transfers data at about 2-4GB/s.
Are you not sure that financial data "with hundreds of fields" is more complex than chat data which has a relatively linear threading and only a handful of fields?
I'm asking about how your system scales to the number of queries, but you seem to be taking every question personally. You seem to really want to make sure everyone knows that you think Discord's problems are easy to solve. I'm not saying Discord is more complicated, but you're not really giving enough information to prove that Discord's problems are a subset of yours.
Do you support more simultaneous queries than Discord?
these days they added an extra thread_id field, sure. But the data itself is blisteringly uncomplex and there is only a single way to display it (ordered by time, i.e. the 'thread')
Just recently Symphony switched order in which I have received two messages from a colleague. This completely changed the meaning of the message and we got in an argument that was only resolved after I have posted screenshot of what he wrote.
It seems, the threading might not be that simple after all.
> -- if you are willing to spend a little bit of learning effort, it is easily possible to run millions of non trivial transactions per second on a single server,
I got into programming through the Private Server (gaming) scene. You learn that the more you optimize and refactor your code to be more efficient, the more you can handle on less hardware, including embedded systems. So yeah, it's amazing how much is wasted. I'm kind of holding hope that things like Rust and Go focus on letting you get more out of less hardware.
Go is slower because it optimizes for compile time.
Rust is slower (usually) because Rust does not revolve around making the most use of the hardware it has.
both are fine choices; nothing wrong with either direction.
Zig performs very well by default because it was designed to be efficient and fast from the start, without compromise. it has memory safety, too, but in a way that few seem to understand, myself included, so it's difficult for me to describe with my rudimentary understanding.
Maybe if you use nostd. I evaluated a bunch of Rust libraries for some server software, but I could not use any of them because they pervasively assume that it is ok to make syscalls to allocate memory. If you'd like to write software that makes few syscalls in the steady state, you can do it in rust, but you can't use libraries. Or String or Vec, I guess.
Why is memory allocation via syscalls bad? I get it for embedded (which was mentioned above so perhaps that's what this targets), but I kind of assumed malloc was a syscall underneath on an actual OS and that was fine.
Actually, you don't need to run any functions to cause a context switch. For example, at the very lowest level, even trying to access a memory page that is presently in physical memory but does not have entry in TLB causes CPU to interrupt to OS to ask for mapping.
This is my experience too. Millions of persisted and serialized TPS on regular NVMe with ms latency. Though it took a bit of effort to get to these numbers.
Curious as to how many days of data you have in your cluster. It seems like it could be ~1/2 billion records per day, 125 billion per year-ish. In a few years your 3 node Mongo cluster would be getting towards volumes I associate with a 'big data' kind of solution like BigTable.
I'm not "we" but I have some experience in this area.
Computers are fast, basically. ACID transactions can be slow (if they write to "the" disk before returning success), but just processing data is alarmingly speedy.
If you break down things into small operations and you aggregate by day, you can always have big numbers. The monitoring system that I wrote for Google Fiber ran on one machine and processed 40 billion log lines per day, with only a few seconds of latency from upload start -> dashboard/alert status updated. (We even wrote to Spanner once-per-upload to store state between uploads, and this didn't even register as an increase in load to them. Multiple hundred thousand globally-consistent transactional writes per minute without breaking a sweat. Good database!)
apenwarr wrote a pretty detailed look into the system here: https://apenwarr.ca/log/20190216 And like him, I miss having it every day.
I have a plan to write a book on how to write reactive applications like that. Mostly collection of observations, tips, tricks, patterns for reactive composition, some very MongoDB specific solutions, etc.
Not sure how many people would be interested. Reactor has quite steep learning curve but also very little literature on how to use for anything non-trivial.
The aim is not just enable good throughput, but also achieve this without compromising on clarity of implementation. Which is where I think reactive, and specifically ReactiveX/Reactor, shines.
I'm interested in getting your book published. Career in publishing and specialist media but a lot of it spent on related problems to your subject. Semi retired have risk capital to get to the right distribution maintaining well above industry standard terms. Email in profile.
Thanks. I will try to self publish. I want to keep freedom over content and target and I am not looking for acclaim for having my name on a book from a well known publisher. I am just hoping to help people solve their problems.
Considering that a cpu can do 3 billion things a second , and a typical laptop can store 16 billion things in memory , it shouldn’t take more than 5 of these to handle “billions of messages” . I agree with you that modern frameworks are inefficient
By 16 billion things you mean 16 billion bytes? If you are talking about physical memory, then no, you can't occupy the entire memory. If you are talking about virtual memory, then you can store more data.
Actually, CPU processes things in words, not bytes. On 64-bit architecture the word is 64 bit or 8 bytes.
But there is a lot of things that CPU can do even faster than that, because this limitation only relates to actual instruction execution (and even then there are instructions that can process multiple words at a time).
All this happens on the aforementioned MongoDB cluster and just two server nodes. And the two server nodes are really only for redundancy, a single node easily fits the load.
What I want to say is:
-- processing a hundred million simple transactions per day is nothing difficult on modern hardware.
-- modern servers have stupendous potential to process transactions which is 99.99% wasted by "modern" application stacks,
-- if you are willing to spend a little bit of learning effort, it is easily possible to run millions of non trivial transactions per second on a single server,
-- most databases (even as bad as MongoDB is) have a potential to handle much more load than people think they can. You just need to kind of understand how it works and what its strengths are and play into rather than against them.
And if you think we are running Rust on bare metal and some super large servers -- you would be wrong. It is a normal Java reactive application running on OpenJDK on an 8 core server with couple hundred GB of memory. And the last time I needed to look at the profiler was about a year ago.