Well... we have 3 node MongoDB cluster and are processing up to a million trades...

bob1029 · on Aug 25, 2021

A million sounds impressive, but this is clearly not serialized throughput based on other comments here. Getting a million of anything to fast NVMe is trivial if there is no contention and you are a little clever with IO.

I have written experimental datastores that can hit in excess of 2 million writes per second on a samsung 980 pro. 1k object size, fully serialized throughput (~2 gigabytes/second, saturates the disk). I still struggle to find problem domains this kind of perf can't deal with.

If you just care about going fast, use 1 computer and batch everything before you try to put it to disk. Doesn't matter what fancy branding is on it. Just need to play by some basic rules.

Primary advantage with 1 computer is that you can much more easily enforce a total global ordering of events (serialization) without resorting to round trip or PTP error bound delays.

t0mas88 · on Aug 25, 2021

Trading is for multiple reasons ideal for this, one is that total global ordering is a key feature (and requirement) of the domain so this "1 fast big server" thing is good. It is also quite widely known that several of the big exchange operate this model, a single sequencer application and then using multicast to transmit the outcomes of what it sees.

The other thing that is helping a lot here compared to Discord: Trading is very neatly organized in trading days and shuts down for hours between each trading day. So you don't have the issue that Discord had where some channels have low message volumes and others have high, leading to having scattered data all over the place. You can naturally partition data by day and you know at query time which data you want to have.

enlyth · on Aug 25, 2021

What about cryptocurrency trading which goes on continuously for 24 hours a day?

anonymoushn · on Aug 25, 2021

Cryptocurrency venues handle extremely low throughput and, with maybe one exception, regularly go down for hours.

MacsHeadroom · on Aug 27, 2021

The decentralized Solana network is currently capable of over 50k TPS 24/7 in the live Beta with a target transaction finality speed of 30ms round trip from your browser/device. Their unofficial moto is "DeFi at Nasdaq Speed." Solana is nascent and will likely reach 10M+ TPS and 10ms finality within a couple years time.

Decentralized Acyclic Graph based networks (e.g. Hashigraph, which are not technically blockchains) can reach effectively infinite TPS but suffer in time to finality.

Solana is a blockchain with zero downtown (and a Turing complete smartchain), mind you-- nota centralized exchange.

anonymoushn · on Aug 27, 2021

Solana has 400ms block times, so I don't think it can achieve finality in 30ms, and the whole comment seems a bit off-topic.

jfoutz · on Aug 25, 2021

you can handle the load or not, right? A built in maintenance window is super nice, but servers crash all the time. So, that's a problem, or you've got a system in place. if you can handle failover, you've got free maintenance windows anyway, so it seems not any more difficult?

lmilcin · on Aug 25, 2021

> but servers crash all the time

This is both wise and stupid at the same time.

It is wise if you mean "be ready for servers to crash at any time by thinking they are going to crash at the worst possible moment".

But it is stupid, because people think they need massive parallel deployments just because servers will be constantly crashing and it is just not true. The cost they pay is in having couple of times more nodes than they really need to have if they got their focus right (making the application efficient first, scalable later)

The reality is, servers do not crash. At least not the kind of hardware I am working on.

I have been responsible for keeping communication with a stock exchange for like 3 years in one of my past jobs and during that time we haven't lost a single packet.

And aside from some massive parallel loads which used tens of thousands of nodes and aside from one time my server room boiled over due to failed AC (and no environmental monitoring) I never had a server crash on me for the past 20 years.

So you can reasonably assume that your servers will be functioning properly (if you bought quality) and it kinda helps a lot at design stage.

bob1029 · on Aug 25, 2021

> The reality is, servers do not crash

This is the regime we operate in as well. For our business, a failure, while really bad, is not catastrophic (we still maintain non-repudiation). We look at it like any other risk model in the market.

For many in our industry, the cost of not engineering this way and eating the occasional super rare bag of shit is orders of magnitude higher than otherwise tolerable. One well-managed server forged of the highest binned silicon is usually the cheapest and most effective engineering solution over the long term.

lmilcin · on Aug 25, 2021

Yes, that is my experience.

Another super important thing to remember is that main goal of this is to have super simple code and very simple but rock solid guarantees.

The main benefit is writing application code that is simple, easy to understand and simple to prove it works correctly, enabled by reliable infrastructure.

When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.

bob1029 · on Aug 25, 2021

> When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.

This is 100% our philosophy. I honestly don't understand why all high-stakes software isn't developed in the same way that we build these trading/data systems.

I think this is the boundary between "engineering" and "art". In my experience, there are a lot of developers who feel like what they do is not engineering because they believe it to be so subjective and open to interpretation. Perhaps there is a mentality that it cant ever be perfect or 100% correct, so why even try upholding such a standard as realistic? It is certainly more entertaining to consume new shiny technology than sitting down with business owners in boring meetings for hours every week...

In reality, you can build software like you build nuclear reactors. It is all a game of complexity management and discipline. Surprisingly, it usually costs less when you do it this way, especially after accounting for the total lifecycle of the product/service. If you can actually build a "perfect" piece of software, you can eliminate entire parts of the org chart. How many developer hours are spent every day at your average SV firm fighting bugs and other regressions? What if you could take this to a number approximating zero?

The classical retort I hear from developers when I pose comments like these is "Well the business still isnt sure exactly what the app should do or look like". My response to that is "Then why are you spinning up Kubernetes clusters when you should be drawing wireframes and schema designs for the customer to review?"

lmilcin · on Aug 25, 2021

Every time I write something like "Yes, you really can write reliable application. No, if it breaks you can't blame everybody and the universe around you. You made a mistake and you need to figure out how this happened and how to prevent it from happening in the future." I just get downvoted to hell.

I suspect in large part it is because when people fail at something they feel a need to find some external explanation of it. And it is all too easy when "business" actually is part of the problem.

The best people I worked with, let's just say I never heard them blaming business for their bugs. They own it, they solve it and they learn from it.

What I am not seeing is people actually have a hard look on what they have done and how they could have avoided the problems.

For example, the single most cause of failed projects I have seen, by far, is unnecessary complication stemming from easily avoidable technical debt.

Easily avoidable technical debt is something that could have reasonably be predicted at the early stage and solved by just making better decisions. Maybe not split your application to 30 services and then run it on Kubernetes? Maybe rather than separate services, pay attention to have proper modules and APIs within your application and your application will just fit couple of servers? Maybe having function calls rather thancascade of internal network hops is cheap way to get good performance rather than (ignore Amdahl's law and) try to incorporate some exotic database that nobody knows and will have to start learning from scratch?

Then people rewrite these projects and rather than understanding what caused the previous version to fail -- just repeat the same process only with new application stack.

jfoutz · on Aug 25, 2021

[flagged]

lmilcin · on Aug 25, 2021

We know because that communication happens on UDP and each packet on app layer has sequence number. It is used on receiving side to rebuild sequence of packets (events from the exchange can only be understood correctly when processed in same order as generated and only if you have complete stream -- you can't process a packet until you processed the one preceding it). It is trivial to detect that we haven't missed a packet.

We had a special alert for a missing packet. To my knowledge that has never activated in production except for exchange-mandated tests (the exchange runs tests regularly to ensure every brokerage house can handle faults in communications, maximum loads, etc.)

If a packed was missed, the same data is replicated on another, independent link through another carrier.

And if this wasn't enough, if your system is down (which it shouldn't happen during trading) you can contact exchange's TCP service and request for the missing sequences. But that never happened, either.

As we really liked this pattern, we built a small framework and used it for internal communication as well including data flowing to traders' workstations.

Mind, that neither the carrier link, the networking devices or people who maintain it are cheap.

t0mas88 · on Aug 25, 2021

In regular trading (not crypto, see the other comment about the volume differences) it is common to tune Java for example to run GC outside of trading hours. That works if you don't allocate new heap memory in every transaction/message but instead only use the stack + pre-allocated pools.

kevinsundar · on Aug 25, 2021

Thats not really what this article is about. Their problem wasn't throughput. What's the size of all the data in your MongoDB instance? And what's the latency in your reads?

In the big data world the "complexity" of the data doesn't really mean much. It's just bytes.

lmilcin · on Aug 25, 2021

> What's the size of all the data in your MongoDB instance?

3x12TB

> In the big data world the "complexity" of the data doesn't really mean much.

Oh how wrong you are.

It is much easier to deal with data when the only thing you need to do is to just move it from A to B. Like "find who should see this message, make sure they see it".

It is much different when you have large, rich domain model that runs tens of thousands of business rules on incoming data and each entity can have very different processing depending on its state and the event that came.

I am writing whole applications just to data-mine our processing flow just to be able to understand a little bit of what is happening there.

At that traffic you can't even log anything for each of the transactions. You have to work indirectly through various metrics, etc.

kevinsundar · on Aug 25, 2021

Nice that's a pretty decent size, curious on the latency still. Thats the primary problem for a real time chat app.

Complexity of data and running business rules on it is not a data store problem though, that's a compute problem. It's highly parallelizable and compute is cheap.

For reference, my team runs transformations on about 1 PB of (uncompressed) data per day with 3 spark clusters, each with 50 nodes. We've got about 70ish PB of (compressed) data queryable. All our challenges come from storage, not compute.

lmilcin · on Aug 25, 2021

The latency is a complex topic.

In order to be able to run so much stuff on MongoDB, we almost never run single queries to the database. If I fetch or insert trade data, I probably run a query for 10 thousand trades at the same time.

So what happens is, as data comes from multiple directions it is being batched (for example 1-10 thousand at a time), split into groups that can be processed together in a roughly similar process, and then travels the pipeline as a single batch which is super important as it allows amortizing some of the costs.

Also the processing pipeline has many, many steps in it. A lot of them have buffers inbetween so that steps don't get starved for data.

All this causes latency. I try to keep it subsecond but it is a tradeoff between throughput and latency.

It could have been implemented better, but the implementation would be complex and inflexible. I think having clear, readable, flexible implementation is worth a little bit of tradeoff in latency.

As to storage being source of most woes, I fully agree. In our case it is trying to deal with bloat of data caused by business wanting to add this or that. All this data causes database caches to be less effective, requires more network throughput, more CPU for parsing/serializing, needs to be replicated, etc. So half the effort is constantly trying to figure out why they want to add this or that and is it really necessary or can be avoided somehow.

throwdbaaway · on Aug 25, 2021

I thought you are doing millions qps with a 3 nodes mongodb cluster, from the top level comment. That would be impressive.

By batching 1-10 thousands records at a time, your use case is very different from discord, which needs to deliver individual messages as fast as possible.

lmilcin · on Aug 25, 2021

Data doesn't come or leave batched. This is just internal mechanism.

Think in term of Discord, their database probably already queues and batches writes. Or maybe they could decide to fetch details of multiple users with a single query by noticing there are 10k concurrent asks for user details. So why have 10k queries when you could have 10 queries for 1k user objects?

If you complain that my process is different because I refuse to run it inefficiently when I can spot an occasion to optimize then yes, it is different.

throwdbaaway · on Aug 25, 2021

Of course, cassandra/mongodb/etc can perform their own batching when writing to the commit log, and can also benefit from write combining by not flushing out the dirty data immediately. That's besides the point.

Your use case allows you to perform batching for writes at the *application layer*, while discord's use case doesn't.

kerng · on Aug 25, 2021

Why couldn't others with lots of traffic use a similar approach? I assume they do. Seems pretty genius idea to batch things like that, especially when qps is very high batching (maybe waiting for a few ms to fill a batch) makes a lot of sense.

lmilcin · on Aug 25, 2021

I don't see why discord's case can't use same tricks. If they have a lot of stuff happening at the same time and their application is relatively simple (from the point of view of number of different types of operation it performs) at any point in time it is bound to have many cases of the same operation being performed.

Then it is just a case of structuring your application properly.

Most applications are immediately broken, by design, by having a thread dedicated to the request/response pair. It then becomes difficult to have parts of that processing from different threads be selected and processed together to take benefit of amortizing costs.

The alternative I am using is funneling all requests into a single pipeline and having that pipeline split into stages distributed over CPU cores. So it comes in (by way of Kafka or REST call, etc.), it is queued, it goes to CPU core #1, gets some processing there, then moves to CPU core #2, gets some other processing there, gets published to CPU core #3 and so on.

Now, each of these components can work on huge number of tasks at the same time. For example when the step is to enrich the data, it might be necessary to shoot a message to another REST service and wait for response. During that time the component picks up other items to do the same.

As you see, this architecture practically begs to use batching and amortize costs.

ricardobeat · on Aug 25, 2021

What you're describing sounds like vanilla async concurrency. I seriously doubt 'most applications' use the one-thread-per-request model at this point in time, most major frameworks are async now. And it's not a silver bullet either, plenty of articles on how single-thread is sometimes a better fit for extremely high-performance apps.

After reading all of you responses, I still don't see how you think your learnings apply to Discord. They would not be able to fit the indexes in memory on MongoDB. They can't batch reads or writes at the application server level (the latency cost for messaging is not acceptable). Millions of queries happen every second, not one-off analytical workloads. It seems these two systems are far enough apart that really there is no meaningful comparison to be made here.

fao_ · on Aug 25, 2021

I'm not sure why you're being downvoted when you're a domain expert talking about your craft? People have weird hangups on hacker news, it seems

Aeolun · on Aug 25, 2021

Something about the tone of the messages rubs me entirely the wrong way.

bottled_poe · on Aug 25, 2021

It reads like justified opinions from experience. Not seeing much emotional tone in there.

combyn8tor · on Aug 25, 2021

Well on one hand you've got engineers at a billion dollar company explaining how they've solved a problem. On the other hand you've got some random commentor on HN over-simplifying a complex engineering solution.

BigJono · on Aug 25, 2021

Sounds to me like it's some "random commentor" who has solved a similar problem at a similar scale with a solution that's much simpler.

Aeolun · on Aug 25, 2021

I dunno, sounds very ‘I am very smart’ to me. They may be right or they may not, but both solutions sound workable to me.

Don’t let perfect be the enemy of good, and all that. There’s enough utter garbage around to shit on.

fao_ · on Aug 25, 2021

I think you're reading into it. They are stating that the solution in the post was overengineered, and describing an alternate solution that doesn't require as much abstraction or resources, but is manageable for data with a much higher dimensional structure

The fact that you read that as "I am very smart" and that that was a reason to downvote the post, tells more about you than it does the person you're supposedly describing.

polishdude20 · on Aug 25, 2021

This just gives me Factorio flashbacks.

phailhaus · on Aug 25, 2021

Again, the article is not about throughput. How fast can you search across all historical trades?

lmilcin · on Aug 25, 2021

As an example, there are bitemporal queries like "for the given population of trades specified by following rules, find the set of trades that met the rules at a particular point in time, based on our knowledge at another given point in time". Also trades are versioned (are a stream of business events from trading system), then have amendments (each event may be amended in the future but the older version must be preserved). Our system can also amend the data (for example to add some additional data to the trade later). All this causes trades to be a tree of immutable versions you need to comb through. A trade can have anywhere from 1 to 30k versions.

This takes about 20 seconds. The process opens about 200 connections to the cluster and transfers data at about 2-4GB/s.

michael_j_ward · on Aug 25, 2021

Thanks for all the great insight.

Did you consider an explicitly bitemporal database like crux[0]?

[0] https://github.com/juxt/crux?

lmilcin · on Aug 25, 2021

Our application predates crux. As to crux, I can't comment because I did not know about it when I was actively searching for a product to do this.

phailhaus · on Aug 25, 2021

How many simultaneous queries of that nature can the system handle?

fao_ · on Aug 25, 2021

Are you not sure that financial data "with hundreds of fields" is more complex than chat data which has a relatively linear threading and only a handful of fields?

phailhaus · on Aug 25, 2021

I'm asking about how your system scales to the number of queries, but you seem to be taking every question personally. You seem to really want to make sure everyone knows that you think Discord's problems are easy to solve. I'm not saying Discord is more complicated, but you're not really giving enough information to prove that Discord's problems are a subset of yours.

Do you support more simultaneous queries than Discord?

fao_ · on Aug 26, 2021

I think you need to recheck my username, you have mistaken me with another poster in the thread.

lmilcin · on Aug 25, 2021

Actually, our threading is quite simple.

There is exactly as many threads (that do anything) as CPU cores.

fao_ · on Aug 25, 2021

I meant threading in as much as the connections or links between message data

All discord needs to store is:

{ channel_id, message_id, user_id, content, reply_to, datetime }

these days they added an extra thread_id field, sure. But the data itself is blisteringly uncomplex and there is only a single way to display it (ordered by time, i.e. the 'thread')

gpderetta · on Aug 25, 2021

I think parent meant threading as in message threads.

lmilcin · on Aug 25, 2021

Just recently Symphony switched order in which I have received two messages from a colleague. This completely changed the meaning of the message and we got in an argument that was only resolved after I have posted screenshot of what he wrote.

It seems, the threading might not be that simple after all.

giancarlostoro · on Aug 25, 2021

> -- if you are willing to spend a little bit of learning effort, it is easily possible to run millions of non trivial transactions per second on a single server,

I got into programming through the Private Server (gaming) scene. You learn that the more you optimize and refactor your code to be more efficient, the more you can handle on less hardware, including embedded systems. So yeah, it's amazing how much is wasted. I'm kind of holding hope that things like Rust and Go focus on letting you get more out of less hardware.

naikrovek · on Aug 25, 2021

Go and Rust won't help there, but Zig might.

unchar1 · on Aug 25, 2021

I can understand that Go has a GC so it (might) be slower. But wouldn't Rust and Zig have about the same performance?

naikrovek · on Sept 1, 2021

Go is slower because it optimizes for compile time.

Rust is slower (usually) because Rust does not revolve around making the most use of the hardware it has.

both are fine choices; nothing wrong with either direction.

Zig performs very well by default because it was designed to be efficient and fast from the start, without compromise. it has memory safety, too, but in a way that few seem to understand, myself included, so it's difficult for me to describe with my rudimentary understanding.

anonymoushn · on Aug 25, 2021

Maybe if you use nostd. I evaluated a bunch of Rust libraries for some server software, but I could not use any of them because they pervasively assume that it is ok to make syscalls to allocate memory. If you'd like to write software that makes few syscalls in the steady state, you can do it in rust, but you can't use libraries. Or String or Vec, I guess.

onei · on Aug 25, 2021

Why is memory allocation via syscalls bad? I get it for embedded (which was mentioned above so perhaps that's what this targets), but I kind of assumed malloc was a syscall underneath on an actual OS and that was fine.

lmilcin · on Aug 25, 2021

Actually, you don't need to run any functions to cause a context switch. For example, at the very lowest level, even trying to access a memory page that is presently in physical memory but does not have entry in TLB causes CPU to interrupt to OS to ask for mapping.

thibauts · on Aug 27, 2021

This is my experience too. Millions of persisted and serialized TPS on regular NVMe with ms latency. Though it took a bit of effort to get to these numbers.

https://github.com/dataptive/styx

ljw1001 · on Aug 26, 2021

Curious as to how many days of data you have in your cluster. It seems like it could be ~1/2 billion records per day, 125 billion per year-ish. In a few years your 3 node Mongo cluster would be getting towards volumes I associate with a 'big data' kind of solution like BigTable.

dpedu · on Aug 25, 2021

Sounds interesting. Does this "we" have any writings about this?

jrockway · on Aug 25, 2021

I'm not "we" but I have some experience in this area.

Computers are fast, basically. ACID transactions can be slow (if they write to "the" disk before returning success), but just processing data is alarmingly speedy.

If you break down things into small operations and you aggregate by day, you can always have big numbers. The monitoring system that I wrote for Google Fiber ran on one machine and processed 40 billion log lines per day, with only a few seconds of latency from upload start -> dashboard/alert status updated. (We even wrote to Spanner once-per-upload to store state between uploads, and this didn't even register as an increase in load to them. Multiple hundred thousand globally-consistent transactional writes per minute without breaking a sweat. Good database!)

apenwarr wrote a pretty detailed look into the system here: https://apenwarr.ca/log/20190216 And like him, I miss having it every day.

lmilcin · on Aug 25, 2021

I have a plan to write a book on how to write reactive applications like that. Mostly collection of observations, tips, tricks, patterns for reactive composition, some very MongoDB specific solutions, etc.

Not sure how many people would be interested. Reactor has quite steep learning curve but also very little literature on how to use for anything non-trivial.

The aim is not just enable good throughput, but also achieve this without compromising on clarity of implementation. Which is where I think reactive, and specifically ReactiveX/Reactor, shines.

Cullinet · on Aug 25, 2021

I'm interested in getting your book published. Career in publishing and specialist media but a lot of it spent on related problems to your subject. Semi retired have risk capital to get to the right distribution maintaining well above industry standard terms. Email in profile.

lmilcin · on Aug 25, 2021

Thanks. I will try to self publish. I want to keep freedom over content and target and I am not looking for acclaim for having my name on a book from a well known publisher. I am just hoping to help people solve their problems.

the_arun · on Aug 25, 2021

Do you have a blog? So that we can learn those tips soon? :)

sumnole · on Aug 25, 2021

Just curious, what was the rationale for choosing MongoDB?

lmilcin · on Aug 25, 2021

Idk. I wasn't the one making the choice. I am just the one trying to make the best with what I have inherited.

eismcc · on Aug 25, 2021

Arctic [0] uses mongodb so perhaps that was inspiration?

[0] https://github.com/man-group/arctic

tonymet · on Aug 25, 2021

Considering that a cpu can do 3 billion things a second , and a typical laptop can store 16 billion things in memory , it shouldn’t take more than 5 of these to handle “billions of messages” . I agree with you that modern frameworks are inefficient

adamnemecek · on Aug 25, 2021

By 16 billion things you mean 16 billion bytes? If you are talking about physical memory, then no, you can't occupy the entire memory. If you are talking about virtual memory, then you can store more data.

lmilcin · on Aug 25, 2021

Actually, CPU processes things in words, not bytes. On 64-bit architecture the word is 64 bit or 8 bytes.

But there is a lot of things that CPU can do even faster than that, because this limitation only relates to actual instruction execution (and even then there are instructions that can process multiple words at a time).

adamnemecek · on Aug 25, 2021

I know however the only interpretation of the statement above was that he meant bytes since few laptops have 64GB.