Trading is for multiple reasons ideal for this, one is that total global orderin...

enlyth · on Aug 25, 2021

What about cryptocurrency trading which goes on continuously for 24 hours a day?

anonymoushn · on Aug 25, 2021

Cryptocurrency venues handle extremely low throughput and, with maybe one exception, regularly go down for hours.

MacsHeadroom · on Aug 27, 2021

The decentralized Solana network is currently capable of over 50k TPS 24/7 in the live Beta with a target transaction finality speed of 30ms round trip from your browser/device. Their unofficial moto is "DeFi at Nasdaq Speed." Solana is nascent and will likely reach 10M+ TPS and 10ms finality within a couple years time.

Decentralized Acyclic Graph based networks (e.g. Hashigraph, which are not technically blockchains) can reach effectively infinite TPS but suffer in time to finality.

Solana is a blockchain with zero downtown (and a Turing complete smartchain), mind you-- nota centralized exchange.

anonymoushn · on Aug 27, 2021

Solana has 400ms block times, so I don't think it can achieve finality in 30ms, and the whole comment seems a bit off-topic.

jfoutz · on Aug 25, 2021

you can handle the load or not, right? A built in maintenance window is super nice, but servers crash all the time. So, that's a problem, or you've got a system in place. if you can handle failover, you've got free maintenance windows anyway, so it seems not any more difficult?

lmilcin · on Aug 25, 2021

> but servers crash all the time

This is both wise and stupid at the same time.

It is wise if you mean "be ready for servers to crash at any time by thinking they are going to crash at the worst possible moment".

But it is stupid, because people think they need massive parallel deployments just because servers will be constantly crashing and it is just not true. The cost they pay is in having couple of times more nodes than they really need to have if they got their focus right (making the application efficient first, scalable later)

The reality is, servers do not crash. At least not the kind of hardware I am working on.

I have been responsible for keeping communication with a stock exchange for like 3 years in one of my past jobs and during that time we haven't lost a single packet.

And aside from some massive parallel loads which used tens of thousands of nodes and aside from one time my server room boiled over due to failed AC (and no environmental monitoring) I never had a server crash on me for the past 20 years.

So you can reasonably assume that your servers will be functioning properly (if you bought quality) and it kinda helps a lot at design stage.

bob1029 · on Aug 25, 2021

> The reality is, servers do not crash

This is the regime we operate in as well. For our business, a failure, while really bad, is not catastrophic (we still maintain non-repudiation). We look at it like any other risk model in the market.

For many in our industry, the cost of not engineering this way and eating the occasional super rare bag of shit is orders of magnitude higher than otherwise tolerable. One well-managed server forged of the highest binned silicon is usually the cheapest and most effective engineering solution over the long term.

lmilcin · on Aug 25, 2021

Yes, that is my experience.

Another super important thing to remember is that main goal of this is to have super simple code and very simple but rock solid guarantees.

The main benefit is writing application code that is simple, easy to understand and simple to prove it works correctly, enabled by reliable infrastructure.

When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.

bob1029 · on Aug 25, 2021

> When you are not focusing on various ridiculous technologies that each require PhDs to understand well, you can focus on your application stack, domain modeling, etc. to make it even more reliable.

This is 100% our philosophy. I honestly don't understand why all high-stakes software isn't developed in the same way that we build these trading/data systems.

I think this is the boundary between "engineering" and "art". In my experience, there are a lot of developers who feel like what they do is not engineering because they believe it to be so subjective and open to interpretation. Perhaps there is a mentality that it cant ever be perfect or 100% correct, so why even try upholding such a standard as realistic? It is certainly more entertaining to consume new shiny technology than sitting down with business owners in boring meetings for hours every week...

In reality, you can build software like you build nuclear reactors. It is all a game of complexity management and discipline. Surprisingly, it usually costs less when you do it this way, especially after accounting for the total lifecycle of the product/service. If you can actually build a "perfect" piece of software, you can eliminate entire parts of the org chart. How many developer hours are spent every day at your average SV firm fighting bugs and other regressions? What if you could take this to a number approximating zero?

The classical retort I hear from developers when I pose comments like these is "Well the business still isnt sure exactly what the app should do or look like". My response to that is "Then why are you spinning up Kubernetes clusters when you should be drawing wireframes and schema designs for the customer to review?"

lmilcin · on Aug 25, 2021

Every time I write something like "Yes, you really can write reliable application. No, if it breaks you can't blame everybody and the universe around you. You made a mistake and you need to figure out how this happened and how to prevent it from happening in the future." I just get downvoted to hell.

I suspect in large part it is because when people fail at something they feel a need to find some external explanation of it. And it is all too easy when "business" actually is part of the problem.

The best people I worked with, let's just say I never heard them blaming business for their bugs. They own it, they solve it and they learn from it.

What I am not seeing is people actually have a hard look on what they have done and how they could have avoided the problems.

For example, the single most cause of failed projects I have seen, by far, is unnecessary complication stemming from easily avoidable technical debt.

Easily avoidable technical debt is something that could have reasonably be predicted at the early stage and solved by just making better decisions. Maybe not split your application to 30 services and then run it on Kubernetes? Maybe rather than separate services, pay attention to have proper modules and APIs within your application and your application will just fit couple of servers? Maybe having function calls rather thancascade of internal network hops is cheap way to get good performance rather than (ignore Amdahl's law and) try to incorporate some exotic database that nobody knows and will have to start learning from scratch?

Then people rewrite these projects and rather than understanding what caused the previous version to fail -- just repeat the same process only with new application stack.

jfoutz · on Aug 25, 2021

[flagged]

lmilcin · on Aug 25, 2021

We know because that communication happens on UDP and each packet on app layer has sequence number. It is used on receiving side to rebuild sequence of packets (events from the exchange can only be understood correctly when processed in same order as generated and only if you have complete stream -- you can't process a packet until you processed the one preceding it). It is trivial to detect that we haven't missed a packet.

We had a special alert for a missing packet. To my knowledge that has never activated in production except for exchange-mandated tests (the exchange runs tests regularly to ensure every brokerage house can handle faults in communications, maximum loads, etc.)

If a packed was missed, the same data is replicated on another, independent link through another carrier.

And if this wasn't enough, if your system is down (which it shouldn't happen during trading) you can contact exchange's TCP service and request for the missing sequences. But that never happened, either.

As we really liked this pattern, we built a small framework and used it for internal communication as well including data flowing to traders' workstations.

Mind, that neither the carrier link, the networking devices or people who maintain it are cheap.

t0mas88 · on Aug 25, 2021

In regular trading (not crypto, see the other comment about the volume differences) it is common to tune Java for example to run GC outside of trading hours. That works if you don't allocate new heap memory in every transaction/message but instead only use the stack + pre-allocated pools.