I think a stream processing engine written in rust will have better performance, lower latency, more stable services, lower memory footprint, and cost savings. At the same time, ArkFlow is based on DataFusion implementation, which will put ArkFlow on a strong open source community.
Rust is rather heavy on its copy/clone imposed semantics making it potentially less suitable for low-latency or large data volume processing workloads. Picking Rust for its performance potential only means that you're going to have a harder time beating other native performance-oriented stream processing engines written in either C or C++, if that is your goal of course.
This logic
> written in rust will have better performance, lower latency, ..., lower memory footprint
is flawed and is cargo-cult programming unless you say what are you objectively comparing it against and how you intend to achieve those goals. Picking the right™ language just for the sake of these goals won't get you too far.
> Rust is rather heavy on its copy/clone imposed semantics making it potentially less suitable for low-latency or large data volume processing workloads. Picking Rust for its performance potential only means that you're going to have a harder time beating other native performance-oriented stream processing engines written in either C or C++, if that is your goal of course.
There is absolutely nothing in Rust's semantics preventing you from writing high-performance data processing workloads in it, and in fact it's one of the best languages for that purpose. Beyond that, the usual barrier to entry for working on a product like this written in C++ is incredibly high in part because stability and safety are so critical for these products--which is one of the reasons that in practice they are often written in memory safe languages, where C++ is not even an option. Have you worked on any nontrivial Rust data processing product where "copy/clone imposed semantics" somehow prevented you from getting big performance wins? I'd be very curious to hear about this if so.
Stability and safety are the least of the concerns in data processing and database workloads. That's totally not the reason why we saw an increase of these systems during the 90s and early 00s written in Java or similar alternative languages. It was ease of use, low-entry bar into the ecosystem and generally developer pool accessibility. Otherwise, the cost is the main driver in infrastructure type of software and the reason why we see many of these rewritten exactly in C++. Rust is just another contender here, and it's usually because of the performance and a lot of hype recently, which is fair.
> Stability and safety are the least of the concerns in data processing and database workloads. That's totally not the reason why we saw an increase of these systems during the 90s and early 00s written in Java or similar alternative languages.
not_sure_if_serious.jpg
To be extra clear about it (and to avoid pure snark, that's frowned upon here at HN): that's the kind of software (alongside a lot of general enterprise code) that got rewritten from C++ to Java, not the other way around. The increased safety of Java was absolutely a consideration. Java was the 'Rust' of the mid-to-late 1990s and 2000s, only a whole lot slower and clunkier than the actual Rust of today.
I am serious. C is a simple language but rather complicated to wrap your head around it since it requires the familiarity with low-level machine concepts. C++ ditto but with a difference that it is a rather complicated language with rather advanced programming language concepts - something that did not really exist at that time. So the net result was a very high entry barrier and this was the main reason, and not "safety" as you say, why many people were running away from C and C++ to Java/C# because those were the only alternatives we had at that time. I don't remember "safety" being mentioned at all during the past 20 years or so up until Rust came out. "Segfaults" were the 90s and 00s "safety" vocabulary but, as I said, it was a skill issue.
Frenzy around the "safety" IMO is way too overhyped and when you and OP say that "safety" plays a huge role in data processing and database kernel source development, no - it is literally not even a 1% of time that a developer in that domain spends his time on. C and C++ are still used in those domains full on.
> that's the kind of software (alongside a lot of general enterprise code) that got rewritten from C++ to Java, not the other way around
So you agree that many people were absolutely "running away from C and C++ to Java/C#" but somehow this didn't involve any data processing code, even though arguably the main thing that internally-developed enterprise code does is data processing of some kind? OK, I guess.
> Which C or C++ engines exactly got rewritten to Java?
It's difficult to give names precisely because private enterprise development was involved. But essentially every non-trivial Java project starting from the mid-1990s or so, would've been written in C++ if it had been around in the late 1980s or earlier in the 1990s. It's just not very sensible to suppose that "data processing" as a broad area was somehow exempted from this. And if writing segfault-free code in C/C++ could be dismissed as a mere "skill issue" we wouldn't need Rust either. It's a wrong take today and it was just as wrong back then.
(And yes, Java took significant steps forward in safety, including adding a GC - which means no wild pointers or double-free issues - and converting "null pointer" dereferences into a properly managed failure, with backtraces and all that. Just because the "safety" vocabulary wasn't around back then except for programming-theory experts, doesn't imply that people wouldn't care just as much about a guarantee of code being free from the old segfault errors.)
You're a servant to the business needs so whatever the business needs are at that moment. It's a vague answer probably not appealing to many engineers but that's what it really is. You're solving problems for your business stakeholder and for your business stakeholder clients.
In another words, programming language is usually not at the very focus of daily development, given that there's always much bigger fish to fry in this domain, but if Rust provides such an undisputed benefit to your business model, while keeping the cost and risk of it viable for the business, then it's going to be a no-brainer. Chances are that this is going to be the case is very very low.
So, my advice would rather be use the language whichever you prefer but don't dwell over it - rather put your focus on innovating workload-specific optimizations that are solving real-world issues that are palpable and easily proven/demonstrated. Study the challenges of storage or data processing engines or vectorized query execution algorithms. Depending on the domain problem you're trying to solve, make sure that your language of choice does not step in your way.
Why do you have to beat a native performance-oriented streaming engine written in C or C++?
Currently, most of the mainstream stream processing engines are written in Java. Sorry, I may not add qualifiers to make you misunderstandings.
Software does not have silver bullets, so does programming languages, and each has its own strengths. I also like to use go and Java to develop software.
So if you don't want to beat native engines in performance what is it that you're trying to solve but Java-based engines don't have? I think it's pretty important to set a vision upfront otherwise you're going to set yourself a trap for a quick failure.