>What would be your language of choice for the next gen, stable distributed file system?
Here's my heavily biased subjective opinion on this entirely hypothetical software:
I think we should do one or both of two things:
A) Do it in very clean, fast, simple C. Put an emphasis on speed and simplicity.
B) Do it in very reliable, secure, simple Haskell. Put an emphasis on correctness and simplicity.
With some effort, the C one could be correct and the Haskell one could be fast.
I mention these two languages because they compile to native code and have very good cross-platform support. You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded). C has an advantage of allowing the absolute minimal implementation, and Haskell has an advantage of allowing a massively concurrent implementation. Yada yada yada
Of course, it could be that the question is completely irrelevant. Just define a spec for a DFS, and then let different implementations pop up in whatever language is best suited to that implementation's specific details.
You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded).
Why is this important in this use-case? If the DFS is being used for data processing then presumably the nodes are reasonably capable machines.
There may well be a difference use-case for a DFS for embedded and resource-constrained devices. That's not what Google or Hadoop is doing though.
The biggest limiting factor even in our relatively low-density populated rack is heat and power. With off the shelf servers and relatively low density, I can trivially exceed the highest power allocations our colo provider will normally allow per rack. The more power you waste on inefficient CPU usage, the less you can devote to putting more drives in.
The OP's claim is that memory is the limiting factor in the case of Java. I don't entirely agree, but even if I did it would almost certainly be a fixed overhead per machine, and unlikely to be a problem on server class machines.
Also, the read/processing characteristics of compute nodes often means the CPU is underutilized while filesystem operations are ongoing.
I will leave with an elliptical meta-comment, for those whose competitive advantage lies in others not getting it right, have little interest in correcting misconceptions. You might have interest in this anecdote https://news.ycombinator.com/item?id=7948170
But how much of that is Java, and how much is Hadoop?
Spark runs on the JVM, and much, much faster than Hadoop on similar workloads (Yes, I understand it isn't just doing Map/Reduce, but the point is that Java doesn't seem to be a performance limitation in itself).
Indeed and as I said it did surprise me that Hadoop was so much slower. But the buck really stops at resources consumed per dollar of usable results produced, and in that Java is going to consume a whole lot more. At large scales, running costs far exceeds development costs. BTW my point was not only about Java but also about your assessment of the hardware.
CPU and memory resources spend on an inefficient filesystem implementation are just wasted resources, not available for your workload. Keep in mind that the inefficiencies are multiplied over all your cluster nodes.
I don't think large scale distributed file systems written in C are hypothetical. I'm pretty sure this is exactly what MapR has done - replace the Java-based HDFS with C, retaining the API. GlusterFS by Red Hat is another DFS.
As someone who is currently implementing a next-gen distributed file system, I can highlight one aspect: you have a lot of concurrency and asynchronous processing. Thus you need at least reference counting.
Can you really do Haskell on embedded? I thought the far far abstraction away from memory as a concern made it pretty much a non-starter for the foreseeable future.
Embedded meaning "ARM running an OS", yes. Embedded meaning "OS-less microcontroller", not so much. You'd have to use an embedded programming DSL for that, which isn't really ARM anymore.
ats will probably be an interesting best-of-both-worlds third option soon, though from what little I've seen of it it is currently harder to write code in than either haskell or c. but once you do put the work in to write your proofs etc. both correctness and speed should fall out naturally.
Here's my heavily biased subjective opinion on this entirely hypothetical software:
I think we should do one or both of two things:
A) Do it in very clean, fast, simple C. Put an emphasis on speed and simplicity.
B) Do it in very reliable, secure, simple Haskell. Put an emphasis on correctness and simplicity.
With some effort, the C one could be correct and the Haskell one could be fast.
I mention these two languages because they compile to native code and have very good cross-platform support. You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded). C has an advantage of allowing the absolute minimal implementation, and Haskell has an advantage of allowing a massively concurrent implementation. Yada yada yada
Of course, it could be that the question is completely irrelevant. Just define a spec for a DFS, and then let different implementations pop up in whatever language is best suited to that implementation's specific details.