*You won't have any trouble running either of these on embedded devices (which I...

vidarh · on June 26, 2014

The biggest limiting factor even in our relatively low-density populated rack is heat and power. With off the shelf servers and relatively low density, I can trivially exceed the highest power allocations our colo provider will normally allow per rack. The more power you waste on inefficient CPU usage, the less you can devote to putting more drives in.

nl · on June 26, 2014

The OP's claim is that memory is the limiting factor in the case of Java. I don't entirely agree, but even if I did it would almost certainly be a fixed overhead per machine, and unlikely to be a problem on server class machines.

Also, the read/processing characteristics of compute nodes often means the CPU is underutilized while filesystem operations are ongoing.

srean · on June 26, 2014

I will leave with an elliptical meta-comment, for those whose competitive advantage lies in others not getting it right, have little interest in correcting misconceptions. You might have interest in this anecdote https://news.ycombinator.com/item?id=7948170

nl · on June 26, 2014

But how much of that is Java, and how much is Hadoop?

Spark runs on the JVM, and much, much faster than Hadoop on similar workloads (Yes, I understand it isn't just doing Map/Reduce, but the point is that Java doesn't seem to be a performance limitation in itself).

srean · on June 26, 2014

Indeed and as I said it did surprise me that Hadoop was so much slower. But the buck really stops at resources consumed per dollar of usable results produced, and in that Java is going to consume a whole lot more. At large scales, running costs far exceeds development costs. BTW my point was not only about Java but also about your assessment of the hardware.

delian66 · on June 26, 2014

CPU and memory resources spend on an inefficient filesystem implementation are just wasted resources, not available for your workload. Keep in mind that the inefficiencies are multiplied over all your cluster nodes.

wyager · on June 26, 2014

There are many use cases for a DFS. Some of those cases involve relatively low-resource nodes.

nl · on June 26, 2014

I agree. But that's a difference use-case to what HDFS is designed for.