Hacker News new | past | comments | ask | show | jobs | submit login

It's "trivial" to build a distributed, petabyte scale filesystem.

It's hard to build a cost-effective, reliable and fast distributed, petabyte scale filesystem that's suitable for a wide range of workloads.

Consider that you need to minimise the amount of copies of data to keep costs reasonable, yet the fewer copies of data, the lower your IO capacity for accessing that data is (since readers/writers will content for the IO capacity of a small number of storage nodes), so you want to maximise the number of copies of data to maximise throughout. Yet the higher number of copies to maintain, the more IO it takes to spread each write out through your storage network. Soon enough you start running into "fun" problems such not being able to naively push writes out to each storage server it is meant to go to for data that needs to be replicated widely, because you'll be bandwidth constrained, but instead needing a fan-out even for simple writes.

You'll also want to minimise operational headaches; a disk going dead or an entire server failing needs to be handled transparently, as every additional disk or server you add increases the odds of a failure per any unit of tim.

(Compare with the naive approach for just a 1PB system: I can "easily" get about 200TB per off-the shelf storage server with hardware RAID. Lets say 150TB usable space; get about 14 of them to let you replicate stuff across two servers, and put GlusterFS on it. It'll work. It'll also be expensive, horribly slow for a number of workloads, and a regular disk replacement nightmare)




If you need to minimize the amount of copies then yes, you need to have some "risk management" software to estimate which machines are more reliable, and which files are more important, and then assign those files to enough replicas to be able to statistically guarantee some SLA. Then you need failover where at least one if the replicas is always available.

The routing table should be small enough to fit in RAM on every machine, and consulted for request. It would be updated when failover occurs. The table would consist of general rules with temporary exceptions for specific partition ranges that are being failed over.

You can store indexes in files, in a similar way. Just avoid joins and make like a graph database: first load documents from the index and then do mapreduce to get the related documents.

But besides that, I can see how maybe multi user concurrent access might necessitate eventual consistency algorithms for each app, but that's it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: