Inside the Magic Pocket

james_cowling · on May 6, 2016

Hey folks, this is James from Dropbox. As with the last post (https://news.ycombinator.com/item?id=11282948) a few of us will check in from time to time to see if there are any questions.

Apologies for taking so long to get this blog post out! We'll follow up with a few more posts going into more technical details.

jhartmann · on May 6, 2016

First thanks so much for doing this, I really appreciate it. I only did a quick read of your post apologies if it addresses these comments/questions. I recently built a system for image storage and made some similar but different tradeoffs so I'm very curious about some of the motivation behind your decisions. Definitely would be cool to go into them in a follow on post.

I'm curious why you choose to what seems like an identical bucket stored on many nodes. I have recently built a specialized system that resembles facebook's haystack for storing several petabytes of images. I think it is way more simple and efficient to precreate all the possible 'slabs' on each machine, and just place the data in a round robin fashion over all open slabs. You can periodically grow these with system calls to allocate blocks of disk to make it efficient, maybe several 100's of megs at a time when running at scale. That way each server makes its own placement decisions, and no master is necessary to figure out data placement. Especially since you went with a mysql architecture, I think this makes a lot of sense. I guess it does make things more complex on the erasure coding side, since each block would then have to be erasure coded individually. It might be worth exploring though, since you could completely eliminate the requirement of a master.

Of course my system is considerably simpler since my smallest atomic unit is a whole file (keyed by sha1 hash + length to deduplicate), like facebook's haystack it was designed primarily to be an image store. I also do full replication rather than erasure coding, and only plan to implement that later.

One thing you didn't go into very much that would be very interesting would be the on disk format of the bucket. I used a modified version of folly RecordIO that returns the disk offsets and stored my payload using thrift with the sha1 hash and some flags. This allows for both quick indexing to each block and a nice recovery story. I'd also be interested in what sort of index format you are using, I store mine in both mysql and in a google sparsehash table by record token. Storing the index locally has great benefits if the mysql server goes down, I'd be curious if your system does that as well.

james_cowling · on May 6, 2016

Technically the different copies of the buckets aren't "identical" since they may get populated in unserialized ways (they may store data in different orders) but yes they all store the same set of blocks.

Note that I just described the replicated data storage in the blog post. Approximately 24 hours later these volumes get packaged up and erasure coded into a much more efficient storage representation, so the initial replicated writes are just there for high durability and efficient writes while the buckets are mutable.

I'm not sure I understand exactly what your question is but I think it's asking why we use 1GB buckets allocated all over the place rather than just picking a set of disks and filling them up RAID-style?

This is a pretty important point: If we picked X nodes and stored all the same data on them then if we lost one of those nodes we'd only have X-1 other nodes to re-replicate from. This sets up a hotspot in the system and limits how quickly we can re-replicate. This can be a pretty big deal if we lose a machine storing 1PB of data - we need to amortize that reconstruction cost over a large number of NICs and disk spindles (this is covered in more detail in the blog post).

The other advantage of having "small" buckets is that they're a manageable unit for combining into larger volumes when erasure coding, or for tracking partial repair operations, or for garbage collection and compaction when data is deleted.

Seems like it might be worth going into more detail on erasure coding in a future blog post.

jhartmann · on May 6, 2016

My main questions was why always store the same blocks together, and you definitely answered that.

Interesting approach. It definitely achieves your goal of spreading out the load. I think this can be achieved without making sure you store the blocks always together though, and it might help move towards a less coordinated system.

Here is how I solved the problem:

You can arrange each instance (you call it cell in your post) so it has different pools of storage. Maybe something that might make sense is each machine at the top of each rack is one pool, all the next machines down is another pool, and so on. All replication is always done across a pool. All the machines in a pool are stored to in a round robin fashion at the client. The end user can choose how many pools they want to replicate over.

Abstract each of the machines as a storage unit. Each storage unit actually contains a series of 'slabs' which are largish files (32 gb in size generally), that grow in some large increment and close out when they hit a high watermark. Each write to a machine goes to a different writable slab in a round robin fashion. Each server knows the path and the contents of each slab, and on disk failure all the individual pieces are replicated to another set of open slabs just like the blocks came in fresh (with an exclusion of the slabs that already have a copy). Since all requests are balanced across the pools, and each client has a different round robin list it manages locally the individual pieces are evenly distributed over the other machines in the other pools.

I tried really hard though to optimize for a zero master situation, so my clients handle contacting all the different machines and doing caching and retrying. So it sounds like I had different goals than you guys did, so I'm not surprised your have different trade offs.

james_cowling · on May 6, 2016

Yep looks like we may have had slightly different goals here. Another one of our design requirements was that we wanted to be able to move data around within the system (repair operations, rebalancing, encoding, compaction etc) without dealing with the metadata overhead of addressing hundreds of billions of blocks. Statically mapping a block to a 1GB bucket allows us to manage placement decisions on the granularity of buckets rather than orders of magnitude more individual files/blocks.

jorangreef · on May 8, 2016

Thanks for sharing a great post.

Do you think a 512MB bucket might be more optimal compared to a 1GB bucket? 2x less data required to be read when restoring an erasure coded bucket, and 2x quicker restore times, for only a 2x increase in bucket metadata costs? Erasure coded bucket slices (around 85MB per slice assuming 6/9 Reed Solomon) should be just right to amortize disk seeks.

james_cowling · on May 9, 2016

Thanks!

It's always a bit of a tradeoff. A 512MB bucket size would double the size of the Replication Table and we want make sure we keep the entire working set in the buffer cache. 1GB also has the advantage of filling up slower, which gives us more time to close full buckets and create new ones as needed.

That said, 1GB is a bit of an arbitrary number which just happens to work pretty well. 512MB might work just fine. The exact number isn't as significant since typically we lose a whole disk at a time and thus needs to recover a certain number of TB regardless of the bucket size.

jorangreef · on May 9, 2016

Thanks, that makes sense!

Pyxl101 · on May 6, 2016

Nice engineering work! This is a really interesting case study.

Has the investment into bespoke storage technologies and data centers paid off already, given the real costs to build and set it up, and opportunity costs for other uses of engineering time? How long did it take to pay off through reduced costs, or when do you expect it will in the future? It would be interesting to learn at what point it might be worthwhile for a large player in a space like file storage to switch from existing commodity cloud solutions or proprietary on-premise solutions to custom ones -- as an extrapolation basis for other problems.

The general implication behind my question is: a custom solution can provide lower cost when viewed in isolation on a per-unit basis, but the total costs to the business must also account for the engineering work to develop, troubleshoot, deploy, and migrate to the storage system, as well as the time needed to lease and build data centers, etc., as well as to operate facilities and software stacks long-term; and to hire people to do that specifically; and also take into account the other things one could have done but cannot due to the decision to focus on a given problem (e.g., if the investment purely lowers cost, then that might be a tradeoff compared to another project that adds differentiating features. I know your blog post also mentioned performance, so it's not either-or) To make a good decision one must also weigh the expected value of alternatives, e.g., the current and predicted feature-sets of existing cloud or on-premise storage technologies, their expected prices over time, their expected regional availability. There's also the complication of owning substantially more assets rather than having operational expenses on cloud providers or other third parties.

I'm just curious to learn more about how you think about navigating the decision space. Of course, there's significant strategic value to "owning one's own destiny", but it also seems to have required significant investment and attention from Dropbox engineering team for perhaps a few years.

Any general advice regarding how big your use-case needs to be before this kind of engineering is worthwhile? How do you analyze the situation to identify when that's the case -- how do you identify the inflection point with confidence? Was it your prototyping that helped you determine that you could execute on this approach?

jamwt · on May 6, 2016

These systems are of such a scale that the materials/power/real estate costs dwarf the engineering people costs.

The tradeoff, it depends on your business and your team, your risk tolerance, time scales. But a rule of thumb might be if you don't have more than 10PB or so, you probably shouldn't bother.

jhartmann · on May 6, 2016

I believe a team of one or two talented individuals can build something that scales well and that will save a HUGE amount of money once you get into any significant portion of a petabyte with your data. One or two months of S3's list price can easily pay for the cost of hardware. You can easily go with whitebox hardware from say supermicro, and colocation is SOOOO much less expensive. If your not storing a significant portion of a petabyte, its going to probably be cheaper development wise to just use S3. You also don't need to build something as complex as their system, they have to scale to exabytes. Unless you plan to hit that scale immediately or have that much data, you can build for something less complex and smaller to start.

james_cowling · on May 6, 2016

Just a caveat I'll add is that you have to work really very hard to build a system with equivalent/better durability and availability guarantees as the big cloud storage providers. It's typically not possible to be significantly cheaper with these same guarantees unless you're at large scale with well-established supply chain organizations, etc.

Of course if you don't need 11+ nines of durability or multi-datacenter redundancy, or if you have lower access requirements, then you can exploit those reduced requirements to your advantage.

james_cowling · on May 6, 2016

To be clear, we do need those durability and availability guarantees :)

We just benefit from our very-large scale and close alignment between hardware and software to tailor our infrastructure to our workloads.

toomuchtodo · on May 6, 2016

https://www.backblaze.com/b2/storage-pod.html

notmyname · on May 6, 2016

Great stuff! Thanks for publishing it. I'm looking forward to future posts.

I'm curious about how you're managing the data on a drive itself. Are you storing the blocks as individual files on a filesystem? Are you doing direct management of the block device itself? Something else?

james_cowling · on May 6, 2016

Yup, as @jamwt says we directly manage the block device.

That said, up until somewhat recently our data really was written into 1GB extents stored as files on an XFS filesystem. This mostly worked fine but we switched away from this so that we could directly manage disk layout for SMR storage (a new type of hard drive that has bad random write performance), along with some performance improvements, increases in storage density, and some minor reliability improvements from avoiding the filesystem.

It takes a lot more operational tooling to directly manage a block device since you obviously can't use all the standard filesystem tools.

jamwt · on May 6, 2016

We're going to go into detail on that (the OSD component) in a future blog post... but short version, yep, it's a custom "filesystem" directly done on the block device.

jhartmann · on May 6, 2016

Thats very interesting. When you post about that could you go into why you choose that over say large files (say 32 gb or so) that contain each of the collection of blocks? XFS or Ext4 can preallocate large chunks of the files beforehand, so the data looks nice on disk. A fully custom block level device sounds like a nice way to minimize overhead, guess it goes to show that extreme scale makes certain decisions that seem crazy from a support point of view start to make sense.

jkaptur · on May 6, 2016

I'm not an expert in this area, but at a glance, an MP cell looks a lot like a GFS cluster, right? It seems like the big differences are that MP is designed for

a) a huge number of smallish files, accessed by a large number of clients

b) file writes, rather than appends

c) cross-cell, cross-zone replication

Is this impression correct, or am I missing other major design differences?

james_cowling · on May 6, 2016

It's been a very long time since I read the GFS paper but it's similar in many ways - there are only so many ways to design a large-scale blob store and they all share a lot of the same design elements.

If I remember correctly GFS has a coordinator node in the critical path, much like the name node in HDFS, which impacts availability if it goes down. The nice thing about the Master in MP is that it's completely disjoint from the data-plane and not a critical component in the system.

Your assessment is pretty accurate, and the cross-zone replication component is one distinguishing characteristic. MP also makes heavy use of highly-efficient erasure coding as a first-class design element, which wasn't included in the original implementation of GFS but was added later.

The biggest difference from an API perspective is that MP is only a blob store whereas GFS is an actual filesystem (we implement the filesystem at a higher abstraction layer). From what I understand MP is closer in spirit to the underpinnings of Colossus than GFS.

sroussey · on May 12, 2016

Do you really create a filesystem? A mountable one? I can't see why you would want that kind of overhead. (I don't mean the mount you provide to users for their files.)

technion · on May 7, 2016

The whole thing is an incredible achievement - you have a new design, running on a new infrastructure, written in a new language.

I'd be very interested in any information on how you provided yourselves confidence in the product? There are so many ways that a software issue could bring all that redundancy down.

james_cowling · on May 7, 2016

This is a really great question since the vast majority of the work was in ensuring correctness and reliability: everything from testing discipline to fault injection to auditing. This also included hardware testing, like pulling out circuit breakers to test our power distribution, or overheating a rack to test graceful shutdown.

I'll give a slightly lazy answer here however and point you to a talk I gave about building durable systems, which covers a lot of this material: https://www.oreilly.com/events/velocity/devops-web-performan...

akhilcacharya · on May 6, 2016

What are some of the cool advantages of rebuilding Magic Pocket with Rust (outside of just memory usage)?

jamwt · on May 6, 2016

(We actually only rebuilt two components in Rust--most of Magic Pocket is still go.)

Memory usage/control, easy FFI with C, safety/correctness of code. Those were the big advantages for our projects.

Edit: deterministic destruction of resources is pretty great too. Yay RAII.

Edit 2: So is the ability to use the "standard" C tool suite for performance, memory, debugging, etc. valgrind, jemalloc profiling, gdb/lldb, perf. All that stuff.

squiguy7 · on May 6, 2016

What were some of the funnest engineering challenges Dropbox faced when deciding on architectural implementations or how to get a system that could store exabytes?

amelius · on May 6, 2016

Since this is a distributed system, did you consider using Erlang for the implementation?

jamwt · on May 6, 2016

Nope, honestly. Go is dropbox's primary infrastructure programming language. There's an extensive internal library/culture/toolset around go, so it made the most sense to use that.

kordless · on May 7, 2016

> immutable block storage system

There are higher order problems that result in that insight.