The mmap pattern

tsukikage · on May 1, 2014

"By preserving state explicitly to memory backed files, several classes of unexpected events causing the program to crash can be recovered from with minimal disruption."

The cure is surely worse than the disease!

The program state in memory at the time of abnormal termination is likely to be inconsistent, leading to an unusable file. The subset of that that happens to have been committed to disk is likely to be worse.

Contrast with the traditional approach of explicit serialisation / deserialisation for persistence:

* serialisation occurs when the program is in a known state. We can reason about what invariants hold at the point when state is sent to storage. We can take steps to confirm that data actually has been committed to persistent storage before we treat the operation as complete.

* deserialisation recreates program state in a clean environment from minimal data. This has the side effect of reinitialising state that may have caused the unexpected termination. Coupled with the ability to reason about program state at the point of serialisation it makes it feasible to attempt error recovery.

* Explicit serialisation and deserialisation stages make it possible to construct persistent state that is portable across environments and through development

mmap() is blindingly fast, but it is a tradeoff between performance, complexity and reliability - naive use across the board is unlikely to improve all three!

hyc_symas · on May 7, 2014

Portability of binary data structures across environments is IMO irrelevant. We export to plaintext for migrations. Writing in a binary format that is portable across both 32 and 64 bit CPU architectures will be a waste on one or the other of the archs, and the point of approaches like this is to optimize runtime performance.

Your point about serialisation occurring with a known state is not lost in an mmap approach. If you have transactional writes (as LMDB does), then writing to the mmap also only occurs with a known state.

In reality, e.g. in OpenLDAP, you wind up with a minimal serialisation step, but with zero deserialisation. E.g., if I want to store a record: typedef struct foo { int len; char *data; } foo;

The struct and the data of the char ptr may originally have been obtained from two separate mallocs and thus reside at very different locations in memory. My serializer would simply allocate a new block and store the character data contiguously with the end of the struct. If you mmap at a fixed address, then you can store the address of the character data directly in the struct's "data" field, and the next time you come around to use this record, no deserialization is needed. Explicit deserialization is truly just a waste.

ninkendo · on May 1, 2014

Also, beyond only the state in the file being inconsistent, even if you manage to get a complete atomic write of your program state before you crash, the very same program state that caused the crash (if the crash was your own program's fault) is now being faithfully read back from disk! You're just asking to crash again immediately by blindly loading it all back.

hyc_symas · on May 7, 2014

This is why you must have ACID transactions to use this approach.

Even without mmap, if your program successfully writes an invalid data structure to disk, the next time you run and read it back, you will crash. That's orthogonal to the method used to write and read the program state.

marvy · on May 1, 2014

Indeed. Just like this: http://thedailywtf.com/Articles/Serial-Port-Pit-Bull.aspx

barrkel · on May 1, 2014

The act of serializing data structures and reloading them can be a guard against long-term corruption.

If you only use mmap, there's a risk of some corner of the object graph getting subtly wrong owing to a bug in one version of the software, and never getting repaired.

Versioning of data structures is also a problem.

I'd leave this pattern for use cases for which copying of memory on load has a measurable impact on performance, and where required integrity of data isn't very high - ideally, the data should be able to be regenerated from primary sources.

One case I know of where this technique was used was the precompiled headers file for the Borland C++ compiler. There, it makes sense on both counts.

pm215 · on May 1, 2014

The "just mmap" pattern has also got portability issues if you might move the data file between systems which could have different endianness, pointer size, alignment requirements, etc. (Serialization done properly avoids this because you can swap to a consistent host-independent on-the-wire representation as part of the process.) This is also less of an issue for the cases you suggest where you can regenerate the data from a primary source -- you can just treat the host-dependent data file as a cache that can be blown away if moving to a different machine.

mikeash · on May 1, 2014

Isn't the potential for long-term corruption purely related to not rewriting the entire file from scratch every time you make changes? Even without mmap, any data store where you make incremental changes is potentially vulnerable to this sort of thing.

One obvious example would be filesystems. They're basically specialized databases which treat your whole disk as a single gigantic file, and of course there's a long, proud history of programs used to repair corruption in them due to bugs or other problematic events.

To me, that's an argument for not using formats with incremental formats if you can get away with rewriting the file each time, but once you have enough data to where you can't afford a total rewrite each time, does mmap make the problem any worse?

barrkel · on May 2, 2014

Most app file formats do not have the same reliability testing level as filesystems. Filesystems need years of use by brave people before they're considered OK for production.

If you're updating files in place without a whole rewrite, then you are effectively writing a filesystem.

I prefer to use a different approach where possible: write a diff log, and occasionally do a rewrite of the whole file.

When you do have to treat the on-disk file as a filesystem, I'd explicitly treat it as such. Have a low-level structure that does not need modification for different versions, is portable and highly tested. Do application-level structure with normal read-modify-write, but layered on top of the virtual filesystem layer. Consider having two modes, one where the virtual filesystem just proxies the real filesystem, and instead of a "file", you have a directory structure on disk. Consider using an existing component for this (zip archive libraries are popular, but not particularly efficient for writing).

Mmap does make the problem worse. It encourages you to be lazy towards versioning and portability.

rdtsc · on May 1, 2014

This is basically used for this little gem:

http://symas.com/mdb/

LMDB is at the heart of the ubiquitous LDAP ( OpenLDAP ) and is very well optimized ( look at his benchmarks ). Now they are optimized for reading, which is important.

I would imagine mmap-ing with large amount of write will result in unpredictable performance....

jbert · on May 1, 2014

You can try and manage the predictability using msync(), which you'll need anyway if you're doing any kind of transactional write.

For more general use, a simple way to keep the amount of 'pending writes' down to reasonable levels is to have a thread walking your mmap()'d regions and enforcing an msync() on a regular basis. This also gives you guarantees such as "each part of mmap'd file is no more than N seconds out of date".

Of course, you'll potentially increase disk load overall, since a page dirtied three times, once every second, might only get written out once by the kernel (after 5s) but if you're syncing every 1s it'll get written three times.

But you should be able to take control.

edwintorok · on May 1, 2014

Writes to mmap-ed regions aren't very reliable: besides the synchronization issue there's also the question of what happens on an I/O error or ENOSPC. Usually your program just gets a SIGBUS and crashes. IIRC LMDB uses regular syscalls to update the mapping, and use mmap only for read.

hyc_symas · on May 7, 2014

Correct, at least for the original design of LMDB. We later added an option to write thru the mmap as well, at the request of some other users. Since LMDB is quite careful about writing and use of fsync/msync, reliability generally isn't an issue, but yes, hard errors on the disk and such can crash your program when using a writable mmap. Also, in the default mode of read-only mmap, the DB is incorruptible, stray pointer writes will crash the program instead of corrupting the DB. With the writable mmap you obviously have no such protection. We recommend developing your app with the default read-only mode, and only enabling the writable mmap after your app code is well tested.

comex · on May 1, 2014

For a while I've wanted a nice C library for an mmapped heap with allocation and common data structures (and maybe locks?) - all the usual stuff you'd expect in a standard library, but with support for relative offsets instead of pointers, crash robustness, introspection, and other features required to work well with a persistent file. I do not know any library of this type that currently exists.

jasonjackson · on May 1, 2014

I believe this would qualify as a Java library https://github.com/peter-lawrey/Java-Chronicle

"This library also supports distributed, durable, observable collections (Map, List, Set)" "It uses almost no heap, trivial GC impact, can be much larger than your physical memory size (only limited by the size of your disk) and can be shared between processes with better than 1/10th latency of using Sockets over loopback."

nly · on May 1, 2014

For C++:

http://www.boost.org/doc/libs/1_55_0/doc/html/interprocess.h...

"Boost.Interprocess offers useful tools to construct C++ objects, including STL-like containers, in shared memory and memory mapped files"

sovande · on May 1, 2014

SQLite

jadc · on May 1, 2014

You can also check out libgist at http://gist.cs.berkeley.edu/libgist-2.0/

mtdewcmu · on May 1, 2014

BerkeleyDB?

brianpgordon · on May 1, 2014

> One very key architectural decision for a system is the degree of reliability that it should possess. This is an explicit trade-off between the rapidity of development (in particular the level of indoctrination needed before new contributors are able to augment the feature set) and the operational stability. By preserving state explicitly to memory backed files, several classes of unexpected events causing the program to crash can be recovered from with minimal disruption. The benefit here is that the system can be developed rapidly with code reviews focusing on data integrity and paying less attention to complex interactions that can lead to crashes.

The mmapped files are going to break every time you make the smallest change to your data structures. This strikes me as not terribly useful for rapid development.

Negitivefrags · on May 1, 2014

That depends on how they were created.

On our game project we use a data tool that allows you to create tables of data out of columns with specific data types as well as references to other table rows.

When you export from the tool it creates data files but also generates the C++ struct that represents the schema. The loader simply mmaps the data in to memory. Variable length data like strings gets packed at the end of the table with the fixed length records containing relative pointers in to it.

The data on disk is stored in the exact format that is needed at run time.

References to rows in other tables turn in to pointer like objects so you get a perfectly natural syntax as if the data had been loaded in to hand written classes.

It's perfect for rapid development because designers can add new columns or tables of data and then just tell the programmer what they are called.

It also loads lightning fast and has excellent run time properties due to cache coherency.

mtdewcmu · on May 1, 2014

This sounds like it could be a good fit for mobile on iOS (I can't speak to android). If the app has significant startup costs, this lets you basically save a core dump of those structures that were slow to build. Then you could have an alternate fast startup mode that reads from a core dump.

hyc_symas · on May 7, 2014

Your data structure design needs serious help if this is a problem for you. In OpenLDAP we store LDAP entries which can contain variable numbers of attributes and variable numbers of values per attribute, all in a unified format. Redefining your data structures is as simple as adding or removing an attribute or value from an entry. No sweat. Rolling your own data model is probably a bad idea...

icedchai · on May 1, 2014

You can plan ahead and leave "spare" space in the memory mapped structures for future expansion. Combined with versioning, you can take this quite far.

I've worked on financial transaction processing systems using memory mapped files as their primary means of data storage. It is very effective.

kristianp · on May 1, 2014

Google code search uses an mmapped index of trigrams, in Go:

https://code.google.com/p/codesearch/source/browse/index/mma...

http://swtch.com/~rsc/regexp/regexp4.html

samlittlewood · on May 1, 2014

If you can live with boost, then boost::interprocess is a good way of managing a mmapped segment - shared with other processes or shared with yourself over time.

This give string,set,map,list,deque,vector types that work in the shared space.

http://www.boost.org/doc/libs/1_55_0/doc/html/interprocess/a...

tlb · on May 1, 2014

I did this for a high-performance time sequence database of robot activity. It's pretty hard. You need custom data structures for everything (no STL), adding fields to the data structures requires a schema migration tool, and it's easy to have subtle bugs around things like hash table growth.

However, it is insanely fast when it works.

simcop2387 · on May 1, 2014

This is one place where I'm wondering how Cap'n Proto[1] would help. Since the structures are already made for in memory usage I think it might work pretty well. And given that it's all versioned/tagged structure wise it should be easy to do the schema migration.

[1] http://kentonv.github.io/capnproto/

stuaxo · on May 1, 2014

Wow, that looks great !

jzwinck · on May 1, 2014

You can't use most STL containers by default, but Boost Interprocess solves most of that: http://www.boost.org/doc/libs/1_55_0/doc/html/interprocess/a...

sz4kerto · on May 1, 2014

kdb is mmapping db files - it's also a high-perf time series db, with exactly the same advantages and disadvantages what you mention.

guard-of-terra · on May 1, 2014

"explicitly choosing a file and mapping that into the memory space of the program means it can persist across crashes"

It's not that your program crashes, it's that your memory becomes corrupt and that causes the crash.

Restarting program is essentially redoing all the memory. If you persist your memory your program will crash on run.

IgorPartola · on May 1, 2014

Different types of crashes and different memory. You can crash because you program did not handle a network error well, etc. Also, you do not need to use mmap'ed memory for everything. You can still use the stack and the heap for temporary structures, but store permanent data in the mmap'ed segments. Think about persistent data vs running queries in a database server.

guard-of-terra · on May 1, 2014

How is it then different from sqlite?

mtdewcmu · on May 1, 2014

SQLite does it for you; here you do it yourself. Presumably you'd have done a cost/benefit analysis before choosing this over SQLite, so your reasoning would be sound...

IgorPartola · on May 1, 2014

I don't know how sqlite3 works so I cannot tell you, but it sounds like a pattern that would fit it well.

kentonv · on May 1, 2014

I had a G+ thread about mmap a couple months ago that surfaced a lot of really good points. I like mmap a lot, but there are certainly some down sides:

https://plus.google.com/u/0/+KentonVarda/posts/NKUUzx2nEsN

If you're looking for an easy way to exploit mmap in your code, Cap'n Proto is a serialization format that works similarly to Protocol Buffers but is designed to work well with mmap():

http://capnproto.org

(Disclosure: I am the author of Cap'n Proto... and also the former maintainer of protobufs.)

malkia · on May 1, 2014

Microsoft has __based pointers - e.g. pointers that are based on others - not sure whether gcc/clang has something like it - http://msdn.microsoft.com/en-us/library/57a97k4e.aspx

wging · on May 1, 2014

Reminds me of a post about the internals of ITA, which mentions that they used memory-mapped files to make data access fast:

http://www.paulgraham.com/carl.html

nly · on May 1, 2014

> Modern CPUs have the ability to arbitrarily (generally at a 4kB granularity) map memory addresses as visible to a program to physical RAM addresses

While true and exceptionally powerful TLBs are limited in size and, being essentially a hash table put in to your CPU by either Intel and AMD, are likely optimised for common-case mapping patterns. Of course, these penalties still pale in comparison to disk I/O... but kind of a downer if you're dreaming of something crazy like a linked list where every node lives in a separate random page.

chrisrohlf · on May 1, 2014

"or try to mmap each region to a well-known start address (and fail if it cannot obtain that address)."

Optimizations aside, beware this approach. ASLR is one of your two best friends (the other is DEP). When you purposely circumvent the protection it provides a security researcher somewhere will make you the topic of a very pointy blog post.

asveikau · on May 1, 2014

Is it me or does this article completely ignore two huge downsides with mmap - number one is that page faults are expensive, number two is that i/o failures become bad pointer dereferences.

hyc_symas · on May 7, 2014

Page faults are no more expensive than any other physical I/O. If the data you want isn't already in cache, a seek()/read() is as expensive as a page fault. Moreso really, because the read has to do a buffer copy too.

Fair point about I/O failures though.

jzwinck · on May 1, 2014

TFA says "By mmap'ing a region and then accessing into it, the overhead of creating multitudes of small interlinked items can be reduced hugely."

You do not need to call mmap directly, because malloc() will do it if your allocation exceeds MMAP_THRESHOLD (usually on the order of a megabyte). So you get this optimization "for free"--the only important part is that you do not call malloc() for each tiny object.

mpweiher · on May 1, 2014

I think TFA talks about mapping files into memory...

jpollock · on May 1, 2014

I'm sorry, this is probably a newbish question, but why didn't he use mmap for writes?