Libsqlfs: A POSIX-style file system on top of an SQLite database

Animats · on Oct 4, 2014

Tandem Computers actually did implement files on top of a database. Their OS didn't have files. Their system was a replicated database running directly on the hard disk, with no file system.

There's something to be said for this. Your files get ACID properties. If we were serious about file integrity, we'd have file systems that worked like this:

- Unit files. The unit of data is the entire file. Files are written once, and when closed successfully, the file transaction commits and others can read the file. Any update replaces the entire file as an atomic operation. (Many applications need this, and try to do it with various move and rename operations, usually leaving files behind if things fail at the wrong moment.)

- Log files. You can only add at the end. Writes are atomic. In the event of a crash, the file is valid up to some recently completed write. (On many systems, log files can tail off into junk or contain truncated records.)

- Temporary files. When the process or process group exits, they're gone. Random access is OK. (You shouldn't have to clean up junk temporary files.)

- Managed files. These support a database or something with complex structure. There are extra I/O functions for locking, flushing and being sure a write has been committed to disk.

That covers most of the use cases for files. There have been file systems which did some of this, but not in recent years.

Alupis · on Oct 4, 2014

OS/400 systems (by IBM, also called AS/400, System i, iSeries, iOS, etc)are a database operating system, there are no files, only "libraries". The entire thing sits on top of a customized DB2 database.

fragmede · on Oct 3, 2014

I wonder if this has anything to do with when PalmOS got a 'filesystem' even though the OS originally only gave programs a database to interface with, way back when...

guessbest · on Oct 4, 2014

Good catch. It might have something to do with the Palm Lifedrive[1] and its 4gb limit or the NVFS[2] of the Palm 650 and other models. Who knows? It seems like it could be used by a software services company to make a thin client with an encrypted operating system that could detect tampering or cloning. I'm guessing that what the Guardian Project[3] does and why it gets mentioned along with SQLCipher.

[1] http://how-to.wikia.com/wiki/Howto_replace_microdrive_with_c...

[2] http://en.wikipedia.org/wiki/Non-Volatile_File_System

[3] https://guardianproject.info/

gioele · on Oct 4, 2014

Presentation by Hans Reiser on Reiser4. Reiser just explained the new Linux VFS layer introduced by Reiser4 and the ability to see each file a small directory or a set of records.

Audience: Then what is the difference between a DB and a filesystem?

Reiser: Marketing.

agumonkey · on Oct 4, 2014

I too would like to remove unnecessary tersm and concepts. But it seems that splitting into FS / DB / Apps gives people a way to optimize their own way. At the cost of communication/reintegration.

ps: I always found the IBM/COBOL record oriented FS a good idea. You remove a lot of ad-hoc parsing code from loading and writing data.

edlebert · on Oct 3, 2014

Time to put a SQLite database on this new filesystem!

izietto · on Oct 3, 2014

lol they created a recursive monster *.+

chris_wot · on Oct 4, 2014

It would be more fun to see Oracle run on it.

linuxhansl · on Oct 3, 2014

Why not use a local loop mount? You'd still have a filesystem in a single file.

At least on Linux.

akira2501 · on Oct 3, 2014

Storage overhead. The loopback device is still a block device, if you skip this step, you don't have to waste the block overhead when storing a file that's smaller than the block size. If you're going to store a large number of small-sized files, this fuse <-> sql bridge could be a win.

shawnz · on Oct 3, 2014

But this has block overhead too. Why not just use a smaller block size?

Dylan16807 · on Oct 3, 2014

It uses blocks but records get packed on disk. (I think; it's certainly capable of it.)

More obvious is that the database as a whole is dynamically sized.

justincormack · on Oct 3, 2014

Some filesystems can compress in this case.

krylon · on Oct 4, 2014

I do not know if the library takes advantage of it, but using an ACID-compliant database as the backend, one could make use of transactions at the filesystem level.

rakoo · on Oct 4, 2014

Something that's lacking in the Linux world is an efficient way to determine "what file has changed since that time". Using SQLite db could be really helpful in this situation.

narrator · on Oct 3, 2014

A good use case for this might be a filesystem with a large number of files where one could quickly find recently changed files without having to sequential scan the whole system. Presumably the access and modified time would be stored as separate indexed attributes.

e12e · on Oct 4, 2014

I was going to reply with something along the lines of "modern filesystems should already have meta-data indexes".... because I thought I'd read about that in connection to ext4, btrfs and/or zfs. Now I'm no longer certain ... all I came up with was the following, which might be of interest:

BabuDB (related to extreemfs.org): http://dl.acm.org/citation.cfm?id=1849822

Spyglass: http://www.ssrc.ucsc.edu/pub/leung09-fast.html

And more from the same researchers: "Scalable File System Indexing": http://www.ssrc.ucsc.edu/proj/fsindexing.html

Perhaps I'm thinking of reiser4?

See also:

TokuFS: https://www.usenix.org/conference/hotstorage12/workshop-prog... https://github.com/esmet/tokufs

I'm still certain I've seen talk of indexing metadata, and I thought it was in an actual, open, system...

XorNot · on Oct 4, 2014

Metadata indexing isn't really a filesystem-level problem though, depending what you're trying to do.

An index can be anywhere, and the data you want to keep is somewhat arbitrary - so why not just use a file on the filesystem anyway?

There's also an important difference: metadata indices can be considered disposable in a lot of cases. File data isn't which means the constraints are different: with metadata you want to pack it all into the tiniest, most local part of the disk you can.

With file-data (and the actual filesystem) you want to distribute and replicate that data as widely as possible to minimize the chances that a cluster of failures wipes out important structures.

e12e · on Oct 4, 2014

The typical example (lifted from one of the links, I forget which) -- is looking at a huge filesystem, quickly seeing the files changed since a given date, which have the "archive"-bit set, query after files by name (unknown path -- ie, run "find") etc.

Indexing on other meta-data, like tags for images and music files -- can be considered a filesystem level problem -- if one considers approaches like Mac (or Amiga) resource forks/info-files.

Perhaps I should have stated "file metadata" as opposed to "just" metadata... One could of course claim that the only thing the filesystem does is take an exact path, and return the data at that path. In such a case, you could replace the path+name with a guiid, and store the filename and path-name info in a file... and update file whenever you accessed a file... and then you end up building some of that into the filesystem interface. So the question really is where the filesystem ends and the "system" starts...

286c8cb04bda · on Oct 4, 2014

> I'm still certain I've seen talk of indexing metadata...

This reminded me of BFS: https://en.wikipedia.org/wiki/Be_File_System

e12e · on Oct 4, 2014

That could be it... thanks. I went as far afield as Hammer (DragonflyBSD) -- but didn't think to look at BeFS...

jrapdx3 · on Oct 4, 2014

Though not exactly the same thing, Tcl can use its VFS extension to provide a "file system" for an application backed by a database or ZIP archive.

The VFS allows bundling up an app as a self-contained executable (a "starpak") or runnable with the Tcl/Tk interpreter (aka "tclkit"). FS access is transparent to the app--read/write operations are the same for all FS types.

Some work has been done to use sqlite as a VFS data storage medium. I haven't yet tried it myself, but in principle, it's not too hard to accomplish. I'm putting that project on my list...

Alupis · on Oct 3, 2014

Interesting project.

Out of curiosity, when/why would someone want to use something like this?

SQLite is a file database, in that the database is literally a file, which means it will reside on another already existing filesystem - so you would have:

`

Abstract Filesystem

-------------------

    SQLiteDB

-------------------

  OS Filesystem

fiatmoney · on Oct 3, 2014

It looks like they wanted to implement a encrypted userspace filesystem, presumably in an environment without something like EncFS.

craigching · on Oct 3, 2014

This was exactly how I thought it might be interesting too.

shawn-butler · on Oct 3, 2014

A portable, encrypted, transactional "abstract" filesystem that works with existing POSIX semantics and code.

Would be useful where you have legacy codebase and want to deploy it new scenarios where POSIX filesystem access is not guaranteed.

koenigdavidmj · on Oct 3, 2014

DMG (a disk image that can be mounted) is the main way to distribute OS X applications, and something this could have fulfilled the same use case if Linux wasn't as fond of centrally distributed packages.

icebraining · on Oct 3, 2014

It's not like Linux lacks disk image formats, though, like cloop (used by Knoppix). It's just that when you don't have to distribute weird metadata like resource forks, what's the point of using a whole filesystem versus a tar file?

mikeash · on Oct 3, 2014

Even when you do have weird metadata, it's easy enough to retrofit archive formats to handle it, which Mac OS X did years ago. Disk image distribution is a complete anachronism now, and it's odd how persistent it is.

lvillani · on Oct 3, 2014

DMGs still have the advantage that you can customise the Finder windows that opens when you double click one [1].

So, from an UX standpoint, they still make a lot of sense to distribute application bundles since it compels users to "install" the application by dragging the bundle icon to the appropriate directory, which is conveniently symlinked in that Finder window. It's also easier to produce than a self-installing package.

With a .zip or .tar.gz users would be left with an app bundle and no idea of where to put it. Generally they would just throw it somewhere random across the filesystem. At least, that's what non-technical Mac users I know happen to do.

[1]: https://support.cdn.mozilla.net/media/uploads/gallery/images...

mikeash · on Oct 3, 2014

I think the proper solution to that, if there's actually a problem with "somewhere random", is to have apps that move themselves on first launch. It's pretty easy to do, and IMO easier than creating a consistent build process for a good-looking dmg.

icebraining · on Oct 4, 2014

The solution is to use a .tar file, but give it a custom extension (like, say, .deb) and so when the user double clicks on it, the system knows it's a software package and installs it in the appropriate directory.

Alupis · on Oct 4, 2014

One can still install a program to "somewhere random across the filesystem" on most OS's, even Windows. It just so happens that by convention, most programs get installed to the "Program Files" directory, but that is not a requirement.

xorcist · on Oct 4, 2014

The only reason I can think of is DRM. Sometimes you are allowed to distribute physcial media, where keys are scrambled outside the regular filesystem, and perhaps there that allows for image distribution as well. But I'm guessing here.

liveoneggs · on Oct 3, 2014

It could be very convenient as an alternative to tarfs where you want to package up an entire X into a mount point to distribute.

bane · on Oct 3, 2014

I could see this being a ridiculously easy way to distribute portable applications.

bnolsen · on Oct 4, 2014

I would really like it if they did lgpl with a static link exception. dynamic can be a pain to deal with cross platform.

jtseng · on Oct 4, 2014

Many people say that DBs are not good for large blobs, hence the advice to put large files such as images in the files system, not in the database. What's different here that makes it acceptable to put files in the database?

hartzell · on Oct 4, 2014

Reminds me of Mike Olson's [Inversion Filesystem](http://db.cs.berkeley.edu/papers/S2K-93-28.pdf) from the early 90's.

0xbadcafebee · on Oct 3, 2014

There's a universe of archive formats that do this without the need for a structured query language or FUSE. But i'm sure somebody has a need for this...

Dylan16807 · on Oct 3, 2014

Archive formats are for archiving. They barely support writes at all, let alone atomic random writes.

malkia · on Oct 3, 2014

Must port this to windows - there is DOKAN and CbFS (commercial) that are basically FUSE for Windows...

VikingCoder · on Oct 3, 2014

The reports I see is that DOKAN is more trouble than it's worth.

And the problem I have with CbFS is it's fairly expensive for a hobbyist like me to consider developing a file system on top of it. :-/

I wish there were a good, free FUSE for Windows.

7952 · on Oct 4, 2014

Funny, I have just spent most of the morning trying to find a user mode file system for windows. The best I can come up with is mounting WebDav or using a .net SMB implementation. I wonder if it would be better to just build a software iSCSI target?

malkia · on Oct 6, 2014

Try Dokan, you might like it. It's not perfect, but free and might just work for you.

pwr22 · on Oct 3, 2014

Any performance benchmarks?

fsiefken · on Oct 4, 2014

reminds me of the abandoned WinFS project (Windows Future Storage). http://en.wikipedia.org/wiki/WinFS

fit2rule · on Oct 4, 2014

In the "why use this?" department, I think the answer to the question is that you want to distribute massive amounts of content in a single-file (making content updates easy) while also maintaining a POSIX-level control over the content in the container.. so if you've got a massive SQL database full of content - say a dictionary or wikipedia, or whatever - doing a content transform that spits out a single .db to your paid/registered/updating customers is quite useful..