Joyent Open Sources SmartOS: Zones, ZFS, DTrace and KVM

mrb · on Aug 15, 2011

Wow this is huge. Having an OS be able to run both ZFS and KVM enables fantastic things. For example you can store each virtual machine disk template on a dedicated ZFS filesystem, and use "zfs clone" to rapidly deploy VMs (instead of using the KVM-level support for base disk images "qemu-img create -b ..."), as well as "zfs snapshot", "zfs revert", etc. The main advantage being that these clones and snapshots are possible while using the simple and fast "raw" KVM disk image format instead of the notoriously slower "qcow2" format that was, until today, the best format supporting base images and snapshots.

I, for one, have been wanting to use ZFS specifically like that for a while. This does not compare at all to running ZFS on, say, an NFS server or iSCSI SAN serving data to a server running KVM (ZFS data integrity is only verified remotely on the storage server, it is slower, etc).

Who ported KVM to Illumos? I know some old version of QEMU was running on Solaris at some point in the past, but I had no idea they had a full blown KVM port.

bcantrill · on Aug 15, 2011

KVM was ported to illumos by Max Bruning, Robert Mustacchi and me. Details are here: http://t.co/knW1UIJ

slillibri · on Aug 15, 2011

Kind of an off topic rant, but can we at least avoid url shorteners that point to other url shorteners? Here is the original link http://www.slideshare.net/bcantrill/experiences-porting-kvm-...

mambodog · on Aug 16, 2011

Honestly, no one should be using URL shorteners on HN. Use a footnote[1] if necessary.

[1] http://www.google.com/search?q=hi,+this+is+a+footnote

RexRollman · on Aug 16, 2011

I like the footnote form. I use that in e-mails as well.

fanf2 · on Aug 16, 2011

SlideShare is also a world of hate. Try http://www.linux-kvm.org/wiki/images/7/71/2011-forum-porting...

emaste · on Aug 16, 2011

I really like the use of DTrace in #if 0'd out code for identifying the highest priorities on porting. Having faced a similar task that seemed impenetrable at the start I would have really liked to have this technique available.

Nice work and congrats on the port.

djcapelis · on Aug 15, 2011

Uhm, isn't that a violation of the GPL given that you're mixing the complete work with non-compatible (by design) CDDL code? Did you ask for KVM to be relicensed for the port or something?

bcantrill · on Aug 15, 2011

There is not a licensing issue here: both KVM and our KVM port remain licensed under the GPL.

djcapelis · on Aug 15, 2011

Isn't your port of KVM integrated into the Solaris kernel? How are you ensuring they are two independent and separate works under copyright law?

kmavm · on Aug 15, 2011

My (somewhat casual) understanding is that the Solaris kernel has always, by necessity, supported modules under a different license much more deeply than Linux. Commercially significant software like the Veritas filesystem and device drivers have always worked this way. So, e.g., the Solaris kernel has a real, no-foolin' ABI.

justincormack · on Aug 15, 2011

That doesnt mean that linking in a GPL module doesnt make the rest of the kernel GPL. A real ABI might not mean it is not linked. Allowing commercial modules and linking with GPL ones are not the same.

I dont know though, the definition of linking is pretty opaque to me.

nl · on Aug 15, 2011

It doesn't matter - Joyent doesn't own the copyright for Solaris so nothing that they do can change the licencing conditions for Solaris itself.

You can write GPL'ed device drivers for Windows - doesn't make Windows GPL.

bonzini · on Aug 16, 2011

> You can write GPL'ed device drivers for Windows - doesn't make Windows GPL.

The public interface exposed to Windows drivers is _much_ more shallow than the public interface exposed to Linux and Solaris/Illumos modules. Not coincidentially, the Windows driver exports overlap a lot with the interface that Linux allows for usage in proprietary drivers.

Instead, KVM uses a lot of hooks into the kernel innards (into the scheduler, into the MMU, etc.) that are marked as EXPORT_SYMBOL_GPL in the Linux kernel, and the Illumos port does use some of the same hooks.

I am not a lawyer, so I don't have any clear answer. I also work for Red Hat, so even if I had one I would not want to say it (should not say it). But the licensing question is _not_ an idle question.

Also, the licensing question does not detract anything from the awesome work of these guys.

bcantrill · on Aug 16, 2011

Actually, our port does not use these hooks -- there were zero mods to the illumos kernel to support KVM per se.

bonzini · on Aug 16, 2011

You do not get to choose whether it matters that the hooks (such as context ops) were preexisting. Perhaps it does, perhaps it doesn't. But it's the same trying to define what is a derived work, and neither you nor the KVM copyright holders can do that.

In the meanwhile, just thank whoever said that GPL-incompatibility was an explicit design goal for the CDDL.

bcantrill · on Aug 16, 2011

The definition of a derived work is actually much more crisp than you are making it out to be -- and it is simple fact (and not opinion) that illumos is not a derived work of our KVM module. Indeed, this issue is so clear-cut and your objections so unfounded that this is beginning to sound like you are deliberately trying to place doubt over the legality of our work. So as long as we're playing Pretend Lawyer, consider your employer and look this one up: tortious interference.

bonzini · on Aug 16, 2011

> this is beginning to sound like you are deliberately trying to place doubt over the legality of our work

Oh come on. Again: if anyone has placed doubt on it, it's whoever chose CDDL because it was GPL-incompatible. All I'm saying is the question is not idle. If it was so clear-cut, it would not have popped out in 3 out of 3 places where I read about KVM/Illumos (your blog, LWN, HN).

bcantrill · on Aug 15, 2011

No; the KVM kernel module is in its own repo: https://github.com/joyent/illumos-kvm

sp332 · on Aug 16, 2011

But distributing binaries built with some GPL'd code is a violation of the GPL if you don't publish all the code that went into the binary.

rednaught · on Aug 15, 2011

Can you provide more details of Crossbow? Who maintains it/status? I hadn't heard anything about it since OpenSolaris went dead.

bcantrill · on Aug 15, 2011

Could you be more specific? It's there, it works, it's in shipping illumos-based products (from Joyent and others) and it's supported by those who ship it or stand it up. In that regard, it's like any other aspect of the system...

rednaught · on Aug 15, 2011

Thank you. I guess I just have to go dig through the Illumos/OpenIndiana documents.

jlawer · on Aug 16, 2011

While ZFS adds some advantages, if you were going for snap shots and quick deploy with KVM your best bet was to use LVM as a raw device. A quick lvm snapshot + virt-clone + virsh start will let me have a cloned vm up within seconds.

Thats not saying ZFS doesn't add a lot (checksum backed store, heap of other features). But I would actually expect LVM to be faster, precisely because it is "crappier" (no checksumming, very basic address remapping, etc).

The real question will be support, if they can convince people that this will have enough life to build a user base large enough to support it moving forward. Especially if oracle stops publishing code under open source licenses.

mrb · on Aug 16, 2011

The problem with LVM snapshots is that you have to manually reserve a portion of the logical volume to store them. And if you calculate your needs incorrectly, LVM runs out of space to preserve them, and they become corrupted (they are reported as "INVALID" in lvdisplay IIRC).

That's why I prefer KVM's built-in base image support (qemu-img create -b ...) to quickly provision VMs, instead of LVM. That's how I architected KVM hosts for my employer, running 100+ lightweight VMs each, with 1000+ disk images available on disk.

amazingman · on Aug 15, 2011

http://smartos.org/2011/08/15/kvm-on-illumos/

nyellin · on Aug 15, 2011

To clarify, Joyent open sourced SmartOS. Zones, ZFS, DTrace, and KVM are maintained by other companies and have been open source for ages.

Edit: That's not to imply Joyent is freeloading. There is another post on HN with an example of the awesome work they're doing: http://news.ycombinator.com/item?id=2887092

strlen · on Aug 15, 2011

This is really great. Joyent has been running this internally for years. The combination is just great: ZFS, DTrace and KVM speak for themselves, but another great ingredient is their use of the NetBSD userland (in place of legacy Solaris one).

Linux is great, but a monoculture benefits no one.

rednaught · on Aug 15, 2011

I'm having a difficult time wrapping my head around this. If Joyent wanted to stick with a UNIX then why not FreeBSD and help implement KVM there? Wasn't XEN already available for Dom0? I just don't see much mindshare in Illumos/SmartOS for keeping up with current server hardware drivers. I really would have liked to see one of the existing BSDs benefit from the time they've put into this project.

Obviously Linux already has KVM, and there are new implmentations of DTrace and ZFS. Btrfs is not too many more kernel releases away from being considered stable(fsck being a glaring problem) and if you want container-based virtualization you've got LXC(some prefer OpenVZ).

With that said, who is their target audience?

grey · on Aug 15, 2011

The Linux implementations of ZFS are dog slow, and for the last few weeks I've been evaluating btrfs as a possibility for my home storage server, and I've ruled it out completely.

btrfs has been "Close to a stable release" for years now, but if you follow their mailing list there are still people reporting total loss of large filesystems once or twice a week (This week I saw filesystem corruption from an unexpected power outtage, and last week there was a data loss/corruption bug caused by a pool of drives with different sector sizes), I desperately want btrfs to be mature but it's more than a few releases away from being stable.

dap · on Aug 15, 2011

You're suggesting Joyent should have finished porting DTrace, ZFS, and Zones to FreeBSD, and then port KVM to FreeBSD, instead of just porting KVM to Illumos?

rednaught · on Aug 15, 2011

FreeBSD has already had ZFS and DTrace for years. They have Jails which would have benefited from Solaris' stronger Zones/Containers.

There are already a lot of BSD users. Where are the Illumos users? Are you suggesting that all BSD users go to SmartOS? I used HP-UX and AIX for many years...the one off UNIX systems just end up recreating each other due to fragmentation. They are not the way forward. The smaller mind share/usage of OpenSolaris/Illumos isn't helping. Wouldn't contribution to a BSD have strengthened the community?

dap · on Aug 15, 2011

Not at all. The beauty of KVM on SmartOS is that Joyent's customers can run BSD, or Linux, or Windows, or whatever they want without even knowing it's running on top of Illumos. That's why the user base isn't as much an issue. But the underlying foundation is critical. I'm actually really glad BSD is adopting DTrace and I'm sure it will be successful, but the BSD documentation says that it's experimental and not yet production ready (http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/dt...). Solaris DTrace has been in production since at least 2005.

Disclaimer: I work for Joyent, but I do not speak for them.

moe · on Aug 15, 2011

without even knowing it's running on top of Illumos

What's the advantage of running Linux->KVM->Illumos instead of Linux->KVM->Linux?

acdha · on Aug 15, 2011

DTrace and ZFS are both great and lack competitive alternatives on Linux. For years, the horrid state of Solaris' userland made it hard to justify order-of-magnitude support increases but this lets you use the best parts without having to suffer through things like a system-destroying package updater[1] simply to get a reliable filesystem.

1. At a previous employer, we bought Sun compute nodes. Nice hardware, arrived with Solaris 10 preinstalled. Each time we got a new shipment I'd get them racked and see if the updater wouldn't render the system unbootable after the first run. Second reboot was always to the Debian installer.

moe · on Aug 15, 2011

Well, pretty much the same here. We ran linux on xfires until recently because we loved the hardware, but wouldn't touch solaris with a 10ft pole.

I'm skeptical DTrace and ZFS can justify the risk to invest in a niche platform. Personally I'm holding out for the linux alternatives (btrfs, ceph) to mature.

rednaught · on Aug 15, 2011

I agree the foundation is critical. So would you say that the extensive community use of KVM on SmartOS has decided it is production quality compared to use on Linux?

Freaky · on Aug 15, 2011

There's been a FreeBSD KVM port for years too, albeit more a proof of concept than anything: http://retis.sssup.it/~fabio/freebsd/lkvm/

hello_moto · on Aug 15, 2011

They have a lot of investment around Solaris (let's not argue about Solaris, OpenSolaris, Illumos) since a few years ago. Brought a lot of ex-SUN engineers, hired open source activist around OpenSolaris, probably built their infrastructure around it as well.

rednaught · on Aug 15, 2011

Thanks. But didn't they also have a lot of FreeBSD such as during the TextDrive days?

hello_moto · on Aug 15, 2011

Probably customer + talent acquisition as opposed to technology/infrastructure.

lsc · on Aug 16, 2011

last I looked, xen was culled from openindiana (then the only functional Illumos based distro) - this was back in the days when we thought xen was dying, back before it was pushed upstream in Linux.

darklajid · on Aug 15, 2011

The biggest yay for me is in the fine print:

"SmartOS is comprised of the Illumos kernel (with ZFS, DTrace, OS-level virtualization and next-generation KVM) with __BSD package management and a GNU toolchain__."

Oh god, this makes me happy. I gave Solaris a couple of tries, but I've never felt at home there. These two facts mean that I'll try again.

binarycrusader · on Aug 16, 2011

Solaris 11 Express has the GNU toolchain and a modern package management system.

For those that want something like SmartOS but free, and with a modern package management system, I would suggest OpenIndiana. OpenIndiana will be integrating many of the SmartOS changes.

barrkel · on Aug 16, 2011

Try out nexenta too. Debian userland.

wmf · on Aug 15, 2011

They're really going out of their way to not say "Solaris".

po · on Aug 15, 2011

They're really going out of their way to not use Solaris.

SmartOS is comprised of the Illumos kernel (with ZFS, DTrace, OS-level virtualization and next-generation KVM) with BSD package management and a GNU toolchain.

I used Solaris for almost 5 years and when Sun was bought I thought for sure the cool parts of it were doomed. Illumos popped up and I thought that they didn't have a chance, it would wither and die. It's very nice to see that is not the case.

wmf · on Aug 15, 2011

They're really going out of their way to not use Solaris.

Considering that the Illumos kernel is basically the OpenSolaris kernel, I don't understand what you mean. They are using Solaris, but for some reason they're trying to hide it.

po · on Aug 15, 2011

This project is certainly not using Solaris which is now an Oracle product (and trademark) and not open source. It is based on Illumos (which originally came from an OpenSolaris kernel as you said) but my understanding is that OpenSolaris as a project is dead. I don't think it's hiding something to name-check Illumos instead of OpenSolaris.

nknight · on Aug 15, 2011

Illumos != Solaris, unless you'd like Oracle to beat you with a stick for trademark infringement.

I suppose they could refer to it in technical documents as "the OpenSolaris kernel", but by that same logic, an OS using the DragonFly BSD kernel should say it's using "the FreeBSD kernel".

killerswan · on Aug 15, 2011

Trademark disputes with hyper-litigious competitors would not be fun...

thirdhaf · on Aug 15, 2011

Well since they have a lot of former Sun engineers working for them I don't really think they have to SAY Solaris. As others have mentioned there are trademarks involved here too. Sun/Oracle have been hemorrhaging talent, particularly the former Fishworks team, for a while now and a lot of those people are ending up in places like Joyent.

mkup · on Aug 15, 2011

Better they had ported KVM to FreeBSD.

Who will add support for the new hardware to basically abandoned OpenSolaris kernel? Oracle? No. That small company? No. Leveraging FreeBSD community to grow the list of supported hardware would be much, much wiser.

kraemate · on Aug 15, 2011

This is simply amazing. Tracing KVM using Dtrace is going to throw up some pretty useful and surprising results. And add ZFS on top that as well - with all its features (sp. dedup+COW) useful for virtualized hosting.

Anyone know if linux has any variant/clone of dtrace yet? (and no, not systemtap)

otoburb · on Aug 15, 2011

Paul Fox seems to have been working on a Linux DTrace port since 2008 (based on the directory listing). He blogs about his progress on DTrace (and other things) at http://crtags.blogspot.com/

sogrady · on Aug 16, 2011

Here's Bryan Cantrill, one of the original creators of DTrace, on the status of the port:

https://plus.google.com/105843697186982227624/posts/5uk5SHrh...

nathanb · on Aug 15, 2011

"ZFS makes SANs and other expensive, redundant storage systems obsolete"

Yes, because clearly what enterprise-level storage customers really want is to roll their own storage solutions using open source technology.

daeken · on Aug 15, 2011

Many, many companies do want this, yes. Back in 2005-2006, I worked at MP3tunes and among other things I helped design the storage system. After testing a bunch of storage solutions (many of them high-end systems), we ended up rolling our own using Linux + MogileFS. This was scaled up to a couple hundred terabytes before I left, and even higher afterwards. Sometimes the off-the-shelf solutions simply don't work, especially at scale.

mitchty · on Aug 15, 2011

Pretty much, the amount of times we've had issues with new storage vendor arrays from a 3 letter company that starts with an E, and ends with a C, is a bit too much to count.

Firmware bugs that affect anything on a fabric, that affect how it distributes its cache slot locks on writes, issues with the entire array acting funny, which are blamed on either the server hardware or os itself until proven to be an issue with the array (this is far too common to be honest, yay for "support" contracts), etc....

Yes you get "support", and I use the term lightly, with big vendors, but you also have to take a machete through their support organization to get to someone that can help you with a problem. At times rolling your own solution will end up being both cheaper and less problematic. The old adage of "Nobody ever got fired for choosing IBM" may soothe managers minds, but wait until you do end up buying those fancy pants high end arrays and find out how much snake oil turns out not to work on them.

acdha · on Aug 15, 2011

We had great experiences with NetApp's products and support but it was priced accordingly.

toddmorey · on Aug 15, 2011

If it can be an order of magnitude cheaper, why not? Enterprise storage has commanded a high premium for a long time and of course enterprises are storing more data than ever, not all of which needs the same performance characteristics. I welcome the shakeup in this space. The old days where you fretted endlessly over which data to preserve on expense SANs are fading behind us.

nathanb · on Aug 15, 2011

I think in general the premium goes to pay for support contracts and service as much as for the atoms and bits. Of course, it can always be argued whether those contracts and services are worth the premium, but many enterprise customers (Google being the obvious high-profile exception) seem to think so.

acdha · on Aug 15, 2011

In many cases, yes, because you can only afford gold-plated storage for your most important apps.

toddmorey · on Aug 15, 2011

I'd also argue that you don't always need gold-plated storage. Think about user docs that are infrequently accessed , numerous, and must be retained for long periods. Deploying NetApp or EMC for that use case doesn't seem to make sense with options like this or OpenStack object storage.

nathanb · on Aug 15, 2011

If you mess up your data retention, though, you can get in big legal trouble. Sometimes you have to ask yourself whether you want to be the one who's liable for that.

sciurus · on Aug 15, 2011

Can anyone either explain or link to material that explains the relationships between the different projects and distributions that have sprung up since Oracle discontinued OpenSolaris?

piotrSikora · on Aug 16, 2011

After OpenSolaris OS/Net kernel development was closed, 3 forks were made: Illumos, SchilliX-ON and Stormix (sadly, they all seem quite inactive).

There are 3 distributions based on the first one: OpenIndiana, Nexenta and SmartOS. There is also SchilliX (based on SchilliX-ON) and StromOS (based on Stormix).

This is just based on my existing knowledge, so there might be more forks and/or distributions out there.

binarycrusader · on Aug 16, 2011

Both IllumOS and OpenIndiana are quite active (which shouldn't be surprising since the latter relies on the former).

piotrSikora · on Aug 16, 2011

Is it, really? According to GitHub's commit history, there were 297 commits made since onnv_147, so ~1 commit/day. Prior to the fork, OpenSolaris was getting ~10 commits/day.

To put this into perspective, there were over 8000 commits made to OpenBSD in the same time-frame. I won't even try to compare this with FreeBSD or Linux...

Don't get me wrong, I love the effort, but at the same time I feel like there wasn't any real progress made since the project's inception.

binarycrusader · on Aug 16, 2011

It depends on your definition of active.

As an open source project, it's definitely active.

Is it as active as the actual Solaris codebase? No. But it has also significantly less developers working on it.

piotrSikora · on Aug 16, 2011

Well, it's active as in "not dead", but not as in "actively developed operating system".

I'm aware of the difference in the number of developers, but you have to agree that there are hardly any user-visible changes since onnv_147.

binarycrusader · on Aug 16, 2011

Uh, no, it is definitely an "actively developed operating system".

As for user visible, actually, yes, changes made to dtrace recently for example.

Sparse zones as another example, and so on.

Remember that Solaris (and its derivatives, unlike Linux) is userland + kernel -- not just the kernel. So changes in userland count in my opinion when considering "actively developed operating system".

piotrSikora · on Aug 17, 2011

Apparently our definitions vary a bit ;)

> As for user visible, actually, yes, changes made to dtrace recently for example.

> Sparse zones as another example, and so on.

Those changes came from SmartOS/Joyent and were not made by the IllumOS developers.

Don't get me wrong, I really appreciate their work, but I've already lost hope with IllumOS/OpenIndiana... Fortunately Joyent stepped in, so I hope that things will speed-up a bit now :)

jjm · on Aug 15, 2011

Cool. I see this really helping me when testing my application prior to deployment. Would like to see some more movement with No.de. Have been waiting for a bit to get access.

jkahn · on Aug 16, 2011

I'm struggling to understand why ZFS makes enterprise storage obsolete. One of the benefits of the SAN is that it is shared amongst a number of servers.

How does ZFS help with that? Assuming you install SmartOS as virtualisation host, you'd still need some kind of shared storage.

[Edit:] The reason I point this out is that if you lose a host, you lose all VMs running on that host. And then you won't easily be able to get them back online on another host if you've lost your storage as well.

wmf · on Aug 16, 2011

You can use ZFS+Comstar to create a SAN controller.

organico · on Aug 18, 2011

But you were able to do this before SmartOS... I'm curious about this too, and how Joyent have architected their cloud arrangement. I feel like perhaps hypervisors have local ZFS storage, as opposed to large SAN-backed virtual machines, but I can't really find much useful information on how Joyent/SmartOS implement/architect storage...

kermitthehermit · on Aug 16, 2011

This looks very good.

How can packages be added? How can it be reconfigured? How is one supposed to add management tools (think libvirt, openstack) to it?

I took it for a spin and noticed it had kvm, a whole lot of regular Illumos binaries and that it had a BSD-ish userland.

I couldn't find anything like libvirtd, I didn't get to check if it has it, though.

sgt · on Aug 15, 2011

Fantastic news. I've been excited about Illumos for a long time, and now we're finally getting a decent distro with commercial backing. My servers will be very happy when they hear about this.

RexRollman · on Aug 16, 2011

As an OS geek, I always enjoy the annoucement of a new OS, even if it is one I personally have no use for. Choice is good.

foobarbazoo · on Aug 16, 2011

Would be sweet to run SmartOS on Backblaze's storage hardware.

ristretto · on Aug 15, 2011

I 'm happy for all the involvement of Joyent in opensource, but how about also dropping the prices in their accelerators? I 've been using the same servers with the same prices for 3 years with zero upgrades.