Saturday, August 24, 2013

The Linux storage stack: is it ready for prime time yet?

I've been playing with LIO quite a bit since rolling it into production for Viakoo's infrastructure (and at home for my personal experiments). It works quite a bit differently from the way that Intransa's Storstac worked. Storstac created a target for each volume being exported, while with LIO you have a single target that exports a LUN for each volume being exported. The underlying Linux kernel functionality is there to create a target per volume, but the configuration infrastructure is oriented around the LUN per volume paradigm.
Not a big deal, you might say. But it does make a difference when connecting with the Windows initiator. With the Windows initiator, the target per volume paradigm allows you to see what volume a particular LUN is connected to (assuming you give your targets descriptive names, which StorStac does). This in turn allows you to easily coordinate management of a specific target. For example to resize it you can offline it in Windows, stop exporting it on the storage service, rescan in Windows, expand the volume on your storage server, re-export it on your storage server, then online it in Windows and expand your filesystem to fill the newly opened up space. Still, this is not a big deal. LIO does perform quite well and does have the underlying capabilities that can serve the enterprise. So what's missing to keep the Linux storage stack from being prime time? Well, here's what I see:
  1. Ability to set up replication without taking down your filesystems / iSCSI exports. Intransa StorStac had replication built in, you simply set up a volume the same size on the remote target, told the source machine to replicate the volume to the remote target, and it started replicating. Right now replication is handled by DRBD in the Linux storage stack. DRBD works very well for its problem set -- local area high availability replication -- but to set up a replication after the fact on a LVM volume simply isn't possible. You have to create a drbd volume on top of a LVM volume, then copy your data into the new drbd volume. One way around this would be to automatically create a drbd volume on top of each LVM volume in your storage manager, but that adds overhead (and clutters your device table) and presents problems for udev at device assembly time. And still does not solve the problem of:
  2. Geographic replication: StorStac at one time had the ability to do logged replication across a WAN. That is, assuming that your average WAN bandwidth is high enough to handle the number of writes done during the course of a workday, a log volume will collect the writes and ship them across the WAN in the correct order to be applied at the remote end. If you must do a geographic failover due to, say, California falling into the sea, you lose at most whatever log entries have not yet been applied at the remote end. Most filesystems will handle that in a recoverable manner as long as the writes are being applied in the correct order (which they are). DRBD *sort of* has the ability to do geographic replication via an external program, "drbd-proxy", that functions in much the same way as StorStac replication (that is, it keeps a log of writes in a disk volume and replays them to the remote server), but it's not at all integrated into the solution and is excruciatingly difficult to set up (which is true of drbd in general).
  3. Note that LVM also has replication (of a sort) built in, via its mirror capability. You can create a replication storage pool on the remote server as a LVM volume, export it via LIO, import it via open-iscsi, create a physical volume on it, then create mirror volumes specifying this second physical volume as the place you want to put the mirror. LVM also does write logging so can handle the geographic situation. The problem comes with recovery, since what you have on the remote end is a logical volume that has a physical volume inside it that has one or more logical volumes inside it. The circumlocutions needed to actually mount and use those logical volumes inside that physical volume inside that logical volume are non-trivial, it may in fact be necessary to mount the logical volume as a loopback device then do pvscan/lvscan on the loopback device to get at those volumes. It is decidedly *not* as easy as with StorStac, where a target is a target, whether it's the target for a replication or for a client computer.
So clearly replication in the Linux storage stack is a mess, nowhere near the level of ease of use or functionality as the antiquated ten-year-old Intransa StorStac storage stack. The question is, how do we fix it? I'll think about that for a while, but meanwhile there's another issue: Linux doesn't know about SES. This is a Big Deal for big servers. SES is the SCSI Enclosure Services protocol that is implemented by most SAS fanout chips and allows control of, amongst other things, the blinky lights that can be used to identify a drive (okay, so mdmonitor told you that /dev/sdax died, where the heck is that physically located?!) . There are basically two variants extant nowadays, SAS and SAS2, that are very slightly different (alas, I had to modify StorStac to talk to the LSI SAS2X24 expander chip which very slightly changed a mode page that we depended upon to find the slot addresses). Linux itself has no notion that things like SAS disk enclosures even exist, much less any idea how to blink lights in them.

And finally, there is the RAID5/RAID6 write hole issue. Right now the only reliable way to have RAID5/RAID6 on Linux is with a hardware RAID controller that has a battery-backed stripe cache. Unfortunately once you do this, you can no longer monitor drives via smartd to catch failures before they happened (yes, I do this, and yes, it works -- I caught several drives in my infrastructure that were doing bad things before they actually failed and replaced them before I had to deal with a disaster recovery situation), you can no longer take advantage of your server's gigabytes of memory to keep a large stripe cache so that you don't have to keep thrashing the disks to load stripes in the case of random writes (if the stripe is already in cache, you just update the cache and write the dirty blocks back to the drives, rather than have to reload the entire stripe) and you can also no longer take advantage of the much faster RAID stripe computations allowed by modern server hardware (it's amazing how much faster you can do RAID stripe calculations with a 2.4Ghz Xeon than you can with an old embedded MIPS processor running at much slower speeds). In addition it is often very difficult to manage these hardware RAID controllers from within Linux. For these reasons (and other historical issues not of interest at the moment) StorStac always used software RAID. Historically, StorStac used battery-backed RAM logs for its software RAID to cache outstanding writes and recover from outages, but such battery-backed RAM log devices don't exist for modern commodity hardware such as the 12-disk Supermicro server that's sitting next to my desk. It doesn't matter anyhow, because even if it did exist, there's no provision in the current Linux RAID stack to use it.

So what's the meaning of all this? Well, the replication issue is... troubling. I will discuss that more in the future. On the other hand, things like Ceph are handling it at the filesystem level now, so perhaps block level replication via iSCSI or other block-level protocols isn't as important as it used to be. For the rest, it appears that the only thing lacking is a management framework and a utility to handle SES expander chips. The RAID[56] write hole is troublesome, but in reality data loss from that is quite rare, so I won't call it a showstopper. It appears that we can get 90% of what the Intransa StorStac storage stack used to do by using current Linux kernel functionality and a management framework on top of that, and the parts that are missing are parts that few people care about.

What does that mean for the future? Well, your guess is as good as mine. But to answer the question about the Linux storage stack: Yes, it IS ready for prime time -- with important caveats, and only if a decent management infrastructure is written to control it (because the current md/lvm tools are a complete and utter fail as anything other than tools to be used by higher-level management tools). The most important caveat being, of course, that no enterprise Linux distribution has been released yet with LIO (I am using Fedora 18 currently, which is most decidedly *not* what I want to use long-term for obvious reasons). Assuming that Red Hat 7 / Centos 7 will be based on Fedora 18, though, it appears that the Linux storage stack is the closest to being ready for prime time as it's ever been, and proprietary storage stacks are going to end up migrating to the current Linux functionality or else fall victim to being too expensive and fragile to compete.

-ELG

Sunday, August 11, 2013

The killer app for virtualization

The killer application for virtualization is... running legacy operating systems.

This isn't a new thought on my part. When I was designing the Intransa StorStac 7.20 storage appliance platform I deliberately put virtualization drivers into it so that we could run Intransa StorStac as a virtual appliance on some future hardware platform not supported by the 2.6.32 kernel. And yes, that works (no joke, I tried it out of course, the only thing that didn't work was sensors but if Viakoo ever wants to deliver a virtualized IntransaBrand appliance I know how to fix the sensors). My thought was future-proofing -- I could tell from the layoffs and from the unsold equipment piled up everywhere that Intransa was not long for the world, so I decided to leave whoever bought the carcass a platform that had some legs on it. So it has drivers for the network chips in the X9 series SuperMicro motherboards (Sandy/Ivy Bridge) as well as the virtualization drivers. So there's now a pretty reasonable migration path to keep StorStac running into the next decade... first migrate it to Sandy/Ivy Bridge physical hardware, then once that's EOL'ed migrate it to running on top of a virtual platform on top of Haswell or its successors.

But what brought it to mind today was ZFS. I need some of the features of the LIO iSCSI stack and some of the newer features of libvirtd for some things I am doing, so have ended up needing to run a recent Fedora on my big home server (which is now up to 48 gigabytes of memory and 14 terabytes of storage). The problem is that two of those storage drives are offsite backups from work (ZFS replication, duh) and I need to use ZFS to apply the ZFS diffsets that I haul home from work. That was not a problem for Linux kernels up to 3.9, but now Fedora 18/19 have rolled out 3.10, and ZFSonLinux won't compile against the 3.10 kernel. I found that out the hard way when the new kernel came in, and DKMS spit up all over the floor because of ZFS.

The solution? Virtualization to the rescue! I rolled up a Centos 6.4 virtual machine, pushed all the ZFS drives into it, gave it a fair chunk of memory, and voila. One legacy platform that can sit there happily for the next few years doing its thing, while the Fedora underneath it changes with the seasons.

Of course that is nothing new. A lot of the infrastructure that I migrated from Intransa's equipment onto Viakoo's equipment was virtualized servers dating in some cases all the way back to physical servers that Intransa bought in 2003 when they got their huge infusion of VC money. Still, it's just a practical reminder of the killer app for virtualization -- the fact that it allows your OS and software to survive despite underlying drivers and architectures changing with the seasons. Now making your computer work faster can be done without changing anything at all about it -- just buy a couple of new virtualization servers with the very latest fastest hardware and then migrate your virtual machines to them. Quick, easy, and terrifies OS vendors (especially Microsoft) like crazy because now you no longer need to buy a new OS to run on the new hardware, you can just keep using your old reliable OS forever.

-ELG