Tuesday, August 25, 2015

Where does the future of enterprise storage lie?

I've talked about how traditional block and NAS storage isn't going away for small businesses. So what about enterprise storage? In the past few years, we've seen the death of multiple vendors of scale-out block storage, two of which were of interest to me being Coraid and Intransa, both of which allowed chaining together large numbers of Ethernet-connected nodes to scale out storage across a very large array (the biggest cluster we built at Intransa had 16 nodes and a total of 1.5 petabytes of storage but the theoretical limits of the technology were significantly higher). Reality is that they had been on life support for years because the 1990's and 2000's were the decades of NAS, not of block storage. Oh, EMC was still heaving lots of big iron block storage over the wall to power big databases, but most applications of storage other than those big corporate data marts were NAS applications, whether it was Windows and Linux NAS servers at the low end or NetApp NAS servers at the high end.

NAS was pretty much a necessity back in the era of desktops and individual servers. You could mount people's home directories on a CIFS or NFS share (depending on their OS). People could share their files with each other by simply copying them to a shared directory. You saw block storage devices being exported to these desktops via iSCSI sometimes, but usually block storage devices were attached to physical servers in the back room on dedicated storage networks that were much faster than floor networks. The floor networks were fast enough to carry CIFS, but CIFS at its core is just putting and getting objects, not blocks, and can operate much more asynchronously than a block device and thus wasn't killed by latency the way iSCSI is.

But there's problems too. For one thing, every single device has to be part of a single login realm or domain of some sort, because that's how you secure connections to the NAS. Furthermore, people have to be put into groups, and access set on portions of the overall NAS cloud based on what groups a person belongs to. That was difficult enough in the days when you just had to worry about Linux servers and Windows desktops. But now you have all these devices

Which brings up the second issue with NAS -- it simply doesn't fit into a device-oriented world. Devices typically operate in a cloud world. They know how to push and pull objects via http, but they don't speak CIFS or NFS, and never will. What we are seeing is that increasingly we are operating in a world that isn't file based, it's object based. When you go into Google Docs to edit a spreadsheet, you aren't reading and writing a file. You're reading and writing an object. When you are running an internal business application, you are no longer loading a physical program and reading and writing files. You're going to a URL for a web app that most likely is talking to a back end database of some kind to load and store objects.

Now, finally, add in what has happened in the server room. You'll still see the big physical iron for things like database servers. But by and large the remainder of the server room has gone away, replaced by a private cloud, or pushed into a public cloud like Amazon's cloud. Now when people want to put up a server to run some service they don't call IT and work up a budget and wait months for an actual server to be procured etc., they work at the speed of the cloud -- they spin up a virtual machine, they attach block storage to it for the base image and for any database they need beyond object storage, and they implement whatever app they need to implement.

What this means is that block storage and object storage integrated with cloud management systems like OpenStack are the future of enterprise storage, a future that alas did not arrive soon enough for the vendors of scale-out block storage that survived the previous decade, who ended up without enough capital to enter this brave new world. NAS won't go away entirely, but it will increasingly be a departmental thing feeding desktops on the floor, not something that anything in the server room uses. And that is, in fact, what you see happening in the marketplace today. You see traditional Big Iron vendors like HDS increasingly pushing object storage, and the new solid-state storage vendors such as Pure Storage and Solidfire are predominantly block storage vendors selling into cloud environments.

So what does the future hold? For one thing, lower latencies via hypervisor integration. Exporting a share via iSCSI then mounting it via the hypervisor has all of the usual latency issues of iSCSI. Even with 10 gigabit networking now hitting affordability and 25 to 100 gigabit Ethernet in the future, latency is a killer if you're expecting a full round trip. What if writes were cached on a local SSD array, in order, and applied in order? For 99% of the applications out there this provides all the write consistency that you need. The cache will have to be turned off prior to migrating the virtual machine to a different box, of course -- thus the need for hypervisor integration -- but other than a catastrophic failure (where the virtual machine will go lights out also and thus not have inconsistent data when it is restarted on another node) you will, at best, have some minor data loss -- much better than inconsistent data.

So: Block storage with hypervisor and cloud management integration, and object storage. The question then becomes: Is there a place for the traditional dedicated storage device (or cluster of devices) in this brave new world? Maybe I'll talk about that next, because it's an interesting question, with issues of data density, storage usage, power consumption, and then what about that new buzzword, "software defined storage"? Is storage really going to be a commodity in the future where everybody's machine room has a bunch of generic server boxes loaded with someone's software? And what impact, exactly, is solid state storage having? Interesting things to think about there...

-ELG

Saturday, August 1, 2015

The quest for an integrated storage stack

In prior posts I've mentioned the multitude of problems with the standard Linux storage stack. It's inflexible -- once you've set up a stack (usually LV->VG->PV->MD->BLOCK) and opened a filesystem on it, you cannot modify it to, e.g., add a replication layer to the stac. It lacks the ability to do geographic replication in any reasonable fashion. The RAID layer in particular lacks the ability to write to (and replay) a battery-backed RAM cache to deal with the RAID 5 write hole (which, despite its name, also applies to other RAID levels and results in silently corrupted data). Throw iSCSI into this equation to provide block devices to virtual machines and, potentially, to do replication to block devices on other physical machines, and things get even more complex.

One method that has been proposed to deal with these issues is to simply not use a storage stack at all. Thus we have ZFS and BTRFS, which attempt to move the RAID layer and logical volume layers into the filesystem. This certainly solves the problem of corrupted data, but at a significant penalty in terms of performance, especially on magnetic media where the filesystem swiftly becomes fragmented. As a result running virtual machines using "block devices" that are actually files on a BTRFS filesystem results in extremely poor "disk" performance on the virtual machines. A file on a log-based subsystem is simply a poor substitute for an extent on a block device. Furthermore, use of these filesystems for databases has proven to be woefully slow compared to using a normal filesystem like XFS on top of a RAID-10 layer.

The other method that has been to abandon the Linux storage stack except as a provider of individual block devices and instead layer a distributed system like Ceph on top of it. My tests with Ceph have not been particularly promising. Performance of Ceph block devices at an individual virtual machine level were abysmal. There appears to be three reasons for this: 1) Overly pessimistic assumptions about writes on the part of Ceph, 2) The inherent latencies involved in a distributed storage stack, and 3) the fact that Ceph reads/writes via XFS filesystems layered on top of block devices, rather than to extents on raw block devices. For the latter, in my experience you will see *at least* a 10% degradation in virtual machine block device performance if the block device is implemented as a file on top of XFS rather than directly to a LVM extent.

In both cases, I wonder if we are throwing out the cart because the horse has asthma. I've worked as a software engineer for two of the pioneers of Linux-based storage -- Agami Systems, which did a NAS device with an integrated storage system, and Intransa Inc., which did scalable iSCSI storage systems with an integrated block storage subsystem. Both suffered the usual fate of pioneers -- i.e., face down dead with arrows in the back, though it took longer with Intransa than with Agami. Both wrote storage stacks for Linux which solved most of the problems of the current Linux storage stack, though each solved a different subset of those problems. There are still a significant number of businesses which do not need the expense and complexity of a full OpenStack data center in order to solve their problems, but which do need things like, e.g., logged geographic replication to replicate their data to an offsite location, something which Intransa solved ten years ago (but which, alas, died with Intransa), or real-time snapshots of virtual machine block devices at the host device level, or ...

In short: Despite the creation of distributed systems like CEPH and integrated storage management filesystems like BTRFS, there is a significant need for an integrated storage stack for Linux -- one that allows flexibility in configuring both block devices and network filesystems, which allows for easy scalability and management, which has modern features such as logged geographic replication, battery backed RAM cache support (or at least fast SSD log device support at the MD layer), and allows dynamic insertion of components into the software stack much as you could create a replication layer in the Intransa StorStac and have it sync then replicate to a remote device without ever unmounting any filesystem or making the iSCSI target inaccessible. There is simply a large number of businesses which just don't need the expense and complexity of a full OpenStack data center, which indeed don't need more than a pair of iSCSI / NAS storage appliances (a pair in order to handle replication and snapshotting), and the current Linux storage stack lacks fundamental functionality that was implemented over a decade ago but never integrated into Linux itself. It may not be possible to bring all the concepts that Agami and Intransa created into Linux (though I'll point out that all of Intransa's patents are now owned by a patent entity that allows free use for Open Source software), but we should attempt to bring as many of them as possible into the standard Linux storage stack -- because the cloud is the cloud, but most smaller businesses have no need for the cloud, they just need reliable local storage for their local physical and virtual machines.

-ELG

Saturday, April 4, 2015

Network monitoring is not for amateurs

So, another blogger recently posted that network monitoring was easy. All you needed to do was deploy off the shelf network monitoring tools X, Y, and Z, and your network would be monitored and you would be able to sleep well at night knowing that your network was working well.

At which point I have to stop and say... what... the.... bleep?

I've been doing this a *long* time. In this or prior jobs (where I was a staff engineer at a firewall company in particular) I've deployed Nagios, MRTG, Zenoss, OpenNMS, Big Brother, and a variety of proprietary network monitoring tools such as PRTG and combined firewall/IDS tools such as Checkpoint, as well as specialty tools such as smartd, arpwatch and snort. And what I have to say is that monitoring networks is *hard*. Pretty much any of the big network monitoring tools (other than Nagios, whose configuration is utterly non-automated) will take you 80% of the way to effective monitoring of your network with a few mouse-clicks. But to go to that extra 20% needed to make sure that you're immediately notified if anything your client relies on is not providing services, you end up having to install plugins, write sensors, and combine multiple tools together. It's a lengthy task that can require days of research to solve just 1% of that last 20%.

Here's an example. I was having issues where one of my iSCSI servers was dropping off the network for thirty seconds or so. I'd learn about it because my users would start howling that their Linux virtual machines were no longer functioning properly. I investigated, and found that when Linux cannot write to a volume for a certain amount of time, it then marks the volume read-only. Which, if the volume is the root volume on a Linux virtual machine, means things go to bleep in a handbasket.

And Nagios reported not a problem during all this time. All the normal Nagios sensors for system load, disk space free, memory free, and so forth showed that the virtual machine was operating completely normally. The virtual machine was pingable, it was responding to SNMP queries, it looked completely healthy to any network monitoring tool in the universe. So I ended up writing a Nagios sensor that 'touched' a file in /, and if it couldn't because the file system had been marked read-only, reported an error. I then deployed it to my virtual machines and added it to my list of sensors to monitor, and when my iSCSI server dropped offline and the filesystems went read-only, I'd get a Nagios alert. (Said iSCSI server, BTW, is not an Intransa iSCSI server, it's a generic SuperMicro box running Linux with the LIO iSCSI stack, but that's irrelevant to the conversation). After some time of getting alerted when things went AWOL I managed to zone in on exactly the set of log file entries on the iSCSI server that indicated what was going on at the time things went to bleep, and discovered that I had a failing disk in the iSCSI system that was not triggering the smartmond monitoring tool. Unfortunately figuring out which of those 24 disks were causing the SAS bus reset loop was non-trivial given that smartctl was showing no problems when I looked at them, but once I tracked it down and replaced the disk, the problem was solved.

So let's recap. I had a serious problem with my storage network. smartmond, the host-based disk monitoring tool on the iSCSI box, couldn't detect it, all the disks looked fine when you looked at them with the tool it uses, smartctl. Nagios and all its prepackaged sensors couldn't detect it until I wrote a custom sensor. And this other blogger says that monitoring complex networks including clients, storage boxes, video cameras, and so forth is *easy*? Just a matter of installing packages X, Y, and Z and sitting back and relaxing?

Crack. That's my only explanation for anybody making such a statement. Either that, or utter inexperience at monitoring complex heterogeneous networks with multiple points of failure, of which only 80% or so are covered by off-the-shelf network monitoring tools, and the remainder requiring writing custom sensors to detect domain-specific points of failure. Either way, anybody who listens to that blogger and believes that monitoring a network effectively is easy is a fool. And that's all I will say on that subject.

-ELG

Friday, January 30, 2015

User friendly

Here's a tip:

If your Open Source project requires significant work to install, work that would be easily scriptable by a competent software engineer, I'm not going to use it. The reason I'm not going to use it is because I'm going to assume you're either an incompetent software engineer, or the project is in unfinished beta state not useful for a production environment. Because a competent software engineer would have written those scripts if the product was in a finished production-ready state. Meaning you're either incompetent, or your product is a toy of use for personal entertainment only.

Is this a fair assessment on my part? Probably not. But this is what 20 years of experience has taught me. Ignore that at your peril.

-ELG

Monday, December 15, 2014

The problem with standards

As part of a long rant about programming vs engineering, Peter Welch makes the statement, "standards are unicorns." Uhm, no. Just no. Unicorns are mythical. They don't exist. Standards do exist. They're largely useless, but they exist. I have linear feet of shelf space lined with the bloody things to attest to that.

So what's a better analogy for standards? I have a good one: Hurricane stew.

After a hurricane, the power is out. On every block in hurricane country there's a guy who has a giant vat used to boil crabs and crawfish and shrimp, that sits on a large burner powered by a large propane tank. You know, That Guy. So That Guy pulls out his vat and sets it up on the driveway in front of his carporch and tosses in some water and the contents of his refrigerator before they can spoil and go bad. And his neighbors come by and toss in the contents of *their* refrigerators before they can spoil and go bad. And as people finish cleaning out their refrigerators they bring more and more random scraps and throw them in, until what you have is a big glop of soupy strangeness containing a little of every possible food item that could ever exist. And then people subsist on this hurricane stew, this random glop of indistinguishable odds and ends, for the next week or two as it continues to slowly bubble on the driveway of That Guy and continues to occasionally get new glop thrown in as people throw the contents of their freezers into it too.

That's standards -- just a big glop of indistinguishable mess created by everybody under the sun throwing their own scraps and odds and ends into it, in the end being of use to absolutely no one except the people who threw scraps into it, who end up subsisting off of it for the rest of their careers because nobody else has the slightest freakin' idea what's in that opaque bubbling mess with the oddly-colored mist wafting off of it. And every single implementation of this "standard" is different and does things in a different way in the thousands of edge cases that have been thrown into the standard as possible way to do things because that was what was in someone's refrigerator err code base at the time the standard was created and so they tossed it in before it could spoil, driving anybody who has to write software that's "standards-compliant" slowly insane to the point of stabbing a two-foot-tall printout of the standard with a knife repeatedly, over and over, while screaming "Die! Die! Die!" at the top of their lungs.

That's standards. That's what they're good for.

- ELG

Thursday, August 21, 2014

Performance tips for Grails / Hibernate batch processing

So I'm working on an application that does batch processing of records sent by client systems, and Groovy/Grails is the language/framework that the application is written with. This is a story of how it failed -- and how it was fixed.

Failure #1: Record sets sent via HTTP take too long to process, causing HTTP timeouts before a response can be returned to the client. Solution: Plop the record sets into a batch queue instead, and process them via a batch queue runner running as a Quartz job.

Failure #2: Hibernate/Grails optimistic locking is, well, overly optimistic. As in, if I have multiple EC2 instances processing batch queues, I have to hope and pray that two different instances don't attempt to process the same set of records at the same time, else they'll both fail and rollback at some point in time and my batch queue will never get emptied. Meanwhile, Hibernate fine-grained locking is too fine-grained, and ends up causing deadlocks.

Solution: Create a locking system (via your database or via memcached or whatever, doesn't matter as long as it serializes access) and divide your database records into logical non-overlapping sets. Then lock those logical sets at a higher level prior to processing a batch that touches that particular set. For example, if you're batch processing store records at Walmart central office, a logical set might be an individual store and all its individual inventory items.

Note that this requires *very* careful schema layout to insure that things that can be changed by the end user interface do not get overwritten by the batch processor, unless you *want* them to get overwritten by the batch processor. But it's doable.

Failure #3: The Hibernate session consumes all of memory, crashing the application.

Solution: We're doing batch processing, so each record set runs for a significant amount of time (30 seconds or more) with tens of thousands of operations. This means we can let each record set have its own session. For each record set processed by the application, create a new session. Flush that session then destroy it at the end of each record set. For example:

while (batch = getNextBatch()) { // returns non-Hibernate objects, typically parsed from JSON or EDI
   Store.withNewSession {  session -> 
       ... process batch here ...
       session.flush()
   }
}

Failure #4: Multi-threaded performance slammed into a brick wall at the Hibernate query cache.

Solution: In general, the Hibernate caches are a performance hinderance when batch processing. The number of records that you process over the course of running all of your queues is far larger than the amount of memory you have, so any cached database records from the beginning of the queue run are long gone by the time the queue gets re-filled and you start over at the beginning of the queue again. Furthermore, the query cache is single-threaded, so if you're running on a modern multi-threaded processor and using multiple threads to consume its resources, you might as well be running on an 80386, performance is going to top out at less than 2 threads worth of performance. So disable the caches in the 'hibernate' block in your config/DataSource.groovy file and instead manually cache any items that you need to cache within batches or across batches:

hibernate {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
      .... other options here ....
}
Failure #5: Lots of small queries kill performance.

For example, a store might send its nightly inventory records. The nightly inventory records update the quantities for each inventory item, which in turn create ordering alerts when inventory has fallen below a certain level. You know ahead of time that a) the number of inventory records is limited (figure 40,000 different items per store), and b) 75% of the items are going to be modified. So: doing things the inefficient way, you'd do:
inventory_batch.each { rec=Inventory.findByStoreAndItemNum(store,it.itemnum) ; rec.quantity=it.quantity; rec.save() }
But that results in 40,000 queries to the database, each of which has an enormous amount of Hibernate overhead associated with it.

Solution: Cache the entire set of items beforehand (using a HashMap and a cache class to wrap it), and fetch them from the cache instead. For example, assuming you've created a 'InvCache' class that caches inventory items:

  rec_set = Inventory.findAllByStore(store)
  inv_cache = new InvCache(rec_set)
   inventory_batch.each { 
           rec = inv_cache.findByItemNum(it) // looks it up in a hashmap, and if not there, adds it to the database.
           rec.quantity = it.quantity
           rec.save()  // in a real application, you'd check result of save and print validation errors.
           // in a real application would check quantity against limits and issue an inventory alert if inventory too low.
   }

Note that rec.save() does not immediately update the record, it merely marks the record as dirty and the next time Hibernate flushes, it will then issue a SQL query to do the update. You still end up issuing 35,000 update statements but that's still better than issuing 40,000 select + 35,000 update statements, and they're all issued in a single batch rather than via multiple Hibernate calls preparing statements and etc.

Failure #6: Flushes in big Hibernate sessions kill performance.

Some stores have a big inventory. It can take several seconds to flush the Hibernate session due to Hibernate's extremely inefficient algorithm for determining what needs to be flushed (it tries to trace the entire relationship structure multiple layers deep, so it is an exponential curve, not a linear line). The Hibernate session gets flushed before virtually every query that you make to the database by default, meaning that if you have to do 500 queries against the database in the course of processing to handle things not easily cached as above, you will have 500 flushes. 500 flushes times 5 seconds per flush is 41 minutes worth of flushing. EEP!

Solution #1: Don't use Hibernate's built in flushing and transaction ordering system. Do your own, because most of what you're doing is either batch appends of log records (where you're never going to query it back out again in the process of doing the batch thus don't care when it actually gets flushed), or updates of records where again you really don't care about when it's flushed. So: switch the flush mode to 'manual' and flush only when necessary to maintain relational ordering, and otherwise flush only at the end of logical batches. For example, if the store manager has added a new InventoryItem, and this new InventoryItem is referenced by a new InventoryAlert to note that this item needs to be ordered, the order will be to create the new InventoryItem, use item.save(flush:true) to flush the session, add it to the inventory cache if it's going to be used for other things, then create the new InventoryAlert. There is no need to use flush:true on the InventoryAlert because you don't care when it actually gets flushed, you care only that the InventoryItem gets saved before the InventoryAlert that references it. Hibernate is supposed to handle the dependency order here, if you properly set up your Grails objects... but sometimes it doesn't, as I've previously noted.

Note that setting the flush.mode in the hibernate{} block in DataSource.groovy will not set the flush mode to 'manual' in the session we created earlier. It will get set to 'auto' or 'commit' by Grails depending on whether you're in an @Transactional service when you create the new session, Grails ignores the Hibernate value. You'll need to explicitly set the flush mode when you create the new session:

import org.hibernate.FlushMode
...
Inventory.withNewSession { session ->
      session.setFlushMode(FlushMode.MANUAL)
        .... do processing here ....
}

Solution #2: In many cases, we are creating new records in batches. For example, cash register logs. So: Create a bunch of new records, flush them to disk, then discard them from the session in order to keep the session size down. For example:

registers.each { register ->
   LogEntry.withSession { session ->
     entry_list=[]
     register.logs.each { logentry -> 
         entry = new LogEntry(logentry)  // creates it from hash
         ... do any other processing / initialization for entry here ...
         entry.save() // would validate/check return val in real app
         entry_list.add(entry)
     }
     session.flush() // flush the 5,000 register logs for this register to disk.
     entry_list.each { entry -> 
        entry.discard() // get rid of the 5,000 register logs for this register.
     }
   }
}

Conclusion

Hibernate has a deserved reputation as an inefficient ORM that is not well suited for high performance operations. This is primarily because its standard settings are appropriate for only a small subset of the possible problem space, and are utterly inappropriate for batch processing. Its session management is incapable of handling sessions with large numbers of objects in a timely manner, and its caches actually make many applications slower rather than faster. However, by applying the above to the application in question, I successfully reduced the processing time for the target largest batch sent to our system from being over 60 minutes to 3 minutes, which is roughly five times faster than it's required to be in order to meet our performance requirements. Yes, a factor of 20 times improvement. You can make Hibernate perform. The batch processor could have been made even faster by dropping down to doing raw SQL in Java, but it would have taken a factor of 20 times longer to write too.

In the end it's all about tradeoffs. Hibernate sucks, but in this case, given the deadlines and time pressures and the fact that it was the back end of a large code base already written with Groovy/Grails/Hibernate, it was the best of a batch of poor solutions. The ideal is sometimes the enemy of the good enough. If we hit a problem set large enough that we cannot achieve it with the technology we're using, then we'll drop down to lower level / faster technologies such as using raw Java EE and raw SQL (probably via something like MyBatis to intermediate for sanity's sake). In the end, however, in most applications there's other problems worth solving once performance is "good enough". So don't let Hibernate's poor performance scare you off if it's the solution to getting a product out the door in a timely manner. That is, after all, the goal -- and for most applications, Hibernate can be made fast enough.

-ELG

Friday, May 2, 2014

Behold the XKCD Passphrase Generator

Behold the XKCD Passphrase Generator. Copy and paste it into a file on your own Linux machine, and run (assuming you've installed the 'words' package, which is almost always the case). It'll pick five random words and concatenate them together. Should also run on other machines with Python installed, but you may need to find a words file somewhere and edit accordingly. If this were a real program I'd add paramaters yada yada, but since it's just a toy...
#!/usr/bin/python
# XKCD passphrase generator.
# See XKCD 936 http://xkcd.com/936/
# You'll have to provide your own 'words' file. One word per line.
# Unix based systems usually have /usr/share/dict/words but you'll need
# to get that from somewhere else for Windows or etc.
# After editing wordsfile, numwords, separator:
# Execute as: python genpf.py  

import os

wordsfile="/usr/share/dict/words"
numwords = 5
separator = ".%*#!|"

def gen_index(len):
    i=(ord(os.urandom(1)) << 16) + (ord(os.urandom(1)) << 8) + (ord(os.urandom(1)))
    return i%len

f=open(wordsfile)

pf=""
words = f.readlines()

i=gen_index(len(separator))
c=separator[i]
while (numwords > 0):
    i=gen_index(len(words))
    s=words[i].strip()
    if (pf==""):
        pf=s
    else:
        pf=pf+c+s
        pass
    numwords = numwords - 1

print pf

Thursday, February 20, 2014

Hibernate gotchas: ordering of operations

Grails / GORM was throwing a Hibernate error from time to time:

org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1

What was confusing was that there was no update anywhere in the code in question, which was a queue runner. The answer to what was causing this was interesting, and says a lot about Hibernate and its (in)capabilities.

The error in question was being thrown by transaction flush in a queue runner. The queue is in a Postgres database. Each site gets locked in the Postgres database, then its queue run with each queued-up item in the Postgres database deleted after it is processed, then the site gets unlocked.

The first problem arose with the lock/unlock code. There was a race condition, clearly, when two EC2 instances tried to lock the same site at the same time. The way it was originally implemented was with Hibernate, the first would create its lock record, then flush the transaction, then re-query to see whether there was other lockers with a lower ID holding a lock on the object. If so, it'd release the lock by deleting its lock record. Meanwhile the other instance finished processing that queue and released all locks on that site. So the first instance would go to delete its lock, find that it'd already been deleted, and throw that exception.

Once that was resolved, the queue runner itself started throwing the exception occasionally when the transaction was flushed after the unlock. What was discovered by turning on Hibernate debug was that Hibernate was re-ordering operations so that the unlock got applied to the database *before* the deletes got applied to the database. So the site would get unlocked, another queue runner would then re-lock the site to itself and start processing the same records that previously got processed, then go to delete the records, and find that the records had already been deleted out from under it. Bam.

The solution, in this case, was to rewrite the code to use the Groovy SQL API rather than use GORM/Hibernate.

What this does emphasize is that you cannot rely on operations being executed by Hibernate in the same order in which you specified them. For the most part this isn't a big deal, because everything you're operating on is going to be applied in a reasonable order so that the dependencies get satisfied. E.g. if you create a site then a host inside a site, the site record will get created in the database before the host record gets created in the database. But if ordering matters... time to use a different tool. Hibernate isn't it.

-ELG

Thursday, November 21, 2013

EBS, the killer app for Amazon EC2

So I gave Ceph a try. I set up a Ceph cluster in my lab with three storage servers connected via 10 gigabit Ethernet. The data store on each machine is capable of approximately 300 megabytes per second streaming throughput. So I created a Ceph block device and ran streaming I/O with large writes to it. As in, 1 megabyte writes. If writing to the raw data store, I get roughly 300 megabytes per second throughput. If writing through ceph, I get roughly 30 megabytes per second throughput.

"But wait!" the folks on the Ceph mailing list said. "We promise good *aggregate* throughput, not *individual* throughput." So I created a second Ceph block device and striped it with the first. And did my write again. And got... 30 megabytes per second. Apparently Ceph aggregated all I/O coming from my virtual machine and serialized it. And this is what its limit was.

This seems to be a hard limit with Ceph if you're not logging to a SSD. And not any old SSD. One optimized for small writes. There seems to be some fundamental architectural issues with Ceph that keep it from performing at anywhere near hardware speed. The decision to log everything, regardless of whether your application needs that level of reliability, appears to be one of those decisions. So Ceph simply doesn't solve a problem that's interesting to me. My users are not going to tolerate running at 1/10th the speed they're accustomed to running, and my management is not going to appreciate me telling them that we need to buy some very pricy and expensive enterprise SSD hardware when the current hardware with the stock Linux storage stack on it runs like a scalded cat without said SSD hardware. There's no "there" there. It just won't work for me.

So that's Ceph. Looks great on paper, but the reality for me is that it's 1/10th the speed without any real benefits for me over just creating iSCSI volumes on my appliances, pointing my virtual machines at them, and then using MD RAID on the virtual machine to do replication between the iSCSI servers. Yes that's annoying to manage, but at least it works, works *fast*, and my users are happy.

At which point let's talk about Amazon EC2. Frankly, EC2 sucks. First, it isn't very elastic. I have autoscaling alerts set up to spin up new virtual machines when load on my cloud reaches a certain point. Okay, fine. That's elastic, right? But: While an image is booting up and configuring, it's using 100% CPU. Which means that your alarm goes off *again* and spin up more instances, ad infinitum, until you hit the upper limit you configured for the autoscaling group, *unless* you put hysteresis in there to wait until the instance is up before you spin up another instance. So: It took five minutes to spin up a new instance. That's not acceptable. If the load has kept going up in the meantime, that means you might never catch up. I then created a new AMI that had Tomcat and other necessary software already pre-loaded. That brought the spin up time to three minutes -- one minute for Amazon to actually create the instance, one minute for the instance to boot and run Puppet to pull in the application .war file payload from the puppetmaster, and one minute for Tomcat to actually start up. Acceptable... barely. This is elastic... if elastic is eons in computer time. The net result is to encourage you to spin up new instances before you need them, spin up *multiple* instances at a time when spinning up instances, and then take a while before tearing them back down again once load goes down. Not the best way to handle things by any means, unless you're Amazon and making money by forcing people to keep excess instances hanging around.

Then let's not talk about the fact that CloudFormation is configured via JSON. A format deliberately designed without comments for data interchange between computers, *not* for configuring application stacks. Whoever specified JSON as the configuration language needs to be taken out behind Amazon HQ and beat with a clue stick until well and bloody. XML is painful enough as a configuration language. JSON is pure torture. Waterboarding is too good for that unknown programmer.

And then there's expense. Amazon EC2 is amazingly expensive compared to somebody like, say, Digital Ocean or Linode. My little 15 virtual machine cloud would cost roughly $300/month at Linode, less at Digital Ocean. You can figure four times that amount for EC2.

So why use EC2? Well, my Ceph experiment should clue you in there: it's all about EBS, the Elastic Block Store. See, I have data that I need to store up in the cloud. A *lot* of data. And if you create EBS-optimized virtual machines and striped MD RAID arrays across multiple EBS volumes, your I/O is wicked fast. With just two EBS volumes I can easily exceed 1,000 IOPS and 100 megabytes per second when doing pg_restore of database files. With more EBS volumes I could do even better.

Digital Ocean has nothing like EBS. Linode has nothing like EBS. Rackspace and HP have something like EBS (in public beta for HP right now, so not sure how much to trust it), but they don't have a good instance size match with what I need and if you go the next size up, their pricing is even more ridiculous than Amazon's. My guess is that as OpenStack matures and budget providers adopt it, you're going to see prices come down for cloud computing and you're going to see more people providing EBS-like functionality. But right now OpenStack is chugging away furiously trying to match Amazon's feature set, and is unstable enough that only providers like HP and Linode who have hundreds of network engineers to throw at it could possibly do it right. Each iteration gets better so hopefully the next iteration will be better. (Note from 10 months later: nope. Still not ready for mere mortals). Finally, there's Microsoft's Azure. I've heard good things about it, oddly enough. But I'm still not trusting it too much for Linux hosting, given that Microsoft only recently started giving grudging support to Linux. Maybe in six months or so I'll return and look at it again. Or maybe not. We'll see.

So Amazon's cloud it is. Alas. We look at the Amazon bill every month and ask ourselves, "surely there is a better alternative?" But the answer to that question has remained the same for each of the past six months that I've asked it, and it remains the same for one reason: EBS, the Elastic Block Store.

-ELG

Saturday, August 24, 2013

The Linux storage stack: is it ready for prime time yet?

I've been playing with LIO quite a bit since rolling it into production for Viakoo's infrastructure (and at home for my personal experiments). It works quite a bit differently from the way that Intransa's Storstac worked. Storstac created a target for each volume being exported, while with LIO you have a single target that exports a LUN for each volume being exported. The underlying Linux kernel functionality is there to create a target per volume, but the configuration infrastructure is oriented around the LUN per volume paradigm.
Not a big deal, you might say. But it does make a difference when connecting with the Windows initiator. With the Windows initiator, the target per volume paradigm allows you to see what volume a particular LUN is connected to (assuming you give your targets descriptive names, which StorStac does). This in turn allows you to easily coordinate management of a specific target. For example to resize it you can offline it in Windows, stop exporting it on the storage service, rescan in Windows, expand the volume on your storage server, re-export it on your storage server, then online it in Windows and expand your filesystem to fill the newly opened up space. Still, this is not a big deal. LIO does perform quite well and does have the underlying capabilities that can serve the enterprise. So what's missing to keep the Linux storage stack from being prime time? Well, here's what I see:
  1. Ability to set up replication without taking down your filesystems / iSCSI exports. Intransa StorStac had replication built in, you simply set up a volume the same size on the remote target, told the source machine to replicate the volume to the remote target, and it started replicating. Right now replication is handled by DRBD in the Linux storage stack. DRBD works very well for its problem set -- local area high availability replication -- but to set up a replication after the fact on a LVM volume simply isn't possible. You have to create a drbd volume on top of a LVM volume, then copy your data into the new drbd volume. One way around this would be to automatically create a drbd volume on top of each LVM volume in your storage manager, but that adds overhead (and clutters your device table) and presents problems for udev at device assembly time. And still does not solve the problem of:
  2. Geographic replication: StorStac at one time had the ability to do logged replication across a WAN. That is, assuming that your average WAN bandwidth is high enough to handle the number of writes done during the course of a workday, a log volume will collect the writes and ship them across the WAN in the correct order to be applied at the remote end. If you must do a geographic failover due to, say, California falling into the sea, you lose at most whatever log entries have not yet been applied at the remote end. Most filesystems will handle that in a recoverable manner as long as the writes are being applied in the correct order (which they are). DRBD *sort of* has the ability to do geographic replication via an external program, "drbd-proxy", that functions in much the same way as StorStac replication (that is, it keeps a log of writes in a disk volume and replays them to the remote server), but it's not at all integrated into the solution and is excruciatingly difficult to set up (which is true of drbd in general).
  3. Note that LVM also has replication (of a sort) built in, via its mirror capability. You can create a replication storage pool on the remote server as a LVM volume, export it via LIO, import it via open-iscsi, create a physical volume on it, then create mirror volumes specifying this second physical volume as the place you want to put the mirror. LVM also does write logging so can handle the geographic situation. The problem comes with recovery, since what you have on the remote end is a logical volume that has a physical volume inside it that has one or more logical volumes inside it. The circumlocutions needed to actually mount and use those logical volumes inside that physical volume inside that logical volume are non-trivial, it may in fact be necessary to mount the logical volume as a loopback device then do pvscan/lvscan on the loopback device to get at those volumes. It is decidedly *not* as easy as with StorStac, where a target is a target, whether it's the target for a replication or for a client computer.
So clearly replication in the Linux storage stack is a mess, nowhere near the level of ease of use or functionality as the antiquated ten-year-old Intransa StorStac storage stack. The question is, how do we fix it? I'll think about that for a while, but meanwhile there's another issue: Linux doesn't know about SES. This is a Big Deal for big servers. SES is the SCSI Enclosure Services protocol that is implemented by most SAS fanout chips and allows control of, amongst other things, the blinky lights that can be used to identify a drive (okay, so mdmonitor told you that /dev/sdax died, where the heck is that physically located?!) . There are basically two variants extant nowadays, SAS and SAS2, that are very slightly different (alas, I had to modify StorStac to talk to the LSI SAS2X24 expander chip which very slightly changed a mode page that we depended upon to find the slot addresses). Linux itself has no notion that things like SAS disk enclosures even exist, much less any idea how to blink lights in them.

And finally, there is the RAID5/RAID6 write hole issue. Right now the only reliable way to have RAID5/RAID6 on Linux is with a hardware RAID controller that has a battery-backed stripe cache. Unfortunately once you do this, you can no longer monitor drives via smartd to catch failures before they happened (yes, I do this, and yes, it works -- I caught several drives in my infrastructure that were doing bad things before they actually failed and replaced them before I had to deal with a disaster recovery situation), you can no longer take advantage of your server's gigabytes of memory to keep a large stripe cache so that you don't have to keep thrashing the disks to load stripes in the case of random writes (if the stripe is already in cache, you just update the cache and write the dirty blocks back to the drives, rather than have to reload the entire stripe) and you can also no longer take advantage of the much faster RAID stripe computations allowed by modern server hardware (it's amazing how much faster you can do RAID stripe calculations with a 2.4Ghz Xeon than you can with an old embedded MIPS processor running at much slower speeds). In addition it is often very difficult to manage these hardware RAID controllers from within Linux. For these reasons (and other historical issues not of interest at the moment) StorStac always used software RAID. Historically, StorStac used battery-backed RAM logs for its software RAID to cache outstanding writes and recover from outages, but such battery-backed RAM log devices don't exist for modern commodity hardware such as the 12-disk Supermicro server that's sitting next to my desk. It doesn't matter anyhow, because even if it did exist, there's no provision in the current Linux RAID stack to use it.

So what's the meaning of all this? Well, the replication issue is... troubling. I will discuss that more in the future. On the other hand, things like Ceph are handling it at the filesystem level now, so perhaps block level replication via iSCSI or other block-level protocols isn't as important as it used to be. For the rest, it appears that the only thing lacking is a management framework and a utility to handle SES expander chips. The RAID[56] write hole is troublesome, but in reality data loss from that is quite rare, so I won't call it a showstopper. It appears that we can get 90% of what the Intransa StorStac storage stack used to do by using current Linux kernel functionality and a management framework on top of that, and the parts that are missing are parts that few people care about.

What does that mean for the future? Well, your guess is as good as mine. But to answer the question about the Linux storage stack: Yes, it IS ready for prime time -- with important caveats, and only if a decent management infrastructure is written to control it (because the current md/lvm tools are a complete and utter fail as anything other than tools to be used by higher-level management tools). The most important caveat being, of course, that no enterprise Linux distribution has been released yet with LIO (I am using Fedora 18 currently, which is most decidedly *not* what I want to use long-term for obvious reasons). Assuming that Red Hat 7 / Centos 7 will be based on Fedora 18, though, it appears that the Linux storage stack is the closest to being ready for prime time as it's ever been, and proprietary storage stacks are going to end up migrating to the current Linux functionality or else fall victim to being too expensive and fragile to compete.

-ELG