Behold the XKCD Passphrase Generator. Copy and paste it into a file on your own Linux machine, and run (assuming you've installed the 'words' package, which is almost always the case). It'll pick five random words and concatenate them together. Should also run on other machines with Python installed, but you may need to find a words file somewhere and edit accordingly. If this were a real program I'd add paramaters yada yada, but since it's just a toy...#!/usr/bin/python # XKCD passphrase generator. # See XKCD 936 http://xkcd.com/936/ # You'll have to provide your own 'words' file. One word per line. # Unix based systems usually have /usr/share/dict/words but you'll need # to get that from somewhere else for Windows or etc. # After editing wordsfile, numwords, separator: # Execute as: python genpf.py import os wordsfile="/usr/share/dict/words" numwords = 5 separator = ".%*#!|" def gen_index(len): i=(ord(os.urandom(1)) << 16) + (ord(os.urandom(1)) << 8) + (ord(os.urandom(1))) return i%len f=open(wordsfile) pf="" words = f.readlines() i=gen_index(len(separator)) c=separator[i] while (numwords > 0): i=gen_index(len(words)) s=words[i].strip() if (pf==""): pf=s else: pf=pf+c+s pass numwords = numwords - 1 print pf
Friday, May 2, 2014
Thursday, February 20, 2014
org.hibernate.StaleStateException: Batch update returned unexpected row count from update ; actual row count: 0; expected: 1
What was confusing was that there was no update anywhere in the code in question, which was a queue runner. The answer to what was causing this was interesting, and says a lot about Hibernate and its (in)capabilities.
The error in question was being thrown by transaction flush in a queue runner. The queue is in a Postgres database. Each site gets locked in the Postgres database, then its queue run with each queued-up item in the Postgres database deleted after it is processed, then the site gets unlocked.
The first problem arose with the lock/unlock code. There was a race condition, clearly, when two EC2 instances tried to lock the same site at the same time. The way it was originally implemented was with Hibernate, the first would create its lock record, then flush the transaction, then re-query to see whether there was other lockers with a lower ID holding a lock on the object. If so, it'd release the lock by deleting its lock record. Meanwhile the other instance finished processing that queue and released all locks on that site. So the first instance would go to delete its lock, find that it'd already been deleted, and throw that exception.
Once that was resolved, the queue runner itself started throwing the exception occasionally when the transaction was flushed after the unlock. What was discovered by turning on Hibernate debug was that Hibernate was re-ordering operations so that the unlock got applied to the database *before* the deletes got applied to the database. So the site would get unlocked, another queue runner would then re-lock the site to itself and start processing the same records that previously got processed, then go to delete the records, and find that the records had already been deleted out from under it. Bam.
The solution, in this case, was to rewrite the code to use the Groovy SQL API rather than use GORM/Hibernate.
What this does emphasize is that you cannot rely on operations being executed by Hibernate in the same order in which you specified them. For the most part this isn't a big deal, because everything you're operating on is going to be applied in a reasonable order so that the dependencies get satisfied. E.g. if you create a site then a host inside a site, the site record will get created in the database before the host record gets created in the database. But if ordering matters... time to use a different tool. Hibernate isn't it.
Thursday, November 21, 2013
"But wait!" the folks on the Ceph mailing list said. "We promise good *aggregate* throughput, not *individual* throughput." So I created a second Ceph block device and striped it with the first. And did my write again. And got... 30 megabytes per second. Apparently Ceph aggregated all I/O coming from my virtual machine and serialized it. And this is what its limit was.
This seems to be a hard limit with Ceph if you're not logging to a SSD. And not any old SSD. One optimized for small writes. There seems to be some fundamental architectural issues with Ceph that keep it from performing at anywhere near hardware speed. The decision to log everything, regardless of whether your application needs that level of reliability, appears to be one of those decisions. So Ceph simply doesn't solve a problem that's interesting to me. My users are not going to tolerate running at 1/10th the speed they're accustomed to running, and my management is not going to appreciate me telling them that we need to buy some very pricy and expensive enterprise SSD hardware when the current hardware with the stock Linux storage stack on it runs like a scalded cat without said SSD hardware. There's no "there" there. It just won't work for me.
So that's Ceph. Looks great on paper, but the reality for me is that it's 1/10th the speed without any real benefits for me over just creating iSCSI volumes on my appliances, pointing my virtual machines at them, and then using MD RAID on the virtual machine to do replication between the iSCSI servers. Yes that's annoying to manage, but at least it works, works *fast*, and my users are happy.
At which point let's talk about Amazon EC2. Frankly, EC2 sucks. First, it isn't very elastic. I have autoscaling alerts set up to spin up new virtual machines when load on my cloud reaches a certain point. Okay, fine. That's elastic, right? But: While an image is booting up and configuring, it's using 100% CPU. Which means that your alarm goes off *again* and spin up more instances, ad infinitum, until you hit the upper limit you configured for the autoscaling group, *unless* you put hysteresis in there to wait until the instance is up before you spin up another instance. So: It took five minutes to spin up a new instance. That's not acceptable. If the load has kept going up in the meantime, that means you might never catch up. I then created a new AMI that had Tomcat and other necessary software already pre-loaded. That brought the spin up time to three minutes -- one minute for Amazon to actually create the instance, one minute for the instance to boot and run Puppet to pull in the application .war file payload from the puppetmaster, and one minute for Tomcat to actually start up. Acceptable... barely. This is elastic... if elastic is eons in computer time. The net result is to encourage you to spin up new instances before you need them, spin up *multiple* instances at a time when spinning up instances, and then take a while before tearing them back down again once load goes down. Not the best way to handle things by any means, unless you're Amazon and making money by forcing people to keep excess instances hanging around.
Then let's not talk about the fact that CloudFormation is configured via JSON. A format deliberately designed without comments for data interchange between computers, *not* for configuring application stacks. Whoever specified JSON as the configuration language needs to be taken out behind Amazon HQ and beat with a clue stick until well and bloody. XML is painful enough as a configuration language. JSON is pure torture. Waterboarding is too good for that unknown programmer.
And then there's expense. Amazon EC2 is amazingly expensive compared to somebody like, say, Digital Ocean or Linode. My little 15 virtual machine cloud would cost roughly $300/month at Linode, less at Digital Ocean. You can figure four times that amount for EC2.
So why use EC2? Well, my Ceph experiment should clue you in there: it's all about EBS, the Elastic Block Store. See, I have data that I need to store up in the cloud. A *lot* of data. And if you create EBS-optimized virtual machines and striped MD RAID arrays across multiple EBS volumes, your I/O is wicked fast. With just two EBS volumes I can easily exceed 1,000 IOPS and 100 megabytes per second when doing pg_restore of database files. With more EBS volumes I could do even better.
Digital Ocean has nothing like EBS. Linode has nothing like EBS. Rackspace and HP have something like EBS (in public beta for HP right now, so not sure how much to trust it), but they don't have a good instance size match with what I need and if you go the next size up, their pricing is even more ridiculous than Amazon's. My guess is that as OpenStack matures and budget providers adopt it, you're going to see prices come down for cloud computing and you're going to see more people providing EBS-like functionality. But right now OpenStack is chugging away furiously trying to match Amazon's feature set, and is unstable enough that only providers like HP and Linode who have hundreds of network engineers to throw at it could possibly do it right. Each iteration gets better so hopefully the next iteration, Finally, there's Microsoft's Azure. I've heard good things about it, oddly enough. But I'm still not trusting it too much for Linux hosting, given that Microsoft only recently started giving grudging support to Linux. Maybe in six months or so I'll return and look at it again. Or maybe not. We'll see.
So Amazon's cloud it is. Alas. We look at the Amazon bill every month and ask ourselves, "surely there is a better alternative?" But the answer to that question has remained the same for each of the past six months that I've asked it, and it remains the same for one reason: EBS, the Elastic Block Store.
Saturday, August 24, 2013
Not a big deal, you might say. But it does make a difference when connecting with the Windows initiator. With the Windows initiator, the target per volume paradigm allows you to see what volume a particular LUN is connected to (assuming you give your targets descriptive names, which StorStac does). This in turn allows you to easily coordinate management of a specific target. For example to resize it you can offline it in Windows, stop exporting it on the storage service, rescan in Windows, expand the volume on your storage server, re-export it on your storage server, then online it in Windows and expand your filesystem to fill the newly opened up space. Still, this is not a big deal. LIO does perform quite well and does have the underlying capabilities that can serve the enterprise. So what's missing to keep the Linux storage stack from being prime time? Well, here's what I see:
- Ability to set up replication without taking down your filesystems / iSCSI exports. Intransa StorStac had replication built in, you simply set up a volume the same size on the remote target, told the source machine to replicate the volume to the remote target, and it started replicating. Right now replication is handled by DRBD in the Linux storage stack. DRBD works very well for its problem set -- local area high availability replication -- but to set up a replication after the fact on a LVM volume simply isn't possible. You have to create a drbd volume on top of a LVM volume, then copy your data into the new drbd volume. One way around this would be to automatically create a drbd volume on top of each LVM volume in your storage manager, but that adds overhead (and clutters your device table) and presents problems for udev at device assembly time. And still does not solve the problem of:
- Geographic replication: StorStac at one time had the ability to do logged replication across a WAN. That is, assuming that your average WAN bandwidth is high enough to handle the number of writes done during the course of a workday, a log volume will collect the writes and ship them across the WAN in the correct order to be applied at the remote end. If you must do a geographic failover due to, say, California falling into the sea, you lose at most whatever log entries have not yet been applied at the remote end. Most filesystems will handle that in a recoverable manner as long as the writes are being applied in the correct order (which they are). DRBD *sort of* has the ability to do geographic replication via an external program, "drbd-proxy", that functions in much the same way as StorStac replication (that is, it keeps a log of writes in a disk volume and replays them to the remote server), but it's not at all integrated into the solution and is excruciatingly difficult to set up (which is true of drbd in general).
- Note that LVM also has replication (of a sort) built in, via its mirror capability. You can create a replication storage pool on the remote server as a LVM volume, export it via LIO, import it via open-iscsi, create a physical volume on it, then create mirror volumes specifying this second physical volume as the place you want to put the mirror. LVM also does write logging so can handle the geographic situation. The problem comes with recovery, since what you have on the remote end is a logical volume that has a physical volume inside it that has one or more logical volumes inside it. The circumlocutions needed to actually mount and use those logical volumes inside that physical volume inside that logical volume are non-trivial, it may in fact be necessary to mount the logical volume as a loopback device then do pvscan/lvscan on the loopback device to get at those volumes. It is decidedly *not* as easy as with StorStac, where a target is a target, whether it's the target for a replication or for a client computer.
And finally, there is the RAID5/RAID6 write hole issue. Right now the only reliable way to have RAID5/RAID6 on Linux is with a hardware RAID controller that has a battery-backed stripe cache. Unfortunately once you do this, you can no longer monitor drives via smartd to catch failures before they happened (yes, I do this, and yes, it works -- I caught several drives in my infrastructure that were doing bad things before they actually failed and replaced them before I had to deal with a disaster recovery situation), you can no longer take advantage of your server's gigabytes of memory to keep a large stripe cache so that you don't have to keep thrashing the disks to load stripes in the case of random writes (if the stripe is already in cache, you just update the cache and write the dirty blocks back to the drives, rather than have to reload the entire stripe) and you can also no longer take advantage of the much faster RAID stripe computations allowed by modern server hardware (it's amazing how much faster you can do RAID stripe calculations with a 2.4Ghz Xeon than you can with an old embedded MIPS processor running at much slower speeds). In addition it is often very difficult to manage these hardware RAID controllers from within Linux. For these reasons (and other historical issues not of interest at the moment) StorStac always used software RAID. Historically, StorStac used battery-backed RAM logs for its software RAID to cache outstanding writes and recover from outages, but such battery-backed RAM log devices don't exist for modern commodity hardware such as the 12-disk Supermicro server that's sitting next to my desk. It doesn't matter anyhow, because even if it did exist, there's no provision in the current Linux RAID stack to use it.
So what's the meaning of all this? Well, the replication issue is... troubling. I will discuss that more in the future. On the other hand, things like Ceph are handling it at the filesystem level now, so perhaps block level replication via iSCSI or other block-level protocols isn't as important as it used to be. For the rest, it appears that the only thing lacking is a management framework and a utility to handle SES expander chips. The RAID write hole is troublesome, but in reality data loss from that is quite rare, so I won't call it a showstopper. It appears that we can get 90% of what the Intransa StorStac storage stack used to do by using current Linux kernel functionality and a management framework on top of that, and the parts that are missing are parts that few people care about.
What does that mean for the future? Well, your guess is as good as mine. But to answer the question about the Linux storage stack: Yes, it IS ready for prime time -- with important caveats, and only if a decent management infrastructure is written to control it (because the current md/lvm tools are a complete and utter fail as anything other than tools to be used by higher-level management tools). The most important caveat being, of course, that no enterprise Linux distribution has been released yet with LIO (I am using Fedora 18 currently, which is most decidedly *not* what I want to use long-term for obvious reasons). Assuming that Red Hat 7 / Centos 7 will be based on Fedora 18, though, it appears that the Linux storage stack is the closest to being ready for prime time as it's ever been, and proprietary storage stacks are going to end up migrating to the current Linux functionality or else fall victim to being too expensive and fragile to compete.
Sunday, August 11, 2013
This isn't a new thought on my part. When I was designing the Intransa StorStac 7.20 storage appliance platform I deliberately put virtualization drivers into it so that we could run Intransa StorStac as a virtual appliance on some future hardware platform not supported by the 2.6.32 kernel. And yes, that works (no joke, I tried it out of course, the only thing that didn't work was sensors but if Viakoo ever wants to deliver a virtualized IntransaBrand appliance I know how to fix the sensors). My thought was future-proofing -- I could tell from the layoffs and from the unsold equipment piled up everywhere that Intransa was not long for the world, so I decided to leave whoever bought the carcass a platform that had some legs on it. So it has drivers for the network chips in the X9 series SuperMicro motherboards (Sandy/Ivy Bridge) as well as the virtualization drivers. So there's now a pretty reasonable migration path to keep StorStac running into the next decade... first migrate it to Sandy/Ivy Bridge physical hardware, then once that's EOL'ed migrate it to running on top of a virtual platform on top of Haswell or its successors.
But what brought it to mind today was ZFS. I need some of the features of the LIO iSCSI stack and some of the newer features of libvirtd for some things I am doing, so have ended up needing to run a recent Fedora on my big home server (which is now up to 48 gigabytes of memory and 14 terabytes of storage). The problem is that two of those storage drives are offsite backups from work (ZFS replication, duh) and I need to use ZFS to apply the ZFS diffsets that I haul home from work. That was not a problem for Linux kernels up to 3.9, but now Fedora 18/19 have rolled out 3.10, and ZFSonLinux won't compile against the 3.10 kernel. I found that out the hard way when the new kernel came in, and DKMS spit up all over the floor because of ZFS.
The solution? Virtualization to the rescue! I rolled up a Centos 6.4 virtual machine, pushed all the ZFS drives into it, gave it a fair chunk of memory, and voila. One legacy platform that can sit there happily for the next few years doing its thing, while the Fedora underneath it changes with the seasons.
Of course that is nothing new. A lot of the infrastructure that I migrated from Intransa's equipment onto Viakoo's equipment was virtualized servers dating in some cases all the way back to physical servers that Intransa bought in 2003 when they got their huge infusion of VC money. Still, it's just a practical reminder of the killer app for virtualization -- the fact that it allows your OS and software to survive despite underlying drivers and architectures changing with the seasons. Now making your computer work faster can be done without changing anything at all about it -- just buy a couple of new virtualization servers with the very latest fastest hardware and then migrate your virtual machines to them. Quick, easy, and terrifies OS vendors (especially Microsoft) like crazy because now you no longer need to buy a new OS to run on the new hardware, you can just keep using your old reliable OS forever.
Thursday, July 25, 2013
Great idea. Sort of. Except it seems to have two critical issues: 1) it doesn't appear to have any way to handle our staging->production cycle, where the transactions coming into production are replicated to staging during our testing cycle, then eventually staging is rolled to production via mechanisms I won't go into right now, and 2) it doesn't appear to actually work -- it claims that the default security groups that are needed for its version of Chef to work don't exist, and they never appeared later on either. Which isn't because I lack permission to create security groups, because I created one for an earlier prototype of my deployment. This appears to be a sporadic bug that multiple people have reported, where the default security groups aren't created for one reason or another.
Eh. Half baked and not ready for production. Oh well. Amazon *did* say it was Beta, after all. They're right.
Wednesday, July 24, 2013
Thursday, May 30, 2013
The first thing I did was create a test framework with Java 6, Java 7, and a Python re-implementation and validate that yes, the problem was something that Java 7 was doing. Java 6 worked. Python via the cPython engine worked. Java 7 didn't work, it logged in (the first transaction) then the next transaction (actually sending a command) failed. Python via the jYthon engine on top of Java 7 didn't work. I tried on both Oracle's official Java 7 JDK and Red Hat's OpenJDK that came with Fedora 18, neither worked. So it clearly had something to do with differences in the Java 7 SSL stack.
I then went to Oracle's site to look at the change notices on difference between Java 6 and Java 7. There was nothing that looked interesting. Setting sun.security.ssl.allowUnsafeRenegotiation=true and sse.enableSNIExtension=false were two things that looked like they might account for some difference. Maybe the old storage array was using an old protocol. So I ran the program under Java 6 with -Djavax.net.debug=all, found out what protocol it was using, then hardwired my SSL ciphers list to use that protocol as default. I tested again under Java 7, and it still didn't work.
Then I ran the program on the two different environments with -Djavax.net.debug=all on my main method on both Java 6 and Java 7. The Java 6 was the OpenJDK from the Centos 6.4 distribution. The Java 7 was the OPenJDK on Fedora 18. Two screens, one on each computer, later, and the output on both of them was identical -- up until the transmission of the second SSL packet. The right screen (the Fedora 18 screen) split it into *two* SSL packets.
But why? I downloaded the source code to OpenJDK 6 and OpenJDK 7 and extracted them, and set out to figure out what was going on here. The end result: A tiny bit of code in SSLSocketImpl.java called needToSplitPayload() was called from the SSL engine, and if that code says yes, it splits the packet. So... is the cipher a CBC mode? Yes. is it the first app output record? No. Is Record.enableCBCProtection set? Wait, where is that set?! I head over to Record.java, and find that it's set from the property "jsse.enableCBCProtection" at JVM startup.
The end result: I call System.setProperty("jsse.enableCBCProtection", "false"); in my initializer (or set it on the command line) and everything works.
So what's the point of CBC protection? Well, the point is to deal with traffic analysis attacks against web servers. With snooping on the traffic you can figure out what transactions are happening and figure out important details of the session. The problem is that in my application, the receiving storage array apparently wants packets to stay, well, packets, not multiple packets. It's not a problem for the code I wrote, I look for begin and end markers and don't care how many packets it takes to collect a full set of data to shove into my XML parser, but I have no control over the legacy storage array end of things. It apparently does *not* look for begin and end markers, it just grabs a packet and expects it to be the whole tamale.
So all in all, CBC protection is a Good Thing -- except in my application, where a legacy storage array cannot handle it. Or in a wide variety of other applications where it slows down the application to the point of uselessness by breaking up nice big packets into teeny little packets that clog the network. But at least Oracle gave us a way to disable it for these applications. The only thing I'll whine about is this: Oracle apparently implemented this functionality *without* adding it to their list of changes. So I had to actually reverse-engineer the Java runtime to figure out that one obscure system property is apparently set to "false" by default in OpenJDK 6 and set to "true" by default in OpenJDK 7. Which is not The Way It Spozed To Be, but at least thanks to the power of Open Source it was not a fatal problem -- just an issue of reading through source code of the runtime until locating the part that broke up SSL packets, then tracing backwards through the decision tree up to the root. Chalk up one more reason to use Open Source -- if something mysterious is happening, you can pretty much always figure it out. It might take a while, but that's why we make the big bucks, right?
Wednesday, May 1, 2013
So why did this approach work so well for BRU Server, allowing us to concentrate on the business logic rather than the database logic and allowing its current owners to maintain the software for ten years now, while it fails so harshly for atrocities like Hibernate? One word: complexity. Hibernate attempts to handle all possible cases, and thus ends up producing terrible SQL queries while making things that should be easy difficult, but that's because it's a general purpose mapper. The BRU Server team -- all four of us -- understood that if we were going to create a complete Unix network backup solution within the six months allotted to us, complexity was the enemy. We understood the compromises needed between the object model and the relational model, and the fact that Python was capable of expressing sets of objects as easily as it was capable of expressing individual objects meant that the "object=record" paradigm was fairly easy to handle. We wrote as much ORM as we needed -- and no more. In some cases we had to go to raw relational database programming, but because our ORM was so simple we had no problems with that. There were no exceptions being thrown because of an ORM caching data that no longer existed in the database, and the actual objects for things like users and servers could do their thing without worrying about how to actually read and write the database.
In the meantime, I have not run into any Spring/Hibernate project that actually managed to produce usable code performing acceptably well in any reasonable time frame with a team of a reasonable size. I was at one company that decided to use the PHP-based code that three of us had written in four weeks' time as the prototype for the "real" software, which of course was going to be Java and Spring and Hibernate and Restful and all the right buzzwords, because real software isn't written in PHP of course (though our code solved the problem and didn't require advanced degrees to understand). Six months later and a cast of almost a dozen and no closer to release than at the beginning of the project, the entire project was canned and the project team fired (not me, I was on another project, but I had a friend on that project and she was not a happy camper). I don't know how much money was wasted on that project, but undoubtedly it hurried the demise of that company.
But maybe I'm just not well informed. It wouldn't be the first time, after all. So can anybody point me to a Spring/Hibernate project that is, say, around 80,000 lines of code written in under five months' time by a team of four people, that not only does database access but also does some hard-core hardware-level work slinging massive amounts of data around in a three-box client-server-agent architecture with multiple user interfaces (CLI and GUI/Web minimum)? That can handle writing hundreds of thousands of records per hour then doing complex queries on those records with MySQL without falling over? We did this with BRU Server, thanks to Python and choosing just enough ORM for what we needed (not to mention re-using around 120,000 lines of "C" code for the actual backup engine components), and no more (and no less). The ORM took me a whole five (5) days to write. Five. Days. That's it. Granted, half of that is because of Python and the introspection it allows as part of the very definition of the language. But. Five days. That's how much you save by using Spring/Hibernate over using a language such as Ruby or Python that has proper introspection and doing your own object-relational mapping. Five days. And I submit that the costs of Spring/Hibernate are far, far worse, especially for the 20% of projects that don't map well onto the Spring/Hibernate model, such as virtually everything that I do (since I'm all about system level operations).
Saturday, April 27, 2013
The reality is that unlike ESXi, which by default locks the VMDK file so that only a single virtual machine can use it at a time, thus meaning that the same VM set to start up on multiple servers will only start on one (that wins the race), libvirtd by default does *not* include any sort of locking. You have to configure a lock manager to do so. In my case, I configured 'sanlock', which has integration with libvirtd. So on each KVM host configured to access shared VM datastore /shared/datastore :
- yum install sanlock
- yum install libvirt-lock-sanlock
- chkconfig wdmd on
- chkconfig sanlock on
- service wdmd start
- service sanlock start
- cd /shared/datastore
- mkdir sanlock
- chown sanlock:sanlock sanlock
- chmod 775 sanlock
- auto_disk_leases = 1
- disk_lease_dir = /shared/datastore/sanlock
- host_id = 1
- user = "sanlock"
- group = "sanlock"
- lock_manager = "sanlock"
- service libvirtd restart