Wednesday, July 18, 2012

Best Laid Plans: Database Redundancy and the "Cloud"

I haven't worked with databases for very long. Two years ago when I started this gig was my first opportunity to interact with MySQL in a really intimate way, and it was straight into the fire at that. No easing into it with a GUI or MySQL Workbench or anything like that. Pretty scary realizing that getting your > and < mixed up can be all that stands between you and an empty database. Let's not even get into starting to understand the differences between a transactional and non-transactional storage engine, or how hard it can be to actually try and restore a MySQL server. When it came time to address our redundancy issues is when I really got into the internals of MySQL. Let me clear here: I hate database administration. I enjoyed learning about the internals, but I do not and never will enjoy being responsible for their health.

My quest for redundancy took to me MySQL-MMM which I abandoned after I wasn't able to get a reliable setup working in my test environment. In case you're not familiar with it, MySQL-MMM is essentially a set of Perl scripts that allow a monitoring instance to handle connections to two or more database instances by way of scripted checks and a virtual IP. In my test environment the db instances kept flapping. The monitor would essentially move the virtual IP between the two nodes, back and forth because it was constantly seeing some check as failing, until eventually the whole thing would go offline. There were no actual problems with the nodes themselves though. Manual checks didn't show any network problems, permissions problems, nothing. I used the forums, Googled, IRC, and was unable to get any assistance so I eventually shelved it and went with a standard master-slave setup. Not automated, but does the trick...hopefully (knock on wood).

In researching a move into the cloud I had another opportunity to try my hand at doing something a little more advanced and foolproof for database replication.
Both Rackspace and Amazon offer specialized database setups: Amazon has their multi-AZ RDS service, and Rackspace has their DB service, although it's still early access and can only be utilized via their API. I started off with Rackspace for pretty shallow reasons that I mentioned in a previous post, and setting up my servers was a pretty simple affair. It was the database portion that bit me; jumping through hoops to do something as simple as enable root access to MySQL itself, and then not having access to configuration options that couldn't be passed through the MySQL console itself (such as setting the 'lower_case_table_names' parameter). I found out later that this parameter could be worked around, although it would be a pain and involve considerable manual intervention for a large number of databases.

Since their offering didn't work I decided to try installing MySQL-MMM again. This was moving along nicely until I got to the part where I had to assign a virtual IP to float between my database nodes.  Rackspace doesn't offer the ability to do that using their cloud servers. At that point the only option was to go with our standard MySQL master-slave setup or move up into another tier of service all together, RackConnect. RackConnect would allow us to create a hybrid infrastructure of physical database servers and virtual application and web servers. Since our app doesn't benefit from the scalable nature of cloud services, having to essentially lease 2 physical servers to get this going seemed...wasteful and silly. So, I turned to Amazon to see if they could do better.

I soon found out that Amazon's database service was no different. Luckily I had learned my lesson from Rackspace and so I didn't go through the pain of trying to set up RDS before actually reading the documentation looking for specific information about how I could access an RDS instance. I also found through some research that RDS isn't as seamless as presented. While having another database server in another availability zone is a great idea for redundancy, there is still considerable lag-time between a server going down and the secondary coming up. For the money there's no real benefit then.

I checked out options for setting up MySQL-MMM in AWS as well. This time things seemed more promising. You can get a virtual private cloud (VPC) which you can set up with a configurable number of subnets. I figured, if I can have a subnet, then I can use whatever IPs I want. I set up all of my servers, statically assigned them private IPs within the subnets I'd chosen, set up one elastic IP for my sole front-end web instance, arranged routing between the different security groups, the works. I contended from the start that Amazon's cloud was more complicated than Rackspace, and this trial certainly upheld that statement, although I could see some benefits to the complexity as it relates to security. On the other hand, it was a real pain to only be able to SSH to the web server and have to use it as the jump-off point to the other nodes.

At any rate, I got everything set up and started MySQL-MMM. I found that I couldn't connect to the database IP, and after some investigation it was determined that even though I presumably had a 10.0.0.0/16 subnet to work with, I actually could not create a virtual IP. I had an IP address all picked out, and I could have assigned this to any instance I wanted, but I could not use it as a virtual IP. I'm gonna pause for a moment of 'SMH' here.

I could, apparently, create what they call an "Elastic Network Instance" and attach it to an instance, but it defeats the purpose of running MySQL-MMM because I would still have to manually detach the EIN and re-attach it to another instance in the event of node failure. Big...fail. I searched for ways around this because after all, I couldn't be the only person trying to deploy MySQL-MMM in AWS, right? I wasn't, but in typical open source form the documented solutions weren't easy to come by or all that clear. There was some talk about setting up your own nameserver instance and using hostnames instead of IPs to accomplish this, which required the installation of an ns_agent companion piece to MySQL-MMM. At this point though I could see myself dipping one toe into the rabbit hole, and I still have some ASA failover to attend to before I leave this position and start my new adventure.

In the end I've settled for a cloud infrastructure that looks a lot like our existing physical infrastructure, with MySQL running in master-slave replication, except this time both nodes are slaves to each other. So far it appears to be working and I've set up a couple of projects. My remaining concern is really about the database performance running in a pure cloud environment not optimized for I/O. That's why I've only migrated over internal projects and not actual client-facing ones.

From my limited experience with both MySQL administration and Rackspace and Amazon though, I have to conclude that I don't think the "cloud" has stepped up to the plate in addressing the concerns of database availability and replication. The marketing hype sounds nice, but when both solutions seem to either not meet the high availability requirements of today's websites without a lot of tinkering and roll-your-own additions, I have to say it's just not ready. Unless you're going straight-up MySQL cluster, there's not a lot of lure to moving to the cloud if redundancy and high availability are among your goals.

No comments:

Post a Comment