Wednesday, July 25, 2012

HPET Warnings in Messages

You know when you happen to be looking through logs for potential answers to a problem, and you run into messages that indicate another, completely different problem that likely needs your attention?

Yeah. That.

You may remember that part of some recent data center work I did involved replacing a server that had died. I booted it up using a Mondo Rescue image of the original server, and then the Dev dude copied specific data from another server using some script he whipped up at the time. One and done. 

Yesterday I SSH'd in to complete the setup by installing Nagios and OpenLDAP. No, we don't have Puppet or Chef in the mix yet. It was one of many things on my to-do list. Anyway, I got NRPE installed but received an error when doing a basic test on the machine to make sure it worked: 

user@server3:/usr/local/nagios/bin$ !99
sudo /usr/local/nagios/libexec/check_nrpe -H localhost
CHECK_NRPE: Error - Could not complete SSL handshake.
Since SSL isn't even enabled on this box I wasn't quite sure what was causing it so I checked /var/log/messages. I found this instead:

Jul 22 12:05:30 JXT3 kernel: [130848.552652] CE: hpet increasing min_delta_ns to 15000 nsec
Jul 22 12:05:30 JXT3 kernel: [130848.552726] CE: hpet increasing min_delta_ns to 22500 nsec
Now this was a few days ago and I don't see any other mentions of it, but it makes me nervous so I check it out. Seems it's a known bug that appears to affect some versions of the Linux kernel, though I can't get an exact bead on which ones. We're running 2.6.32-28 which is usually a couple of minor revision ahead of the versions reported by users online. Also, all of our servers are running the same kernel and I don't see this error in their logs. 

HPET stands for High Precision Event Timer. From Wikipedia, this timer counts upwards in increments of no less than 10MHz and has no less than 3 comparators. This explanation didn't do much towards letting me know what the purpose was. I skimmed a paper on Intel's website that made a little more sense (in the beginning at least) and let me understand the general gist of things, which is that it's essentially an internal timer for the OS. 

Why is it giving errors? contains some info. The thread ends with the suggestion to disable it, which can apparently be done by adding the line "hpet=disable" to grub. I may try this if I continue to see these errors pop up as others have described it as causing stability issues (freezing, crashing) after it reaches the 100ms mark.

Tuesday, July 24, 2012

More Than I Can Chew

I'm leaving my current job at the end of this week. My job switching has unfortunate timing. A couple of projects were started/requested during the last two months, and that's when I found an opportunity that I was interested in pursuing, so the timeline on these projects got decreased in a hurry. The first was getting a test cloud implementation set up to use as a template or guideline going forward. The second was to set up high availability in our data center, i.e. add a second switch and ASA 5505 to the mix. The cloud project was completed, though it took a little longer than expected. The data center work was scheduled to be completed this past Friday, but I didn't quite get there.

The problem was that I had planned to do more than was actually feasible in the time frame I had. We host clients in the data center, and some of them have events that happen well into the evening. It has made it difficult in the past to schedule downtime for any kind of maintenance and I generally am not able to start until almost midnight. I'm not the nightbird I used to be since I have a kid now, so the late night tech work has gotten more difficult. When I worked for WFM and we did our PCI conversion for the region the weeks were filled with nights where I got started at 10pm and didn't leave until the sun came up...after working a full day beforehand. Not so much any more.

I went to the data center at 10pm, along with the Developer who acts as IT backup. The list of things to do:

  • replace a failed server
  • rack a second PDU
  • install second switch and ASA
  • configure switch
  • upgrade existing ASA (both RAM and the software)
  • copy config from new ASA to old one
  • set up HA
  • test
Even writing it now it looks like a lot more than could/should be attempted in one night, but I was running short on time and really wanted to finish what I'd started. I should also point out that the separate tasks also were more detailed. For example, the failed server to replace? I had to build it on onsite because the replacement also involved moving RAM from the old server to the new one. The new (i.e. a refurbed one that we had laying around) didn't have any RAM at all so it couldn't be prepped ahead of time. Luckily since I used Mondo Rescue it was a quicker task than it might have been, but still. Also, I was not only attempting to add redundancy to our infrastructure but also make it more secure by introducing VLANs and moving away from the flat network topology that we have in place. I wanted to put the web servers in a DMZ and keep the database servers inside, standard stuff. Lastly, the newer software for the ASA requires more RAM for the Security Plus license and has a whole new syntax for things like ACLs and NAT, so I had to put in the time to re-learn those things for configuring the new firewall. 

Of course, nothing worked as intended. The Mondo Rescue server build had to be attempted 3 times before it worked. There weren't rack screws included with the PDU we ordered. And although Layer 2 switching is pretty simple in general and I didn't foresee an issue with trying to install a second switch and set up VLANs and the necessary trunking, I couldn't get traffic to flow between switches. I added a second switch, configured it connected it to the new ASA, and couldn't get the two to talk. My pc would talk to the switch, the ASA would talk to the outside, but I could not get from my PC to the outside via the switch and the ASA. By this time it was approaching 2 in the morning, and my brain was fried and not effectively troubleshooting at all. Big suck. 

I would have stayed there all night trying to get things working, but the developer convinced me that I needed to simplify my tasks and move on. In other words, screw the VLANs, set everything up as it was, and regroup. In the end I wound up with leaving the second ASA and switch installed, but not actually connected to the network. In short, the evening was a disaster and I only got 3 of the tasks completed. My error was definitely in attempting to do too much. In retrospect, it would have been smarter to leave the existing network topology alone, and simply add the new hardware, leaving the security portion of it to my predecessor. 

Lesson learned. 

Thursday, July 19, 2012

Just Call Me DNS of the Morning, Angel

I use a Windows desktop on a daily basis and attach to a Windows 2003 domain, but these days I've had little need to mess with Windows on an administrative level. A few antivirus updates here, adding an AD account there – simple stuff. Well, last night I moved some of our sub-domains over into the Amazon cloud and edited the DNS entries (hosted by our ISP) to reflect their new home. This was after hours and done at home, and I verified connectivity and functionality of the sites.

This morning upon arriving at work I was told that none of the sub-domains in question were accessible. I got that pIT of the stomach feeling that comes when it looks like your day is going to be hosed. I quickly logged on to my machine and found that indeed I was getting a 500 error, indicating an issue at the server level. On a hunch though I checked the DNS resolution (pinging the sub-domain) and saw that it was resolving to the old IP. No problem; ipconfig /flushdns should do the trick.

Except it didn't.

Now I had been messing around with my hosts file so I verified that it had been returned to normal, and it had. I happened to have brought my Mac in today so I tested the domains on it and sure enough, most was well. There was a slight problem that I had to log in to the server to fix, but that was something else altogether. The resolution problem was just that. So, it worked on my Mac, it worked on a couple of other non-domain machines. What's the deal?

I tried ipconfig /displaydns to see what was cached. I saw the entry to the domain in question. Cleared the cache again, ran the displaydns command, and it was empty. Should be good, except when I pinged I was still getting the old IP and displaying DNS showed that same entry in the cache. To be on the safe side I deleted temp files, emptied the Recycle Bin, turned the dns client on and off (net stop dnscache). Same results. Okay, so the issue is at the DNS server level. I logged in to the Windows AD machine and launched dnsmgmt.msc. I cleared the cache from the console. I also cleared it from the command line. Windows still showed the results there when I cleared the cache though. I went back a few minutes later (after starting this post) and the entry was gone. Hmmm...maybe it took a bit? At any rate, the problem was solved.

And now I can go and eat my cold Pop Tart.

Wednesday, July 18, 2012

Best Laid Plans: Database Redundancy and the "Cloud"

I haven't worked with databases for very long. Two years ago when I started this gig was my first opportunity to interact with MySQL in a really intimate way, and it was straight into the fire at that. No easing into it with a GUI or MySQL Workbench or anything like that. Pretty scary realizing that getting your > and < mixed up can be all that stands between you and an empty database. Let's not even get into starting to understand the differences between a transactional and non-transactional storage engine, or how hard it can be to actually try and restore a MySQL server. When it came time to address our redundancy issues is when I really got into the internals of MySQL. Let me clear here: I hate database administration. I enjoyed learning about the internals, but I do not and never will enjoy being responsible for their health.

My quest for redundancy took to me MySQL-MMM which I abandoned after I wasn't able to get a reliable setup working in my test environment. In case you're not familiar with it, MySQL-MMM is essentially a set of Perl scripts that allow a monitoring instance to handle connections to two or more database instances by way of scripted checks and a virtual IP. In my test environment the db instances kept flapping. The monitor would essentially move the virtual IP between the two nodes, back and forth because it was constantly seeing some check as failing, until eventually the whole thing would go offline. There were no actual problems with the nodes themselves though. Manual checks didn't show any network problems, permissions problems, nothing. I used the forums, Googled, IRC, and was unable to get any assistance so I eventually shelved it and went with a standard master-slave setup. Not automated, but does the trick...hopefully (knock on wood).

In researching a move into the cloud I had another opportunity to try my hand at doing something a little more advanced and foolproof for database replication.

Tuesday, July 17, 2012

Amazon and Rackspace: A Comparison, Part 2

In my last post I discussed the steps I took to get our application working on AWS. This post will address steps used on Rackspace to get a test environment up and going. This isn't an apples-to-apples comparison. I'm sort've biased towards Rackspace for reasons I mentioned previously, so I'm going the full monty in deploying a test environment with them, doing my best to replicate how I envision the final product actually working.

Setting up servers was pretty flawless and easy. Honestly, you simply pick the kind of server you want, and deploy it. It did take a little longer for each instance to be up and ready than AWS, but the creation process was pretty much point and click. I had a web server, a Tomcat server, and 3 MySQL servers set up in no time (the 3 MySQL servers were because I was going to also try out using MySQL-MMM again).

An immediate difference between Amazon and Rackspace was server access. With Amazon you got a public IP that requires private-key SSH authentication to connect to. All other services are locked down until you edit/create a security group, so by default it's pretty secure. With Rackspace each server you set up also has a public/private IP pair, and you can SSH to the public IP with a username/password combo. The account you use is root, which by default on Ubuntu systems is locked so that you can't log in directly with it, but which for some reason the folks at Rackspace have seen fit to enable. That and the fact that again each server if directly accessible via SSH and a public IP made me a little nervous. If you don't use RackConnect, which is their offering for setting up a private environment including an ASA, you have to enable the firewall on each box and control access that way. From a security standpoint this introduces a lot of potential holes and requires some serious care and attention to detail. I had no sooner set my boxes up than I saw scans from the usual suspects (China, Spain) on my boxes. So right off the bat Rackspace servers require more manual intervention to secure your setup than AWS unless of course you're going to use their RackConnect feature which requires more money.

I got my servers set up in a timely manner and set about working on the load balancer.

Amazon and Rackspace: A Comparison, Part 1

This will likely be an on-going set of posts as I make my way through figuring out which one is the better offering.

 My company is failing in matters of redundancy and failover. I've been here for almost 2 years. When I stepped in as a Junior Sys Admin we had 1 of everything: 1 Apache server, 1 MySQL server, 1 Tomcat server, 1 switch, 1 ASA. Now, SaaS is a new frontier for me. Most of my previous sys admin work was spent in environments that had already been running for a while, and pretty much served inside users; employees and maybe the occasional contractor. These were small to medium-sized businesses that could get away with a single point of failure because if their ASA went down they wouldn't be losing revenue. Productivity time, sure, but not actual client-based revenue.

 In the SaaS industry, your infrastructure is your revenue. Your clients depend on you to provide the kind of environment that they can't necessarily provide for themselves: backups, high availability, etc. I coudl go into the history of my company and how it didn't start as a SaaS provider, hence the lack of foresight in building out the infrastructure, but I'll save that for another post. Suffice it to say that we are now, on my watch, playing catch up and trying to shape our infrastructure into what it needs to be to stay afloat.

I put together a presentation with the options as I saw it. I essentially researched and laid out the pros and cons of 3 options: staying in our data center and expanding physically, i.e. purchasing bigger/better hardware, expanding into multiple/bigger cages, etc.; staying in the data center but essentially creating our own cloud using VMWare or XenServer; or moving into someone else's cloud, like Rackspace or Amazon. After going through the various features, costs, and implementation challenges of each it was decided that the cloud was where we would be going. The only thing left to determine was if it would be Amazon or Rackspace.

Due to issues involving our in-house application server, I had opportunity to test out implementation sooner than I'd anticipated. After the read-only incidents and the failed controller/drives, we had a bunch of our own internal projects that needed to find somewhere else to live. I initially moved them into the data center with our clients just so that we could continue to work, but my plan is to move our development project to Amazon using their micro instance which is free, and to move the rest of us in to Rackspace. I figure this will give me good experience on the ins and outs of setting up either service and help me to determine which is a better fit for us. I admit I'm leaning heavily towards Rackspace for the following reasons: their customer service is unparalleled, their offerings are easy to understand, their control panel is simple to navigate, and you can combine physical and virtual systems pretty seamlessly. Amazon, on the other hand, really requires that you read the manual (closely and many times) to even begin to understand the various services they offer and how they fit in together. Just trying to get an estimate on how much an instance would cost was difficult and required metrics that quite frankly we don't have. Hell, just connecting to an instance from SSH is unwieldy. So, follow me as I try to set up services using both providers.

Tuesday, July 3, 2012

Or...It Could Be Monday

Monday is my day to stay at home and watch our daughter. My wife and I decided that during her first year we wanted to decrease the amount of stuff we'd miss out on by having her directly in daycare 5 days a week, so we both work compressed weeks and get a day to ourselves to enjoy our little girl and get some quality one-on-one bonding time with her during one of her most formative periods. I've been pretty lucky so far not to have anything blow up on me on a Monday, because as she's gotten older she's also gotten a lot more difficult to corral. My luck ran out yesterday, one week after she turned 1.

It turned out that the server problem I encountered Friday was not fully resolved. While the server stayed up and running over the weekend with nary an error, Monday it crashed and crashed hard. The developer who kind've doubles as IT backup rebooted a number of times, ran fsck from Knoppix, etc., and finally got it to stay up and running, although he doesn't know how. He booted the server, it loaded into initramfs, he typed exit, and it booted into the OS. This happened at least twice before the server stayed up and on.

If It's Friday, Something Must Be Breaking

I had an appointment that required me to leave the office at 3:15 today. My kid was due for her one-year pediatrician appointment, complete with shots, and I was going to meet my wife at the train station so we could all go. We've gone to every ped appointment she's had in her year on this earth as a family save one minor visit. At 2:10, things started to go wrong in the office. So wrong.

 First there were reports that our in-house app was freezing. A quick check of some basic server vitals (memory, disk space, etc.) showed everything looked hunky dory. We've had issues in the past where Java leaked memory or there was something else programmatic causing the problem. Tomcat logs showed no appreciable delay in Apache handing off requests, and no errors in Tomcat itself. Since this was a project that affected only internal workers, and there were few of us left at this point, I opted to restart the Tomcat service. Tomcat stopped, and didn't start back up. Why? The file system had mysteriously gone into read-only mode.