Wednesday, April 25, 2012

Apache Performance Issues

So, I'm sitting at my desk, minding my business and working on a risk assessment report (by the way, the Commonwealth of Virginia has a goldmine of information and resources on risk management that you should pilfer if you find yourself in need of doing this) when one of our client managers comes over and tells me that his client is reporting slowness using our application. The report is vague (of course), and the manager hasn't tried to access the application himself, so I try to pull some more information out: what client, what's slow, that kind of thing. I pull up the client's page and notice that it is taking a bit longer than it should to load. I ssh into the server and check out some quick metrics: free -m, top, iostat and everything looks fine. Server doesn't seem to be under load.

As I'm doing this a few more managers come over to report the same thing. Multiple clients have been contacting them to say that their setup is sluggish. So, we can safely rule out a network issue on the client side since these people are scattered all over the country. My heart drops as I realize we may be facing a larger issue. I check out Nagios, and all is green. I start to log into each of the servers that play a part in delivering content and make sure that all is well with the basic server resources. No one is eating up CPU time, no one is running low on memory, no one shows any kind of disk degradation. MRTG shows pretty low bandwidth utilization over all. So, we've ruled out bandwidth and server resources. I have one of the developer's check out MySQL and make sure no one's running some sort of super-large query or that the replication setup hasn't started affecting performance in some way. Everything's good.

I get reports that performance is getting back to normal for people, and indeed one of the sites I was testing earlier loads faster. Everything seemed back to normal. But, what are the first two things out of a manager's mouth after something goes wrong/breaks down? "Is it fixed?" and "What happened?" Due to that and my own curiosity, I kept looking into it. It seemed to me that everything had been ruled out except for one common aspect, that being the Apache server. The details of of our application are that some services go through Tomcat, some go to the databases directly, but everything goes through Apache so given the scope of the problems, Apache was really the only other common denominator outside of the network, which had been cleared of wrongdoing.

Nothing was pegging the resources of the server so I looked at the mod_jk logs to see if there is noticeable lag in Apache handing off requests to the various Tomcat servlets scattered about. I find that during the period in question it took up to 15 seconds in some cases for the transaction to happen. On a normal day this number is more like 0.00* for most clients, so that seemed to be a sure indicator that Apache was having some hiccups. Since the server itself wasn't exhibiting any resource issues though I looked into the configuration for clues. I noticed that while I was lessing and grepping away the shell was starting to respond slowly, intermittently. 

On a hunch I check the number of processes and see that Apache has ~145 or so processes running. Apache.conf has a limit of 150 MaxClients by default. I begin to suspect that my problem may very well be that Apache hit that limit and clients were waiting for free workers to service their requests. I send out a preliminary company-wide email letting them know that the problem seems to have been temporary, and that I had an idea of why it had happened and would be following up after hours. I start looking into the possibility of increasing the MaxClients and making plans to perhaps restart Apache later when I suddenly realize that my ssh session has become unresponsive. A quick check of Nagios shows HTTP server status and Apache status in the red. A quick check of some client site shows them down.

Oh crap.

I shoot another company-wide email out letting them know of the new horror and that I was working on it. Given that the server was unresponsive I called the data center to ask them to do a hard shutdown on the server. Well, let me backtrack a little. I had to call the customer support number for Earthlink (all of Earthlink) because they don't publish the number to the data center anywhere, and I was told by a bored-sounding young lady that she'd have to put in a trouble ticket. I told her I just needed to be transferred to the data center (and I had done this before so I knew it was possible) and told her exactly what I needed done (i.e. I didn't need troubleshooting, I needed one finger), but she insisted on following her script and promised me that someone would call me back. Unacceptable. I called out Sales Rep, who was a super-star and got me the number to the data center, who then reset the machine for me. All of this was done before I ever heard back from their call center.

The reset got the machine back online and client pages back up and going. The entire outage was ~30 minutes. In the meantime I had begun building out another server in case it was a hardware issue and using my Mondo ISOs to make an exact replica of the server in question. Saved me some valuable time. I'm ultimately putting the additional server in tomorrow morning so that we have a "hot spare", but I don't think it was a coincidence that I was doing risk management at the time that this happened. It was only a matter of time before our fragile infrastructure had something like this happen. It could have been a lot worse.

One could ask the question "Why was there only one Apache server anyway?" It's certainly a valid one. There are lots of answers, mainly around time and money. As a small business Sys Admin you find yourself balancing those two a lot. Everyone talks about disaster recovery and risk management, but the time it takes tends to be a little daunting for a one-person IT team, and the money it takes tends to be daunting for a small business barely breaking even. Throw in the fact that as a small business you're often initially built out with the now in mind, so that when growth happens you find yourself caught unaware and racing to scale up, and well...there you have your answer.

Hmmm, I think this deserves its own post. More on that later. 

No comments:

Post a Comment