Tuesday, April 26, 2011

Chasing the Elusive Segmentation Fault, aka The Rabbit Hole

This is gonna be a long one, so grab a beer and relax.

We recently replaced our Apache server with another one running a slightly higher version, 2.2.9 to 2.2.14. Shortly afterward we starting receiving complaints from some of our clients that they were having problems accessing our site. It was never consistent; one day it was fine, the next not. The issues were also varied. Some reported that they intermittently couldn't access their site at all, getting 404 "Page Can Not Be Found" errors, while others were getting authentication errors while interacting with the site. The access logs showed plenty of requests being served to the specified hosts during the time periods of reported inaccessibility, but I did notice that there were loads of segmentation faults in the error log.

[Thu Apr 21 12:38:22 2011] [notice] child pid 18502 exit signal Segmentation fault (11)

Off to Google I went! I found a great debugging guide on the official Apache project page that was actually fairly easy to follow for a change. It identified 3 steps I needed to take to figure out what was going on: 1) get a core dump of the crash, and 2) do a backtrace on the dump file. Seemed simple enough, right?

Taking a dump
I enabled core dumps using the ulimit -c command and created a directory for the files to go in, adding the CoreDumpDirectory directive to apache2.conf. No files ever showed up there, and the error messages never changed to indicate a dump file was being created. More on this later.

The problem persisted so I continued investigating options. I compared the setup on the old server to the new one. I checked memory usage, disk space, and CPU. I found that there was an extra module loaded with this version, reqtimeout. This module apparently controls how long apache will wait for header info to be received from the host, terminating the connection if it takes longer than 10 seconds. I disabled this module using a2dismod, but wasn't able to test it right away since it required restarting Apache.

As an aside, I found an interesting blog post about the difference between reloading and restarting Apache here, since I was curious as to whether I could get away with reloading Apache to make the change.

That being done I continued to look for other avenues. I found http://www.omh.cc/blog/2008/mar/6/fixing-apache-segmentation-faults-caused-php/, which yielded a very interesting little factoid that I had missed and that was likely the cause of the missing core dumps. Apache starts the parent process as root, but forks child processes as the user www-data. While the dump directory I had created had proper permissions (777), I had not necessarily changed the ulimit settings for the user www-data, and if it was this user that needed to create the dump file, that would explain the lack thereof. I was not about to recompile our production Apache server to force the child processes to run as root though. I was also interested in running Apache in debug mode to see if it would be clear there, but again not something I wanted to do on a production server. I wound up adding the ulimit setting to /etc/profile, editing /etc/security/limits.conf as well. Still no core dumps.

I trawled the internet looking for explanations for why this process, which seemed like a very simple thing, was failing so miserably. I posted to forums, read through newsgroups, changed /proc/sys/fs/suid_dumpable to 2 as per http://forums.spry.com/vps-hosting/1654-apache-core-dump-file.html, all to avail. I eventually found a post that suggested reading /usr/share/doc/apache2.2-common/README.backtrace. Indeed this contained instructions for enabling core dumps as well, and were slightly different in that they pointed to an already existing directory under /var/cache/apache2. I had taken an image of the Apache server prior to putting it into production, so I used that to create an identical virtual server to test. I followed the instructions from that doc and killed one of the child processes with the kill -11 command, and indeed there it was: a core dump file! I was so excited, and immediately logged in to the production server to try this out. I did it around midnight so as to not inconvenience too many people when restarting Apache. Guess what? Nothing doing. No dumps. Same steps, no joy.

The next day I'm back to square one with these dump files. I start looking and reading again, and I don't mind admitting that I was getting a little punchy. I went through the logs again, making note of specific seg faults and trying to correlate any other system occurrence to get some clues. What did I see but the full segmentation error, now featuring an additional part: "possible core dump in /var/cache/apache2". Sweet, I thought. But...no core file in that directory. I searched the system for core files since I had read that for some people the core files were thrown into the server root even though they'd specified a dump directory.

Nothing.

Time for Plan B. I didn't have a Plan B. I had given up on producing the core dumps and started looking specifically for any clues as to how to debug PHP. My suspicion was all along that the PHP module was the problem, because that's pretty much what everyone else who had seg fault was reporting. The difference is they had proof, whereas I was going on a hunch. Who wants to start trying to troubleshoot something that they're not certain is the actual cause? I was pretty much going nuts and chasing every loose string until I stumbled across a wiki article about apport. I think I had started searching specifically for how to troubleshoot this on Ubuntu as opposed to a generic search on Linux. Turns out Ubuntu/Debian really has a lot of quirky little "special" things. That annoys me.

With this article I installed and ran apport on my virtual server, manually crashed a child process, and got a crash file in /var/crash-- and all without too much fuss. What a huge contrast to trying to use the kernel's dumping feature. I repeated the same steps on my production server with success. I finally had a core file.

Now what to do with it? That's the next post.

No comments:

Post a Comment