Tuesday, April 26, 2011

Chasing the Elusive Segmentation Fault, aka The Rabbit Hole, Part 2

The Analysis

I managed to get a crash file from a child Apache process. I informed one of my co-workers, a developer who participates in the IT side of things (although he claims he doesn't like it), and asked his advice about how to use it. The actual dump data was in hex64, so not readable to me. I assumed he knew of some "development" tool that could be used to read it. I was prepared to use gdb given the standard core file that I had been expecting, but wasn't sure what to do with this. He took apport one step further and found apport-unpack, which takes the file apport creates and separates it out into proper dump files and associated information. With that I was able to run gdb against a dump file and get on the road to identifying the problem.

Right away running gdb /usr/bin/apache2 against the core file showed that the issue was with libapache2-mod-php5. Not shocking, but at least now there was proof. The next step was to figure out exactly what was happening. bt didn't give us much more than that. The hunt was on again.

I went back to my old pal at http://www.omh.cc/blog/2008/mar/6/fixing-apache-segmentation-faults-caused-php/ and skipped the whole section about putting Apache into single process mode and jumped right into his steps for using gdb to debug PHP. I created a file, gdbinit, and ran the command

(gdb) dump_bt executor_globals.current_execute_data

This essentially did nothing. It complained that dump_bt was an undefined command. I searched on this, and eventually found my way to a gdb guide. Apparently gdb reads gdbinit from your home directory first and then from the working directory (if it's different). That didn't seem to be working, but you can also point it to a file using the source command, which is what I did. That worked beautifully.

Well, almost.

Now gdb is reading gdbinit properly, but it's telling me "Attempt to extract a component of a value that is not a structure".

Have I mentioned that I'm getting a little weary of all the Googling?

I come across this guy's really cool blog (I am now a fan) and try a few of his suggestions, but I'm getting nowhere. At last long I have a stroke of inspiration. Maybe the issue is that I am missing debug symbols. I'd already had a go-round with debug symbols as they relate to Apache. The instructions I'd found in /usr/share/doc/apache2.2-common/README.backtrace had indicated that I needed to install apache2-dbg, but I found this package to be completely unavailable via package management and couldn't locate it anywhere online for manual installation. I decided to try my luck with the PHP5-dbg package, and it was in fact available and downloadable. Unfortunately, it didn't work. They weren't loaded according to php -i, and it didn't remove the error message I was getting in gdb. I found my way to this article and installed the libapache2-php5-dbgsym package according to the instructions there. I did this on my virtual server, not the live server.

I took the captured dump from the live server, transferred it to the virtual server, run gdb and successfully found the exact lines in the PHP code that caused the seg faults. Now it's up to the developers to figure out why the code is trying to access inappropriate areas of memory.

Chasing the Elusive Segmentation Fault, aka The Rabbit Hole

This is gonna be a long one, so grab a beer and relax.

We recently replaced our Apache server with another one running a slightly higher version, 2.2.9 to 2.2.14. Shortly afterward we starting receiving complaints from some of our clients that they were having problems accessing our site. It was never consistent; one day it was fine, the next not. The issues were also varied. Some reported that they intermittently couldn't access their site at all, getting 404 "Page Can Not Be Found" errors, while others were getting authentication errors while interacting with the site. The access logs showed plenty of requests being served to the specified hosts during the time periods of reported inaccessibility, but I did notice that there were loads of segmentation faults in the error log.

[Thu Apr 21 12:38:22 2011] [notice] child pid 18502 exit signal Segmentation fault (11)

Off to Google I went! I found a great debugging guide on the official Apache project page that was actually fairly easy to follow for a change. It identified 3 steps I needed to take to figure out what was going on: 1) get a core dump of the crash, and 2) do a backtrace on the dump file. Seemed simple enough, right?

Taking a dump
I enabled core dumps using the ulimit -c command and created a directory for the files to go in, adding the CoreDumpDirectory directive to apache2.conf. No files ever showed up there, and the error messages never changed to indicate a dump file was being created. More on this later.

The problem persisted so I continued investigating options. I compared the setup on the old server to the new one. I checked memory usage, disk space, and CPU. I found that there was an extra module loaded with this version, reqtimeout. This module apparently controls how long apache will wait for header info to be received from the host, terminating the connection if it takes longer than 10 seconds. I disabled this module using a2dismod, but wasn't able to test it right away since it required restarting Apache.

As an aside, I found an interesting blog post about the difference between reloading and restarting Apache here, since I was curious as to whether I could get away with reloading Apache to make the change.

That being done I continued to look for other avenues. I found http://www.omh.cc/blog/2008/mar/6/fixing-apache-segmentation-faults-caused-php/, which yielded a very interesting little factoid that I had missed and that was likely the cause of the missing core dumps. Apache starts the parent process as root, but forks child processes as the user www-data. While the dump directory I had created had proper permissions (777), I had not necessarily changed the ulimit settings for the user www-data, and if it was this user that needed to create the dump file, that would explain the lack thereof. I was not about to recompile our production Apache server to force the child processes to run as root though. I was also interested in running Apache in debug mode to see if it would be clear there, but again not something I wanted to do on a production server. I wound up adding the ulimit setting to /etc/profile, editing /etc/security/limits.conf as well. Still no core dumps.

I trawled the internet looking for explanations for why this process, which seemed like a very simple thing, was failing so miserably. I posted to forums, read through newsgroups, changed /proc/sys/fs/suid_dumpable to 2 as per http://forums.spry.com/vps-hosting/1654-apache-core-dump-file.html, all to avail. I eventually found a post that suggested reading /usr/share/doc/apache2.2-common/README.backtrace. Indeed this contained instructions for enabling core dumps as well, and were slightly different in that they pointed to an already existing directory under /var/cache/apache2. I had taken an image of the Apache server prior to putting it into production, so I used that to create an identical virtual server to test. I followed the instructions from that doc and killed one of the child processes with the kill -11 command, and indeed there it was: a core dump file! I was so excited, and immediately logged in to the production server to try this out. I did it around midnight so as to not inconvenience too many people when restarting Apache. Guess what? Nothing doing. No dumps. Same steps, no joy.

The next day I'm back to square one with these dump files. I start looking and reading again, and I don't mind admitting that I was getting a little punchy. I went through the logs again, making note of specific seg faults and trying to correlate any other system occurrence to get some clues. What did I see but the full segmentation error, now featuring an additional part: "possible core dump in /var/cache/apache2". Sweet, I thought. But...no core file in that directory. I searched the system for core files since I had read that for some people the core files were thrown into the server root even though they'd specified a dump directory.

Nothing.

Time for Plan B. I didn't have a Plan B. I had given up on producing the core dumps and started looking specifically for any clues as to how to debug PHP. My suspicion was all along that the PHP module was the problem, because that's pretty much what everyone else who had seg fault was reporting. The difference is they had proof, whereas I was going on a hunch. Who wants to start trying to troubleshoot something that they're not certain is the actual cause? I was pretty much going nuts and chasing every loose string until I stumbled across a wiki article about apport. I think I had started searching specifically for how to troubleshoot this on Ubuntu as opposed to a generic search on Linux. Turns out Ubuntu/Debian really has a lot of quirky little "special" things. That annoys me.

With this article I installed and ran apport on my virtual server, manually crashed a child process, and got a crash file in /var/crash-- and all without too much fuss. What a huge contrast to trying to use the kernel's dumping feature. I repeated the same steps on my production server with success. I finally had a core file.

Now what to do with it? That's the next post.

Tuesday, April 5, 2011

Paying for Mistakes

Mistakes happen in IT. Human beings are at the helm, and we are fallible creatures prone to distraction, forgetfulness, and plain old everyday carelessness. It happens to everyone, and you just hope it doesn't happen too often.

We have a ton of Android phones in our organization and I was asked to look into a security solution and draw up a policy around it. I did some research and came upon WaveSecure. WaveSecure seemed like a pretty good solution over all. It allowed for remote wiping, GPS tracking, and backing up contacts and such. We went through a process of vetting and I tested it out on my own phone at my own cost and liked it. I got permission to go forward and set out to buy it. Unfortunately, in the time it took for us to say "yea", WaveSecure got bought by McAfee. This was an immediate red flag as I don't, in general, care for any of McAfee's products. I hate when a company I don't trust or like gets their hands on a product that I do like and trust. I mean, good for tenCube and all, and I'm sure they're a little happier and better off, but yuck for consumers. Reminds me of ATT&T and T-Mobile.

But I digress.

I went to make my first purchase. There isn't an option to bulk buy a mass of licenses, which although lamentable makes sense as this isn't hyped as an enterprise solution. I got my first user phone, put in all the info, checked it twice, and hit the button. And I waited. And waited. I got a confirmation email about the purchase, but I never got the SMS message to the phone with activation information. I downloaded it from the Market, and it asked for the PIN. I had no PIN. Something was wrong. I looked over my info again; it looked flawless. I inspected the phone...and realized that the number I thought was the handset number was actually the last outgoing phone number.

Whoopsies.

No big, I thought. I'll just contact McAfee and let them know about this little mishap, and they'll help me right out. I went to the WaveSecure site and found a link to Customer Support. There was no chat or call option, although there was a web form. I filled that out and waited. And waited some more. I received no confirmation of my request, no indication that it had even been received. Radio silence.

I waited the weekend, and checked back again Monday. Still nothing. At this point I'm more than a little disgruntled. I look all over the website for some alternate means of contacting them and find nothing. I finally start a chat session with regular McAfee support. I'm pretty thorough with my information-sharing so I told the rep I was with everything you've read above. His/her response? "Here's a link to the WaveSecure support site. Fill out this form." Did you not here the part where I'd already done that and was taking step 2 now? And then they went on autopilot, and that was the only response I could get. I was not happy.

I moved on to the phone, thinking maybe I just wasn't good at communicating over chat. I went through the whole story again, in my bestest, calmest voice, and the end result was "go to the website and fill out the form." Okay, I was pretty ticked off at that point. I still kept my cool though, and I asked one final series of questions. My question was: does McAfee actually offer any support for this product they purchased or not? It seemed like he was looking through some documentation or something to answer this simple question. I asked again, in a different way: If I purchased this software for my company, are you telling me that my only support option is via the website and that web form? That I can never speak to a live person who may actually know something about this product? His answer was yes.

I have submitted one final email to WaveSecure. If there is no success from there I will have to ask my boss to open a dispute with AMEX, and I will be killing any ideas to go forward with using this software. I can't work with a vendor that doesn't offer customer support. If they can't help me with something as simple as the initial purchase, what happens when something more complicated comes down the line?

I may end up paying $20 for my mistake, but McAfee/WaveSecure will pay a lot more for their lack of customer support in the end.

Friday, April 1, 2011

OMSA Network Mystery Solved...?

No root cause analysis here, but it's working and quite frankly at this point, that's good enough for me. I made the very common mistake of throwing multiple solutions at this problem at once so I'm not entirely sure which one resolved it, but it was either down/upgrading the OMSA version (I downgraded to 6.3, and not a day later 6.5 was announced so I tried that out as well) or actually rebooting the server instead of starting the datamgr daemon manually. I hope it was the version because it would be pretty crappy if you actually had to reboot the server to get OMSA to work properly. Linux servers are known for not having to reboot after every update/patch/fix and this is pretty contrary to that feature.

I am now using OMSA and the check_openmanage plugin ( to monitor the server hardware through Nagios. Not too shabby. Now, on to getting Nagios to send notifications for alerts.