Tuesday, March 22, 2011

Monitoring with Nagios and OMSA

I'm more experienced with Red Hat and its variants (CentOS) and had only heard of Ubuntu in the context of being a home/desktop system, so I was surprised when I joined my current employer and discovered that they use Ubuntu Server across the board. I'm pretty careful not to be one of those IT people who automatically assume that any solution that doesn't conform with prior experience is inferior. I mean, as much as I would have liked to have swooped in and converted everyone to RH, it's not the best way to go about things. You observe, you note, and make a decision based on real performance metrics and values, not personal preference alone. This is especially true when you're talking about a company that develops its own SAAS offering. The developers are used to a specific platform right, so you'd better be sure it's worth it to shake that up for them.


So, let me get this out of the way. Ubuntu as a distro isn't bad, although its whacky way of installing things from repo is a little challenging. Why Debian needs to rearrange apps like Apache and Tomcat and put them in /etc instead of /opt, or spread them out between /etc and /usr (as in the case of Tomcat) is beyond me. If you want the directory structure to conform to what everyone else in the *Nix universe is using, you have to install from source. This is what I wound up doing when I set up a Nagios server, because it gets pretty frustrating trying to follow well-established instructions, and the ones you find in the Ubuntu documentation are just not comprehensive enough. I was using a great resource,  Ramesh Natarajan's Nagios Core 3 ebook, but it quickly got a little complicated translating Ubuntu's directory structure to the much more sensible layout of the compiled version.

Anywho, Ubuntu is fine except for that, but it lacks the kind of support that a server distro meant for production environments should have. Case in point: these are Dell R300s. Dell has a pretty neat tool, OpenManage Server Administrator, that can be used to monitor the server. You get detailed information on the storage controllers, hard drives, firmware, etc. It's good stuff and I have used it in every other environment. In Windows it's easy to install, and I got it working on RH systems without too much fuss as well. Unfortunately, there isn't a Dell-support, official OMSA release for Ubuntu/Debian. There is a repo available now at http://linux.dell.com/repo/community/deb, which was worked on by some engineers from both Dell and Canonical, but it isn't flawless and support is limited to a listserv.

I installed this software on my 10.04 x64 server, a test server that I'm working on to deploy to our colo. I started with version 6.4, and it installed without too much trouble. I had to edit my /etc/apt/sources.list.d/ directory to add this unofficial repository, and then apt-get update. I installed the main components and started the dataeng service. I immediately lost my SSH session and was unable to get back in. I ultimately had to log in to the console (good thing the server was onsite) and restart networking services. This worked for a while, but I lost connectivity again after a bit. I uninstalled OMSA and all went back to the way it was. Weird. I tried it again with fewer components, just the barebones, and had the same series of incidents. I could find nothing online about why this would be interfering with my network connection.

I then tried installing an older version, 6.3. Same thing happened, and I restarted networking again. So far it's been up and going after that restart for almost an hour I'd guess, so I started to experiment with some of the commands you can use with omreport. The ultimate goal is to use this with check_dell_openmanage. I wanted to test the commands locally and make sure OMSA worked properly before starting the next step of integrating it with Nagios. Good thing.


admin@Server1:~$ omreport chassis hwperformance
Error! No Hardware Peformance probes found on this system.
admin@Server1:~$ omreport chassis memory
Memory Information

Error : Memory object not found

So, the NIC is still functional right now, but I am unable to actually get any information off of the system. Coincidentally, I also found a posting that describes the problem I've been having with the network card: http://lists.us.dell.com/pipermail/linux-poweredge/2011-February/044224.html. That poster was also using 6.4 so maybe it's specific to that release.

Now I'm off to try and solve this new mystery. There is definitely something to be said for using a distribution with popular support. I understand that the only way for a distro like Ubuntu to get that kind of mainstream support is for more people to adopt it, and Dell and other major OEMs won't provide that if it's not being used in production server environments, but it's definitely challenging to be on the back edge of that movement. 

No comments:

Post a Comment