Tuesday, July 3, 2012

Or...It Could Be Monday

Monday is my day to stay at home and watch our daughter. My wife and I decided that during her first year we wanted to decrease the amount of stuff we'd miss out on by having her directly in daycare 5 days a week, so we both work compressed weeks and get a day to ourselves to enjoy our little girl and get some quality one-on-one bonding time with her during one of her most formative periods. I've been pretty lucky so far not to have anything blow up on me on a Monday, because as she's gotten older she's also gotten a lot more difficult to corral. My luck ran out yesterday, one week after she turned 1.

It turned out that the server problem I encountered Friday was not fully resolved. While the server stayed up and running over the weekend with nary an error, Monday it crashed and crashed hard. The developer who kind've doubles as IT backup rebooted a number of times, ran fsck from Knoppix, etc., and finally got it to stay up and running, although he doesn't know how. He booted the server, it loaded into initramfs, he typed exit, and it booted into the OS. This happened at least twice before the server stayed up and on.


I checked out the logs and found a number of errors. Some samples:


Jul 2 10:09:46 amjxt2 kernel: [242948.040068] mptscsih: ioc0: attempting task abort! (sc=ffff8800bf803f00)
Jul 2 10:09:46 amjxt2 kernel: [242948.049480] sd 2:1:0:0: [sda] CDB: Write(10): 2a 00 08 bb a8 d8 00 00 08 00
Jul 2 10:09:47 amjxt2 kernel: [242948.190614] mptscsih: ioc0: task abort: FAILED (sc=ffff8800bf803f00)
Jul 2 10:09:47 amjxt2 kernel: [242948.200146] mptscsih: ioc0: attempting task abort! (sc=ffff880037fb5600)
Jul 2 10:09:47 amjxt2 kernel: [242948.209697] sd 2:1:0:0: [sda] CDB: Write(10): 2a 00 08 bb a8 f8 00 00 08 00


and

Jul 2 10:10:16 amjxt2 kernel: [242977.664275] mptbase: ioc0: Initiating recovery
Jul 2 10:10:39 amjxt2 kernel: [243001.010038] mptscsih: ioc0: host reset: SUCCESS (sc=ffff8800bf803f00)
Jul 2 10:11:51 amjxt2 kernel: [243072.692286] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00)
Jul 2 10:11:51 amjxt2 kernel: [243072.713450] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00)
Jul 2 10:11:51 amjxt2 kernel: [243072.734841] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00)
Jul 2 10:11:59 amjxt2 kernel: [243080.196520] mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=8
Jul 2 10:11:59 amjxt2 kernel: [243080.207432] mptbase: ioc0: PhysDisk is now missing
Jul 2 10:11:59 amjxt2 kernel: [243080.218163] mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=8
Jul 2 10:11:59 amjxt2 kernel: [243080.228950] mptbase: ioc0: PhysDisk is now missing, out of sync
Jul 2 10:11:59 amjxt2 kernel: [243080.239516] mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jul 2 10:11:59 amjxt2 kernel: [243080.249851] mptbase: ioc0: volume membership of PhysDisk 255 has changed
Jul 2 10:11:59 amjxt2 kernel: [243080.260359] mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jul 2 10:11:59 amjxt2 kernel: [243080.260361] mptbase: ioc0: volume membership of PhysDisk 255 has changed
Jul 2 10:11:59 amjxt2 kernel: [243080.260364] mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jul 2 10:11:59 amjxt2 kernel: [243080.260366] mptbase: ioc0: volume is now degraded, enabled
Jul 2 10:11:59 amjxt2 kernel: [243080.264242] end_device-2:0: mptsas: ioc0: removing sata device: fw_channel 0, fw_id 8, phy 0,sas_addr 0x1221000000000000
Jul 2 10:11:59 amjxt2 kernel: [243080.264246] phy-2:0: mptsas: ioc0: delete phy 0, phy-obj (0xffff880037b77800)
Jul 2 10:11:59 amjxt2 kernel: [243080.264259] port-2:0: mptsas: ioc0: delete port 0, sas_addr (0x1221000000000000)
Jul 2 10:11:59 amjxt2 kernel: [243080.264449] scsi target2:0:0: mptsas: ioc0: delete device: fw_channel 0, fw_id 8, phy 0, sas_addr 0x1221000000000000
Jul 2 10:12:36 amjxt2 kernel: [243117.330140] mptbase: ioc0: WARNING - IOC is in FAULT state (8064h)!!!
Jul 2 10:12:36 amjxt2 kernel: [243117.341301] mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
Jul 2 10:12:36 amjxt2 kernel: [243117.362947] mptbase: ioc0: Initiating recovery
Jul 2 10:12:36 amjxt2 kernel: [243117.373835] mptbase: ioc0: WARNING - IOC is in FAULT state!!!
Jul 2 10:12:36 amjxt2 kernel: [243117.384672] mptbase: ioc0: WARNING - FAULT code = 8064h
Jul 2 10:12:39 amjxt2 kernel: [243120.800057] mptbase: ioc0: Recovered from IOC FAULT
Jul 2 10:13:05 amjxt2 kernel: [243146.710053] mptbase: ioc0: WARNING - Issuing Reset from mpt_config!!
Jul 2 10:13:05 amjxt2 kernel: [243146.720824] mptbase: ioc0: Attempting Retry Config request type 0x13, page 0x0, action 0
Jul 2 10:13:37 amjxt2 kernel: [243178.330073] mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success


And so the analysis begins. It's pretty clear from the output that something happened that took one of the physical drives out of the equation. basic drive failure, right? That's why you have RAID, isn't it? Except instead of failing a disk and keepin' on keepin' on, the whole thing went South.

Couple things I had to figure out. What is mptbase, mptsas, etc.? lspci shows

02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
MPT modules are the drivers behind LSI HBAs. At the very least I guess you get mptsas involved although apparently there can be a number of mpt* modules installed for a card. The card itself is a very simple RAID controller, apparently common in 2-drive 1U server setups like the Dell PowerEdge R200 in question here. So, mptbase is some kind of kernel driver for the LSI card.  ioc is actually the name of the controller within Linux, so ioc0 is the first (and only in this case) controller in my system. I don't have Dell OMSA installed on this server and I didn't want to go through those shenanigans so I opted for trying 2 other utilities: mpt-status and lsiutil.

mpt-status unfortunately also installs Postfix, and frankly I didn't feel like I wanted to do more than necessary to get the info I needed. I opted for lsiutil instead then since it only needed for me to update sources.list. I referenced the LSIUtil Configuration Utility User's Guide to navigate the menus. Through a series of commands I found the following info:


  1. I have two physical disks, SATA, in a RAID 1 volume. I knew this already though. 
  2. The volume is degraded. One of the physical disks is simply not there any more. Specifically, PhysDisk 1 has gone bad. That much seemed obvious, but it was good to verify which actual disk had gone bad. 
What I did not get out of this was why a failing hard drive would cause my file system to mount read-only, and in fact crash the entire system. I can only guess that either both hard drives happened to be bad, or the controller card was failing. Another option is that for some reason the RAID defaults to read-only to stop the data from getting too far out of sync, but I've never seen that happen before. I did also see reference to being able to set ext3/4 to flip to read only whenever an error is encountered. The system is down now so there's no way for me to verify it on this system, but I will definitely take a look on another. 


Sources:
http://hwraid.le-vert.net/



No comments:

Post a Comment