Wednesday, January 4, 2017

EC2 Not Booting Main Course, With a Little LVM Sauce on the Side

One of our devs came to me to report that he couldn't log in to an EC2 instance that was acting as our FTP server. He'd tried to do so to investigate a report from a customer that they couldn't upload something. Before everything else then we see there's a big failure here: the customer should never be the first person to tell you that your system is broken or down. That means your monitoring has failed.

I checked the monitoring tab for the instance (after verifying that I also could not SSH to the instance) and saw that CPU was pegged solidly at 100%—for like 3 days.

Way bad.

This is an EC2 instance that time forgot, so little information was known about it. My first thought was to create an AMI before we did anything else so that we wouldn't lose everything. I started the AMI creation but it hung for a long time. I deregistered it to stop the AMI creation process, and tried rebooting the instance. A reboot failed, but a stop/start worked and AWS showed the instance as running. It was still inaccessible via SSH though. One more stop/start to be sure, and time to dig deeper.

I was at a loss as to why the instance was inaccessible. I eventually found the system log in AWS and saw an entry of mountall: Event failed. This led me to understand that the issue was with mounting some storage (unclear what still). The instance had 3 EBS volumes attached: 1 8GB root partition of /dev/sda1; 1 60GB block as /dev/sf; and 1 1TB block as /dev/sdg. I followed Amazon's very useful guide for attaching secondary volumes and detached /dev/sda1 from the instance and attached it to another instance. I was able to then investigate /etc/fstab. I found two entries for mounting two devices: /dev/sdb and one for an LVM mount. I commented both out, reattached the volume to the original instance, and tried again. Boot successful.

Now able to log in to the instance, I next had to reattach the volumes in question. I tried a simple sudo mount -a which failed with the message "mount: special device /dev/customer-ebsVG/custLV does not exist". That tells me that the problem lies with the LVM volume. I haven't touched LVM in years, so I had to refresh myself on commands and syntax.


ubuntu@host:~$ sudo lvdisplay
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Logical volume ---
  LV Name                /dev/customer-ebsVG/custLV
  VG Name                customer-ebsVG
  LV UUID                xvFLZV-EltK-cubd-bh4s-VGec-oM2a-Vk1AV3
  LV Write Access        read/write
  LV Status              NOT available
  LV Size                1.39 TiB
  Current LE             364534
  Segments               2
  Allocation             inherit

  Read ahead sectors     auto

This confirms that LVM is in play, and tells me a little about the overall config including how big it should be.


ubuntu@host:~$ sudo pvdisplay
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Physical volume ---
  PV Name               /dev/sdg
  VG Name               customer-ebsVG
  PV Size               1.00 TiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              262143
  Free PE               0
  Allocated PE          262143
  PV UUID               9OrOr5-Jm7u-DYay-8Xbb-L4Kz-6b84-Z3LwvL

  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Physical volume ---
  PV Name               unknown device
  VG Name               customer-ebsVG
  PV Size               399.97 GiB / not usable 2.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              102391
  Free PE               0
  Allocated PE          102391
  PV UUID               p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3

  "/dev/sdf" is a new physical volume of "60.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdf
  VG Name
  PV Size               60.00 GiB
  Allocatable           NO
  PE Size               0
  Total PE              0
  Free PE               0
  Allocated PE          0

  PV UUID               Wgzu2H-JF0Y-BgsV-ONI3-ONkT-pt9f-1RceL9
This tells me which physical devices are included in the LVM volume group, and how large they are. As you can see the volume group in question, customer-ebsVG, has two disks allocated to it of different sizes, but only one of the disks is being seen. There is a 400GB device missing.

I looked through the EBS volumes in the AWS GUI to see if perhaps one was simply not attached. I found a few 400GB volumes but after attaching them determined that they were not the droids I was looking for. Here's where it gets a little weird (shout-out to "Stuff They Don't Want You To Know"): remember in fstab there was also an entry for mounting a device /dev/sdb? So, I hadn't mounted it yet, figuring I needed to concentrate on the LVM device. At a loss in trying to find this mysterious VG member, I went ahead and manually mounted /dev/sdb. It shows up in df -h:

ubuntu@host:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             7.9G  2.3G  5.3G  30% /
none                  1.9G  124K  1.9G   1% /dev
none                  1.9G     0  1.9G   0% /dev/shm
none                  1.9G   60K  1.9G   1% /var/run
none                  1.9G     0  1.9G   0% /var/lock
/dev/sdb              394G  199M  374G   1% /mnt

Yep, /dev/sdb happens to also be 400GB. Is this the missing volume? If so, why is it mounted as a formatted disk instead of part of the LVM? Alright, whatever, I run pvdisplay again to see if this has magically solved the problem. It hasn't.

sudo vgchange -a y customer-ebsVG

The hope is that for some reason the volume group was simply marked as inactive and doing this will change that. It doesn't. I wonder then if maybe /dev/sdb is not supposed to be part of the volume group any more. Maybe it was and it was removed and LVM needs to be cleaned up. I try sudo vgchange -a y customer-ebsVG --partial and it activated the volume group, but there was no data. Seems like everything I needed was on the missing device.

At this point I started looking into ways to repair/recover LVM. I found the backup files in /etc/lvm/backup and from there could verify that the missing device was indeed /dev/sdb. sudo blkid showed me the UUIDs of the devices again, and I considered that perhaps the issue was that /dev/sdb had a different UUID than LVM was expecting, and I that I may need to change the UUID. I started Googling how to do this. At the same time I reached out to a colleague to get his input. Sometimes in explaining the situation to someone else it helps to clarify things and maybe show something you missed, and sometimes maybe they'll see something you missed.

I found some very basic instructions for changing the UUID here. I hesitated to pull the trigger here because if I made things worse I wasn't sure how we'd recover, and I wasn't entirely sure what damage I would do. I talked with the dev guy to go over a recovery scenario, where we'd have to create a new volume and recreate the directory structure for our clients, which we luckily had a roadmap for. Recreating the directory structure would take less than an hour to do, so it seemed like we were going down the road.

My colleague in the mean time had found this Red Hat page that ultimately had the same instructions as I'd found, but in more detail. He hesitated as I did, but ultimately pulled the trigger and ran pvcreate --uuid "p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3" --restorefile /etc/lvm/archive/ /dev/sdb which put the LVM back in order and brought the volume group back up.

So, not only did I get a little refresher course in LVM due to our issue (as well as troubleshooting some AWS issues), I have yet another reminder about the importance of sometimes just pulling the trigger and not being so afraid of being wrong or messing up. I had the solution to the problem in my hand, but my fear of making a mistake stopped me from seeing it through.

No comments:

Post a Comment