Monday, July 28, 2014

You Touched It Last, And Now You Own This Cluster @#$% (XenServer Dom0 Internal Error)

I was asked to spin up a vm and install WordPress. Piece of cake, LAMP then WP then done (with a solemn nod to my Golden Rule). Except...when I tried to create the VM I got this message:



Oh.

This is a four-host XenServer cluster that I have barely touched other than to do what I was trying to do here: spin up a vm. Back in the heady days of yore when we were a bursting at the seams team of five, this cluster belonged to Someone Else. Someone Else is gone, and it's now one of those un-loved, you-touched-it-you-own-it systems. You know what I'm talking about.

So, great. I hit the Googles because, while I know enough to get the gist of the error message (some block device for my vm is attaching to dom 0 which is the native domain, Ground Zero for Xen), I had no idea what to do about it. This is apparently a common error message as I found a bunch of posts about it. Unfortunately in most cases the OP would come back and say, "Hey, I rebooted it! All good now!". This made me smack my head and groan because ideally there'd be a solution that did not require a reboot, this being a prod server and all. I wasn't alone in this, but no one had a clear answer. The few posts that did attempt to fix the error didn't go into enough detail for someone not intimately familiar with the inner-workings of Xen's virtualization stack. For example, there were a lot of suggestions to simply run xenstore-ls and use that output to delete the offending device. Have you ever run xenstore-ls? There's a lot of info, and finding anything online that tells you what exactly you're looking at is pretty much impossible.

I tried cobbling together a bunch of commands from various sites and trying to skim the admin guide, but eventually gave up under the sheer weight of information overflow and started considering the reboot. I looked into live migration (or whatever Xen's version of that is) because in theory we should be able to move guests around for just this kind of thing. This is what I saw:



You know what I didn't want to do? Start trying to solve yet another problem in the midst of the one I was currently struggling with. That meant a reboot was definitely going to take some prod vms down. As a last ditch effort I hit the Citrix forums.

I lucked out. Sometimes you can go to these forums and wait forever for someone to respond to a post, but I got a response fairly quickly here. Score 1 for Citrix. The poster suggested I do the following:

xe vbd-list vm-uuid=<UUID of VM>
xe vbd-unplug uuid=<UUID from previous command>
xe vbd-destroy uuid=<UUID from first command>


I ran those commands and tried to start the vm up again. Guess who came back to join the party?



Looks like I'm diving down that rabbit hole after all. 

I checked the Networking tab for the VM as per this guide and found that in comparison to other vms, this one had three Networks attached instead of two. I deleted the third and tried again. This time I received a bootloader error: The bootloader for this VM returned an error -- did the VM installation succeed? Error from bootloader: no bootable disk". This makes sense if I think about it, because I did run a command with the word "destroy" in it a few minutes ago. 

I deleted the vm and started over. This time I only got the network error message, not the Dom0 message, so that's a positive sign. I deleted Network 3 again (I will probably have to find out what this is and why it's being added to new VMs), and I finally got the VM to start up. Yay. Now I can actually, you know...do what I was supposed to be doing all this time. Progress!

No comments:

Post a Comment