Wednesday, January 4, 2017

EC2 Not Booting Main Course, With a Little LVM Sauce on the Side

One of our devs came to me to report that he couldn't log in to an EC2 instance that was acting as our FTP server. He'd tried to do so to investigate a report from a customer that they couldn't upload something. Before everything else then we see there's a big failure here: the customer should never be the first person to tell you that your system is broken or down. That means your monitoring has failed.

I checked the monitoring tab for the instance (after verifying that I also could not SSH to the instance) and saw that CPU was pegged solidly at 100%—for like 3 days.

Way bad.

This is an EC2 instance that time forgot, so little information was known about it. My first thought was to create an AMI before we did anything else so that we wouldn't lose everything. I started the AMI creation but it hung for a long time. I deregistered it to stop the AMI creation process, and tried rebooting the instance. A reboot failed, but a stop/start worked and AWS showed the instance as running. It was still inaccessible via SSH though. One more stop/start to be sure, and time to dig deeper.

I was at a loss as to why the instance was inaccessible. I eventually found the system log in AWS and saw an entry of mountall: Event failed. This led me to understand that the issue was with mounting some storage (unclear what still). The instance had 3 EBS volumes attached: 1 8GB root partition of /dev/sda1; 1 60GB block as /dev/sf; and 1 1TB block as /dev/sdg. I followed Amazon's very useful guide for attaching secondary volumes and detached /dev/sda1 from the instance and attached it to another instance. I was able to then investigate /etc/fstab. I found two entries for mounting two devices: /dev/sdb and one for an LVM mount. I commented both out, reattached the volume to the original instance, and tried again. Boot successful.

Now able to log in to the instance, I next had to reattach the volumes in question. I tried a simple sudo mount -a which failed with the message "mount: special device /dev/customer-ebsVG/custLV does not exist". That tells me that the problem lies with the LVM volume. I haven't touched LVM in years, so I had to refresh myself on commands and syntax.

lvmdisplay

ubuntu@host:~$ sudo lvdisplay
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Logical volume ---
  LV Name                /dev/customer-ebsVG/custLV
  VG Name                customer-ebsVG
  LV UUID                xvFLZV-EltK-cubd-bh4s-VGec-oM2a-Vk1AV3
  LV Write Access        read/write
  LV Status              NOT available
  LV Size                1.39 TiB
  Current LE             364534
  Segments               2
  Allocation             inherit

  Read ahead sectors     auto


This confirms that LVM is in play, and tells me a little about the overall config including how big it should be.

pvdisplay

ubuntu@host:~$ sudo pvdisplay
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Physical volume ---
  PV Name               /dev/sdg
  VG Name               customer-ebsVG
  PV Size               1.00 TiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              262143
  Free PE               0
  Allocated PE          262143
  PV UUID               9OrOr5-Jm7u-DYay-8Xbb-L4Kz-6b84-Z3LwvL

  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Physical volume ---
  PV Name               unknown device
  VG Name               customer-ebsVG
  PV Size               399.97 GiB / not usable 2.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              102391
  Free PE               0
  Allocated PE          102391
  PV UUID               p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3

  "/dev/sdf" is a new physical volume of "60.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdf
  VG Name
  PV Size               60.00 GiB
  Allocatable           NO
  PE Size               0
  Total PE              0
  Free PE               0
  Allocated PE          0

  PV UUID               Wgzu2H-JF0Y-BgsV-ONI3-ONkT-pt9f-1RceL9
This tells me which physical devices are included in the LVM volume group, and how large they are. As you can see the volume group in question, customer-ebsVG, has two disks allocated to it of different sizes, but only one of the disks is being seen. There is a 400GB device missing.

I looked through the EBS volumes in the AWS GUI to see if perhaps one was simply not attached. I found a few 400GB volumes but after attaching them determined that they were not the droids I was looking for. Here's where it gets a little weird (shout-out to "Stuff They Don't Want You To Know"): remember in fstab there was also an entry for mounting a device /dev/sdb? So, I hadn't mounted it yet, figuring I needed to concentrate on the LVM device. At a loss in trying to find this mysterious VG member, I went ahead and manually mounted /dev/sdb. It shows up in df -h:

ubuntu@host:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             7.9G  2.3G  5.3G  30% /
none                  1.9G  124K  1.9G   1% /dev
none                  1.9G     0  1.9G   0% /dev/shm
none                  1.9G   60K  1.9G   1% /var/run
none                  1.9G     0  1.9G   0% /var/lock
/dev/sdb              394G  199M  374G   1% /mnt

Yep, /dev/sdb happens to also be 400GB. Is this the missing volume? If so, why is it mounted as a formatted disk instead of part of the LVM? Alright, whatever, I run pvdisplay again to see if this has magically solved the problem. It hasn't.

sudo vgchange -a y customer-ebsVG

The hope is that for some reason the volume group was simply marked as inactive and doing this will change that. It doesn't. I wonder then if maybe /dev/sdb is not supposed to be part of the volume group any more. Maybe it was and it was removed and LVM needs to be cleaned up. I try sudo vgchange -a y customer-ebsVG --partial and it activated the volume group, but there was no data. Seems like everything I needed was on the missing device.

At this point I started looking into ways to repair/recover LVM. I found the backup files in /etc/lvm/backup and from there could verify that the missing device was indeed /dev/sdb. sudo blkid showed me the UUIDs of the devices again, and I considered that perhaps the issue was that /dev/sdb had a different UUID than LVM was expecting, and I that I may need to change the UUID. I started Googling how to do this. At the same time I reached out to a colleague to get his input. Sometimes in explaining the situation to someone else it helps to clarify things and maybe show something you missed, and sometimes maybe they'll see something you missed.

I found some very basic instructions for changing the UUID here. I hesitated to pull the trigger here because if I made things worse I wasn't sure how we'd recover, and I wasn't entirely sure what damage I would do. I talked with the dev guy to go over a recovery scenario, where we'd have to create a new volume and recreate the directory structure for our clients, which we luckily had a roadmap for. Recreating the directory structure would take less than an hour to do, so it seemed like we were going down the road.

My colleague in the mean time had found this Red Hat page that ultimately had the same instructions as I'd found, but in more detail. He hesitated as I did, but ultimately pulled the trigger and ran pvcreate --uuid "p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3" --restorefile /etc/lvm/archive/customer-ebsVG_00007.vg /dev/sdb which put the LVM back in order and brought the volume group back up.

So, not only did I get a little refresher course in LVM due to our issue (as well as troubleshooting some AWS issues), I have yet another reminder about the importance of sometimes just pulling the trigger and not being so afraid of being wrong or messing up. I had the solution to the problem in my hand, but my fear of making a mistake stopped me from seeing it through.

Wednesday, April 22, 2015

Proxy ARP and the ASA

We got an ASA 5545-X to replace our 5505 (finally!). My task is to do the upgrade. I thought my biggest challenge would be translating the old school NAT statements (IOS 8.2 and older) into the new syntax of the 8.3+ releases.



So my big plan was to simply configure the new ASA the same as the old with a few difference—namely the IPs assigned to the inside and outside interfaces. That way I could add the new ASA to the network, hanging off of our core switch just like the old one, and test that the config (L2L and IPSec VPN, NAT translation, routing, etc.) works. It'd look a little like this when I was done:


I tend to really, really dwell on things before I implement them though. Sometimes it's a bad thing, as in it takes me longer to complete a task, and other times it saves my bacon. This would be one of those times, because before I just threw the firewall on the network I stopped and thought, In what way could this backfire and cause disruption to the production network? More specifically I was wondering if there was any way that the new ASA would/could respond to requests meant for the old ASA. That's when I began to think about proxy arp, and developed a splitting headache (totally coincidental, I'm sure). I popped onto the Cisco support forums to craft a lengthy post, and when I didn't get responses fast enough for my liking I went to the Cisco IRC channel and asked. I got the following replies: 

Random1: never use proxy arp
for anything

Random2: turn it off

You win some, you lost some on IRC, amirite? I lost and got the IRC-equivalent of "Nick Burns, Your Company's Computer Guy". I tried to ask some clarifying questions and got radio silence so I went out and did a little more reading on my own. It's not a hard concept in the end, but few people seem to have the ability to clearly explain how it works and its implications, so I'm gonna try because I can do soooo much better. That was sarcasm. 

Here's the rundown on proxy arp:

It exists on networking devices to allow them to pass traffic on to interfaces/networks other than the one on which the traffic was received. Confusing? Yeah, try reading about it on Cisco's site. Basically if for some reason your host at 192.168.1.10 wants to ARP for its destination host at 12.2.3.4 instead of the default gateway, the router would get that ARP and if it has an interface with a destination to the same network as 12.2.3.4 it would respond to the ARP request with its own MAC and forward it along. 

ASAs do the same sort've thing, but instead of being based on a routing table it's based on NAT statements, whether static or dynamic, and aliases. In a lot of standard ASA configurations, especially for smaller businesses, you'll have an IP address on your outside interface that also servers as your global NAT address for internal IPs. This is a scenario in which you would need to have proxy arp enabled. I checked our existing ASA config with the show run all command and sure enough, we have disabled noproxyarp, which means proxy arp is enabled. That makes sense of course. We're doing a standard config and if we didn't have proxy arp on, we would not be able to pass traffic to any of our publicly accessible hosts (like web servers) and VPN wouldn't work either. Clearly the suggestion to "turn if off" was pretty short-sighted. 

Now, this is not the case in every situation. If you're routing from your upstream to your ASA—say, your ISP is routing to your outside interface instead of ARPing—then you can turn proxy arp off. If your NATs live on a different network from the IP of your interface you can turn if off then as well. That's not the case in our house though. 

What does this mean for me? It means my very simple plan is no longer so simple. I can turn off proxy arp and not put in any NAT statements on the new ASA and test that I can at least reach it from outside, but that won't help with the more complex scenarios such as testing NAT translation and routing to the internal network and across networks. I'm either going to have to configure it and pull the trigger, hoping it works, or figure something else out. Maybe time to check out GSN3 again? 

Thursday, March 5, 2015

Wget Mystery

This is just a quick snippet of some one-off weirdness. I was trying to deploy the Community Edition of Aerospike via Chef. I created a recipe and tested it out with Vagrant successfully before pushing it up to our Chef server. I then attempted to deploy the recipe to our 8-node Aerospike cluster and got error messages on the extract step. The recipe looks like this:

aerospike_tar_directory = "#{Chef::Config[:file_cache_path]}"
aerospike_tar = "#{aerospike_tar_directory}/aerospike-server-community-3.5.3-el6.tgz"
aerospike_temp_dir = Dir.mktmpdir

remote_file aerospike_tar do
  source node[:aerospike][:url]
  mode 00644
end

directory aerospike_temp_dir do
  owner 'root'
  group 'root'
  mode '0755'
end

execute 'extract_aerospike' do
  command "tar zxf #{aerospike_tar}"
  cwd aerospike_temp_dir
  action :run
end

user 'aerospike' do
end

directory node[:aerospike][:dir] do
  owner 'aerospike'
  mode '0755'
  recursive true
end

script 'move_aerospike' do
  action :run
  cwd aerospike_temp_dir
  interpreter 'bash'
  code <<-EOH
  mv #{aerospike_temp_dir}/aerospike-server-community-*/* #{node[:aerospike][:dir]}
  EOH
end

execute "install_aerospike" do
command "cd #{node[:aerospike][:dir]}; ./asinstall; chown -R aerospike:aerospike #{node[:aerospike][:dir]}; rm #{node[:aerospike][:dir]}/*.rpm"
end

service "aerospike" do
action [ :start, :enable ]
end


The error message I was receiving indicated that the downloaded file, aerospike.tgz, was not in fact a tarball. Sure enough I tested it and it was a text file with HTML enclosed. The HTML said the resource wasn't found. I tested it directly from my local machine and confirmed that hitting that same url on my Mac got me a properly formatted .tgz file. I ran wget on the server and was confused that it did not in fact download the file I was expecting. It instead downloaded a file with a different name all together. I repeated this process a couple of times to verify that I wasn't hallucinating.

A little research led me to the flag --trust-server-name. This flag is apparently used with wget when there are multiple redirects on the resource (which there were in this case). If there are redirects wget will download the file as the name of the last component in the redirection url, as opposed to the original resource name. Thus, if I did wget http://aerospike.com/download/server/latest/artifact/el6, without that flag wget would throw a file named el6 in my directory instead of the expected aerospike-server-community-*-.tgz file.

My Chef recipe still isn't exactly working, but I've solved one small problem at least.

Thursday, September 25, 2014

Troubleshooting Haproxy

I received an email from our R&D team stating that they were seeing less HTTP requests for a particular client/campaign than they expected on the order of ~30% less. He took a look at the stats on the stats page for haproxy and recorded that for a given time period the frontend was logging 1,688,298 sessions vs 1,186,079 sessions on the backend. I was asked to look into the cause for this discrepancy.

I took a look at the logs and felt like it was giving me much insight into what was happening between the frontend and backend. The logs were also being sent to /var/log/messages so it was pretty noisy to look through. I checked haproxy.cfg and edited the logging to send it to its own file, and also added the log global parameter to the backend section. After a reload of haproxy I went back to the logs. The logs looked the same, although in a different spot. After making some inquires on the haproxy IRC channel I learned that haproxy was indeed giving some information about the backend connections. The log lines were of a format I wasn't accustomed to, and there was actually a very detailed section on logging in the Haproxy documentation that explained what I was looking at.

Here's an example of the numbers I was working with (this is after the reload):

Frontend
- 3434009 requests total
- 79263 3xx errors
- 1283482 4xx errors
- 30396 5xx errors

Backend
- 79306 3xx errors
- 136 4xx errors
- 30396 5xx errors

Those are numbers gathered from the stats page. I compared these against numbers I was getting from grepping haproxy.log. The numbers didn't match up across the board.


Error
Counts
206
6
304
221635
400
230
404
11
405
60
408
21
500
15058
503
30396

The 503 errors matched up with what I was recording in the logs at 30396. This numbers make sense in terms of what the 503 error means. According to w3.org, a 503 error means: 

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

The 503 lines from the logs look like this:

Sep 24 16:39:21 localhost haproxy[8562]: someip:54196 [24/Sep/2014:16:39:13.668] public webservers/srv1 8/8268/-1/-1/8276 503 109 - - SC-- 454/454/168/5/+5 0/37 {Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36|http://abunchofstuff} "GET /?requesthere HTTP/1.1"

According to the documentation, the SC flags represent the status of the TCP connection at connection termination.

Monday, July 28, 2014

You Touched It Last, And Now You Own This Cluster @#$% (XenServer Dom0 Internal Error)

I was asked to spin up a vm and install WordPress. Piece of cake, LAMP then WP then done (with a solemn nod to my Golden Rule). Except...when I tried to create the VM I got this message:



Oh.

This is a four-host XenServer cluster that I have barely touched other than to do what I was trying to do here: spin up a vm. Back in the heady days of yore when we were a bursting at the seams team of five, this cluster belonged to Someone Else. Someone Else is gone, and it's now one of those un-loved, you-touched-it-you-own-it systems. You know what I'm talking about.

So, great. I hit the Googles because, while I know enough to get the gist of the error message (some block device for my vm is attaching to dom 0 which is the native domain, Ground Zero for Xen), I had no idea what to do about it. This is apparently a common error message as I found a bunch of posts about it. Unfortunately in most cases the OP would come back and say, "Hey, I rebooted it! All good now!". This made me smack my head and groan because ideally there'd be a solution that did not require a reboot, this being a prod server and all. I wasn't alone in this, but no one had a clear answer. The few posts that did attempt to fix the error didn't go into enough detail for someone not intimately familiar with the inner-workings of Xen's virtualization stack. For example, there were a lot of suggestions to simply run xenstore-ls and use that output to delete the offending device. Have you ever run xenstore-ls? There's a lot of info, and finding anything online that tells you what exactly you're looking at is pretty much impossible.

I tried cobbling together a bunch of commands from various sites and trying to skim the admin guide, but eventually gave up under the sheer weight of information overflow and started considering the reboot. I looked into live migration (or whatever Xen's version of that is) because in theory we should be able to move guests around for just this kind of thing. This is what I saw:



You know what I didn't want to do? Start trying to solve yet another problem in the midst of the one I was currently struggling with. That meant a reboot was definitely going to take some prod vms down. As a last ditch effort I hit the Citrix forums.

I lucked out. Sometimes you can go to these forums and wait forever for someone to respond to a post, but I got a response fairly quickly here. Score 1 for Citrix. The poster suggested I do the following:

xe vbd-list vm-uuid=<UUID of VM>
xe vbd-unplug uuid=<UUID from previous command>
xe vbd-destroy uuid=<UUID from first command>


I ran those commands and tried to start the vm up again. Guess who came back to join the party?



Looks like I'm diving down that rabbit hole after all. 

I checked the Networking tab for the VM as per this guide and found that in comparison to other vms, this one had three Networks attached instead of two. I deleted the third and tried again. This time I received a bootloader error: The bootloader for this VM returned an error -- did the VM installation succeed? Error from bootloader: no bootable disk". This makes sense if I think about it, because I did run a command with the word "destroy" in it a few minutes ago. 

I deleted the vm and started over. This time I only got the network error message, not the Dom0 message, so that's a positive sign. I deleted Network 3 again (I will probably have to find out what this is and why it's being added to new VMs), and I finally got the VM to start up. Yay. Now I can actually, you know...do what I was supposed to be doing all this time. Progress!

Tuesday, April 1, 2014

Zabbix Performance

Email and monitoring are two services that I believe can and should be outsourced if you don't have the people power to dedicate to them fully. Like so many technologies they can be fairly simple to get up and running, but how often is the out-of-the-box install sufficient for your business? You can install Exchange fairly easily, and there are countless tutorials and articles out there to guide you through it, but what happens when it breaks or when some new functionality needs to be added? Monitoring systems love to advertise with catchy slogans like "Get your entire infrastructure monitored in less than 10 minutes". That's true...if all you're monitoring are basic Linux stats like disk space, cpu, memory, etc. Need to add JMX queries, or monitor a custom app? Now it's time to roll up your sleeves and get to work.

We were using LogicMonitor to monitor our entire infrastructure when I came onboard almost two years ago (my Old Boss had brought it in). Every device, Java apps, file sizes, whatever custom queries we needed to keep an eye on our production platform, was handled by this app. My New Old Boss (who started about 6 months ago after my Old Boss quit, and has since left as well) came in and wanted to chuck the system. He was a big advocate of "if you can do it in-house, do it". LogicMonitor wasn't perfect, and we had our share of problems with it, but in hindsight a large part of that pain was that we hadn't followed through with set up. We were using default values and triggers for a lot of things, and they created noise. We didn't tune the metrics we received from LogicMonitor's setup team to match our environment. Rather than invest the time to learn the system we had, we scrapped it and went with Zabbix.

I hate Zabbix. I hated it from the beginning. We didn't have much of a burn-in of the product; it got installed, we started "testing" it, and suddenly it was our production monitoring platform for all of the new equipment we were putting into production. This was part of a major rollout as we were introducing a new platform, so it all became one and the same. One of my biggest complaints about Zabbix was that it wasn't user-friendly. Adding a host and its metrics to Zabbix had a pretty unnatural step process, as did adding users/groups and setting up alerts. For example, to set up a host with alerting or particular metrics you have to first add the host, then create items, then create triggers for the items, then create actions for the triggers. You can create host groups to aggregate hosts of like purpose, and you can create a template to group items, but you can't apply a template to a host group—you have to apply them to hosts individually still. When I raised concerns about the complexity of its front-end—because really, a GUI should not be more difficult to use than say doing creating the same thing from the command-line in Nagios as an example—New Old Boss explained that Zabbix was "expert-friendly".

Months later, New Old Boss has moved on and I've inherited Zabbix, which I pretty much let New Old Bus handle while he was here, and what a bag of trouble it has turned into. "Bag of trouble" is of course polite blog speak for what I really want to call it.

Monday, February 3, 2014

What's Up With RAID?

Recently we received an alert that we were running out of disk space on our Exchange server. The first thought was "add more space". This is an older box, an HP DL360 G5, with a RAID HBA. After looking at its configuration details more closely we found out that the chassis was actually full. We were using a standard setup of two RAID 1 drives for the OS and 4 RAID 0 drives for Exchange itself. The 4th drive was in the array as a "hot spare", so there was no way to really expand the size of the array without doing some very intrusive and time-intensive things like adding larger drives one by one and rebuilding the array. There are no backups.

Now, let me be clear: I know that lack of backups is crazy. What you see here is something that I think happens at a lot of smaller companies that are mainly based on Linux. Operations, IT—whatever you want to call it—focuses on hiring Linux admins, and Linux admins treat Windows as the plague. No one wants to be in charge of it, no one wants to touch it. This was the case at my old company, and remains the case here.

Our team made the decision to convert the hot spare to a member of the array, and expand it. Since we were using an HP P400i we had the requirements to do this, so we did. Our manager was not pleased when he found out. He was concerned that we had taken away the hot spare, leaving us open to potential disaster if a disk failed. We rushed to assure him that a RAID 5 array could lose one disk without the whole server falling over, but he was still not exactly psyched that we no longer had that spare.

In the meantime, while the expansion and extension of the array had gone fine (it now showed that additional drive space as available as far as the HP RAID utility was concerned), it was not visible to the OS. This should not have been the case; the additional disk space should have shown as unallocated in Disk Management and we should have been able to expand the logical drive in Windows to use the remaining free space. Up until that point I had been going on historical knowledge of HP, the P400i card in particular, and the utilities HP provided for managing the array. Now I had to delve deeper into the technology and what was supposed to be a quick fix turned into an interesting discovery.

Apparently while I was traipsing around in the land of Linux here, where we used mainly commodity hardware and made of LVM and software RAID for all of our servers except the legacy HPs—which were only being used for Windows services—RAID 5 had fallen out of favor. In my searches I happened across a Spiceworks thread wherein one of the posters opined loudly and often that RAID 5 was THE WORSE choice for modern systems and likely to cause problems instead of mitigate them. This came as news to me. RAID 5 was the de facto standard for all implementations I did as a consultant with the outsourced IT provider for whom I worked. They had a blueprint for how systems were to be deployed, and that was non-negotiable. This was only a few years ago. In light of this I followed some links to find out more.

The basic workings of RAID 5 aren't a mystery. Parity is the key to its resiliency. For every bit of data (not literal bit) parity is calculated using an XOR operation and written to a disk. The parity is spread across multiple disks, so that any one disk can fail and the array can rebuild based on the remaining parity information when you add a new disk in. This all sounds good on paper so why the dire warnings? Well, it apparently has more to do with the reliability of drives, URE, and failure rates. URE is an acronym for unrecoverable read error. I'll admit that I was not versed on this particular concept, so let me explain it here in case you need a refresher.

Disks fail. Over time a standard spindle-based disk (as opposed to an SSD, which has its own wear specs) will start to not be able to read data for some reason. In general you don't notice this, especially in some kind of RAID setup because the spot is marked as bad by the OS and the world keeps on spinning. No pun intended. The formula folks seem to agree on for the probability of this kind of error is 10^14 bits of data. 10^14 bits is roughly 12.5TB of data. If you have an array of 2TB, then read that data 6 times over the course of its existence, you'll encounter at least 1 URE. The idea then is that if you have an array that fails due to a bad disk, running RAID 5, during the resilvering process you will encounter a URE. This means that in the best case scenario you will simply lose some data; in the worse case scenario, the entire rebuild will bork and you will be left without an array and praying that you have a backup.

Now, these rules seem to pertain to large-capacity (2TB+) SATA drives more than anything else. From what I've read SAS drives are less prone (with a 10^16 rate), and the more drives you have in your array, the better your chances of encountering a URE during a RAID recovery. On top of that, a hot spare is apparently not exactly a good thing. I sort've pssh'd the idea of a hot spare with RAID 5. This is in fact the first setup I've encountered where anyone actually designated a hot spare. If a disk fails and you have monitoring and alerting set up properly, the chances of having a second one fail in the time it takes you to replace the failed drive seemed trivial. Another article I read actually called the concept of a hot spare dangerous when combined with RAID 5 and URE factors. If a system automatically starts to rebuild an array with a spare, as in the case with our P400i, you are essentially getting into a state where you could have a complete array meltdown without even knowing about it. In this author's scenario a drive fails in a 3-drive array, the hot spare comes online and the array starts to rebuild, then encounters a URE and the rebuild fails all together even though no second drive failure actually occurred. So now not only do you have a down server, you had no warning about it and no chance to intervene. Intervene in this case would mean to take a backup of your data (or verify an existing backup) before starting a rebuild and being prepared to restore data if necessary.

Of course this makes me start to second guess our setup and evaluate the options. In general I think the warnings about using RAID 5 present in this post are not quite as serious as they're made out to be, at least in our situation. We are in fact using RAID 5, and using (or in the process of using) 4 drives, but they are all SAS drives of 146GB each, nowhere near the big capacity marks being cited in the examples out there. So while the alarms aren't ringing off the hook about the risks involved with our particular setup, it was valuable to learn about some of the other gotchas possible with RAID 5, and consider that a future migration to another RAID configuration—especially as we look at increasing disk capacity for growth—may be worth considering.