Monday, February 3, 2014

What's Up With RAID?

Recently we received an alert that we were running out of disk space on our Exchange server. The first thought was "add more space". This is an older box, an HP DL360 G5, with a RAID HBA. After looking at its configuration details more closely we found out that the chassis was actually full. We were using a standard setup of two RAID 1 drives for the OS and 4 RAID 0 drives for Exchange itself. The 4th drive was in the array as a "hot spare", so there was no way to really expand the size of the array without doing some very intrusive and time-intensive things like adding larger drives one by one and rebuilding the array. There are no backups.

Now, let me be clear: I know that lack of backups is crazy. What you see here is something that I think happens at a lot of smaller companies that are mainly based on Linux. Operations, IT—whatever you want to call it—focuses on hiring Linux admins, and Linux admins treat Windows as the plague. No one wants to be in charge of it, no one wants to touch it. This was the case at my old company, and remains the case here.

Our team made the decision to convert the hot spare to a member of the array, and expand it. Since we were using an HP P400i we had the requirements to do this, so we did. Our manager was not pleased when he found out. He was concerned that we had taken away the hot spare, leaving us open to potential disaster if a disk failed. We rushed to assure him that a RAID 5 array could lose one disk without the whole server falling over, but he was still not exactly psyched that we no longer had that spare.

In the meantime, while the expansion and extension of the array had gone fine (it now showed that additional drive space as available as far as the HP RAID utility was concerned), it was not visible to the OS. This should not have been the case; the additional disk space should have shown as unallocated in Disk Management and we should have been able to expand the logical drive in Windows to use the remaining free space. Up until that point I had been going on historical knowledge of HP, the P400i card in particular, and the utilities HP provided for managing the array. Now I had to delve deeper into the technology and what was supposed to be a quick fix turned into an interesting discovery.

Apparently while I was traipsing around in the land of Linux here, where we used mainly commodity hardware and made of LVM and software RAID for all of our servers except the legacy HPs—which were only being used for Windows services—RAID 5 had fallen out of favor. In my searches I happened across a Spiceworks thread wherein one of the posters opined loudly and often that RAID 5 was THE WORSE choice for modern systems and likely to cause problems instead of mitigate them. This came as news to me. RAID 5 was the de facto standard for all implementations I did as a consultant with the outsourced IT provider for whom I worked. They had a blueprint for how systems were to be deployed, and that was non-negotiable. This was only a few years ago. In light of this I followed some links to find out more.

The basic workings of RAID 5 aren't a mystery. Parity is the key to its resiliency. For every bit of data (not literal bit) parity is calculated using an XOR operation and written to a disk. The parity is spread across multiple disks, so that any one disk can fail and the array can rebuild based on the remaining parity information when you add a new disk in. This all sounds good on paper so why the dire warnings? Well, it apparently has more to do with the reliability of drives, URE, and failure rates. URE is an acronym for unrecoverable read error. I'll admit that I was not versed on this particular concept, so let me explain it here in case you need a refresher.

Disks fail. Over time a standard spindle-based disk (as opposed to an SSD, which has its own wear specs) will start to not be able to read data for some reason. In general you don't notice this, especially in some kind of RAID setup because the spot is marked as bad by the OS and the world keeps on spinning. No pun intended. The formula folks seem to agree on for the probability of this kind of error is 10^14 bits of data. 10^14 bits is roughly 12.5TB of data. If you have an array of 2TB, then read that data 6 times over the course of its existence, you'll encounter at least 1 URE. The idea then is that if you have an array that fails due to a bad disk, running RAID 5, during the resilvering process you will encounter a URE. This means that in the best case scenario you will simply lose some data; in the worse case scenario, the entire rebuild will bork and you will be left without an array and praying that you have a backup.

Now, these rules seem to pertain to large-capacity (2TB+) SATA drives more than anything else. From what I've read SAS drives are less prone (with a 10^16 rate), and the more drives you have in your array, the better your chances of encountering a URE during a RAID recovery. On top of that, a hot spare is apparently not exactly a good thing. I sort've pssh'd the idea of a hot spare with RAID 5. This is in fact the first setup I've encountered where anyone actually designated a hot spare. If a disk fails and you have monitoring and alerting set up properly, the chances of having a second one fail in the time it takes you to replace the failed drive seemed trivial. Another article I read actually called the concept of a hot spare dangerous when combined with RAID 5 and URE factors. If a system automatically starts to rebuild an array with a spare, as in the case with our P400i, you are essentially getting into a state where you could have a complete array meltdown without even knowing about it. In this author's scenario a drive fails in a 3-drive array, the hot spare comes online and the array starts to rebuild, then encounters a URE and the rebuild fails all together even though no second drive failure actually occurred. So now not only do you have a down server, you had no warning about it and no chance to intervene. Intervene in this case would mean to take a backup of your data (or verify an existing backup) before starting a rebuild and being prepared to restore data if necessary.

Of course this makes me start to second guess our setup and evaluate the options. In general I think the warnings about using RAID 5 present in this post are not quite as serious as they're made out to be, at least in our situation. We are in fact using RAID 5, and using (or in the process of using) 4 drives, but they are all SAS drives of 146GB each, nowhere near the big capacity marks being cited in the examples out there. So while the alarms aren't ringing off the hook about the risks involved with our particular setup, it was valuable to learn about some of the other gotchas possible with RAID 5, and consider that a future migration to another RAID configuration—especially as we look at increasing disk capacity for growth—may be worth considering.