Learn about the technologies behind the Internet with The TCP/IP Guide!|
NOTE: Using robot software to mass-download the site degrades the server and is prohibited. See here for more.
Find The PC Guide helpful? Please consider a donation to The PC Guide Tip Jar. Visa/MC/Paypal accepted.
|View over 750 of my fine art photos any time for free at DesktopScenes.com!|
[ The PC Guide | Systems and Components Reference Guide | Hard Disk Drives | Hard Disk Performance, Quality and Reliability | Redundant Arrays of Inexpensive Disks (RAID) | RAID Concepts and Issues | RAID Reliability Issues ]
Reliability of Other System Components
As described in this section, the reliability of a system is a function of the reliability of the various components that comprise it. The more components in a system, the less reliable a system will be. Furthermore, in terms of reliability, the chain is truly only as strong as its weakest link. When dealing with a PC, there are a number of critical components without which the system will not function; if one of these hardware pieces fails then your array will go down, regardless of the number of disks you have or how well they are manufactured. This is an important point that too few people consider carefully enough when setting up a RAID box.
One unreliable component can severely drag down the overall reliability of a system, because the MTBF of a system will always be lower than the MTBF of the least reliable component. Recall the formula for reliability of a system:
Also recall that if the MTBF values of all the components are equal (i.e., MTBF1 = MTBF2 = ... = MTBFN) then this boils down to:
This means that if we have four components with an MTBF of 1,000,000 hours each, the MTBF of the system is 250,000 hours. But if we have four components, of which three have an MTBF of 1,000,000 hours and the fourth has an MTBF of 100,000 hours? In this case, the MTBF of the system drops to only about 77,000 hours, one-third of the previous value.
What this all means is that you can have the greatest hard disks in the world, and use multiple RAID levels to protect against drive failure, but if you put it all in a system with lousy support components, you're not going to have a reliable, high-availability system. It's as simple as that, but in fact, it's actually worse than that. While RAID reduces reliability, it improves fault tolerance; however, most of the other components in the system have no fault tolerance. This means that the failure of any one of them will bring down the PC. Of particular concern are components that affect all the drives in a system, and which generally have a reputation for problems or relatively low reliability.
To increase the reliability of the PC as a whole, systems using RAID are usually designed to use high-quality components. Many systems go beyond this, however, by introducing fault tolerance into other key components in the system that often fail. Since many of the most common problems with PCs are related to power, and since without the power supply nothing in the PC will operate, many high-end RAID-capable systems come equipped with redundant power supplies. These supplies are essentially a pair of supplies in one, either of which can operate the PC. If one fails then the other can handle the entire load of the PC. Most also allow hot swapping, just like hard disk hot swapping in a RAID array. See this section for more.
Another critical issue regarding support hardware relates to power protection--your PC is completely dependent on the supply of electricity to the power supply unless you use a UPS. In my opinion, any application important enough to warrant the implementation of a fault-tolerant RAID system is also important enough to justify the cost of a UPS, which you can think of as "fault tolerance for the electric grid". In addition to allowing the system to weather short interruptions in utility power, the UPS also protects the system against being shut down abruptly while in the middle of an operation. This is especially important for arrays using complex techniques such as striping with parity; having the PC go down in the middle of writing striped information to the array can cause a real mess. Even if the battery doesn't last through a power failure, a UPS lets you shut the PC down gracefully in the event of a prolonged outage.
How about components like motherboards, CPUs, system memory and the like? They certainly are critical to the operation of the system: a multiple-CPU system can handle the loss of a processor (though that's a pretty rare event in any case) but no PC around will function if it has a motherboard failure. Most systems running RAID do not provide any protection against failure of these sorts of components. The usual reason is that there is no practical way to protect against the failure of something like a motherboard without going to (very expensive) specialty, proprietary designs. If you require protection against the failure of any of the components in a system, then you need to look beyond fault-tolerance within the PC and configure a system of redundant machines.