8. ANALYSIS OF NOVEL RELIABILITY MODELS

In this chapter, new reliability models of Chapter 7, the reliability effects of the proposed scanning algorithms, and a delayed disk array repair process are studied in comparison with traditional disk array reliability models and repair algorithms. The goal is to get knowledge concerning how much the scanning algorithm improves the disk array reliability by detecting the latent sector faults and how much the disk array reliability is decreased when the repair process is either delayed or obstructed. Other reliability scenarios are also studied.

This chapter is divided into four parts: validation of the reliability models, sensitivity analysis of the parameters, accuracy of the approximation models, and reliability scenarios. The first part verifies that the achieved equations provide the same results as the previous studies with the same input parameters. The second part studies how stable the equations are with respect to different input parameters. The third part estimates the accuracy of the approximation models. Finally, the various scenarios of the reliability aspects are evaluated in the fourth part.

8.1 Validation of novel reliability models

In Appendix B, the MTTDL figures are illustrated in various parameter combinations. In this part, the corresponding equations of the technical literature (here called Traditional Markov Model, TMM) [Schwarz 1994, Hillo 1993, Gibson 1991] are compared with the results of EMM1 and EMM2A. The main objective for this validation is to verify that the new equations agree with the results of the previous studies.

The validation is divided into two parts. The first part compares the equations with exactly the same parameters while the second part compares with different values. In the first three comparisons, some of the parameters, such as sector fault rate, are ignored (i.e., those values are set so low/high that they have no effect). In the last three comparisons, it is checked that the new models give reasonable results also when the new features are included.

Comparison with identical values

Figures B-1, B-2, and B-3 of Appendix B illustrate the comparison of the MTTDL values of TMM and the MTTDL values of EMM1 and EMM2A models as presented in this thesis with comparable parameters.

Figure B-1 compares TMM with EMM1 and EMM2A with no sector faults as a function of the reliability of the disk unit. The MTTDL values of TMM and EMM1 give the same results while EMM2A has a small error when the disk unit reliability is low. The approximation error is studied later in this chapter.

Figure B-2 compares TMM with EMM1 and EMM2A with no sector faults as a function of the mean time to repair a disk unit. The MTTDL values of TMM and EMM1 give the same results while EMM2A has a similar magnitude error when the repair time is long as in the previous comparison. Again, this is because of the approximation that is used in EMM2A.

Figure B-3 compares TMM with EMM1 and EMM2A with no sector faults as a function of the number of disks in the array. The MTTDL values of TMM, EMM1, and EMM2A give the same results in all three cases with the entire range of the number of disks.

Comparison with sector faults included

Figures B-4, B-5, and B-6 of Appendix B illustrate the comparison of the MTTDL values of TMM and the MTTDL values of EMM1 and EMM2A as presented in this thesis when the sector faults are not ignored.

Figure B-4 compares TMM with EMM1 and EMM2 with sector faults as a function of the reliability of the disk unit. The MTTDL values of TMM provide somewhat poorer MTTDL than EMM1 and EMM2A (with sector fault detection). This is because they all have the same probability of having the first fault in the array (either a sector or a disk unit fault), but EMM1 and EMM2A have lower probability of the second fault because the failure rate in the sector fault states is lower than in the disk unit fault states. On the other hand, MTTDL of EMM1 and EMM2A drops dramatically from the values of TMM if the sector faults are included but not detected. This is well in the line what is expected because due latent faults the system is mainly on the sector fault state.

Figure B-5 compares TMM with EMM1 and EMM2A with sector faults as a function of the mean time to repair a disk unit. Here, both sector and disk unit repair times are varied simultaneously. The MTTDL values of TMM are somewhat worse than those of EMM1 and EMM2A with sector fault detection because of the same reason as above in Figure B-4. MTTDL of EMM1 and EMM2A without sector fault detection is significantly lower as the reliability is totally dominated by the undetected sector faults.

Figure B-6 compares TMM with EMM1 and EMM2A with sector faults as a function of the number of disks in the array. The MTTDL values of TMM are somewhat worse than those of EMM1 and EMM2A with sector fault detection because of the same reason as above in Figures B-4 and B-5. Respectively, the MTTDL values of EMM1 and EMM2A with no sector fault detection result significantly poorer as the reliability of the array is effected by the undetected sector faults and growing number of disks.

Mission success probabilities

Some of the mission success probabilities of above comparisons are listed in Table 10. These mission success probabilities are based on the default values of the parameters listed in Appendix B. The results in all four cases are almost the same when the same parameters are used. The approximation methods (EMM1A and EMM2A) have a marginal underestimation for the mission success probabilities.

Table 10. Sample mission success probabilities for TMM, EMM1, EMM1A, and EMM2A (the values of the parameters are listed in third and fourth columns of Table B-1 in Appendix B)

Comparison

Mission

TMM

EMM1

EMM1A

EMM2A

B-3

M1

0.987

0.987

0.987

0.987

B-3

M3

0.961

0.961

0.960

0.960

B-3

M10

0.876

0.875

0.873

0.873

B-6

M1

0.987

0.351 *)
0.990 **)

0.117 *)
0.990 **)

0.117 *)
0.990 **)

B-6

M3

0.961

0.011 *)
0.971 **)

0.002 *)
0.970 **)

0.002 *)
0.970 **)

B-6

M10

0.876

0.000 *)
0.905 **)

0.000 *)
0.905 **)

0.000 *)
0.905 **)

*)

with no sector fault detection

**)

with sector fault detection

Conclusions of validation of novel reliability models

When EMM1 and EMM2A are compared with TMM the following observations can be made:

• with the same input parameters (i.e., the sector faults ignored), EMM1 provides exactly the same results as TMM;

• with the same input parameters (i.e., the sector faults ignored), EMM2A provides very good approximation of TMM;

• when the sector faults are included but not detected, EMM1 and EMM2A indicate significantly worse reliability than TMM with no sector faults; and

• when the sector faults are included and detected, EMM1 and EMM2A indicate slightly better reliability than TMM with no sector faults.

This concludes that the novel reliability models comply with the old model since EMM1 provides the same results as TMM with the same input parameters and EMM2A provides a good approximation of TMM.

8.2 Sensitivity analysis

In the sensitivity analysis of the reliability models, the parameters of the EMM1 and EMM2A are studied. In Appendix C, MTTDL of disk arrays is illustrated in various parameter combinations. The sensitivity analysis is divided into three main parts: effect of the number of disks, effect of the failure rates, and effect of the repair rates.

The sensitivity analysis is done so that only two parameters are varied at the same time. The primary parameter is varied from one extreme to another while the second parameter has usually only values: minimum, typical, and maximum. Rest of the parameters are kept in their default values. With this configuration, it is possible to analyze effects of the parameters with a limited set of the combinations.

The following abbreviations are used:

• MTBDF: Mean Time Between Disk unit Failures;

• MTBSF: Mean Time Between Sector Faults (in a disk);

• MTTRDF: Mean Time To Repair Disk unit Failure;

• MTTRSF: Mean Time To Repair Sector Fault;

• MTBSDF: Mean Time Between Second Disk unit Failures; and

• MTTOSD: Mean Time To Order and replace Spare Disk.

8.2.1 Sensitivity to the number of disks

Figures C-1 to C-5 illustrate the reliability effect of the number of disks. Five scenarios are studied in combination with the number of disks in the array:

• disk unit failure rate;

• sector failure rate;

• disk unit repair rate;

• sector repair rate; and

• disk unit repair rate that is relative to the number of disks in the array.

Disk unit failure rate

Figure C-1 illustrates the reliability of the disk array as a function of the number of disks in the array and the disk unit reliability. Regardless of the number of disks in the array, the system MTTDL is improving slightly over one decade when MTBDF improves with one decade (from 200 000 hours to 2 million hours). Similarly, MTTDL drops over one decade when MTBDF drops one decade (from 200 000 hours to 20 000 hours) regardless of the number of disks in the array. The drop is higher as the probability of having second disk fault causes higher probability of data loss while the increase in reliability is limited by the other parameters in the system.

Sector failure rate

Figure C-2 illustrates the reliability of the disk array as a function of the number of disks in the array and the sector reliability. Regardless of the number of disks in the array, the system MTTDL is improving by a factor of three when MTBSF improves by one decade (from 200 000 hours to 2 million hours). Similarly, MTTDL drops almost one decade when MTBSF drops one decade (from 200 000 hours to 20 000 hours) regardless of the number of disks in the array. The reliability change is smaller when MTBSF is increased as the reliability is limited by the other component while the decrease of MTBSF causes the sector failure rate to be the reliability bottleneck.

These results are better than when varying the disk unit reliability because the probability of having a data loss after a sector fault is smaller than after a disk unit fault. After a sector fault, either a disk unit fault or a corresponding sector fault in another disk causes data loss. On the contrary, after a disk unit fault any sector fault or any disk unit fault causes data loss.

Disk unit repair rate

Figure C-3 illustrates the reliability of the disk array as a function of the number of disks in the array and the disk unit repair rate. Regardless of the number of disks in the array, the system MTTDL improves about 50% when the mean disk unit repair time drops from 8 hours to 2 hours. Respectively, the system MTTDL is reduced to one half (quarter) when the mean disk unit repair time is increased from 8 hours to 24 (72) hours regardless of the number of disks in the array.

Sector fault repair rate

Figure C-4 illustrates the reliability of the disk array as a function of the number of disks in the array and the sector fault detection and repair rate. Regardless of the number of disks in the array, the system MTTDL is improving about 20% (40%) when the sector faults are detected and repaired in 12 hours instead of 24 hours for EMM1 (EMM2A). The system MTTDL drops significantly when the sector fault repair takes a longer time (78, 254, or 824 hours) leading up to one decade worse MTTDL for the whole range of disks.

Disk unit repair rate relative to the number of disks

Figure C-5 illustrates the reliability of the disk arrays as a function of the number of disks in the array when the repair time is related to the number of disks in the array. It is assumed that it takes 15 minutes to read/write an entire disk at normal repair rate (i.e., about 1 MB/s average transfer rate for a disk of 1 GB capacity). In addition, there is the minimum startup delay of 8 (2) hours in EMM1 (EMM2A). When the number of disks in the array is small, there is no significant difference in the system MTTDL with different repair activities as the difference in the repair times is only a few tens of minutes. On the other hand, the effect on the system MTTDL is almost one decade when there are many disks in the array. For example, when there are 100 disks in the array, the difference in the repair time with normal and 10% repair speeds is 225 hours. This explains the significant difference also in the MTTDL values.

Conclusions of the number of disks

Figures C-1 to C-5 illustrate that the relative reliability of the disk arrays (as expressed with MTTDL) is not significantly related to the number of disks in the array. Naturally, the absolute reliability improves when the number of disks decreases, but the relative difference between configurations remains quite the same regardless of the number of disks in the array.

Only, if the disk unit repair time depends on the number of disks (as illustrated in Figure C-5), the number of disks in the array plays a significant role in the reliability estimates.

As for the conclusion, this means that it is possible to use a fixed number of disks in the array (in the future analysis 50 disks is used as a default) and the relative improvement in MTTDL will apply reasonably well also for disk arrays with larger or smaller number of disks.

8.2.2 Failure rates

Four different failure rates are analyzed: disk unit, sector, second disk unit, and spare disk unit failure rates.

Disk unit failure rate

Figure C-6 illustrates the reliability of the disk array as a function of the reliability of disks and the disk unit repair rate. When MTBDF is over 100 000 hours, MTTDL increases linearly with MTBDF. Below that, MTTDL decreases faster than MTBDF as the second disk unit fault becomes the reliability bottleneck.

Sector failure rate

Figures C-7 and C-8 illustrate the reliability of the disk array as a function of the reliability of the disk sectors and the sector fault detection rate. When MTBSF is in its normal range (10 000h - 1 000 000h), MTTDL increases but not linearly with MTBSF. With higher MTBSF values, MTTDL is limited by the disk unit failure rate (two disk units failing one after another). Similarly, MTTDL has lower limit as the probability of having two sector faults at the corresponding sectors is so marginal that also the lower bound of the reliability is eventually limited with the disk unit reliability (a disk unit fault is needed after a sector fault).

If the disk unit reliability were not the limiting factor, MTTDL in Figures C-7 and C-8 would behave the same way as in Figure C-6.

Second disk failure rate

Figure C-9 illustrates the reliability of the disk array as a function of the possibility to have interrelated disk faults (second disk fault probability is higher than the first one) and the reliability of the disk units. The system MTTDL is almost constant when the second disk unit failure rate is about the same as or smaller than the first disk unit failure rate. When the second disk unit failure rate increases (probability of having related disk unit faults), MTTDL approaches the case where no redundancy is provided.

Spare disk failure rate

Figure C-10 illustrates the reliability of the disk array as a function of spare disk reliability. The disk array reliability seems to be almost independent of the spare disk reliability. Only, a marginal drop is noticeable when the spare disk reliability decreases. This is in the line with the observation made by Gibson that need for second spare disk is marginal [Gibson 1991]

Conclusions of failure rates

Figures C-6 to C-10 illustrate that the relative reliability of the disk array (as expressed with MTTDL) is not significantly related to failure rates when the failure rates are in the conventional range of modern disks (100 000 - 1 million hours). Naturally, the absolute reliability improves when the failure rate decreases, but the relative difference between configurations remains quite the same regardless of the failure rate. Hence, 200 000 hours can be used for both MTBSF and MTBDF in the further analysis.

8.2.3 Repair rates

Three different repair rates are analyzed: disk unit, sector, and spare disk unit repair rates.

Disk unit fault repair rate

Figure C-11 illustrates the reliability of the disk array as a function of mean disk unit repair time and reliability of the disks. When MTTRDF is in its practical range (from few hours to few hundred hours), MTTDL depends linearly on the mean repair time. The upper bound of MTTDL is limited by the other components in the system (such as sector faults). When the mean disk unit repair time approaches infinity, the lower bound of MTTDL is also limited (i.e., the array acts like a system that has no spare disks, but can tolerate one fault).

There is significant difference in MTTDL of EMM1 and EMM2A when the repair time is short and MTBDF is 20 000 hours. This is because it takes longer time (at least 24 hours) in EMM2A to install a new spare disk in the array and the probability of having a second faulty disk soon after the first one is high.

Sector fault repair rate

Figure C-12 illustrates the reliability of the disk array as a function of mean sector fault repair time and reliability of the disks. When MTTRSF is in its practical range (from a few hours to a few hundred hours), MTTDL depends linearly on the mean sector repair time. The upper bound of MTTDL is limited by the other components in the system (such as disk unit reliability). When the mean sector repair time approaches infinity, the lower bound of MTTDL is also limited (i.e., when the sector faults are not detected, the latent sector fault and a disk unit fault cause the data loss).

Spare disk fault repair rate

Figure C-13 illustrates the reliability of the disk array as a function of the spare disk fault detection and repair rate. When this rate is in its practical range (from few hours to few hundred hours), MTTDL is quite independent of the spare disk fault detection and repair rate. Only, when the disk reliability is low or the repair takes hundreds of hours, the repair rate plays a significant role.

Figure C-14 illustrates the reliability of the disk array as a function of the spare disk replacement rate. When this rate is in its practical range (from few hours to hundred hours), MTTDL is quite independent of the spare disk repair rate. Only, when the disk reliability is low or the repair takes hundreds of hours, the repair rate plays a significant role.

Conclusions of repair rates

Figures C-11 to C-14 illustrate that the relative reliability of the disk arrays (as expressed with MTTDL) is not significantly related with repair rates when the repair rates are in the conventional range of modern disks (few hours to few hundred of hours). Naturally, the absolute reliability improves when the repair rate increases, but the relative difference between configurations remains quite the same regardless of the repair rate. Thus, repair times from few hours to hundred hours can be used in the future analysis with no risk of unstability of the equations.

8.3 Accuracy of approximations

In the accuracy analysis, the approximation methods of EMM1A and EMM2A are compared with the analytical approach EMM1. In Appendix D, MTTDL of disk arrays is illustrated for various parameter combinations. The accuracy analysis is done by varying the main parameters of the disk array: the number of disks, disk unit failure rate, sector failure rate, second disk failure rate, disk unit repair rate, and sector repair rate.

Number of disks in the array

Figure D-1 illustrates the accuracy of the approximation of EMM1A and EMM2A when compared with EMM1 as a function of the number of disks in the array. Both approximations provide very accurate results. When the number of disks is varied from one to 100 disks and the disk repair time is varied between 8 to 24 hours while the other parameters are in their default values, the maximum error (6%) in this figure is achieved when the number of disks is 100 and MTTRDF is 72 hours. In most cases, the error is less than 1%.

Disk unit failure rate

Figure D-2 illustrates the accuracy of the approximation of EMM1A and EMM2A when compared with EMM1 as a function of disk unit reliability. Both approximations provide very accurate results. The maximum error (22%) in this figure is achieved when MTBDF is 10 000 hours and MTTRDF is 72 hours. When MTBDF is better than 100 000 hours, the error is less than 5% and in most cases less than 1%.

Sector failure rate

Figure D-3 illustrates the accuracy of the approximation of EMM1A and EMM2A when compared with EMM1 as a function of sector reliability. Both approximations provide quite accurate results when the sector reliability is in its practical range (MTBSF is greater than 10 000 hours). Then, the maximum error is about 7% and in most cases less than 1%. When MTBSF is below 10 000 hours, over 30% error in MTTDL is experienced. This is because the failure rates in the array are no longer significantly smaller than repair rates.

Second disk unit failure rate

Figure D-4 illustrates the accuracy of the approximation of EMM1A and EMM2A when compared with EMM1 as a function of second disk unit failure probability. Both approximations provide accurate results only when the MTBSDF is over 10 000 hours. Then, the maximum error is about 10% and in most cases less than 1%. When MTBSDF decreases below 10 000 hours, the error grows dramatically. The reason for this is exactly the same as above in Figure D-3. In some cases, the disk array failure rate is much larger than the repair rate which contradicts the assumptions of the approximation.

Disk unit fault repair rate

Figure D-5 illustrates the accuracy of the approximation of EMM1A and EMM2A when compared with EMM1 as a function of disk unit repair rate. When MTBDF is at least 100 000 hours and the mean repair time is less than few hundred hours, both approximations provide quite accurate results. Then, the maximum error is 10% and in most cases less than 1%. When the repair time is long or disks unreliable, significant error results. Again, the approximation is erroneous when the repair rates are no longer significantly larger than the failure rates.

Sector fault repair rate

Figure D-6 illustrates the accuracy of the approximation of EMM1A and EMM2A when compared with EMM1 as a function of sector repair rate. Both approximations provide quite accurate results when the sector repair time is less than a few hundred hours. Then, the maximum error is 10% and in most cases less than 1%. When the repair time is long, significant error results. Again, the approximation is erroneous when the repair rates are no longer significantly larger than the failure rates.

Conclusions of approximation models

Figures D-1 to D-6 illustrate that the approximation models provide accurate results when the repair rates are significantly higher than the failure rates. In practical systems this is usually the case, as the mean time between failures is typically over 100 000 hours and the mean time to repair less than a hundred hours, while the number of disks is less than a hundred. Then, the failure rate is at most one tenth of the repair rate.

Both EMM1A and EMM2A provide the same results. It can be therefore assumed that both models approximate reasonably well the actual reliability model.

The only case, when it is not possible to use the approximation, is the study of interrelated disk unit faults as shown in Figure D-4. However, this is not such a serious problem because this can be studied analytically as shown later in this chapter in Scenario 5.

In all cases when there is an error in comparison with EMM1, both approximation methods, EMM1A and EMM2A, underestimate the reliability. This agrees with the prediction in Chapter 6.

8.4 Reliability scenarios

Table 11 lists nine different comparison scenarios. The columns in the table indicate whether the emphasis is more on the repair process, the sector faults, or the scanning algorithm. These scenarios illustrate various aspects of reliability related to the sector faults or the new reliability models. The main objective here is to analyze the scenarios to achieve deeper understanding of the behavior of the disk arrays.

The default parameters for the scenarios are listed in Table 12. These values are used in all scenarios unless otherwise stated.

Effect of the disk access patterns

The analysis uses those user access patterns that are listed in Chapter 6, namely Uniform, Single-80/20, Double-80/20, and Triple-80/20. Actually, the access patterns are relevant here only for specifying the sector fault detection rate. Beside the user access pattern, the relative activity of the user read requests compared with the scanning requests specifies the relative sector fault detection rate.

Table 11. Different scenarios for the reliability analysis

Scenario

Repair process method

Sector faults

Scanning algorithm

SC1: Effect of the sector faults

a)

b)

c)

normal
normal
normal

ignored
included
included

ignored
ignored
only by user requests

SC2: Effect of scanning algorithm

a)

b)

normal
normal

included
included

ignored
included

SC3: Delayed disk repair

a)

b)

normal
delayed

included
included

ignored
ignored

SC4: Delayed disk repair with scanning algorithm

a)
b)

normal
delayed

included
included

ignored
included

SC5: Related faults

normal

included

included

SC6: Hot swap vs. hot spare

normal

included

included

SC7: Effect of the spare disk

normal

included

included

SC8: Percentage of sector faults

normal

included

included

SC9: RAID-1 vs. RAID-5

normal

included

included

Disk array configurations

The analysis is mainly done for a RAID-5 disk array and in some cases also a RAID-1 disk array is used. From the point of view of the reliability analysis, the main differences of these two array configurations are the different repair times and the number of disks involved. In practice, the RAID-1 array with one pair of mirrored disks has equal reliability as RAID-5 with two disks. Hence, the RAID-1 array can be treated as a special case of RAID-5. Hence, EMM1, EMM1A, and EMM2A can be used for both RAID-1 and RAID-5 disk array configurations. Actually, the reliability models are also applicable for RAID-3 and RAID-4 disk arrays since RAID-3, RAID-4, and RAID-5 have identical failure models.

Table 12. Default parameters for the scenarios

Parameter

Default values in models EMM1, EMM1A, and EMM2A

50

1 000 000

200 000h

200 000h

200 000h * S

2 000 000h

8h

24h

44h

32h

In the case of a larger RAID-1 array (such as ten pairs of mirrored disks), the results can be achieved from the single pair of disks and treating the pairs as independent units. It is infeasible to have a hot spare disk for every pair. Instead, a common spare disk is used. This means that only one such pair could be repaired at any time. This is considered to cause only minor error as it has been stated that even with a disk array of 32 disks there is very little use for more than one on-line spare disk [Gibson 1991].

8.4.1 Scenario 1: Effect of sector faults

The first scenario compares the effect of sector faults in general (i.e., how much worse is the disk array reliability if the sector faults are included, but not detected efficiently when compared with the conventional disk array model where sector faults are totally ignored?). The results are illustrated in Figure 26. To make the reliability figures comparable, it is assumed that 50% of the faults in EMM1 are sector faults and the remaining 50% are disk unit faults. This corresponds to the view of the distribution of the faults between sector and disk faults in modern disks [Räsänen 1996]. Thus, the total failure rate of each disk is the same, but cause is different.

When the sector faults are included and detected slowly, the disk array reliability (as expressed with MTTDL) lags significantly behind the predicted MTTDL of TMM. This is because the sector faults can remain latent for a long time in the array and just one disk unit fault is needed to cause data loss. The lower limit of the reliability is achieved when the sector faults are not detected at all. In that case, MTTDL is only about 2% of MTTDL that is predicted by TMM when the average disk lifetime is about 200 000 hours.

On the other hand, when the sector faults are detected faster, the reliability of the disk array is at the same level as what TMM predicts. Actually, even better reliability can be achieved as it is shown in Figure 26 if the average sector fault detection time is 12 or 24 hours. The reasoning for this was given earlier in this chapter (8.1) when the new reliability models were validated.

Figure 26. Effect of the sector faults

Conclusions of Scenario 1

This scenario points out two issues. First, the reliability drop is dramatic if the sector faults are included but not detected. However, the same level of reliability can be achieved even when the sector faults are included if they are detected efficiently. Second, disk requests of a typical user are not very suitable for detecting latent sector faults because of their uneven access patterns. Instead, efficient scanning algorithms that can scan the entire disk in matter of hours (typically once or twice per day) can provide good latent fault detection and also good reliability.

8.4.2 Scenario 2: Effect of scanning algorithm

The second scenario studies the effect of the scanning algorithm and its activity. The efficiency of the scanning algorithm depends on the following aspects:

• access patterns of the user disk requests;

• relative activity of the scanning algorithm (compared with the user disk requests); and

• absolute activity of the scanning algorithm.

An example for a situation when the scanning algorithm would provide significant improvement on the latent fault detection is when the user access pattern is uneven (such as Triple-80/20 distribution) and the scanning algorithm is running at the same order of magnitude as the user disk requests. Then, sector fault detection rate is significantly higher with the scanning algorithm. The ratio of the sector fault detection of the scanning algorithm and the user disk access pattern is

. (103)

The results of fast scanning can be seen in Figure 26 of Scenario 1. The reliability of the disk array can be improved significantly (MTTDL can improve ten fold) when an efficient scanning algorithm is used.

On the other hand, if the scanning algorithm has significantly lower activity than the user requests and the user’s disk access pattern is distributed evenly, then the effect of the scanning algorithm in quite minimal. For example, if Uniform access pattern is used and the scanning activity is only 5% of the user read requests, then the ratio of the fault detection of the scanning algorithm and the user disk access pattern is

. (104)

Then, the scanning algorithm would improve the sector fault detection rate by 10%. This would have only a marginal effect on the reliability.

The practical ratio of the fault detection of the scanning algorithm and the user access pattern is somewhere in between. For example, 5% scanning algorithm and Triple-80/20 access pattern leads to ratio

. (105)

This would mean that the scanning algorithm can still improve MTTDL by 100% as shown by Figure 26.

RAID-5 parity sectors

An important point in the RAID-5 array architecture is that if the array is used only for reading data and never (or hardly ever) data is written into the array, there is no need to access the parity sectors. If write operations are also done into the disk array, they can either detect the latent sector faults or mask them.

If there is a latent sector fault in a parity area of the RAID-5 array, the user disk read requests would never detect that. On the other hand, the scanning algorithm has no problem detecting sector faults in the parity areas as it can access information on the disk regardless of the array configuration.

Depending on the size of the array, a certain percentage of the sectors will be unaccessed by the user requests. For example, if the RAID-5 array consists of 10 (50) disks, 10% (2%) of the disk space is inaccessible by user read requests because that space is used for storing parity information. Those sectors can be checked non-destructively only by the scanning algorithm.

Conclusions of Scenario 2

This scenario shows clearly that the proposed scanning algorithm can improve the reliability of the disk array in two ways. First, the scanning algorithm can also access sectors (such as RAID-5 parity sectors) that are not accessible by the normal user read requests. Second, if the user access pattern is uneven, the scanning algorithm can significantly expedite the latent sector fault detection even when the activity of the scanning algorithm is low.

8.4.3 Scenario 3: Delayed disk unit repair process

The third scenario compares the normal disk array repair algorithm with a delayed repair algorithm where either the repair process is slowed down or it is postponed until later for a more suitable time. Figures 27 and 28 illustrate the effect of the delayed disk repair process as a function of average disk lifetime and the number of disks in the array.

Figure 27. MTTDL in Scenario 3 as a function of average repair time and average disk lifetime

Both figures show that by delaying the repair process from 2 (8) hours to 8 (24) hours, the reliability (as expressed with MTTDL) drops about 30% (50%) when MTBDF is 200 000h. The mission success probabilities are illustrated in Table 13. There is no significant impact on reliability if the repair time is delayed from two hours to eight hours. If the repair time is further delayed to 24 hours, the ten years mission success probability starts to degrade significantly.

The main benefit of delaying the repair time is the possibility to reduce the performance degradation during the repair process. For example, if the repair time is extended from two to eight hours, the load caused by the repair process is reduced by 75% as the repair time is four times longer. This can be done if there is a hot spare disk in the array that can be used for starting the repair immediately but slowly.

Table 13. Mission success probabilities of Scenario 3

Average disk repair time

M1 (1 year)

M3 (3 years)

M10 (10 years)

2h

98.8%

96.5%

88.7%

8h

98.3%

95.0%

84.3%

24h

97.0%

91.3%

73.7%

Figure 28. MTTDL in Scenario 3 as a function of average repair time and the number of disks in the array

Conclusions of Scenario 3

This scenario shows that it is possible to delay the repair process at the expense of reduced reliability. This is needed if, for example, the performance requirements must be met also during the repair time. This analysis assumes that the faults are not related and therefore the probability of having a second fault just after the first one is typically low.

Related faults are studied further in Scenario 5.

8.4.4 Scenario 4: Effect of combined scanning algorithm and delayed repair

The fourth scenario compares the combined scanning algorithm and delayed repair process. In Figure 29, reliability of a disk array is illustrated as a function of the activity of the scanning algorithm relative to the user requests. In this figure, it is assumed that user access pattern is Triple-80/20 and the scanning algorithm is adjusted relative to it.

The effect of the combined scanning algorithm and the delayed repair can be illustrated with the following two examples:

1. If the average disk repair time is two hours and the relative activity of the scanning algorithm is 5%, then MTTDL of the disk array is, according to Figure 29, 90 000 hours. However, the same reliability can be achieved:

• using the average disk repair time of eight hours and the scanning activity about 5.5%;

• using the average disk repair time of 24 hours and the scanning activity 7.5%; or

• using the average disk repair time of 72 hours and the scanning activity 30%.

Figure 29. MTTDL as a function of relative activity of the scanning algorithm

1. If the average disk repair time is two hours and the relative activity of the scanning algorithm is 20%, then MTTDL of the disk array is, according to Figure 29, 270 000 hours. However, the same reliability can be achieved:

• using the average disk repair time of eight hours and the scanning activity about 25%; or

• using the average disk repair time of 24 hours and the scanning activity 100% leaving no capacity for user disk requests.

In the second example, the latter alternative is not feasible as the additional load caused by the scanning algorithm is very likely much higher than what the repair process would generate (at least when compared with the former alternative where the repair process is spread over eight hours, but the scanning algorithm is increased from 20% to 25%. Hence, this is a typical optimization problem.

Conclusions of Scenario 4

This scenario points out that it is possible to delay the disk repair algorithm and compensate for the decreased reliability by increasing the activity of the scanning algorithm to detect the latent faults. The result is that the reliability can be kept at the same level while the performance degradation due to the high speed disk recovery process can be eliminated.

8.4.5 Scenario 5: Effect of related faults

Figure 30. Effect of the related disk unit faults as a function of second disk unit failure rate

The fifth scenario studies the effect of related faults in the disk array. For example, if the disks are coming from the same manufacturing batch or if they are located in the same cabinet, they are more likely to get faulty at the same time. Thus, their faults are not independent anymore.

Figure 30 illustrates the effect of related faults. If the faults are independent, it is very unlikely that the second disk should fail soon after the first one. Then, the risk of having data loss is quite small. On the other hand, if the faults are related, there is a significantly higher probability of having soon a second fault that causes data loss. As shown in Figure 30, drop in the reliability is much higher than in Figure 27 of Scenario 3. Here, the reliability drops to one third of MTTDL if the disk repair time is delayed from two hours to eight or from eight to 24 hours.

An even more dramatic effect on the reliability can be seen in Figure 31. In Scenario 4 it was shown that by combining delayed repair and expedited scanning algorithms it is possible to maintain the same level of reliability. If the average disk unit lifetime after the first disk has failed is 20 000 hours or 2 000 hours, it is more difficult to compensate for the delayed repair with expedited scanning algorithm. Especially in the latter case, the scanning algorithm is no longer capable of regaining the reliability drop that is caused by the delayed disk repair process.

Figure 31. Effect of the related disk unit faults as a function of the scanning algorithm activity

The lower limit of the reliability can be achieved when the second disk is assumed to fail immediately after the first disk. Then, estimated MTTDL of the disk array can be expressed as the reliability of the single disk divided by the number of disks in the array. For example, in the array of 50 disks where the average disk unit lifetime is 200 000 hours, MTTDL of the array is only 4000 hours.

The situation of the related faults can be even more serious than expressed above. The reliability can drop even further if even the first disk unit failure rate is much higher than expected due to, for example, increased temperature in the disk cabinet.

Conclusions of Scenario 5

Scenario 4 showed a possibility to compensate for the obstructed repair process by increasing the sector fault detection to maintain the reliability level. This is not the case when the possibility to have related faults is considered as shown in Scenario 5. Hence, the disk repair process should be completed as quickly as possible to minimize the risk of interrelated faults that would dramatically drop the reliability.

The worst case of related faults is when the first failing disk causes the second disk to fail immediately. Then, the D+1 reliable disk array turns into an array of D+1 disks with no redundancy (just like RAID-0). The reliability of the disk array can decrease even further, if the reason for the related faults causes extended degradation on the disk unit from the beginning.

Figure 32. MTTDL as a function of spare disk replacement time

8.4.6 Scenario 6: Effect of hot swap or hot spare disks

The sixth scenario studies the effect of the spare disk replacement time. In Figure 32, the reliability of the disk array is illustrated as a function of spare disk replacement time. Here, EMM2A is compared with EMM1 and EMM1A. Figure 32 shows that the reliability is almost the same if the spare disk replacement time is 8, 24, or 72 hours. This means, that if the disk array has an online spare disk, the time to order another spare disk after a disk failure is not so important. Only, if the spare disk ordering takes one week or one month, then there is significant effect on the disk array reliability.

More significant effects can be found if the disk unit faults are related. For example, Figure 33 illustrates the case when the disk unit faults are related and the disk unit failure rate is ten times higher after the first disk failure. In this case, the reliability drops more when the spare disk order and replacement time increases.

Figure 33. MTTDL as a function of spare disk replacement time when MTBSDF is 20 000 hours

Conclusions of Scenario 6

This scenario shows the benefit of hot spare disks. When the disk array is repaired with a hot spare disk, it is not so critical when a new spare disk is added to the array. If the new spare disk is added within 24 hours there is no significant effect on the reliability even if the disk unit faults are related.

8.4.7 Scenario 7: Effect of spare disk reliability

The seventh scenario illustrates the effect of the spare disk reliability. In Figure 34, the reliability of the disk array is represented as a function of spare disk reliability. As it can be seen, the reliability of the spare disk plays an insignificant role in the total reliability. Only a marginal drop is noticed when the spare disk reliability decreases.

Conclusions of Scenario 7

This scenario together with Scenario 6 show that it is possible to use the simpler model (EMM1) of the disk array with no significant error when the disk array repair time is the same and the new spare disk is assumed to be added soon enough (e.g., within 24 hours after the disk failure).

Figure 34, MTTDL as a function of spare disk reliability

This is a significant observation as in the next chapter, where the performability is studied, the analysis can be done using only the analytical approach (EMM1) to analyze both EMM1 and EMM2 models.

8.4.8 Scenario 8: Effect of the percentage of sector faults

The eighth scenario studies the effect of the relative amount of sector and disk unit faults in the array. Figure 35 illustrates the disk array reliability as a function of the percentage of sector faults of all faults. The reliability of the disk array does not change very much if the percentage of sector faults is changed from the default value (50%) down to 0%. The effect is only 25% (15%) for an array that has the average disk unit fault repair time of 24h (8h) and the average sector fault repair time of 24h (12h).

On the other hand, if the percentage of sector faults increases significantly, the reliability of the disk array also increases radically. This is because having two sector faults at the corresponding sectors is not very probable as it has been described earlier in this chapter. Eventually, if all faults were sector faults, the disk array reliability would be extremely high as it would require having two sector faults in the corresponding sectors for a data loss.

Figure 35. MTTDL as a function of the percentage of sector faults in the array

Conclusions of Scenario 8

This scenario shows that it is quite safe to assume that 50% of all faults in a disk are sector faults, when the disk unit and sector repair times are comparable. If the percentage of sector faults is in the range of 0% to 50%, the effect on the disk array reliability is around 15-25%.

8.4.9 Scenario 9: RAID-1 vs. RAID-5

The ninth scenario compares two different array architectures: RAID-1 and RAID-5. The goal for this comparison is to compare the reliability of an array when the array is either built using D+1 redundancy (RAID-5) or D+D redundancy (RAID-1). Figure 36 illustrates the reliability of the disk array as a function of the number of data disks in the array. It is clearly seen that RAID-1 provides significantly higher reliability than RAID-5 with the same number of data disks. However, this is done at the expense of the increased cost of the array as almost double the number of disks is needed.

Figure 36. Comparison of RAID-1 and RAID-5 disk arrays

One additional point that increases the difference between RAID-1 and RAID-5 reliability estimations is the repair time. As in RAID-5, all disks are involved with the disk array repair, the repair time is much longer and also the crippled array performance is lower. If the RAID-5 with 50 data disks can be repaired in 8 (24) hours, then RAID-1 with 50 disks can be easily repaired respectively in 2 (8) hours because in the latter case basically only two disks are involved in the repair process, not the entire array. Hence, RAID-1 can provide 30 times higher MTTDL than RAID-5 with 50 data disks while the same repair time would have provided only 24 times higher MTTDL.

Conclusions of Scenario 9

This scenario shows that RAID-1 can provide significantly higher reliability than a RAID-5 array, but at the cost of an increased number of disks. Hence, there are almost double the number of failed disks in RAID-1, but because of smaller number of disks in the disk group and the faster repair process, the probability of a data loss is still smaller.

8.4.10 Summary of scenarios

In the following list, a summary of conclusions for the scenarios is collected.

• Reliability of the disk array depends heavily on the latent sector faults and their detection rate.

• Typical user disk accesses are not very good in detecting latent sector faults because of their uneven access patterns. User access patterns are not capable of detecting all latent faults (such as faults in RAID-5 parity blocks).

• A special scanning algorithm should be used for detecting latent sector faults. The proposed scanning algorithm can improve the reliability significantly.

• Performance degradation of a crippled disk array can be reduced by delaying or obstructing the repair process at the expense of reduced reliability. Within a certain limits, the reliability can be regained by expediting the latent (sector) fault detection.

• Related faults decrease dramatically the reliability of the disk array. In the worst case, the reliability of RAID-5 array with D+1 disks is as bad as the reliability of RAID-0 with D+1 disk (having no redundancy).

• A hot spare disk provides better reliability than a hot swap disk. This is mainly because the repair process can start earlier. It is not so important to add a new spare disk into the disk array after a disk failure. There is no big difference if the new spare disk is added immediately or on the average in 8 or 24 hours.

• It is possible to model a disk array with a hot spare disk with EMM1, if the repair time is as fast as with the hot spare disk case and the new spare disk is added within 24 hours after the disk failure. Then, the same analytical approach (EMM1) can be used for hot swap and hot spare arrays in Chapter 9.

• When the percentage of sector faults (of all faults in the disk) is in the range 0-50%, the analysis provides similar results for the reliability. Thus, the analysis made in Scenarios 1-7 and in Scenario 9 should be quite independent of the percentage of sector faults. Only, when majority of the faults is sector faults, then there is a significant effect on the reliability.

• RAID-1 provides significantly better reliability for the same capacity disk array than RAID-5 at the expense of a larger number of disks and therefore also a higher number of faulty disks.

8.5 Mission success probabilities

In this final reliability study, mission success probabilities of RAID-1 and RAID-5 arrays are compared. Table 14 illustrates reliability of disk arrays with 1, 5, and 50 data disks. With one and five data disks, all three mission success probabilities are very close to 100% as reliability of the disk arrays is very high as shown by MTTDL. There is no significant difference between RAID-1 and RAID-5 arrays. On the contrary, there is a significant difference in the mission success probabilities when there are 50 data disks in the disk array. The RAID-1 configuration can still provide good reliability while RAID-5 has a significantly degraded reliability.

If the faults are related, reliability will drop dramatically as it was shown earlier in this chapter in Scenario 5. Then, the reliability would be only a fraction of the values listed in Table 14. However, RAID-1 would still provide a significantly better reliability for three main reasons. First, the mirroring approach is more suitable for high reliability. Second, the repair process is much faster as only two disks are involved and the repair process is simpler (as described in the next chapter). Finally, the RAID-1 architecture is less sensitive to related faults because it is constructed of pairs of disks and not like in the RAID-5 architecture where a disk group has several disks. As a conclusion, the RAID-1 architecture can repair disk faults much faster and therefore it tolerates better related faults.

Table 14. Reliability of RAID-1 and RAID-5 arrays with 1, 5, and 50 disks

Disks

Array configuration

MTTDL [hours]

M1 [%]

M3 [%]

M10 [%]

1

EMM1: RAID-1

3,34E+08

99,997 %

99,992 %

99,974 %

EMM1: RAID-5

3,34E+08

99,997 %

99,992 %

99,974 %

5

EMM1: RAID-1

6,67E+07

99,987 %

99,961 %

99,869 %

EMM1: RAID-5

2,23E+07

99,961 %

99,882 %

99,607 %

50

EMM1: RAID-1

6,67E+06

99,869 %

99,607 %

98,695 %

EMM1: RAID-5

2,66E+05

96,772 %

90,610 %

71,973 %