Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 685 (1998)
(USC DC Other)
USC Computer Science Technical Reports, no. 685 (1998)
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Description
Roger Zimmermann and Shahram Ghandeharizadeh. "HERA: Heterogeneous extension of RAID." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 685 (1998).
Transcript (if available)
Content
HERA: Heterogeneous Extension of RAID
Roger Zimmermann and Shahram Ghandeharizadeh
r.zimmermann@ieee.com, shahram@pollux.usc.edu
Department of Computer Science
University of Southern California
Los Angeles, California 90089
Abstract
A number of recent technological trends have made data intensive applications such as continuous
media (audio and video) servers a reality. These servers store and retrieve a large volume of data using
magnetic disks. Servers consisting of multiple nodes and large arrays of heterogeneous disk drives have
become a fact of life for several reasons. First, magnetic disks might fail. Failed disks are almost always
replaced with newer disk models because the current technological trend for these devices is one of annual
increase in both performance and storage capacity. Second, storage requirements are ever increasing,
forcing servers to be scaled up progressively.
In this study we present HERA, a framework that enables parity-based data protection for heteroge-
neous storage systems. We describe the tradeoffs associated with three alternative HERA techniques:
independent subservers, dependent subservers, and disk merging. The novel disk merging approach
provides a low cost solution for systems that require highly available secondary storage.
1 Introduction
Applications that utilize digital continuous media, such as video and audio clips, require a vast amounts of
storage space [BGM95]. Large archives may consists of hundreds, if not thousands of disks, to satisfy both
the bandwidth and storage requirements of the working sets imposed by different applications. Although
a single disk is fairly reliable, with a large number of disks, the aggregate rate of disk failures can be too
high. At the time of this writing, the mean time to failure (MTTF) of a single disk is on the order of
1,000,000 hours; this means that the MTTF of some disk in a 1,000 disk system is on the order of 1,000
hours (approximately 42 days).
With those servers that assume a hierarchical storage structure, a disk failure may not result in loss of
data. This is because these systems render the entire database tertiary resident [GS93, BGM95, GVK
95].
The disks cache the most frequently accessed objects in order to minimize the number of references to
the tertiary. Data redundancy at disk level continues to be important because it is undesirable for a single
disk failure to impact all the active displays
1
. Even with redundant data, multiple failures might force the
system to terminate some active displays. We use the mean time to service loss (MTTSL) to quantify the
fault-tolerant characteristics of the algorithms discussed in this paper.
This research was supported in part by the National Science Foundation under grants IRI-9203389, IRI-9258362 (NYI award),
and CDA-9216321, and a Hewlett-Packard unrestricted cash/equipment gift.
1
With no data redundancy and a disk failure, depending on the organization of data, all requests might be forced to retrieve the
missing data from tertiary which we assume to have a much lower bandwidth than the aggregate bandwidth of the disk subsystem.
This would diminish the number of simultaneous displays supported by the system.
1
A common technique to protect against both data and service loss is to add redundancy to the system,
either by mirroring data or adding parity information [BG88, PGK88]. There is a vast body of literature
analyzing techniques in support of homogeneous disk subsystems [CLG
94] but very few are concerned
with heterogeneous disk subsystems [DS95, CRS95]. From a practical perspective, these techniques must
be extended to support heterogeneous subsystems. This is due to the current technological trends in the area
of magnetic disks, namely, the annual 40% to 60% increase in performance and 40% decrease in storage
cost [Pat93, Gro97]. Consequently, it is very likely that failed disks will be replaced by newer models. In
addition, with scalable storage subsystems, a system might evolve to consist of several disk models.
With HERA, there are multiple ways of configuring the hardware and organizing data. These choices
are a tradeoff in MTTSL, cost, and need for detective techniques that dissolve bottlenecks by replicating
the read-only data. It is beyond the focus of this workshop paper to investigate all possible organizations.
Instead, we provide a preliminary study of three alternative organizations:
1. Independent subservers [GS93, DS95, WYS95]: With this organization, a heterogeneous collection
of disks is organized into a collection of subservers, each consisting of a homogeneous array of disk
drives. A file (e.g., a movie) is assigned to one subserver, see Figure 1(a). Hot read-only files
(e.g., popular movies) might be replicated across the subservers to avoid formation of hot spots and
bottlenecks. The configuration may employ a detective technique to detect hot spots and replicate the
read-only data to dissolve these bottlenecks.
2. Dependent subservers: Similar to the previous technique, this technique constructs a collection of
subservers consisting of homogeneous disks, however, it stripes a file across the subservers in order to
distribute the load of a sequential retrieval across all subservers, see Figure 1(b). (This is an extension
of Streaming RAID [TPBG93] to a heterogeneous collection of subservers.) When compared with
technique 1, this strategy prevents the formation of bottlenecks and no longer replicates read-only
data. However, its MTTSL is inferior due to data dependence among the subservers.
3. Disk merging [ZG97, Zim98]: This technique constructs a logical collection of disk drives from an
array of heterogeneous disk drives. The logical disks appear homogeneous to the upper software layers
of the system. A logical disk drive might be realized using a fraction of the bandwidth and storage
space provided by several physical disk drives. For example, in Figure 2, logical disk number 2 ( d
l
)
is realized using a fraction of the bandwidth provided by physical disk drives 0, 1, and 2 (d
p
d
p
, and
d
p
). When compared with dependent subservers (technique 2), this technique minimizes the amount
of memory required by continuous media and hence results in the lowest cost per stream. Moreover, it
provides effective support for diverse configurations. For example, if a system administrator extends
the storage subsystem with a single Quantum disk drive, this paradigm can utilize both the storage
space and bandwidth of this disk drive. With the other two techniques, it would be difficult to form
a subserver that consists of a single disk drive (whose failure would result in reorganization of data).
On the down side, the MTTSL of this technique is inferior to the other two strategies.
With the homogeneous view created by disk merging, a parity-based redundancy scheme such as RAID
can be applied, as long as it is modified to handle the following constraint: a physical disk drive should not
form multiple logical disks of a single parity group. Otherwise, the failure of this physical disk would result
in loss of data, i.e., an entire parity group would become in-operational at the logical level. In Section 2,
we describe analytical models to compute the MTTSL of each technique. Next, Section 3 quantifies the
performance tradeoffs associated with each strategy. These results quantify the qualitative tradeoffs discussed
in this section. Conclusions are contained in Section 4.
2
Parity Groups
d
p
0
d
p
1
d
p
2
d
p
3
d
p
4
d
p
5
1 0
Hawk
1LP
Barracuda
4LP
Hawk
1LP
Hawk
1LP
Barracuda
4LP
Hawk
1LP
X X X X Y Y
0.0 0.1 0.2 0.P 0.0 0.1
d
p
6
Barracuda
4LP
Y
0.P
X X X X
1.0 1.1 1.P 1.2
Y Y
1.0 1.P
Y
1.2
...
...
...
...
...
...
...
Parity Groups
1 0
X X X X X X
0.0 0.1 0.2 0.P 1.0 1.1
X
1.P
X X X X
2.0 2.1 2.P 2.2
d
p
0
d
p
1
d
p
2
d
p
3
d
p
4
d
p
5
Hawk
1LP
Barracuda
4LP
Hawk
1LP
Hawk
1LP
Barracuda
4LP
Hawk
1LP
d
p
6
Barracuda
4LP
Y
0.0
Y
2.0
Y Y Y Y Y
0.1 0.2 0.P 1.0 1.1
Y
1.P
Y Y Y
2.1 2.P 2.2
... ... ...
... ... ...
(a) Technique 1: Independent subservers. (b) Technique 2: Dependent subservers.
Figure 1: Partitioning techniques 1 and 2.
d
l
0
d
l
1
d
l
2
d
l
3
d
l
4
d
l
5
d
l
6
d
l
7
d
l
8
d
l
9
d
l
10
d
l
11
d
l
12
d
l
13
d
l
14
d
l
15
Physical:
Logical:
Parity
Group:
2.2 2.2 3.6 2.2 2.2 3.6 p =
0
p =
1
p =
2
p =
3
p =
4
p =
5
d
p
0
d
p
1
d
p
2
d
p
3
d
p
4
d
p
5
3
2
1
0
Hawk
1LP
Barracuda
4LP
Hawk
1LP
Hawk
1LP
Barracuda
4LP
Hawk
1LP
X
1.0
X
2.0
X
0.1
X
1.1
X
2.1
X
1.2
X
2.2
X
3.0
X
3.1
X
3.2
X
1.P
X
0.2
X
3.P
X
2.P
X
0.3
X
0.P
Figure 2: Technique 3: physical (heterogeneous) and logical (homogeneous) view of a multi-disk storage
server employing disk merging. Six physical disks are mapped to sixteen logical disks. In addition, a
possible parity group assignment to create non-overlapping parity groups is shown. All of a physical disk’s
logical disks must map to different parity groups (G ).
2 Parity-based Redundancy
To support data intensive applications we assume a multi-node server architecture. Each node has a set of
local disks attached to it through an I/O bus, such as the popular SCSI (Small Computer System Interface,
see [ANS94]). The nodes are linked with each other through a high-speed interconnection network. While
we recognize the importance of investigating both disk failures and node failures, we limit the focus of this
paper to only disk failures.
With redundant data, a single failure does not lead to data loss. Furthermore, if the failed disk is replaced
with a new device, then the data on the new disk can be rebuilt. In sum, a server can be said to operate in
either of three modes: (a) normal mode, i.e., all nodes are fully operational, (b) degraded mode, i.e., some
disk or node has failed, and (c) rebuild mode, i.e., the data on a repaired disk or node is being restored. All of
these three modes must be considered when designing reliability techniques to mask disk and node failures,
such that the service to the applications can be continued.
3
Term Definition
MTTF Mean time to failure; mean lifetime of an individual, physical disk with failure rate MTTSL Mean time to service loss
MTTR Mean time to repair of a physical disk with repair rate d
p
i
Physical disk drive i
d
l
i
Logical disk drive i
p
i
Number of logical disks that map to physical disk i
D Number of physical disk drives
D
l
Number of logical disk drives
G Parity group size
G
i
Parity group i; a data stripe allocated across a set of G disks and protected by a parity code
R t Reliability function
Table 1: List of terms used repeatedly in this study and their respective definitions.
Disk Series MTBF (power-on hours)
Hawk 1LP 500,000
Barracuda 4LP 1,000,000
Cheetah 4LP 1,000,000
Table 2: Single disk reliability for three commercial disk drives. In these examples, the manufacturer, Seagate
Technology
TM
, Inc. reported the reliability as mean-time-between-failures (MTBF). A simple approximation
for MTBF is MT BF M TTF MT T R where MTTR denotes the mean-time-to-repair [SS82, pp 206].
Because failed disks are typically replaced and not repaired, the notion of MTTR refers to replacing the disk
and rebuilding its content onto a new disk. This process can usually be completed within several hours.
Hence MT T R MT T F and thus for most practical purposes the following approximation can be used:
MT T F MT BF .
2.1 Parity Group Assignment
In parity-based systems the disks are partitioned into parity groups. The parity information is computed
across the blocks in a parity group, most commonly with an XOR function. The large storage space overhead
of mirroring is reduced since for a parity group size of G only one G
th
of the space is dedicated to data
redundancy. In the basic case, the disks of a storage system are partitioned into non-overlapping parity
groups. This scheme can tolerate one disk failure per parity group. However, when a parity group operates
in degraded mode, each access to the failed disk triggers the retrieval of blocks from all of the disks within
this parity group to reconstruct the lost data. Thus, the load on all operational disks increases by 100% under
failure, making this parity group a hot spot for the entire system. To distribute the additional load more evenly,
parity groups may be rotated such that they overlap [TKKD96]. Further improvements can be achieved by
assigning blocks pseudo-randomly to parity groups [ORSS96]. To provide a focused presentation we will
concentrate on the simple scenario of non-overlapping parity groups.
When the logical disks that were created with the disk merging technique are assigned to parity groups,
the following constraint must be considered. Some logical disks may map to the same physical device and
hence be dependent on each other, i.e., a failure at the physical level may cause multiple logical disks to
become unavailable simultaneously. Consequently, two dependent logical disks cannot be assigned to the
same parity group. The reason is that with a traditional XOR-based parity computation exactly one data
block of a stripe can be reconstructed as long as all the other blocks of that particular stripe are available.
Consequently, the number of independent parity groups D
l
G needs to be larger than any number of logical
4
disks that map to a single physical disk. This can be formally expressed with the following parity group size
constraint
G D
l
p
i
for iD (1)
where G denotes the parity group size, D
l
represents the total number of logical disks, and p
i
denotes the
number of logical disks that map to physical disk d
p
i
(for example p
in Figure 2).
Figure 2 shows an example storage system with six physical disk drives that map to sixteen logical
disks. The parity group size G is required to be either less than or equal to both
j
D
l
p
k
j
k
and
j
D
l
p
k
j
k
. Hence, the maximum parity group size equals 4 which can be accommodated by
creating or more parity groups G
i
. For illustration purposes, we will use a simple, non-overlapping
parity group scheme. One possible assignment of the sixteen logical disks d
l
d
l
to four parity groups
G
G
is as follows (also illustrated in Figure 2): G
fd
l
d
l
d
l
d
l
g G
fd
l
d
l
d
l
d
l
g G
fd
l
d
l
d
l
d
l
g and G
fd
l
d
l
d
l
d
l
g.
From the above parity group assignment we can further determine which physical disks participate in
each of the parity groups (“ ” denotes “maps to”). Note that the parity groups G
and G
map to all six
physical disks in the sample system.
G
d
p
d
p
d
p
d
p
d
p
d
p
(2)
G
d
p
d
p
d
p
d
p
d
p
d
p
(3)
G
d
p
d
p
d
p
d
p
(4)
G
d
p
d
p
d
p
d
p
(5)
2.2 Basic Reliability Modeling
With the help of data replication or parity coding, the reliability of a disk array can be greatly improved.
Several studies have quantified the reliability of homogeneous disk arrays in the context of continuous media
servers with mirroring [Mou95], parity coding schemes (RAID) [PGK88, ML90, Gib91, CLG
94, BGM95],
or a combination of both [BB97]. The concepts of reliability or fault tolerance involve a large number of
issues concerning software, hardware (mechanics and electronics), and environmental (e.g., power) failures.
A large body of work already exists in the field of reliable computer design and it will provide the foundation
for this section. Because it is beyond the scope of this workshop paper to cover all the relevant issues, we
will restrict our presentation to the reliability aspects of the disk hardware.
2.2.1 Analytical Model for Reliability
The reliability function of a system, denoted R t , is defined as the probability that the system will perform
satisfactorily from time zero to time t, given that it is initially operational [SS82, pp 7]. When the higher
failure rates of a component at the beginning (infant mortality or burn-in period) and at the end (wear-out
period) of its lifetime are excluded, then there is strong empirical evidence that the failure rate during its
normal lifetime is approximately constant [SS82, Gib91]. This is equivalent to an exponential distribution
of the product’s lifetime and gives rise to a reliability function R t that can be expressed as:
R t e
t
(6)
5
0
G
G+1
2(G-1)
... 1
1
G-1
...
2(G-1)+1
...
G(G-1)
G(G-1)+1
μ
0
μ
0
μ
0
μ
1
μ
1
μ
2
μ
G-1
μ
1
λ
1
λ
0
λ
0
λ
0
λ
1
λ
1
λ
1
λ
2
λ
G-1
λ
2
λ
G-1
λ
0
λ
2
λ
G-1
λ
0
λ
G-2
1
1
1
1
1
1
2
1
2
0
Figure 3: Markov model for a heterogeneous disk array (one parity group). The labels in each state denote
the number of disk failures encountered by the array. State 2 is a trapping state, i.e., the probability of exiting
is zero, meaning the array has failed.
Perhaps the most commonly encountered measure of a system’s reliability is its mean time to failure
(MTTF) which is defined as follows:
MT T F Z
R t dt (7)
With the exponential lifetime distribution R t substituted from Equation 6 the MTTF simply becomes:
M TTF Z
e
t
dt (8)
In a heterogeneous storage environment it is possible for each physical disk drive d
p
i
to have its own MT T F
i
and hence its own failure rate i
(see Table 2).
The mean lifetime of a system can be greatly prolonged if it contains redundant components that can be
repaired when a failure occurs. This is the case for parity based storage systems, where a parity group can run
in degraded mode until a failed disk is replaced and its data is recovered. The mean time to repair (MTTR)
2
is often difficult to model analytically, and it must usually be estimated or measured [SS82, pp 205]. If the
operating environment of the system is known, then an estimate can be given. For example, if spare disks
are at hand and service personnel is part of the staff at the site location, a MTTR of a few hours should be
realistic. If spare parts must be ordered or service personnel called in then the MTTR may be in the order
of days or even weeks. For the analytical models the repair rate is commonly denoted and for exponential
distributions MT T R . In a heterogeneous environment the repair rate may not be the same for all disk
types, hence in the most general case each disk has its own repair rate i
.
2
It is customary to refer to MTTR as the mean-time-to-repair even though for most practical purposes a failed magnetic disk
will be replaced and not repaired. In such a case the MTTR should include the time to rebuild the lost data on the new disk.
6
2.2.2 Markov Model for a Single Parity Group
Markov models provide a powerful tool for basic reliability modeling if the system is composed of several
processes (such as a failure process and a repair process) [SS82, Gib91]. Each component of such a model is
assumed to be independent. However, logical disks produced by the disk merging technique are not a priori
independent, because several of them may map to the same physical disk drive. Conversely, fractions of a
single logical disk may be spread across several physical disks. In Section 2.2.3 we will derive the mean
time to failure for each individual logical disk in a storage system, such that the necessary independence for
the Markov model is preserved.
Figure 3 shows the Markov model for a single parity group comprised of G independent disk drives.
The states are labeled with the number of disk failures that the array is experiencing. For example, in state
0 all disks are operational. Then with probability i
disk d
i
becomes unavailable and a transition is made to
one of the states labeled 1. With probability i
repairs are completed and the array is again in state 0. Or,
with probability j
a second disk fails and the transition to state 2 indicates that an unrecoverable failure
has occurred. As illustrated, the number of states of the model is G G , i.e., it is quadratic
with respect to the parity group size G. The evaluation of such a model is computationally quite complex,
especially for larger values of G. Hence we propose the following two simplifying assumptions:
1. The repair rate i
is the same for all the disks in the system, i.e., G .Itis
likely that the time to notify the service personnel is independent of the disk type and will dominate
the actual repair time. Furthermore, disks with a higher storage capacity are likely to exhibit a higher
disk bandwidth, leading to an approximately constant rebuild time.
2. The probability of a transition from any of the states 1 to state 2 is
P
G j j
minus the one failure rate
that led to the transition from state 0 to 1. We propose to always subtract the smallest failure rate min
to establish a closed form solution. Hence, by ordering (without loss of generality) j
according to
decreasing values, G , we can express the probability of a transition from state 1
to2tobe
P
G i j
. This approximation will lead to a conservative estimate of the overall MTTF of
a system.
With these assumptions the Markov model is simplified to three states as shown in Figure 4 and can be
solved using a set of linear equations and Laplace transforms [SS82, Gib91].
01 2
λ
i Σ
i=0
G-1
λ
j Σ
j=0
G-2
μ
Figure 4: Simplified Markov model for a heterogeneous disk array.
If the expression of R t does not need to be obtained, M TTSL can be found more easily [Gib91].
Beginning in a given state i, the expected time until the first transition into a different state j can be expressed
as
E[state i to state j E[time in state i per visit]
X
k i
P(transition state i to state k E[state k to state j (9)
where
E[time in state i per visit] P
rates out of state i
(10)
7
and
P(transition state i to state k
rate of transition to state k
P
rates out of state i
(11)
The solution to this system of linear equations includes an expression for the expected time beginning
in state 0 and ending on the transition into state 2, that is, for MT T SL. For the Markov model in Figure 4,
this system of equations is
E[state 0 to state 2] P
G i i
P
G i i
P
G i i
E[state 1 to state 2] (12)
E[state 1 to state 2] P
G j j
P
G j
j
E[state 0 to state 2]
P
G j j
P
G j
j
E[state 2 to state 2] (13)
E[state 2 to state 2] (14)
The resulting mean lifetime of a single parity group of independent heterogeneous disks is shown in
Equation 15.
MT T SL P
G i i
P
G j j
P
G i i
P
G j
j
(15)
Under most circumstances the repair rate MT T R
disk
will be much larger than than any of the
failure rates i
M TTF
disk
i
. Thus, the numerator of Equation 15 can be simplified as presented in
Equation 16 (see [Gib91, pp 141]).
M TTSL P
G i i
P
G j j
P
G i MT T F
disk
i
P
G j MT T F
disk
j
MT T R
disk
(16)
If all the failure rates i
are equal, i.e., G then Equation 16 will, happily,
correspond to the derivation for a homogeneous disk RAID level 5 array with group size G [PGK88]
MT T SL
homog eneous
MT T F
disk
G G MT T R
disk
(17)
where M TTF
disk
is the mean-time-to-failure of an individual disk, MT T R
disk
is the mean-time-to-repair
of a single disk, and G is the parity group size. Both, Equations 16 and 17, capture the probability of an
independent double disk failure within a single parity group.
Example 2.1: Consider a parity coded disk array consisting of 4 Hawk 1LP and one Cheetah 4LP disk
drives (G ). The mean-time-to-failure of the Hawk 1LP series is MT T F
disk
hours and for
the Cheetah 4LP series it is MT T F
disk
hours (see Table 2). If we assume an average repair
time of MT T R
disk
hours, then the MT T SL approaches a stunning 264,000 years. (By comparison, a
homogeneous array consisting of five Hawk 1LP disks would have a mean lifetime of MT T SL years.) 8
2.2.3 Logical Disk Independence
The Markov model of the previous section assumes that all the disks are independent. This assumption is
not guaranteed for disk merging, because one logical disk may map to several physical disks and vice versa.
Hence, to be able to apply Equation 16 at the logical level of a disk merging storage system, we will need to
derive the mean time to failure of each individual, logical disk. Two cases are possible: (1) a logical disk
maps to exactly one physical disk and (2) a logical disk maps to multiple physical disks. Consider each case
in turn.
If a logical disk maps completely to one physical disk then its life expectancy is equivalent to the lifetime
of that physical disk. Consequently, it inherits the mean time to failure of that disk. For example, the mean
lifetimes of the logical disks d
l
and d
l
of Figure 2 are 500,000 hours each.
If, on the other hand, a logical disk depends on multiple physical disks, then it will fail whenever any
one of the physical disks fails. Hence, we can apply the harmonic sum for failure rates of independent
components [CLG
94]
MT T F
d
l M TTF
d
p
M TTF
d
p
MT T F
d
p
k
(18)
As an example, consider applying the above two observations to the the four logical disks in parity group
G
of Figure 2. This will result in the following mean lifetime for each logical disk
M TTF
d
l
hours
hours,
M TTF
d
l
hours,
M TTF
d
l
hours,
M TTF
d
l
hours.
Recall that, if multiple logical disks map to the same physical disk, then a failure of that physical drive will
concurrently render all its logical drives unavailable. For the aforementioned reason all logical disks that
map to the same physical disk must be assigned to different parity groups (see Figure 2).
2.2.4 Multiple Parity Groups
Large storage systems are typically composed of several parity groups. All of the groups need to be
operational for the system to function properly. If the groups are assumed to be independent, then the system
can be thought of as a series of components. The overall reliability of such a configuration is
R
S er ies
t
D
G
Y
i R
i
t (19)
where DG is the number of parity groups and R
i
t is the individual reliability function of each group.
The overall mean lifetime M TTSL of a series of groups can then be derived from the harmonic sum of
the individual, independent failure rates of all the components as shown in Equation 20 [CLG
94].
MT T SL
System
M TTSL
M TTSL
MT T S L
D
G
(20)
9
Technique 1. Independent 2. Dependent 3. Disk
Subservers Subservers Merging
Number of subservers 33 1
Number of simultaneous streams @ 3.5 Mb/s 96 210 210
Memory
a
[MB] 388 3,860 674
Storage capacity
b
[GB] 24.0 55.3 49.4
MTTSL [years] 680,000
c
160,000 17,000
Cost per stream
d
[$] 160 90 75
a
Memory size based on double-buffering.
b
The total capacity of all the disks is 72.3 GB. Approximately 20% is used for parity information.
c
Reliability based on triple modular redundancy [SS82, pp 215], i.e., two of three subservers must function.
d
Cost based on a price of $500 per disk drive and $1 per MB of memory.
Table 3: Analytical results for three techniques with the same disk storage system consisting of 30 disk
drives (10 Hawk 1LP, 10 Barracuda 4LP, and 10 Cheetah 4LP). The parity group size is G and MTTR
= 6 hours for all three configurations.
Consider the simple example shown in Figure 2. Four parity groups are formed from 16 logical disks,
which in turn have been constructed from 6 physical devices. The physical MT T F
disk
are assumed to be
500,000 hours (d
p
d
p
, d
p
, and d
p
), respectively, 1,000,000 hours (d
p
and d
p
), corresponding to Hawk 1LP
and Barracuda 4LP devices (see Table 2). Each logical disk inherits its MT T F
disk
from the physical drive
it is mapped to. For example, MT T F
d
l
hours.
The resulting mean lifetimes for both, group G
and G
, are MT T SL
G
MT T SL
G
years, while the M TTSL
G
and MT T SL
G
are 634,237 years each (Equation 15). The mean time to
failure for the whole storage system is therefore
M TTSL
System
years (21)
Such an extraordinarily high mean lifetime will minimize the chance of failure during any time interval.
3 Analytical Results
We compared the three HERA techniques, independent subservers, dependent subservers, and disk merging
with each other based on a disk subsystem configuration consisting of 10 Hawk 1LP (model ST31200WD),
10 Barracuda 4LP (ST32171WD), and 10 Cheetah 4LP (ST34501WD) disk drives.
Table 3 summarizes the results obtained from the analytical models of Section 2. The ten Hawk disks
provided a total of 9.8 GB of storage, the Barracudas 20.1 GB, and the Cheetahs 42.4 GB, for a total of
72.3 GB. For all techniques a parity group size of G was chosen, resulting in a maximum usable
aggregate capacity of 80% or 57.8 GB. The total raw bandwidth for all the 30 disk drives, not considering
any overhead, is approximately 244 MB/s. Once seek operations, rotational latencies, etc., are factored
in and the most cost-effective operating point is selected a maximum of 420 simultaneous streams can be
supported
3
. Because the load doubles within a parity group that experiences a disk failure, we limit the
system utilization to 50% or 210 streams. Hence only a double-disk failure in a parity group will cause the
termination of some streams and therefore a loss of service.
3
The bandwidth of a stream is assumed to be 3.5 Mb/s, e.g., MPEG-2.
10
For technique 1, the independent subserver paradigm, we based our observations on a triple modular
redundancy (TMR) scheme, where two out of three units need to function for continued operation (we
replicated movies accordingly) [SS82, pp 215]. Because the Hawk and Barracuda subservers provide less
than half of the total bandwidth and storage capacity, only 96 streams can be supported by this configuration.
However, this decision results in the highest MTTSL and cost per stream among the three techniques. With
technique 2, dependent subservers, a much higher utilization of bandwidth and storage space can be achieved.
However, the mean lifetime is also considerably lower. The disk merging system provides a high storage
capacity and achieves the same throughput as technique 2, but it minimizes the amount of memory required
for continuous media which, in turn, minimizes the cost per stream. The MTTSL of this technique is the
lowest among the three paradigms, but it is still sufficiently high for most practical applications. While not
quantified, we are almost certain that both dependent subservers and disk merging result in a higher startup
latency when compared with independent subservers. This is because the system must distribute the load
evenly across the parity groups. Increasing the utilization of disk storage (techniques 2 and 3) minimizes the
number of references to the tertiary. This is especially true if the working set of an application exhausts the
available disk cache space.
4 Conclusion and Future Directions
In this study we investigated parity-based fault tolerance techniques for heterogeneous storage systems. We
introduced HERA, a framework that extends RAID to allow a mix of physical disk drives to be protected
from data and service loss. The examples provided were specific to non-overlapping parity groups. For
the disk merging technique, we have specified both design rules and algorithms that provide the necessary
independence of logical disks among each other and across parity groups for the successful application of
parity-based data redundancy.
There remain several open research issues within the HERA framework that we plan to address in the
future. For example, for a storage system with a large collection of logical disks, an algorithm to map logical
disks to parity groups is needed. Also, a configuration planner could be directed to find the best logical
to physical disk assignment that produces the most reliable disk merging configuration for a given set of
physical disk drives.
References
[ANS94] American National Standard of Accredited Standards Commitee X3, 11 West 42nd Street, New York, NY
10036. Small Computer System Interface (SCSI-2), ANSI X3.131 - 199x, March 1994.
[BB97] Ernst W. Biersack and Christoph Bernhardt. A Fault Tolerant Video Server Using Combined Raid 5
and Mirroring. In Proceedings of Multimedia Computing and Networking 1997 Conference (MMCN’97),
pages 106–117, San Jose, California, February 1997.
[BG88] Dina Bitton and Jim Gray. Disk shadowing. In Proceedings of the International Conference on Very Large
Databases, pages 331–338, September 1988.
[BGM95] S. Berson, L. Golubchik, and R. R. Muntz. Fault Tolerant Design of Multimedia Servers. In Proceedings
of the ACM SIGMOD International Conference on Management of Data, pages 364–375, 1995.
[CLG
94] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-Performance, Reliable
Secondary Storage. ACM Computing Surveys, 26(2):145–185, June 1994.
[CRS95] Ling Tony Chen, Doron Rotem, and Sridhar Seshadri. Declustering Databases on Heterogeneous Disk
Systems. In Proceedings of the 21st International Conference on Very Large Data Bases, pages 110–121,
Z¨ urich, Switzerland, September 1995.
11
[DS95] Asit Dan and Dinkar Sitaram. An Online Video Placement Policy based on Bandwidth to Space Ratio
(BSR). In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages
376–385, San Jose, May 1995.
[Gib91] Garth A. Gibson. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. Ph.D. Dissertation,
University of California at Berkeley, Berkeley, California, December 1991. Also available from MIT
Press, 1992.
[Gro97] Ed Grochowski. Internal (media) data rate trend, 1997. IBM Almaden Research Center, San Jose,
California. URL: http://www.storage.ibm.com/storage/technolo/grochows.
[GS93] S. Ghandeharizadehand C. Shahabi. Management of Physical Replicas in Parallel Multimedia Information
Systems. In Proceedings of the Foundations of Data Organization and Algorithms (FODO) Conference,
October 1993.
[GVK
95] D. J. Gemmell, H. M. Vin, D. D. Kandlur, P. V . Rangan, and L. A. Rowe. Multimedia Storage Servers: A
Tutorial. IEEE Computer, May 1995.
[ML90] Richard R. Muntz and John C.S. Lui. Performance Analysis of Disk Arrays Under Failure. In Proceedings
of the 16
th
Very Large Databases Conference, pages 162–173, Brisbane, Australia, 1990.
[Mou95] Antoine Mourad. Reliable Disk Striping in Video-On-Demand Servers. In Proceedings of the 2nd
IASTED/ISMM International Conference on Distributed Multimedia Systems and Applications, pages
113–118, Stanford, CA, August 1995.
[ORSS96] Banu
¨
Ozden, Rajeev Rastogi, Prashant Shenoy, and Avi Silberschatz. Fault-tolerant Architectures for Con-
tinuous Media Servers. In Proceedings of the ACM SIGMOD International Conference on Management
of Data, pages 79–90, June 1996.
[Pat93] David A. Patterson. Terabytes Teraflops (Or why work on processors when I/O is where the action
is?), May 13, 1993. Keynote address at the ACM SIGMETRICS Conference in Santa Clara, CA.
[PGK88] David A. Patterson, Garth Gibson, and Randy H. Katz. A Case for Redundant Arrays of Inexpensive
Disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data,
pages 109–116, June 1988.
[SS82] Daniel P. Siewiorek and Robert S. Swarz. The Theory and Practice of Reliable Systems Design. Digital
Press, Bedford, Massachusetts, 1982. ISBN 0-932376-13-4.
[TKKD96] R. Tewari, R. P. King, D. Kandlur, and D. M. Dias. Placement of Multimedia Blocks on Zoned Disks. In
Proceedings of IS&T/SPIE Multimedia Computing and Networking, San Jose, January 1996.
[TPBG93] F.A. Tobagi, J. Pang, R. Baird, and M. Gang. Streaming RAID-A Disk Array Management System for
Video Files. In Proceedings of the First ACM Conference on Multimedia, pages 393–400, Anaheim, CA,
August 1993.
[WYS95] J.L. Wolf, P.S. Yu, and H. Shachnai. DASD Dancing: A Disk Load Balancing Optimization Scheme for
Video-on-Demand Computer Systems. In Proceedings of the ACM SIGMETRICS, Ottawa, Canada, May
1995.
[ZG97] Roger Zimmermann and Shahram Ghandeharizadeh. Continuous Display Using Heterogeneous Disk-
Subsystems. In Proceedings of the Fifth ACM Multimedia Conference, pages 227–236, Seattle, Washing-
ton, November 9-13, 1997.
[Zim98] Roger Zimmermann. Continuous Media Placement and Scheduling in Heterogeneous Disk Storage
Systems. Ph.D. Dissertation, University of Southern California, Los Angeles, California, December 1998.
12
Asset Metadata
Creator
Ghandeharizadeh, Shahram (author), Zimmermann, Roger (author)
Core Title
USC Computer Science Technical Reports, no. 685 (1998)
Alternative Title
HERA: Heterogeneous extension of RAID (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
12 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16271193
Identifier
98-685 HERA Heterogeneous Extension of RAID (filename)
Legacy Identifier
usc-cstr-98-685
Format
12 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Linked assets
Computer Science Technical Report Archive