Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 799 (2003)
(USC DC Other)
USC Computer Science Technical Reports, no. 799 (2003)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BroadScale: Heterogeneous Scaling of Randomly Labeled Disks
¤
Shu-Yuen Didi Yao, Cyrus Shahabi, Roger Zimmermann
Computer Science Department
University of Southern California
Los Angeles, CA 90089
fdidiyao, shahabi, rzimmermg@usc.edu
Abstract
Scalable storage architectures allow for the addition or
removal of disks to increase storage capacity and band-
width or retire older disks. We introduce a random place-
ment scheme for data blocks across a group of disks. Our
objective is to redistribute a minimum number of blocks af-
ter disk scaling. In addition, after scaling, a balanced load
should be maintained and blocks should be retrievable in
one disk access, with low computational complexity. Past
work have only addressed these requirements for scaling
homogeneous disks. Although maximizing the resources of
heterogeneous disks have been previously studied, dynamic
scaling was not considered. Heterogeneous disks must of-
ten be used for scaling since they are faster and more cost-
effective. Moreover, old homogeneous disks may no longer
be available. We propose an algorithm termed BroadScale,
based on Random Disk Labeling, to scale heterogeneous
disks by distributing blocks according to disk weights. We
show through experiments that BroadScale results in an
even load, requires few block moves, and maintains fast
block access for heterogeneous disk scaling.
1 Introduction
Computer applications typically require ever-increasing
storage capacity to meet the demands of their expanding
data sets. Because storage requirements often times exhibit
varying growth rates, current storage systems may not re-
serve a sufficient amount of excess space for future growth.
Meanwhile, large up-front costs should not be incurred for
¤
This research has been funded in part by NSF grants EEC-9529152
(IMSC ERC), IIS-0082826 (ITR), and IIS-0238560 (CAREER), and unre-
stricted cash gifts from Okawa Foundation and Microsoft. Any opinions,
findings, and conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect the views of the
National Science Foundation.
a storage system that might only be fully utilized in the dis-
tant future. Therefore, a storage system that accommodates
incremental growth would have major cost benefits. Incre-
mental growth translates into a highly scalable storage sys-
tem where the amount of overall disk storage space and ag-
gregate bandwidth can expand according to the growth rate
of the content and/or application performance needs.
Our proposed scalable storage algorithm can be gener-
alized to mapping any set of objects to a group of storage
units. Some applications include Web proxy servers, file
systems, and continuous media (CM) servers. We assume
CM servers in our discussions and use the terms data block
and disk in place of object and storage unit throughout this
paper.
Our technique to achieve a highly scalable CM server
begins with the placement of data on storage devices such
as magnetic disk drives [10, 16]. More specifically, we
break CM files (e.g., video or audio) into individual fixed-
size blocks and apply a random placement [11] of these
blocks across a group of homogeneous disks. Determin-
ing an optimal block size can improve disk performance
(e.g., [18]), which is orthogonal to the topic of this paper.
Since any block can be accessed with an almost equal prob-
ability the random striping scheme allows the disks to be
load balanced where their aggregate bandwidth and capac-
ity are maximized when accessing CM files. We actually
use a pseudo-randomized placement of file object blocks,
as in [6, 16], so that blocks have roughly equal probabilities
of residing on each disk. With pseudo-random distribution,
blocks are placed onto disks in a random, but reproducible,
sequence.
The placement of block i is determined by its signature
X
i
, which is simply an unsigned integer computed from a
pseudo-random number generator, p r. p r must produce
repeatable sequences for a given seed. One way to derive
the seed is from(StrToL(filename)+i), which is used
to initialize the pseudo-random function to computeX
i
.
The storage system of a CM server requires that disks
1
be scaled (i.e., added or removed) in which case the striped
objects need to be redistributed to maintain a balanced load.
Disks can be added to the system to increase overall band-
width and capacity or removed due to failure or space con-
servation. We use the notion of disk group as a group of
disks that is added or removed during a scaling operation.
Without loss of generality, a scaling operation on a storage
system withD disks either adds or removes one disk group.
Scaling up will increase the total number of disks and
will require a fraction of all blocks to be moved onto the
added disks in order to maintain load balancing across
disks. Likewise, when scaling down, all blocks on a re-
moved disk should be randomly distributed across remain-
ing disks to maintain load balancing. These block moves
are the minimum needed to maintain an even load.
We have previously developed the Random Disk Label-
ing algorithm to assign blocks to homogeneous disks using
block signatures [22]. However, a homogeneous disk group
may not be available at the time of scaling due to advance-
ments in storage technology [7]. Thus, larger, faster, and
more cost-effective heterogeneous disks must be used when
scaling to increase the overall bandwidth and capacity char-
acteristics of the storage system. The number of blocks on
each disk should be proportional to both these characteris-
tics. Load balancing according to just bandwidth may over-
flow some disks earlier than others since a disk with twice
the bandwidth may not necessarily have twice the capacity.
In this paper, we propose the BroadScale algorithm for
the disk assignment and scaling of heterogeneous disks. In
addition to block signatures, disk weights are assigned to
each disk depending on its capacity and bandwidth. We will
show that the system is load balanced after blocks are allo-
cated according to both block signatures and disk weights.
As disks are added to and removed from the system, the
location of a block may change. Our objective of course is
to quickly compute the current location of a block, regard-
less of how many scaling operations have been performed.
Moreover, we must ensure an even load on the disks and
minimal block movement during a scaling operation. We
summarize the requirements more clearly as follows.
Requirement 1 (even load): If there areB blocks stored
onD disks, maintain the load so that the expected number
of blocks on diskd is approximately
w
d
P
D¡1
j=0
w
j
£B where
w
d
is the weight of diskd.
Requirement 2 (minimal data movement): During the
addition of n disks on a system with D disks stor-
ing B blocks, the expected number of block moves
is
P
D+n¡1
j=D
w
j
P
D+n¡1
j=0
w
j
£ B. During the removal of n disks,
P
w
j
2R
wj
P
D¡1
j=0
w
j
£ B blocks are expected to move where R is
the set of disk weights of disks to be removed.
Requirement 3 (fast access): The location of a block is
computed by an algorithm with space and time complexity
of at most O(D) and requiring no disk I/O. Furthermore,
the algorithm is independent of the number of scaling oper-
ations.
We will show that the proposed BroadScale algorithm
solves the problem of scaling heterogeneous disks while up-
holding Requirements 1;2; and 3. With BroadScale, disk
weights could result in fractional weight values. Since
BroadScale operates on integer weight values, the fractional
portions, termed weight fragments, are wasted and cause
load imbalance. These weight fragments can be reclaimed
through two techniques, disk clustering and fragment clus-
tering. These techniques lead to less weight fragmentation
even though additional block moves are incurred when scal-
ing disks. However, we show through experimentation that
this additional movement is marginal.
The remainder of this paper is organized as follows. Sec-
tion 2 gives background on our Random Disk Labeling al-
gorithm which is the basis for BroadScale. In Section 3,
we describe our BroadScale algorithm. In Section 4, we in-
troduce the concept of disk clustering and how they reduce
inefficiencies in BroadScale. In Section 5, we describe an-
other technique called fragment clustering. Section 6 de-
scribes related work. In Section 7, we describe our experi-
ments. Finally, Section 8 concludes this paper and discusses
future research.
2 Background: Random disk labeling
In this section, we provide background on our Random
Disk Labeling (RDL) algorithm for the scaling of homoge-
neous disks. The full details of RDL are discussed in [22].
We adapt a hashing technique called double hashing to
solve our problem of efficient redistribution of data blocks
during disk scaling. Generally speaking, double hashing ap-
plies to hash tables where keys are inserted into buckets. We
view this hash table as an address space, that is, a memory-
resident index table used to store a collection of slots. Each
slot can either be assigned a disk or be empty. Some slots
are left empty to allow for room to add new disks. We can
think of block IDs as keys and slots as buckets.
We design our address space for P slots (labeled
0;:::;P ¡ 1) and D disks where P is a prime number,
D is the current number of disks, and D · P . For this
approach, we use a random allocation of disks where we
randomly place D disks among the P slots. We can sim-
ply think of D disks which are labeled with random slots
in the range 0;:::;P ¡1, but we use the concept of disk
occupying slots to help visualize our algorithm.
As explained in Section 1, each block has a signature,
X
i
, generated by a pseudo-random number function, p r1.
To determine the initial placement of blocks, we use a
2
block’s signature,X
i
, as the seed to a second function,p r2,
to compute a random start position, sp, and a random step
length, sl, for each block. We want to probe slots until a
slot containing a disk is found. The sp value, in the range
0;:::;P ¡1, indicates the first slot to be probed. The sl
value, in the range1;:::;P¡1, is the slot distance between
the current slot and the next slot to be probed. We probe by
the same amount,sl, in order to guarantee that we search all
slots in at mostP probes. As long asP is relatively prime
tosl, this holds true [9]. The first slot in the probe sequence
that contains a disk becomes the address for that block.
0 100 19 3
Block
sp=3, sl=76
Block
sp=46, sl=20
46 66 86 5
P-1
0 1
1 0
53 71
Figure 1. Placement of two blocks. Block 0 initially
hits and block 1 initially misses(D =5;P =101).
Example 2.1: In Figure 1, assume we have 5 disks ran-
domly assigned to101 slots(D = 5;P = 101). Using the
blocks’ signature X
i
, we compute sp and sl for blocks 0
and1. For block0,sp=3 andsl =76. Slot3 has a disk so
this becomes the address for block0. For block1,sp = 46
andsl = 20. Slot46 does not contain a disk so we traverse
block1’s probe sequence, probing by20 slots and wrapping
around if necessary, until we arrive at slot5.
For an addition operation, n disks are added to n ran-
domly chosen empty slots. Then each block is considered
in sequence(0;:::;B¡1) and, without actually accessing
the blocks, we computeX
i
,sp,sl, and the new location for
block i. If the new location is an old disk, then the block
must already lie on this disk so no moving is necessary.
Clearly in this case, the probe length remains the same as
before. However, if the new location is a new disk, then we
continue with the probe sequence to find the block at its cur-
rent location in order to move it to the new disk. In this case,
the probe length becomes shorter since, before this add op-
eration, this new location was also probed but was empty so
probing continued.
Example 2.2: Figure 2 shows an example of adding a new
disk to a set of 5 disks. The disk is added to the randomly
chosen slot10. Here,sp=63 andsl =24 so the probe se-
quence is63;87;10;34;58;82;5 and blocki belongs to the
disk in slot 5. After scaling operation j, a disk is added to
slot10 and blocki moves from the disk in slot5 to the disk
in slot10 since slot10 appears earlier in the probe sequence.
The resulting probe sequence is63;87;10.
63 87 10 34 58 82 5
Block i with
sp=63, sl=24
New disk added
to Slot 10
Block i
is moved
41 48 69 97
Figure 2. Probe sequence of blocki before and after
a disk add operationj. Blocki moves from disk 5 to
disk 10 after disk 10 is added.
Without loss of generality, disks are randomly chosen
for removal. For removal operations, we first mark the disks
which will be removed. Then, for each block stored on these
disks, we continue with its probe sequence until we hit an
unmarked disk to which we move the block. The probe
length is now longer (but no longer than P trials) to find
the new location. This can be illustrated as the reverse of
Example 2.2.
In all cases of operations, the probe sequence of each
block stays the same. It is the probe length that changes
depending on whether the block moves or not after a scal-
ing operation. Hence, the scaling operation and the exis-
tence of disks will dictate where along the probe sequence
a block will reside. After any scaling operation, the block
distribution will be identical to what the distribution would
have been if the disks were initially placed that way. The
amount of block movement is minimized since blocks only
move from old disks to new disks. For any sequence of
scaling operations, RDL will result in a uniform distribu-
tion of blocks across homogeneous disks since blocks have
an equal chance of falling into any slot. Lastly, blocks
are quickly accessible since locating blocks only requires
a maximum ofP probes.
2.1 The Filter Method applied to RDL
In a heterogeneous disk system with different capacities
and bandwidths, certain disks will tend to be favored more
than others. If a disk has, for example, twice the bandwidth
and capacity of the others, we want twice the amount of
blocks hitting it. This means that the block assignments will
not be uniform and must follow some weighting function,
where each disk has an associated weight. We can achieve
this by applying the filter method to RDL so that blocks do
not end up on the first disk they hit in their probe sequence.
Instead, they probe disks one-by-one until the filter method
finds a target disk, based on the disk weight. The higher the
weight, the more likely its corresponding disk will be a hit.
We describe this method below as well as discuss its main
drawback of extra block moves.
Given any block i, let the following denote its probe
3
sequence: P = fd
0
;d
1
;:::;d
D¡1
g where d
j
is a disk
and is unique within the probe sequence. Moreover, disk
weights are assigned based on their bandwidth and capacity.
We give details on determining disk weights in Section 3.
The corresponding weights for the disks that are probed are
given by: W =fw
0
;w
1
;:::;w
D¡1
g.
We now use the filter method for placing block i on a
disk along its probe sequence. We define a filter value for
each of blocki’s probes: F =ff
0
;f
1
;:::;f
D¡1
g where:
f
j
=
w
j
P
D¡1
k=j
w
k
(1)
It is easy to see that0 < f
j
· 1 for allj andf
D¡1
= 1.
In order to determine which disk this particular block be-
longs to, we use its signatureX
i
and the disk identifier,d
j
,
as seeds to a multi-seeded pseudo-random number function
to generate a pseudo-random number r
j
between 0 and 1.
Starting withj =0 toD¡1, we find the first diskd
j
ofP ,
wherer
j
·f
j
, to put blocki.
We can now easily apply the filter method directly to
RDL with varying disk weights by using filter values com-
puted from Equation 1. However, block movement after
scaling will not be minimized in this way. The movement
can only be minimized (moving only from old to new disks)
if disks are added to or removed from the front of the probe
sequence. Since every block has a different probe sequence,
this will not be possible so some blocks will move from old
disks to other old disks.
Thus, this characteristic of the filter method is unde-
sirable and violates Requirement 2 (minimal data move-
ment). Example 2.3 illustrates the additional amount of
block movement when disks are not added to the front of
the probe sequence using the filter method with RDL.
0
f = 1/2
0
1
f = 1
1
Probe sequence
f = 1/3
0
0
f = 1/2
1
Probe sequence
1
f = 1
2
0
f = 1/3
0
f = 1/2
1
Probe sequence
1
f = 1
2
2
2
When adding a disk
to a position other
than the front of
the probe sequence
block movement from
an old disk to
another old disk
will occur.
Block that
fall in this
range will
be evenly
distributed
between
disks 2 and 1.
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
a. b.
c.
Figure 3. Fig. 3a shows an initial group of 2 disks.
Disk 2 is added to the front of the probe sequence in
Fig. 3b and to the middle in Fig. 3c.
Example 2.3: Consider a homogeneous case of the filter
method with RDL where initially D = 2. In Figure 3a,
P = f0;1g, W = f1;1g, and F = f0:5;1g. Using X
i
and d
0
= 0 as seeds, we compute a pseudo-random num-
ber, r
0
. If r
0
· 0:5 we place block i on disk 0, or disk 1
otherwise. In Figure 3b, we add disk 2 to the front of the
probe sequence. Here, P = f2;0;1g, W = f1;1;1g, and
F =f0:33;0:5;1g. We recomputer
0
usingX
i
andd
0
= 2
as seeds. Ifr
0
· 0:33 we place it on disk 2, otherwise we
computer
1
usingX
i
andd
1
=0 as seeds. Now ifr
1
·0:5
blocki is placed on disk0, or disk1 otherwise. In this case,
if block i moves, it only moves from an old disk to a new
disk. However, in Figure 3c, if disk2 is added between disk
0 and1, then blocki could move from disk0 to disk1, old
disk to old disk.
Even though the filter method does not work directly
with RDL for heterogeneous disks, we will use a similar fil-
ter method as a component of our BroadScale algorithm de-
scribed later in Sections 4 and 5. In Section 3, we introduce
BroadScale, an algorithm similar to RDL, which maps mul-
tiple slots to a single disk to support heterogeneous disks.
3 Disk weights
We have shown how RDL [22] can scale the size of a
multi-disk system consisting of homogeneous disks using
a random placement of the data. With the introduction of
heterogeneous disks, a uniform distribution of blocks from
RDL will not enable the disks to be fully utilized, assum-
ing that all blocks are equally popular. In general, larger
and faster disks should hold more blocks. Using the filter
method with RDL attempts to achieve this, but leads to the
undesirable characteristic of additional block moves.
In this section, we will describe our technique called
BroadScale which extends RDL for the support of hetero-
geneous disks. BroadScale is based on RDL but the main
difference is that each disk can be mapped to multiple slots
depending on the weight value of the disk. In Section 3.1,
we describe how to compute disk weights assuming a static
group of disks that have different bandwidth to space ratios
(BSR). In Section 3.2, we describe how to compute disk
weights for a dynamically growing group of disks.
3.1 Disk weights for a static disk group
Instead of using the filter method directly with RDL,
BroadScale assigns multiple slots to a single disk. The more
slots assigned to a disk, the more blocks this disk will con-
tain. We call the number of slots assigned to a particular
disk the weight of the disk. Each disk may or may not have a
different weight depending on its two characteristics: band-
width
1
and capacity. Clearly, a disk of weight 10 will get
twice as many blocks assigned to it than a disk of weight5.
1
Since multi-zoned disks have various bandwidth characteristics, we
use the average bandwidth for simplicity.
4
The weight of diskd is unit-less and is computed by its
normalized bandwidth B
d
=B
MAX
or normalized capacity
C
d
=C
MAX
or a combination of both. When both band-
width and capacity are considered, a system administrator
can set the weight tow
0
d
according to:
w
0
d
=
B
d
B
MAX
£¯+
C
d
C
MAX
£(1¡¯) (2)
where¯ is the percentage of bandwidth contribution tow
0
d
.
B
MAX
andC
MAX
can be set to estimated future maximum
bandwidth and capacity values. Since w
0
d
’s could be frac-
tional numbers, we can divide them by w
0
G
, which is the
Greatest Common Factor (GCF) of thew
0
d
’s, to obtain inte-
ger values for the weights. Hence, the disk weight, w
d
, is
computed using:
w
d
=
w
0
d
w
0
G
(3)
Note that this assumption of w
d
being an integer value
may change when new disks are added resulting in weight
fragmentation. Later, we reduce this fragmentation in Sec-
tions 4 and 5.
Example 3.1: Suppose we have 2 disks where B
0
= 10
MB/s, C
0
= 20 GB, B
1
= 20 MB/s, and C
1
= 40 GB. If
the disk weights should only depend on bandwidth(¯ =1)
thenw
0
G
=10,w
0
=1, andw
1
=2.
When computing disk weights, inefficiencies arise when
the bandwidth to space ratio (BSR) of the disks are not all
identical. If ¯ = 1 then the number of blocks on disks
depends solely on their bandwidth where w
d
= B
d
=w
0
G
.
The storage capacity utilized on each disk will be equiva-
lent to the capacity of the disk with the highest BSR. Thus,
some capacity will be left unused on those disks with lower
BSRs. However, a restriction may occur on disks with
higher BSRs. These disks will fill up more quickly than
other disks since they have proportionally less capacity. In
this case, the bandwidth of these disks will not be fully uti-
lized. To create more room on these disks, more disks need
to be added.
In Figure 4a, w
d
= 1 for d = 0;1;2 so each disk is
assigned to one slot and receives an equal number of blocks.
Disks0 and1 have under-utilized storage capacities because
¯ = 1, indicating that the aggregate bandwidth should be
fully-utilized. Hence, the maximum aggregate amount of
useful storage in the system is the capacity of the disk with
the highest BSR multiplied byD.
On the other hand, if ¯ = 0 then the amount of blocks
on disks depends solely on their capacity so larger disks
will contain more blocks, even if they have little bandwidth.
Here, w
d
= C
d
=w
0
G
. The bandwidth of disks with lower
BSRs will be more stressed since they have proportionally
BSR = 0.31
BSR = 0.61
BSR = 1.23
Slots
C =
148 GB
0
C =
37 GB
2
C =
74 GB
1
β = 1
BSR = 0.31
BSR = 0.61
BSR = 1.23
Slots
C =
148 GB
0
C =
37 GB
2
C =
74 GB
1
β = 0
w = 1
d
w = 1
d
w = 1
d
w = 4
d
w = 2
d
w = 1
d
a. b.
Disk 0 Disk 1 Disk 2 Disk 0 Disk 1 Disk 2
Figure 4. Fig. 4a shows unused capacity in disk 0
and 1 when ¯ = 1. Fig. 4b shows potential bottle-
necks at disk0 and1 when¯ = 0. B
d
= 45:5 MB/s
for all disks.
less bandwidth than disks with higher BSRs. In this case,
disks are restricted by their bandwidths since they might be
slowed considerably. Figure 4b shows an example of this
case where disks 0 and 1 contain more blocks than disk 2
but all have the same bandwidth. Since all blocks have an
equal chance of being accessed, more block requests will be
delivered to disks0 and1, creating potential bottlenecks.
Therefore, we can determine the weight of disk d us-
ing Equation 3 to obtain an integer weight value that can
map slots to disks. Next, we explore how to determine disk
weights for a dynamically growing disk group.
3.2 Disk weights for a dynamic disk group
Since we allocate slots (and therefore blocks) to disks
according to the weight of each disk, dividing the weight
by a factor will have the effect of changing the number of
slots allocated to the disk. Let’s use w
0
F
to denote the di-
viding factor in general. The trade-off is that smallw
0
F
val-
ues lead to larger w
d
values, which tend to make any frac-
tional value of the weight relatively insignificant but will
require more slots. Having more slots increases the storage
requirements of the address space, but more significantly,
increases the total amount of probing to locate blocks [22].
On the other hand, largerw
0
F
values lead to smallerw
d
val-
ues which could result in under-utilized disk resources due
to the more significant fractional value of the weight.
Trend reports, such as [7], of the growth rate of magnetic
disk technology allow us to estimate the characteristics of
near-future disks. We can use these estimations to help us
determine w
0
F
. For example, if a system administrator an-
ticipates adding new disks one year from now, w
0
F
can be
computed from the characteristics of the existing disks and
from estimations of those to be added in one year.
When disks are added to the system in the far-future,
their characteristics (and therefore w
0
F
) are harder to esti-
mate. This could lead to more fractional values and under-
utilized new disks. We do not want to frequently update
5
w
0
F
, which affects the disk weights, since this will require a
reorganization of the data.
We can better utilize unpredictable far-future disks with-
out using an over-abundance of slots by using an estimation
ofw
0
F
which we call the Estimated Common Factor (ECF)
of the total weight, orw
0
E;®
. Thew
0
E;®
of a new disk group
is computed such that the combined bandwidth and capac-
ity usage will be at least ® percent. w
0
E;®
is computed by
Equation 4:
P
D¡1
j=0
(w
0
j
modw
0
E;®
)
P
D¡1
j=0
w
0
j
=1¡
®
100
(4)
where D is the total number of disks in the new group
and the numerator adds up all the fractional portions of the
weights.
ECF
ECF
ECF
ECF ECF
ECF
ECF
ECF
ECF
Total shaded
region is 10%
of total disk
bandwidth.
Total unshaded
region is 90%
of total disk
bandwidth.
ECF
Figure 5. Total bandwidth of 4 disks. In this case,
w
0
E;90
is the ECF.
Therefore,w
0
E;90
gives the ECF of the aggregate weights
such that the utilization is at least 90%. Figure 5 shows
the largest value of w
0
E;90
that still achieves at least 90%
utilization. Since far-future adds most likely involve disks
with higher bandwidth and larger capacity, the weights of
the new disks will be larger, thus the fractional weight por-
tion will be smaller. For the rest of this paper, we refer to
fractional weight values as the weight fragmentation of a
group of disks. Fragmented weights, caused by the shaded
regions in Figure 5, lead to an under-utilization, or waste,
of the disk weight (e.g., a weight of 3:5 has 0:5 of wasted
weight).
We arrive at Equation 5 which maintains at least an ®
percent of utilization for near- and far-future scaling opera-
tions.
w
d
=
w
0
d
w
0
E;®
(5)
Note that w
d
may not be an integer value when a het-
erogeneous disk d is added. This is a problem since w
d
is the number of slots assigned to disk d, which of course
cannot be fractional. In Sections 4 and 5, we describe two
approaches that BroadScale assumes for reducing the waste
associated with weight fragmentation.
4 Disk clustering
The disk weights, as described in the previous section,
might be fragmented and hence cannot be mapped directly
to an integer number of slots in RDL’s address space. In this
section, we explore how to map fragmented disk weights to
slots using disk clustering where clusters have almost inte-
ger weights. For example, a disk weight of 1:5 cannot be
mapped to 1:5 slots, but it can be clustered with another
disk of weight2:5 and mapped together to4 slots. The idea
is to try to reduce the fractional portion of the aggregated
disk weights in each cluster as much as possible. Then each
cluster is assigned to one or more slots instead of each disk
being assigned to slots. Actually, now the slots have no
knowledge of the disks at all. The higher the cluster weight,
the more slots it is assigned to and the more blocks that will
be assigned to the cluster. A question that remains is that
after a block is assigned to a cluster, which disk should it
reside on within the cluster?
For the rest of this section, we first describe the simple
case of single-disk clusters. Then we show that using multi-
ple disks per cluster can reduce the waste of disk resources.
Finally, we describe how to locate a disk within a disk clus-
ter for block assignment. We are not concerned with the
terms disk bandwidth and capacity in our discussions in
Sections 4 and 5 since they have both been translated into
the concept of disk weights.
4.1 Single disk clusters
One method to accommodate disks with fragemented
weights is to simply usebw
d
c as the weight of diskd. This
can be viewed as disk clusters each containing only one
disk. For each cluster, the maximum amount of waste would
then be less than one unit of weight.
If the weights of new disks are relatively high then the
percentage of waste may not be significant. However, some
low disk weights may actually be less than1 in which case
they cannot be mapped to any slots and are, in effect, unus-
able. The single-disk cluster solution may suffice for steady
or increasing disk weights as disks are added, but is clearly
inadequate for fragmented, low disk weights.
4.2 Multiple disk clusters
By logically clustering multiple disks together, we can
reclaim the fractional portions of the disk weights and re-
duce the amount of waste. Instead of using individual disk
weights, cluster weights map a cluster of one or more disks
to the appropriate number of randomly chosen slots.
The fragmentation of a cluster’s weight can decrease as
disks are added to the cluster. The cluster’s weight is the
sum of its disks’ weights. The objective is for the cluster
6
weight, w
c
, of cluster c to be as close tobw
c
c as possible.
Since it may be hard forw
c
to be exactly equal tobw
c
c, the
cluster is said to be full when(w
c
¡bw
c
c)· (1¡²). ² is
specified by the user to indicate when a cluster is full (i.e.,
²£100 percent full). Thus, when²=0:95, a cluster is full
when the fractional portion of w
c
is less than 0:05. A new
disk, d, is either added to an existing non-full cluster or it
is added to a new empty cluster by itself. The disk is added
to a cluster such that the fractional portion ofw
c
is reduced
after the inclusion of w
d
. In other words, disk d is added
in such a way that the overall waste of the storage system is
reduced.
To decide where a new disk is added, we can generalize
our problem to the classical bin packing [4] problem. The
objective of bin packing is to pack variable-sized items in as
few bins as possible thus, each bin becomes as full as possi-
ble. Our main objective is also to pack each cluster as full as
possible with disks. For our problem, we will only consider
the fractional portion of the disk and cluster weights. Disks
are items and clusters are bins, but the fractional portion of
the disk weights are the item sizes and the clusters (bins)
are of size 1. A slight difference to traditional bin packing
is that a disk can only be packed into a cluster if it reduces
the fractional portion of the sum of the disk sizes in that
cluster. An easy way to translate this back to traditional bin
packing is to use Equation 6 to compute disk sizes:
s(w
d
)=1¡(w
d
¡bw
d
c) (6)
wherew
d
is a disk weight ands(w
d
) is the disk size. Hence,
by packing disks into clusters of size1, the weight fragmen-
tation of clusters is reduced.
With traditional bin packing, all items (disks) are known
beforehand so an optimal packing arrangement does exist
even though it cannot be found in polynomial time (an NP-
hard problem). However, since we have no prior knowledge
of how many future disks there are, we must optimally rear-
range the entire packing for each new disk, requiring most
blocks to be moved each time. Obviously, this solution is
infeasible. Therefore, we use a heuristic such as the Best
Fit [4] algorithm to optimize the placement of just the next
disk to be added. Using Best Fit, diskd should be placed in
a cluster that has current size closest to, but not exceeding,
1¡s(w
d
). If this results in multiple clusters then the one
with the lowest index is chosen as the tiebreaker. If disk d
does not “fit” into any of the clusters, a new cluster is cre-
ated with diskd. Thus, Best Fit searches all the clusters and
picks the best one to which the disk should be added.
Large-scale storage systems may have on the order of
1;000 disks so an exhaustive search to find the best cluster
to place a disk may be computationally intensive. In these
cases, the First Fit [4] algorithm may be more appropriate
since it simply picks the first cluster in which the disk fits.
First Fit and Best Fit are both good approximations to the
optimal solution of traditional bin packing since they re-
quire at most 70% more bins than the optimal number of
bins [4].
0 1 4 3 2 6 5 7
Slots
3 2 0 1
Figure 6.4 clusters of disks. Each cluster is mapped
to 1 or more slots. The disks are heterogeneous but
do not appear so in order to simplify the figure.
After diskd is added to clusterc, the number of slots as-
signed to the cluster isbw
c
c. If this represents an increase
in slots, then more empty slots are randomly assigned to the
cluster. Figure 6 shows4 clusters of disks, each mapped to
1 or more slots. Without loss of generality, removing a disk
from clusterc reducesw
c
and slots are randomly chosen to
be unassigned to this cluster. Using RDL and the concept
of disk clusters, blocks are uniformly distributed across the
slots and are thus proportionally distributed across the clus-
ters based onbw
c
c. Now we can find which cluster a partic-
ular block belongs to using the slot to cluster mapping. In
the next section, we describe how to find a particular disk
within a cluster for the block.
4.3 Locating a disk within a cluster
Once a disk cluster is found for a block using RDL, we
must locate a disk within the cluster for the block to reside.
The likelihood that a block will land on diskd within a clus-
terc isw
d
=w
c
since this is the percentage of weight of this
disk among all the disks in the cluster. To achieve this dis-
tribution, we use the filter method described in Section 2.1
except that the probe sequence is only of the disks within
the cluster with the first disk of the sequence being the most
recently added one. More importantly, the probe sequence
is now the same for all blocks.
To locate a disk, first we logically arrange the disks
within the cluster in reverse-order by their disk identifiers,
d. Note that these disk numbers may not be contiguous
within a cluster since new disks could have been added
to different clusters at different times. Then, starting with
the most recent disk (with the highest d), we use the sig-
nature, X
i
, of block i and the global disk identifier, d, as
seeds to a pseudo-random number function to generate a
random value, r
0
, in the range 0:::1. One example of a
well-performing function
2
, as suggested by [20], is:
2
There are other types of pseudo-random number functions to consider,
7
srand(d)
srand(rand()ˆX)
r = rand()/R
whereR is the range ofrand(). Next, ifr
0
is less than or
equal to the filter value,f
0
, then blocki should reside on the
0-th disk of clusterc’s probe sequence. Ifr
0
> f
0
then we
compare r
1
and f
1
to determine if the block should reside
on the 1-st disk, and so on. This filter value is computed
using Equation 7, similar to Equation 1.
f
j
=
w
j
P
D
c
¡1
k=j
w
k
(7)
where D
c
is the number of disks in cluster c and w
j
is the
weight of thej-th disk in the cluster’s probe sequence.
0 1 4 3 2 6 5 7
Slots
8
Probe sequence
3 2 0 1
Figure 7. After disk8 is added, a new slot is mapped
to cluster3 sincew
3
increases. Some blocks in clus-
ters0;1;2 are moved to disks8;6; and4.
Since new disks are always added to the front of the
probe sequence for cluster c, we can guarantee that blocks
will only move from the old disks to the new disks within
the cluster. However, some blocks from outside the cluster
may move to the old disks within the cluster as shown in
Figure 7. This occurs when adding a disk to a cluster in-
creases the cluster weight, w
c
, by more than 1 and causes
an increase in slots that are mapped to the cluster. Thus,
RDL will move some blocks from every old slot to the new
slots. Blocks that are assigned to the new slots all have a
chance of landing on any disk in the cluster, so some blocks
may end up on an old disk of the cluster. We will show in
Section 7 that the amount of this additional movement is
not significant. Note that the movement from old disks to
old disks will not cause any unevenness in the block distri-
bution, only the consumption of additional disk bandwidth
and possibly additional network bandwidth if the disks are
separated across a network. We will also explain in Sec-
tion 7 that having larger clusters requires more computation
to locate disks within a cluster. Thus, the trade-off is be-
tween less computation or less wasted disk weight.
but finding a good one is hard [12].
5 Fragment clustering
As described in the previous section, disk clustering is
a way to reduce the overall weight fragmentation by se-
lectively adding disks to clusters, thereby reducing clus-
ter weight fragmentation. Another approach to reducing
weight fragmentation is fragment clustering where only the
fractional portions of the disk weights are clustered to-
gether. With fragment clustering, physical disks are mapped
to1 or more randomly chosen slots. Then the fractional disk
weight portions are grouped together into unit-sized logical
disks and each logical disk is mapped to1 randomly chosen
slot. The details of this approach are described as follows.
3.0
Slots
3.5
.5
PLD
Figure 8. A disk of weight3:5 is mapped to3 slots.
The:5 fractional value along with a pointer to the disk
is stored as an entry in the PLD. The PLD is activated
into an ALD when it is full.
When a new heterogeneous disk d is added to the stor-
age system, it is mapped tobw
d
c randomly chosen slots in
the virtual address space. If the fractional portion of w
d
,
namely w
d
¡bw
d
c, is greater than 0, its fractional value,
along with a pointer to diskd, is appended as an entry in the
unit-sized Pending Logical Disk (PLD). Figure 8 shows an
example of mapping a disk of weight 3:5 to 3:0 slots with
the fractional weight :5 appended as an entry in the PLD.
The fragment clustering algorithm maintains only one PLD,
which stores the currently unutilized weight fragments. The
PLD becomes full when the sum of its fractional values is
greater than or equal to 1:0. If the sum is greater than 1:0,
the excess value, along with the pointer to d, is stored as
an entry in the subsequent PLD. Once the PLD is full, it
becomes an Active Logical Disk (ALD) and a new logical
disk is allocated to assume the role of the PLD. This ALD
is now mapped to a randomly chosen slot. Of course, the
number of ALDs increases as more disks are added.
Once the disks are mapped to slots in the manner just de-
scribed, data blocks can be assigned to disks using RDL as
normal by probing the slots. Figure 9 illustrates the place-
ment of block 1 directly onto a physical disk and block 2
onto a physical disk via a logical disk (i.e., an ALD). When
a slot mapped to a physical disk is probed, that block is as-
signed to the disk. However, when a slot mapped to an ALD
8
3.0 2.0 3.0 .5.5 .4.6 .2
ALD ALD
PLD
3.5 2.9 3.8
1 2
Physical Disks
not yet
active
Figure 9. Block 1 probes slots using RDL until it
lands on the disk with weight 3:5. Block 2 probes
slots and lands on an ALD. Within the ALD, the fil-
ter method determines Block 2 to hit the entry with
value:6 and is forwarded to the disk with weight3:8.
Note that the pointer of the entry in the PLD is not yet
active since the PLD is not yet full.
is probed, further computation must be done to determine
which physical disk this block eventually resides on. To ac-
complish this, we use the filter method from Sections 2.1
and 4.3 on the ALD, which contains entries of fractional
values summing up to1:0. Once an entry is found using the
filter method, the pointer within that entry is followed to the
physical disk where the block should be placed.
Since adding new disks will never cause updates to ALD
entries (only the PLD entries), we do not need a specific
initial ordering (i.e., least recently added to most recently
added) of the entries. However, once an initial entry order-
ing is decided from the construction of the PLD, this order-
ing must remain the same after conversion to ALDs.
Intuitively, using fragment clustering, the overall amount
of weight fragmentation at any given time will not exceed or
be equal to1:0. This fragmentation will only be contributed
by the fractional values in the PLD. When the PLD fills to
capacity (i.e., 1:0), it becomes active and its weight frag-
ments are utilized. We show that this is true in Section 7.
In sum, BroadScale is a technique which first involves
computing a weight for each disk based on bandwidth
and/or capacity as described in Section 3. If these weights
are not integer values, then we have weight fragmentation
and wasted disk resources will arise. BroadScale reduces
weight fragmentation through two approaches, disk clus-
tering and fragment clustering. Disk clustering strategically
clusters disks together using the Best Fit algorithm to reduce
fragmentation. Fragment clustering is another approach
where the fractional portion of the weights are grouped as
logical disks. A comparison of disk clustering with frag-
ment clustering is discussed in Section 7 along with benefits
and drawbacks of each.
6 Related work
We describe related work on two categories of appli-
cations to which we can apply our BroadScale algorithm.
These categories are redistributing CM blocks on CM server
disks and remapping Web objects on Web proxy servers.
Previous literature on CM servers have discussed ar-
eas such as distributed architectures and retrieval schedul-
ing [10, 16]. The topic of homogeneous and heterogeneous
disk scaling in CM servers has been the focus of a few past
studies. One study mixes popular (“hot”) and unpopular
(“cold”) CM data objects together on heterogeneous disks
with different BSRs [3]. Their objective is to maximize the
utilization of both bandwidth and capacity while maintain-
ing the load balance. However, the popularity of the ob-
jects need to be known ahead of time to be properly placed.
Moreover, their popularity might change over time (e.g.,
new movies tend to be accessed more frequently) so the ob-
jects may need to be moved depending on their current pop-
ularity. Other techniques stripe fixed-size object blocks, de-
scribed below, as opposed to storing them in their entirety.
Disk scaling with round-robin data striping is discussed
in [5]. With round-robin, almost all blocks need to be re-
located when scaling. The overhead of such block move-
ment may be amortized over a period of time but it is,
nevertheless, significant and wasteful. Wang and Du [21]
describe a technique which assigns weights to disks based
on bandwidth and capacity. However, they also distribute
data blocks in a round-robin fashion, requiring large block
movement overhead when scaling. Another technique
called Disk Merging [23] merges a static group of hetero-
geneous physical disks into homogeneous logical disks to
maximize bandwidth and capacity for striped data. This
technique is not intended for dynamic scaling since the sys-
tem must be taken off-line and reconfigured, potentially
reshuffling many blocks.
While traditional constrained placement techniques such
as round-robin placement allow for deterministic service
guarantees, random placement techniques are modeled sta-
tistically. The RIO project demonstrated the advantages of
random data placement such as single access patterns and
asynchronous access cycles to reduce disk idleness [11].
However, they did not consider the dynamic rearrangement
of data due to disk scaling. Although they do not require
prior knowledge of object popularity for full utilization of
heterogeneous disks’ aggregate bandwidth, their solution
requires data replication for short and long term load bal-
ancing [14]. In one scenario, they require at least34% block
replication for 100% bandwidth utilization. Another study
focused on the trade-off between striping and replication
for load balancing [2]. For large systems, the extra storage
needed for replication becomes more significant. In gen-
eral, random placement, or pseudo-random in our case, in-
9
creases the flexibility to support various applications while
maintaining a competitive performance [15].
We developed a prior technique called SCADDAR to re-
distribute data blocks after homogeneous disk scaling in a
CM server by mapping the block signatures to a new set of
signatures for an even, randomized distribution [6]. SCAD-
DAR adheres to the requirements of Section 1 except that
computation of block locations become incrementally more
expensive. Finding a block’s location requires the compu-
tation of that block’s location for every past scaling opera-
tion, so a history log of operations must be maintained. In
comparison, our RDL and BroadScale algorithms are fast in
computation even though they are limited by the total num-
ber of disks (i.e.,P ).
Several past works have considered mapping Web ob-
jects to proxy servers using requirements similar to those
described in Section 1. Below we describe two relevant
techniques called Highest Random Weight (HRW) and con-
sistent hashing along with their drawbacks.
HRW was developed to map Web objects to a group of
proxy servers [20]. Using the object name and the server
names, each server is assigned a random weight. The ob-
ject is then mapped to the highest weighted server. After
adding or removing servers, objects must be moved if they
are no longer on the highest weighted server. The draw-
back here is that the redistribution of objects after server
scaling requiresB£D random weight function calls where
B is the total number of objects and D is the total num-
ber of proxy servers. A simple heterogeneous extension to
HRW is described in [13], but suffers from the same com-
putational complexity. We show in [22] that in some cases
HRW is several orders of magnitude slower than our RDL
technique. An optimization technique for HRW involves
storing the random weights in a directory, but the directory
size will increase asB andD increase causing the algorithm
to become impractical.
Consistent hashing is another technique used to map
Web objects to proxy servers [8]. Here objects are only
moved from two old servers to the newly added server. A
variant of consistent hashing used in a peer-to-peer lookup
server, Chord, only moves objects from one old server to the
new server [19]. In both cases, the result is that objects may
not be uniformly distributed across the servers after server
scaling since objects are not moved from all old servers to
the new server. With Chord, a uniform distribution can be
achieved by using virtual servers, but this requires a consid-
erable amount of routing meta-data [1].
7 Experiments
In this section, we describe our simulation experiments
to validate our BroadScale algorithm. First, we show that
data blocks are distributed across the disks according to the
disk weights. The higher the weight, the more blocks will
reside on the corresponding disk. Next, we measured the
amount of weight fragmentation from disk clustering and
fragment clustering. With disk clustering, varying the size
of the clusters affects the amount of fragmentation. Then,
we show that the additional amount of block movement us-
ing disk clustering is not significant compared to the overall
number of moves. This movement is even lower with frag-
ment clustering. Finally, the average and maximum number
of probes is shown for disk and fragment clustering.
For all of our experiments, we distributed approximately
750;000 blocks across 10 initial disks, which is a realistic
starting point. We set the total number of slots to 1;511
(i:e:;P = 1;511) because we need room to add disks and
multiple slots are mapped to each disk depending on the
disk weight. We computed disk weights for a dynamic
disk group using Equation 5 by setting ¯ = 1, where the
number of blocks on disks depends solely on disk band-
width. We set ® = 90% so that at least 90% of the ag-
gregated disk weight is utilized to determined the number
of blocks per disk. When simulating disk scaling, we as-
sume a10-disk add operation is performed every6 months.
For this time period, industry trends suggest that disk band-
width increases 1:122£ in [7], and disk capacity increases
1:26£ following Moore’s Law. Our added disks follow
these trends.
0
20000
40000
60000
80000
100000
0 1 2 3 4 5 6 7 8 9
# blocks per disk
# blocks / weight
0
1
2
3
4
5
6
7
8
9
Disk weight
# of blocks per disk
Disk weight
Disk number
Figure 10. The number of blocks per disk follows
the same trend as the disk weight.
The disk weights are used to indicate how many blocks
should reside on a disk relative to other disks. Since the
number of slots assigned to a disk is roughly equal to the
disk weight, more slots assigned to the disk will result in
more blocks for the disk. Figure 10 shows blocks dis-
tributed across 10 disks by BroadScale. For illustration
purposes, these disks vary widely in bandwidth and, there-
fore, in weight. After distributing the blocks, the trend of
the amount of blocks per disk follows the trend of the disk
10
weights. The blocks per disk (w.r.t. the left axis) and the
disk weights (w.r.t. the right axis) are overlaid together on
the same figure to show their similarity. Moreover, as ex-
pected, Figure 10 shows that the normalized curve (w.r.t.
the left axis) is quite uniform across disks. The normalized
curve is computed as (blocks on diskd/weight of diskd).
0
10
20
30
40
50
60
10 20 30 40 50 60 70 80 90 100
Disk cluster size = 1
Disk cluster size = 2
Disk cluster size = 3
# of disks (after 10-disk add operations)
Aggregated weight fragmentation
Fragment clustering
Figure 11. The overall amount of weight fragmenta-
tion using disk clustering and fragment clustering.
Disk clustering and fragment clustering were two tech-
niques introduced in Sections 4 and 5. The purpose of clus-
tering is to reduce the fragmentation of the disk weights,
thus reducing the waste, so that each disk will hold a more
accurate number of blocks. Figure 11 shows the aggregated
waste of disk weights using both techniques as the storage
system is scaled by adding10 disks at a time with10 initial
disks. For disk clustering, we want to show that the total
amount of unutilized disk weight decreases as clusters in-
crease in size. When the maximum cluster size,K
MAX
, is
1, the effect is that there is no clustering. Here, diskd is as-
signed tobw
d
c slots and the amount of wasted disk weight
is significant. However, increasing K
MAX
to 2 gives us
much better weight utilization since the clusters are combin-
ing fragmented weights. Furthermore, settingK
MAX
= 3
leads to even greater improvement. We found that disk clus-
ters became full at 3 disks, which is the expected value of
the cluster size, so increasingK
MAX
beyond3 gave no im-
provement.
Using the Best Fit algorithm for disk clustering, the ex-
pected value, E(D
c
), of the number of disks on cluster c
can be determined by analyzing the fractional part of the
disk weights. Given a disk weight, the expected value of
the fractional portion is:5. The second weight must reduce
the fractional portion when summed with the first, so the
expected value becomes :25. Each time a weight is added
in this way, the expected fractional value is halved so we
have:
0:5
E(Dc)
=1¡p (8)
wherep is the precision ofE(D
c
) since0:5
E(D
c
)
will never
equal0. For example, with a precision of0:94,E(D
c
) = 4
disks. Solving forE(D
c
), we arrive at the following:
E(D
c
)=log
0:5
(1¡p) (9)
Fragment clustering demonstrates the best performance
since the maximum amount of total fragmentation is al-
ways less than 1:0. This is attributed to the fractional val-
ues stored in the PLD. However, with fragment clustering,
newly added disks are almost never fully utilized since the
PLD contains weight fragments only from these recently
added disks. Nevertheless, this may become insignificant
as the disk weights increase.
There exists a trade-off between low computation and
low weight fragmentation for disk clustering since finding
which disk within a cluster a block resides requires less
computation for smaller clusters. To find a block located
in clusterc using the filter method, on average, the pseudo-
random function is invoked for half of the disks inc. Hence,
finding a disk within small clusters requires less computa-
tion, but results in more weight fragmentation. However,
since the expected number of disks per cluster, from Equa-
tion 9, is low and we observe low weight fragmentation
in Figure 11 with small clusters (of size 3), high compu-
tation is not required. For fragment clustering, we cannot
change the size of the PLD, but the number of PLD entries
is low so finding a particular entry using the filter method is
not costly. Below, we observe the overall amount of block
movement of disk and fragment clustering.
In Section 4.3, we explained that adding disks could
cause more blocks to be moved than the minimum that is
required to fill the new disks. This is true of both clustering
approaches. With disk clustering, the additional moves de-
pends on the cluster size. If a disk is added to a new empty
cluster, no extra moves are incurred. If a disk is added to
a non-empty cluster, some blocks will be moved to the old
disks in that cluster in addition to the new disks. Similarly,
fragment clustering will result in these redundant moves
when adding a disk causes the conversion of a PLD to an
ALD. Since the PLD contains fractional entries from old
disks, activating the PLD to an ALD will redistribute data
from old disks to these old entries. However, the amount of
these data moves is low since one ALD is a small compo-
nent of the entire storage system.
Figure 12 shows the total amount of block movement
when scaling disks 10 at a time with 10 initial disks. Here
¯ = 1 so disk weights only represent the disk bandwidth,
which grows 1:122£ every scaling operation. We observe
that the percentage of block moves from an old disk to an-
other old disk is on average13% of the total moves for disk
11
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
10 20 30 40 50 60 70 80 90 100
# of disks (after 10-disk add operations)
Total # of block moves
Old disk to new disk
movement
Old disk to old disk
movement
Figure 12. The additional block movement (old disk
to old disk) for disk clustering represents a small frac-
tion of the total block movement.
clustering. Since we can expect cluster sizes to be small,
from Equation 9, and using small clusters is effective, the
additional movement will not be a significant percentage of
the total. We notice a decreasing trend in total block moves
since the10 disks that are added each time require fewer and
fewer blocks to fill them, assuming the number of blocks is
constant. A similar test on fragment clustering results in
only around3% redundant moves since a new ALD is small
compared to the actual added disks.
0
2
4
6
8
10
12
14
16
5 101520 2530354045
Rate of bandwidth growth per scaling operation (%)
% of old disk to old disk block movement
Disk Clustering
Fragment Clustering
Figure 13. The average percentage of redundant
block movement for various growth rates.
Figure 12 employs the industry growth rate of disk band-
width. For other growth rates, the percentage of old disk to
old disk block movement decreases as higher growth rate
disks are added for both clustering techniques. This is due
to the fractional weight portion being proportionally smaller
than the whole weight of these growing disks. Figure 13
shows the percentage of this redundant block movement
with respect to the growth rate. For each growth rate value,
the average percentage of redundant movement is measured
as disks are scaled. The average percentage is calculated
from25 trials of each growth rate value with each trial using
a different randomness factor to slightly vary the disk char-
acteristics. From Figure 13, fragment clustering exhibits
less redundant movement than disk clustering. Redundant
moves results from blocks being redistributed into old disks
of a cluster and old fractional entries of an ALD in disk
and fragment clustering, respectively. Fragment clustering
has fewer redundant moves since it isolates these moves to
just one ALD whereas disk clustering isolates these moves
across an entire disk cluster.
Lastly, a higher growth rate when scaling disks should
lead to less probing. The reason is that disks with larger
weights will require more slots. This causes probing to be
more successful in general since there are fewer empty slots
and misses will be less frequent. Figure 14 shows the av-
erage and maximum number of total probes as disks are
scaled 10 at a time beginning with 10 disks. The probing
results of disk clustering and fragment clustering are simi-
lar and indistinguishable in the figure since the number of
cluster groupings in each technique are similar. Figure 14a
shows that the average number of probes is lower when scal-
ing disks at a bandwidth growth rate of45% than at a growth
rate of5%. Similar results are shown in Figure 14b for the
maximum number of probes.
Fragment clustering appears to be superior to disk clus-
tering in weight fragmentation and redundant block moves.
However, one drawback of fragment clustering is the ad-
ditional bookkeeping required for the ALD entry pointers
to physical disks. Moreover, within the ALDs and the PLD,
the fractional values must be stored with these pointers. An-
other drawback is that newly added disks may not be fully
utilized since their fractional weight portions are stored in
the PLD and not yet activated.
8 Conclusions
BroadScale scales heterogeneous disks in a storage sys-
tem. Weights are assigned to disks depending on their band-
width and capacity characteristics. Blocks are distributed
among the disks proportional to these weights. Since only
the integer portions of the weight values can be used to
direct block placement, the fractional portions are wasted.
However, these wasted portions, or weight fragments, can
be strategically combined using either our disk clustering
or fragment clustering approaches. BroadScale satisfies
our scaling requirements of an even load according to disk
weights, a minimum amount data movement when scaling
disks, and the fast retrieval of data before and after scaling.
We have shown through experimentation that blocks are
12
0
20
40
60
80
100
120
140
160
10 20 30 40 50 60 70 80 90 100
# of disks
Average probes
5%
25%
45%
0
200
400
600
800
1000
1200
1400
1600
1800
10 20 30 40 50 60 70 80 90 100
# of disks
Maximum probes
5%
25%
45%
Figure 14a: Average number of total probes. Figure 14b: Maximum number of total probes.
Figure 14. Average and maximum probes for bandwidth growth rates of5%,25%, and45%(P =10;000).
distributed proportionally to the disk weights using Broad-
Scale. Disk scaling could lead to wasted disk weight (i.e.,
weight fragmentation), but can be substantially reduced
through clustering. We observed significant improvement
using larger cluster sizes in disk clustering. However, our
fragment clustering technique is superior in overall weight
fragmentation as well as average percentage of redundant
block moves with a few minor drawbacks such as some
extra bookkeeping. Although fragment clustering outper-
forms disk clustering, the additional block moves in either
case was not significant compared to the total moves.
For future work, BroadScale can be extended to allow for
scaling beyondP number of total disks by using our previ-
ous algorithm SCADDAR [6]. For heterogeneous scaling
with SCADDAR, we could use a similar weight function
and assign diskd tobw
d
c slots.
We believe BroadScale can be generalized to map any
set of objects to a group of scalable storage units. These ob-
jects might also require a redistribution scheme to maintain
a balanced load. Examples of other applications include
Web proxy servers and extent-based file systems. Scala-
bility in integrated file systems that support heterogeneous
applications [17] may also benefit from BroadScale.
Finally, we wish to investigate how BroadScale could be
applied to storage systems that need to efficiently store a
high influx of data streams such as those generated by large-
scale sensor networks. We also want to explore data re-
trieval in large, scalable peer-to-peer systems or distributed
hash tables. This requires a distributed implementation of
BroadScale on top of these peer-to-peer search techniques.
References
[1] J. Byers, J. Considine, and M. Mitzenmacher. Simple Load
Balancing for Distributed Hash Tables. In Proceedings of the
2nd International Workshop on Peer-to-Peer Systems (IPTPS
’03), February 2003.
[2] C.-F. Chou, L. Golubchik, and J. C. S. Lui. Striping Doesn’t
Scale: How to Achieve Scalability for Continuous Media
Servers with Replication. In Proceedings of the Interna-
tional Conference on Distributed Computing Systems, pages
64–71, April 2000.
[3] A. Dan and D. Sitaram. An Online Video Placement Pol-
icy based on Bandwidth to Space Ratio (BSR). In Proceed-
ings of the ACM SIGMOD International Conference on Man-
agement of Data, pages 376–385, San Jose, California, May
1995.
[4] M. R. Garey and D. S. Johnson. Computer and Intractability:
A Guide to the Theory of NP-Completeness, chapter 6, pages
124–127. W. H. Freeman and Company, New York, 1979.
[5] S. Ghandeharizadeh and D. Kim. On-line Reorganization of
Data in Scalable Continuous Media Servers. In 7
th
Inter-
national Conference and Workshop on Database and Expert
Systems Applications (DEXA’96), September 1996.
[6] A. Goel, C. Shahabi, S.-Y . D. Yao, and R. Zimmermann.
SCADDAR: An Efficient Randomized Technique to Reorga-
nize Continuous Media Blocks. In Proceedings of the 18th
International Conference on Data Engineering, pages 473–
482, February 2002.
[7] J. Gray and P. Shenoy. Rules of Thumb in Data Engineering.
In Proceedings of the 16th International Conference on Data
Engineering, pages 3–10, February 2000.
[8] D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin,
and R. Panigrahy. Consistent Hashing and Random Trees:
Distributed Caching Protocols for Relieving Hot Spots on the
World Wide Web. In Proceedings of the 29th ACM Sympo-
sium on Theory of Computing (STOC), pages 654–663, May
1997.
[9] D. E. Knuth. The Art of Computer Programming, volume 3.
Addison-Wesley, 1998.
[10] C. Martin, P. S. Narayan, B.
¨
Ozden, R. Rastogi, and A. Sil-
berschatz. The Fellini Multimedia Storage Server. In S. M.
Chung, editor, Multimedia Information Storage and Man-
agement, chapter 5. Kluwer Academic Publishers, Boston,
August 1996. ISBN: 0-7923-9764-9.
13
[11] R. Muntz, J. Santos, and S. Berson. RIO: A Real-time Multi-
media Object Server. In ACM Sigmetrics Performance Eval-
uation Review, volume 25, September 1997.
[12] S. K. Park and K. W. Miller. Random Number Generators:
Good Ones Are Hard to Find. Communications of the ACM,
31(10):1192–1201, October 1988.
[13] K. W. Ross. Hash-Routing for Collections of Shared Web
Caches. IEEE Network Magazine, 11(6):37–44, Novem-
ber/December 1997.
[14] J. R. Santos and R. R. Muntz. Performance Analysis of the
RIO Multimedia Storage System with Heterogeneous Disk
Configurations. In ACM Multimedia, pages 303–308, Bris-
tol, UK, September 1998.
[15] J. R. Santos, R. R. Muntz, and B. Ribeiro-Neto. Comparing
Random Data Allocation and Data Striping in Multimedia
Servers. In SIGMETRICS, Santa Clara, California, June 17-
21 2000.
[16] C. Shahabi, R. Zimmermann, K. Fu, and S.-Y . D. Yao. Yima:
A Second Generation Continuous Media Server. IEEE Com-
puter, pages 56–64, June 2002.
[17] P. Shenoy, P. Goyal, and H. M. Vin. Architectural Considera-
tions for Next Generation File Systems. Multimedia Systems,
8(4):270–283, 2002.
[18] P. Shenoy and H. M. Vin. Efficient Striping Techniques for
Variable Bit Rate Continuous Media File Servers. Perfor-
mance Evaluation Journal, 38(3), December 1999.
[19] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Bal-
akrishnan. Chord: A Scalable Peer-to-peer Lookup Service
for Internet Applications. In Proceedings of the 2001 ACM
SIGCOMM Conference, pages 149–160, May 2001.
[20] D. G. Thaler and C. V . Ravishankar. Using Name-Based
Mappings to Increase Hit Rates. IEEE/ACM Transactions
on Networking, 6(1):1–14, February 1998.
[21] Y . Wang and D. H. C. Du. Weighted Striping in Multimedia
Servers. In Proceedings of the IEEE International Confer-
ence on Multimedia Computing and Systems (ICMCS ’97),
pages 102–109, June 1997.
[22] S.-Y . D. Yao, C. Shahabi, and P.-
˚
A. Larson. Disk Label-
ing Techniques: Hash-Based Approaches to Disk Scaling.
Technical Report, University of Southern California, 2003.
ftp://ftp.usc.edu/pub/csinfo/tech-reports/papers/03-785.pdf.
[23] R. Zimmermann. Continuous Media Placement and
Scheduling in Heterogeneous Disk Storage Systems. Ph.D.
Dissertation, University of Southern California, Los Ange-
les, California, December 1998.
14
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 785 (2003)
PDF
USC Computer Science Technical Reports, no. 742 (2001)
PDF
USC Computer Science Technical Reports, no. 739 (2001)
PDF
USC Computer Science Technical Reports, no. 699 (1999)
PDF
USC Computer Science Technical Reports, no. 748 (2001)
PDF
USC Computer Science Technical Reports, no. 766 (2002)
PDF
USC Computer Science Technical Reports, no. 968 (2016)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 964 (2016)
PDF
USC Computer Science Technical Reports, no. 855 (2005)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 685 (1998)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 736 (2000)
PDF
USC Computer Science Technical Reports, no. 650 (1997)
PDF
USC Computer Science Technical Reports, no. 592 (1994)
PDF
USC Computer Science Technical Reports, no. 948 (2014)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 878 (2006)
Description
Shu-Yuen Didi Yao, Cyrus Shahabi, Roger Zimmermann. "BroadScale: Heterogeneous scaling of randomly labeled disks." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 799 (2003).
Asset Metadata
Creator
Shahabi, Cyrus
(author),
Yao, Shu-Yuen Didi
(author),
Zimmermann, Roger
(author)
Core Title
USC Computer Science Technical Reports, no. 799 (2003)
Alternative Title
BroadScale: Heterogeneous scaling of randomly labeled disks (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
14 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269109
Identifier
03-799 BroadScale Heterogeneous Scaling of Randomly Labeled Disks (filename)
Legacy Identifier
usc-cstr-03-799
Format
14 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/