Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 785 (2003)
(USC DC Other)
USC Computer Science Technical Reports, no. 785 (2003)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Disk Labeling Techniques: Hash-Based Approaches to Disk Scaling ¤ Shu-Yuen Didi Yao Cyrus Shahabi University of Southern California Computer Science Department Los Angeles, CA 90089 fdidiyao, shahabig@usc.edu Per- ˚ Ake Larson Microsoft Corporation One Microsoft Way Redmond, WA 98052 palarson@microsoft.com Abstract Scalable storage architectures allow for the ad- ditionorremovalofdiskstoincreasestorageca- pacity and bandwidth or retire older disks. As- suming random placement of data blocks across multiple disks of a disk array, our optimization objective is to redistribute a minimum num- ber of blocks after disk scaling. In addition, a uniform distribution, and hence a balanced load, should be ensured after redistribution. Moreover, theredistributedblocksshouldbere- trievedefficientlyduringthenormalmodeofop- eration: in one disk access and with low com- plexity computation. To achieve this, we pro- poseanalgorithmcalledRandomDiskLabeling (RDL), based on double hashing, where disks can be added or removed without any increase incomplexity. WecompareRDLwithotherpro- posed techniques and demonstrate its effective- ness through experimentation. 1 Introduction Computer applications typically require ever-increasing storagecapacitytomeetthedemandsoftheirexpanding data sets. Because storage requirements often times ex- hibit varying growth rates, current storage systems may not reserve a great amount of excess space for future growth. Also,largeup-frontcostsshouldnotbeincurred for a storage system that might only be fully utilized in the distant future. A storage system that accommo- dates incremental growth would have major cost bene- fits. Incrementalgrowthtranslatesintoahighlyscalable storage system where the amount of overall disk storage spaceandaggregatebandwidthcanexpandaccordingto the growth rate of the content. Ourtechniquetoachieveahighlyscalablestoragesys- tembeginswiththeplacementofdataonstoragedevices suchasmagneticdiskdrives. Morespecifically,webreak data objects into individual blocks and apply a random ¤ This research has been funded in part by NSF grants EEC- 9529152(IMSCERC)andIIS-0082826,andunrestrictedcashgifts from NCR, Microsoft and Okawa Foundation. placement of these blocks across a group of disks. The goal here is that the block placement allows the disks to be load balanced where their aggregate disk bandwidth can be maximized for accessing large file objects even after more disks are added to the storage system. 1.1 Assumed architecture In this paper, our proposed scalable storage algorithm can be generalized to mapping any set of objects to a group of storage units. Furthermore, the objects are stripedindependentlyofeachotheracrossallofthestor- age units for load balancing purposes, that is, any block can be accessed with almost equal probability. This group of storage units has the quality that more units canbeeitheraddedorremovedinwhichcasethestriped objects need to be redistributed to maintain a balanced load. Given these generalizations, our algorithm can be applied to three main categories which share a common architecture as shown in Figure 1. Logical storage units (each a single magnetic disk, RAID device, or proxy server) Random access to CM objects, blocks/extents, or Web objects CM client/ File retrieval thread/ Web client CM server/File system/Web proxy manager Figure 1: Common architecture of CM servers, file sys- tems, and Web proxy servers. The first category is continuous media (CM) servers where object blocks of large CM files (e.g. video or au- dio)arestripedacrossthediskarrayofaCMserver[15]. Eachfixed-sizeblockis placedinitsentiretyonseparate disks using a striping scheme. This differs from tradi- tional designs such as RAID where each block is declus- teredacrossalldisksofthediskarray. Thisdesignwould not scale in throughput when increasing the number of disks [4]. In our scenario, each disk represents a logical storage unit and can potentially be a RAID device it- self. The block placement allows load balancing of the CM storage system where the aggregated capacity and 1 bandwidth are achieved when accessing CM files. Disks can be added to the storage system to increase overall capacityorremovedduetofailureorspaceconservation. The second category is file systems where large file objects are stored on a disk array. More specifically, the blocksofthefilearestripedacrossthedisks. Forextent- based file systems, each block may be as large as a disk extentsoretrievingablockwouldonlyrequireanextent retrieval. No additional I/O overhead is imposed in this case. Again, the objective is to aggregate disk capacity and bandwidth. From the file system perspective and in the presence of multiple applications the probability of each block/extent access is almost equal. This also differs from RAID designs which decluster each block across the disks. The third category is Web proxy servers where Web objects are each mapped to a server within a group of Web proxy servers. Here, Web objects and proxy servers are analogous to file blocks and disks, respec- tively. Proxy servers may intermittently go on- or off- line. Web objects need to be redistributed to on-line proxyserverstoensureloadbalancingacrossproxiesand object availability. 1.2 Problem statement Pseudo-random placement: File objects are split into fixed size blocks and distributed over a group of ho- mogeneous disks such that each disk carries an approx- imately equal load. We use pseudo-randomized place- ment of file object blocks so that a block has roughly equal probabilities of residing on each disk. With pseudo-randomdistribution,blocksareplacedontodisks in a random, but reproducible, sequence. We will show in Section 7 that load balancing is achieved through a uniform distribution. The placement of a block is determined by its signa- ture X, which is simply an unsigned integer. We use a pseudo-random number generator p r to compute the signature of a block of an object. p r must produce re- peatable sequences for a given seed. We compute the seed from the name of the file object plus i and use it to initialize the pseudo-random number generator to compute the signature of block i. Several placement al- gorithms will be described in subsequent sections. Scaling operation: We use the notion of disk group as a group of n disks that is added or removed during a scaling operation. Without loss of generality, a scaling operation on a file server with D disks either adds or removes one disk group. Scaling up will increase the total number of disks and will require a fraction, n=(D + n), of all blocks to be moved onto the added disks in order to maintain load balancing across disks. Likewise, when scaling down, all blocksonaremoveddiskshouldberandomlydistributed across remaining disks to maintain load balancing. The number of block movements just described is the mini- mum needed to maintain an even load. As disks are added and removed in this way, the lo- cation of a block may change. The problem of course is tocomeupwithanalgorithmthatquicklycomputesthe current location of a block, regardless of how many scal- ing operations have been performed while, at the same time, ensuring an even load on the disks and movement of a minimal number of blocks during a scaling opera- tion. We state the requirements more clearly as follows. Requirement 1 (even load): If there are B blocks stored on D disks, maintain the load so that the ex- pected number of blocks on each disk is approximately B=D. Requirement 2(minimaldatamovement): Duringthe addition of n disks on a system with D disks storing B blocks, the expected number of blocks to move is B£ n D+n . During the removal of n disks, B£ n D blocks are expected to move. Requirement 3 (fast access): The location of a block is computed by an algorithm with space and time com- plexity of at most O(D) and requiring no disk I/O. Fur- thermore, thealgorithmisindependentofthenumberof scaling operations. WeproposeSequentialDiskLabeling(SDL)andRan- dom Disk Labeling (RDL), two variations of a hash- based approach, for frequent disk scaling such that the blockstripingschemeismaintainedbeforeandaftereach scaling operation. The remainder of this paper is organized as follows. Section 2 describes related work. Section 3 gives back- ground on two hashing techniques we use to solve our problem. In Section 4, we propose a hash-based solu- tion called SDL. In Section 5, we propose a variation of SDL called RDL that improves the non-uniformity of SDL. Then, in Section 6, we propose a random probing technique that allows the degree of uniformity to be ad- justed. In Section 7, we observe that RDL outperforms SDL in the uniformity of block distribution and in to- tal probes. Finally, Section 9 concludes this paper and presents future research directions. 2 Related work Several naive approaches have been discussed in [5] for solving block redistribution during disk scaling. A di- rectory structure could be used as a simple bookkeeping solution where block addresses are stored. However, fre- quent scaling operations would cause frequent updates to a table containing on the order of millions of records. AccessingthisdirectorymayalsorequirediskI/O.Also, adistributedserverarchitecturemightrequiredistribut- ing the directories resulting in a need for keeping these frequently updated large data structures consistent. Another naive approach is a complete redistribution of data blocks after every scaling operation that would resultinanevenandrandomdistribution. Thisisclearly unrealistic because of the large amount of block moves. Toanalyzetheproblemofblockreorganizationduring scaling, we draw an analogy to that of hash tables. The goals of a hash table are to evenly distribute keys into buckets and to quickly access these keys. These goals are also desirable properties when storing blocks on a set of disks. Moreover, disks can be treated as hash 2 buckets while blocks are similar to keys. Collisions are alsosimilarinthatmoredisksneedtobeaddedtohandle disk overflows just as the number of buckets need to be increased during bucket overflows. Several dynamic hashing techniques could be applied but they all have drawbacks. With extendible hashing, an overflow event causes bucket splits which might dou- ble the number of buckets. We cannot restrict disk scal- ing operations to double the number of disks for every disk overflow. Also, linear hashing [8] does not address load balancing of keys in buckets after bucket splits. In Section 3, we describe two hashing techniques which we later show to be quite adaptable to our prob- lem of block reorganization during disk scaling. We now describe related work on the three categories ofapplicationswherewecanapplyourscalingalgorithm. The redistribution of randomly placed data has been considered under the CM server and proxy server cat- egories, but no prior work of these techniques, to our knowledge, have appeared for file systems. CMservershavebeenthefocusofseveralpaststudies. Data placement and retrieval scheduling specifically for CM objects are described in [9, 15]. One study has ad- dressed the redistribution of data blocks after disk scal- ing with round-robin data striping [3]. Inherently such a technique requires that almost all blocks be relocated when adding or removing disks. The overhead of such blockmovementmaybeamortizedoveracertainamount of time but it is, nevertheless, significant and wasteful. Traditional constrained placement techniques such as round-robin data placement allow for deterministic ser- vice guarantees while random placement techniques are modeled statistically. The RIO project demonstrated the advantages of random data placement such as single accesspatternsandasynchronousaccesscyclestoreduce disk idleness [1, 11, 13]. However, they did not consider the rearrangement of data due to disk scaling. In general, random placement increases the flexibil- ity to support various applications while maintaining a competitive performance [14]. We assume a slight varia- tionofrandomplacement,pseudo-randomplacement,in order to locate a block quickly at retrieval time, without theoverheadofmaintainingadirectory. Thisisachieved by the fact that we can regenerate the sequence of num- bers, each one a block signature, via a pseudo-random generator function when we use the original seed. We developed a technique called SCADDAR to en- able the redistribution of data blocks after disk scaling inaCMserverbymappingtheblocksignaturestoanew set of signatures resulting in an even, randomized distri- bution [5]. SCADDAR adheres to the requirements of Section 1.2 except that computation of block locations becomeincrementallymoreexpensive. Findingablock’s location requires the computation of that block’s loca- tion for every past scaling operation, so a history log of operations needs to be maintained. In other words, the number of computations is equal to the number of scal- ing operations. The application domain here is for infre- quentscalingoperationstobeperformedwhiletheserver is online. Extremely frequent operations will cause the performance of block accesses to suffer since computa- tion of block locations is accumulated with every opera- tion. The history log can be reset by a complete block reorganization. Several past works have considered mapping Web ob- jectstoproxyserversusingrequirementssimilartothose described in Section 1.2. Below wedescribe tworelevant techniques called Highest Random Weight (HRW) and consistent hashing along with their drawbacks. HRW is a technique developed to map Web objects to a group of proxy servers [17]. Using the object name and the server names, each server is assigned a ran- dom weight. The object is then mapped to the highest weighted server. After adding or removing servers, ob- jects must be moved if they are no longer on the highest weighted server. The drawback here is that the redis- tribution of objects after server scaling requires B£D random weight function calls where B is the total num- berofobjectsandDisthetotalnumberofproxyservers. We show in Section 7 that in some cases HRW is sev- eral orders of magnitude slower than our proposed RDL technique. An optimization technique for HRW involves storing the random weights in a directory, but the di- rectory size will increase as B and D increase causing the algorithm to become impractical. In addition, this optimization has the drawbacks of a directory-based ap- proach discussed at the beginning of this section. Consistent hashing is another technique used to map Web objects to proxy servers [6]. Here objects are only moved from two old servers to the newly added server. A variant of consistent hashing used in a peer-to-peer lookup server, Chord, only moves objects from one old server to the new server [16]. In both cases, the result is thatobjectsmaynotbeuniformlydistributedacrossthe servers after server scaling since objects are not moved from all old servers to the new server. With Chord, a uniform distribution can be achieved by using virtual servers, but this requires a considerable amount of rout- ing meta-data [2]. 3 Background: double hashing and ran- dom probing In this section, we briefly describe two hashing tech- niques used as part of our solution: double hashing and random probing. Double hashing (i.e., open addressing with double hashing)scansforavailablebucketswhenresolvingcolli- sions[7]. Wheninsertingkey k, doublehashingusestwo hash functions, h 1 () and h 2 (), to determine a probe se- quence. h 1 (k) determines a bucket address in the range 0:::P¡1, whereP is the total number of buckets. This is the initial bucket to be probed. If this bucket is full thenweneedanotherhashfunction,h 2 (k),toresolvethe collision. h 2 (k)producesa valuein therange1:::P¡1, which is the number of buckets that are skipped for all subsequent probes. If this value is relatively prime to P, then after P probes, everybucket will be probed exactly once. 3 Random probing uses an infinite sequence of inde- pendent hash functions to handle collisions [10]. Each hashfunctioncalculatesanaddressforakeyintherange 0:::P ¡1, so the first hash function computes the first random address to be probed. If a collision occurs with thefirstaddressthenthesecondhashfunctioncomputes the second random address to be probed, and so on. Thefirstavailableaddressinthissequencebecomesthat key’s address. Inthefollowingsections,wewillapplydoublehashing and random probing to handle collisions for our block placement algorithms. 4 Sequential disk labeling (SDL) algo- rithm We adapt the double hashing technique to satisfy our problem of efficient redistribution of data blocks dur- ing disk scaling. Generally speaking, double hashing ap- pliestohashtableswherekeysareinsertedintobuckets. We view this hash table as an address space, that is, a memory-resident index table used to store a collection of slots. Each slot can either be assigned a disk or be empty. Some slots are left empty to allow for room to add new disks. We can think of blocks IDs as keys and slots as buckets. Themaindifferencebetweenahashtableandourad- dressspaceisthemethodinwhichcollisionsarehandled. In double hashing, a collision occurs when a full bucket is probed, resulting in the probing of other buckets until an available bucket is found. With our address space, a collision occurs when an empty slot is probed. When this happens, other slots are probed until a slot with a disk is probed. In both cases, we use the same probing sequence to decide which buckets (disks) to probe. We now examine what will happen when a disk is added to an empty slot in our address space. When new blocksareadded,collisionswillnotoccurhereifthisslot is probed. However, every block that previously collided with this slot must be moved here so that searching for these blocks succeeds. Equivalently with a hash table, collisions would no longer occur on a full bucket if all its keys were suddenly removed. But if this happens, then the keys which previously collided at this bucket need to be moved here from other buckets until it is filled. If thesekeysarenotmovedhere,asearchforthemwillnot be successful. We can anticipate when the current disks willoverflowwithblocksandresolvethisbyaddingdisks to slots in our address space at any time. We design our address space for P slots (labeled 0;:::;P ¡ 1) and D disks where P is a prime num- ber, D is the current number of disks, and D · P. For this approach, we have D disks that occupy slots 0;:::;D¡1. We can simply think of D disks which are labeled 0 through D¡1, but we use the concept of disk occupying slots to help visualize our algorithm. We call this the Sequential Disk Labeling (SDL) algorithm. As explained in Section 1.2, each block has a signa- ture, X, generated by a pseudo-random number gener- ator function, p r1. To determine the initial placement of blocks, we use a block’s signature, X, as the seed to a second function, p r2, to compute two additional ran- dom numbers for the start position, sp, and step length, sl, for each block. Because some slots contain disks and somedonot,wewanttoprobeslotsuntiladiskisfound. Thespvalue,intherange0;:::;P¡1,indicatesthefirst slottobeprobed. Theslvalue,intherange1;:::;P¡1, istheslotdistancebetweenthecurrentslotandthenext slot to be probed. sl should never be 0; avoiding re- peated probes into the same slot. We probe by the same amount,sl, inordertoguaranteethatwesearchallslots inatmostP probes. Thefirstslotintheprobesequence that contains a disk becomes the address for that block. Thus, sp and sl combine to make up the probe sequence where p(s) is the slot address at the s-th probe iteration as defined in Eq. 1. p(s)=(sp+s£sl) mod P; where s=0;1;2;:::;P ¡1 (1) 0 100 9 3 Block sp=3, sl=76 Block sp=46, sl=20 46 66 86 5 D-1 P-1 0 0 1 1 Figure 2: Placement of two blocks. Block 0 initially hits and block 1 initially misses (D =10;P =101). Example 4.1: As shown in Figure 2, assume we have 10 disks and 101 slots (D = 10;P = 101). We want to compute the block signature X, sp, and sl for blocks 0 and 1. Using the blocks’ filename (converted to an integer) as the seed, we compute the block signatures where X = p r1(filename) = 5749 for block 0. We use 5749 as the seed to p r2 to compute block 0’s probe sequence where sp = p r2(5749) mod P = 3 and sl = p r2(5749) mod (P ¡1)+1 = 76. Recall that sl is in the range 1:::P ¡ 1 which is why 1 is added in the computation of sl. Slot 3 is a disk so we have found the address for block 0. The next call to p r1 will give X = 29934 for block 1. Similarly, we find sp = 46 and sl = 20. Slot 46 does not contain a disk so we traverse block 1’s probe sequence, probing by 20 slots, until we arrive at slot 5. In general, finding disks for any given block may require multiple wrap-arounds. Because we set P to a prime number, the probe sequences are guaranteed to be a permutation of 0;1;:::;P¡1, that is, every slot is probed exactly once in the worst case. As long as P is relatively prime to sl, this holds true [7]. Next, we want to be able to add or remove disks to our address space. A scaling operation is the addition or removal of n disks. For block i, d i;j¡1 is its address after scaling operation j¡1 and d i;j is the address after operation j. d i;j¡1 may or may not equal d i;j depending on if block i moves after operation j. For addition op- erations, the n disks will be added sequentially to slots D;:::;(D+n¡1) as long as (D+n)·P. Without loss 4 of generality and to illustrate the concept of probing for diskremovals, weassumedisksarealwaysremovedfrom slots (D¡1¡n);:::;(D¡1). To perform an addition operation, first we add disks to the appropriate slots. Then we consider each block in sequence (0;:::;B¡1) and, without actually accessing the blocks, we compute X, sp, sl, and d i;j for block i. If d i;j is an old disk (d i;j = d i;j¡1 ), then the block must already lie on this disk so no moving is necessary. Clearly in this case, the probe length remains the same as before. However, if d i;j is a new disk (d i;j 6= d i;j¡1 ), then this should be the new location for the block. We continuewiththeprobesequenceto find d i;j¡1 , whichis the current address, and move the block to d i;j . In this case, the probe length becomes shorter since, before this add operation, the slot, d i;j , was also probed but was empty so probing continued. 63 87 10 34 58 82 5 Block i with sp=63, sl=24 New disk added to Slot 10 Block i is moved Figure 3: Probe sequence of block i before and after a disk add operation j. Block i moves from disk 5 to disk 10 after disk 10 is added (D =10;P =101). Example 4.2: Figure 3 shows an example of adding a new disk to slot 10. Before scaling operation j, block i’s probe sequence is sp;(sp + sl) mod P;(sp + 2£sl) mod P;:::; (sp+6£sl) mod P. Here, d i;j¡1 = (sp + 6£ sl) mod P. For this example, sp = 63 and sl = 24 so the probe sequence is 63;87;10;34;58;82;5 and block i belongs to disk 5. After scaling operation j, a disk is added to slot 10 and block i moves from disk 5 to disk 10 since slot 10 appears earlier in the probe sequence. The resulting probe sequence is 63;87;10 and d i;j =(sp+2£sl) mod P. For removal operations, we first mark the disks which will be removed. Then, for each block stored on these disks, we continue with the probe sequence until we hit anunmarkeddisk,d i;j ,towhichwemovetheblock. The probe length is now longer (but no longer than P trials) to find the new location. This can be illustrated as the reverse of Example 4.2. In all cases of operations, the probe sequence of each blockstaysthesame. Itistheprobelengththatchanges depending on whether the block moves or not after a scaling operation. So the scaling operation and the ex- istence of disks will dictate where along the probe se- quence a block will reside. After any scaling operation, theblockdistributionwillbeidenticaltowhatthedistri- bution would have been if the disks were initially placed that way. We were surprised to find that the SDL algorithm does not lead to an even distribution of blocks across the disks. As shown in the next section, the distribution appears “bowl” shaped where the first few and last few diskscontainmoreblocksthanthecenterdisks. Thenwe provide a more superior algorithm called Random Disk Labeling in Section 5. 4.1 Non-uniformityofthesequentialdisklabel- ing algorithm To understand why a bowl-shaped distribution is ob- servedweintuitivelydeterminehowlikelyblockswillfall on certain disks. We examine four cases, which in sum will result in a bowl-shaped distribution of blocks. The bowl-shape is more pronounced when D is much smaller than P so we use this assumption in our analysis. Region A Region C Region B sl sl sl D 0 P-1 (P-D)modsl Figure 4: Three regions of the address space. We split our address space into Regions A, B, and C inFigure4. RegionAcontainsslots0;:::;D¡1. Region Bcontainsslots(D+(P¡D) mod sl);:::;P¡1. Region C contains slots D;:::;(D+(P ¡D) mod sl¡1). For Case 1, probing is successful on the first try so sp is in Region A. The remaining cases involve an initial probe misswherespfallsinRegionsBorC.Theonlydifference in these cases is the range of sl. For Case 2, 1·sl·D. For Case 3, (P ¡D) · sl · (P ¡1). And for Case 4, D <sl <(P ¡D). Case 1: If sp falls in Region A, each disk has an equal chance of receiving the block. These blocks are uniformily distributed among the disks. The value of sl is irrelevant since no hopping is required. This case does not contribute to the bowl-shaped distribution. Region A Region C Region B sl D (P-D)modsl red Figure5: ThreeregionsoftheaddressspacewithRegion B reduced. Case 2: If sp falls in Regions B or C and 1· sl· D then the distribution will be skewed to the left creating the left edge of the bowl. To illustrate, we first reduce Region B in Figure 4 to the length of one sl and call it Region B red as in Figure 5. This is possible since there is an equal probability that sp will land in any slot in Region B and will eventually hop by length sl to the rightmost sl group. NowifspfallsinRegionB red ,itisclearthattheblock will land among slots 0;:::;sl¡1 after only one hop of sl. This hop will wrap-around from the end of the table to the beginning. Moreover, since sl · D, the block will definitely hit a disk after only one wrap-around. To compute the probability of how likely a block is to land onaspecificdisk,welookatdifferentsizesofsl. Ifsl =1 thentheblockwillhitDisk0witha100%probability. If sl =2thenthereisa50%chanceofhittingDisk0anda 50%chanceofhittingDisk1. Ifsl =3then33%eachon 5 Disks0, 1, and2. Ingeneral, wecanfindtheprobability of a block initially landing in Region B red hitting any disk using Equation 2 where Pr d is the probability of hitting disk d among D total disks. Pr d = 1 D D X i=d+1 1 i (2) Equation 2 shows that with smaller d values, a block willhaveahigherprobabilityoflandingonDiskd. Since the disks with smaller d values are located towards the left of Region A, more blocks would be assigned to these disks. Thus the left edge of the bowl is formed. Since Region C is always smaller than sl and usually much smaller than Region B, few blocks will initially hit here. The contribution of blocks landing in this region do not greatly affect the bowl distribution so we ignore this case. Case 3: If sp falls in Regions B or C from Figure 4 and (P¡D)·sl·(P¡1) then the distribution will be skewed to the right causing the right edge of the bowl. Because sl ¸ (P ¡D), one hop of length sl will wrap around the table and the new slot will be less than or equaltoDslotstotheleftoftheoriginalslot. Essentially in this case we are hopping to the left by D or less slots each time. This behavior is the mirror image of Case 2 and the probability of hitting any disk in this case can be computed using Equation 3. Pr d = 1 D D X i=P¡d 1 i (3) Similar to Equation 2, Equation 3 shows that with larger d values, a block will have a higher probability of landing on Disk d. The right edge of the bowl is formed since disks with larger d values exist towards the right of Region A. Case 4: If sp falls in Regions B or C and D < sl < (P ¡ D) then the distribution will have a slight bowl effectwherethebowlisshallow. Itappearsthatthiscase can be further broken into sub-cases where left or right edge contributions can be isolated. However, we have shown that Cases 2 and 3 already demonstrate a major contribution to the left and right edges. Therefore, we do not further investigate this case. The total probabilities of a block landing on a disk are the sum of the three cases above. Initial misses, block’s with sp’s in Regions B and C, cause the bowl- shaped distribution. Figure 6 shows the distribution of blocks resulting from the SDL algorithm with D = 100, P =10;007, and approximately 7:5 million blocks. In Figure 7, we observe the distribution when we add disks from 1;000 to 10;000 at 1;000-disk increments. The y-axis shows the number of blocks that are loaded onthefirst10%ofthedisks,thesecond10%ofthedisks, andsoon. Thedistributionbecomesmoreuniformaswe increase the number of disks because, as D approaches P,theprobabilitythatsphitsadisk,(D=P),approaches 1. However, we want to initially set D to be less than P to allow for disk additions. 0 50000 100000 150000 200000 0 20 40 60 80 100 SDL Disk number Number of blocks Figure 6: Number of blocks on disks using SDL (D = 100;P =10;007). 600000 650000 700000 750000 800000 850000 900000 950000 1000000 1050000 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Number of blocks Disks D=1,000 D=10,000 Figure7: Normalizeddistributionas#ofdisksincreases (P =10;007). 5 Random disk labeling (RDL) algo- rithm Our Random Disk Labeling (RDL) algorithm is similar to the SDL algorithm except that we use a random allo- cation of disks where we randomly place D disks among the P slots instead of placing them sequentially in slots 0;:::;D¡1 as with SDL. Likewise, during disk scaling, we add a disk to an empty slot that is randomly chosen or remove a randomly selected disk in order to maintain the overall random allocation scheme. Essentially, we use double hashing on disks labeled with random slots in the range 0;:::;P ¡1. The RDL algorithm results in a much more even dis- tributionofblocks. Figure9showsthateachdiskhasan approximately equal number of blocks where D = 100 and P =10;0007. The x-axis indicates the disk number ranging from the 0-th disk to the 99-th disk. Although each disk is assigned to a random slot (e.g., Disk 0 in Slot 423, Disk 1 in Slot 29, etc.) we only show the disk number in Figure 9. Because the disks are not sequen- tially clustered together, the placement of blocks will not favor any particular disk. Here, by using a fixed step length, sl, to probe along disks that are randomly 6 0 50000 100000 150000 200000 0 20 40 60 80 100 SDL-RP SDL Disk number Number of blocks 0 50000 100000 150000 200000 0 20 40 60 80 100 RDL-RP RDL Disk number Number of blocks Figure 8a: SDL-RP (K =285;P =10;007). Figure 8b: RDL-RP (K =150;P =1;009). Figure 8: Number of blocks on disks using SDL-RP and RDL-RP. 0 50000 100000 150000 200000 0 20 40 60 80 100 RDL Disk number Number of blocks Figure 9: Number of blocks on disks using RDL (D = 100;P =10;007). spaced apart, we are in essence probing sequentially ar- ranged disks using a sequence of random sl’s. Probing by random sl’s would not cause any disks to be favored and thus would lead to an even distribution. Here, sp still has a D P chance of hitting any disk, but on initial sp misses, probing by a fixed sl will not contribute to a bowl shaped distribution because of the random label- ings. Also, termination of probing when finding a block is still guaranteed to be at most P probes since probe sequences are still permutations of 0;1;:::;P¡1. Later in Section 7, we will give a direct comparison of the im- provement of RDL over SDL. 6 Random probing We applied double hashing to sequentially labeled disks (SDL)andrandomlylabeleddisks(RDL).Wenowshow that random probing can further improve the uniformity of block distribution. In this section, we will present a two-phase algorithm consisting of a random probing phase (Phase 1) and a double hashing phase (Phase 2). Phase 2 is initiated only if Phase 1 fails to find a non- empty slot within K probes. K is the maximum num- ber of probes that Phase 1 performs before switching to Phase 2. This two-phase algorithm can be applied to disks which are both sequentially labeled and randomly labeled which we refer to as SDL-RP and RDL-RP, re- spectively, for the remainder of this paper. To find a block’s address using random probing, we repeatedlycallapseudo-randomnumbergenerator,with block i’s signature as the seed, to produce a sequence of values in the range 0:::P ¡ 1. The first slot that is occupied by a disk in this probe sequence becomes the address for block i. We are, of course, only probing the slots of our memory-resident address space. Each probe simply involves selecting a random value, sp, using a pseudo-random number generator in the range 0;:::;P ¡ 1 until a disk is hit. Intuitively, it is clear that this will lead to a uniform distribution of blocks assuming a well-performing pseudo-random gen- erator. However, there is no guarantee of termination and the probability of continuously hitting empty slots exists, though very small. Our two-phased algorithm solves this. In Phase 1, we perform a maximum of K random probing trials. If no disk is hit during Phase 1, we enter Phase 2 and perform double hashing where, again, termination within P trials is guaranteed. Weshowthenumberofblockson100disksusingthis approach in Figure 8. With SDL-RP using a maximum of K = 285 random probing trials, we see in Figure 8a that a much more even distribution is achieved as com- pared to SDL. Even though RDL produces a more even distribution than SDL, it can still benefit from random probing. One such case is in Figure 8b where D = 100 and P = 1;009. With RDL-RP, using a maximum of K = 150 random probes, the level of uniformity im- proves. Since RDL, in general, shows better uniformity than SDL, in Figures 9 and 6 respectively, RDL-RP will require fewer random probes than SDL-RP to achieve similar levels of uniformity. In Section 7, we compare the levels of uniformity achieved by SDL-RP and RDL- RPandshowthatRDL-RPneverrequiresmorerandom probes than SDL-RP. Even though Phase 2 of SDL-RP still introduces a small amount of the bowl effect, Fig- ure 8a shows that the combination of the two phases 7 virtually eliminates this. Using K random probes in ei- ther SDL-RP or RDL-RP, we can now find a block in at most K +P trials. AquestionthatremainsiswhentoswitchfromPhase 1 to Phase 2 (i.e. what should be the value of K)? If we know the desired level of uniformity then we can calcu- late K for SDL-RP using a closed-form formula, given in Appendix A. However, we were unable to derive a similar formula to compute K for RDL-RP. Instead, we providesomeinsightonpotential K valuesinourexper- iments in Section 7. 7 Experiments Inthissection,wefirstcomparethecomputationtimeof RDL with a prior work called Highest Random Weight (HRW). Then, we compare the load uniformity of RDL and SDL. Finally, we compare the amount of probing of RDL-RP and SDL-RP when varying the maximum probes, K, and the number of slots, P. HRW,describedinSection2,attemptstoredistribute Web objects (data blocks in our case) residing on a scal- able group of proxy servers (disks). HRW does in fact satisfy our first two requirements (even load and mini- mal data movement) described in Section 1.2, however it fails in Requirement 3 (fast access). The time to re- distribute all the blocks is the summation of the access times of each block. HRW’s time to redistribute up to D disks is actually similar to SCADDAR’s time to re- distribute up to D disks for the case that SCADDAR’s previous D¡1 operations are all 1-disk adds. In this case, the number of disks equals the number of scaling operations. Recallingthat HRW requires B£D pseudo- random function calls, for B blocks and D disks, dur- ing scaling, we show that this is significantly more time consuming than RDL. In some cases, the computation time of HRW is several orders of magnitude higher than RDL’stime. NotethatthetimeincludesonlyCPUcom- putation time, not block transfer time, which would be similar for both RDL and HRW since they are transfer- ring similar numbers of blocks. 0.1 1 10 100 1000 10000 100000 10 100 1000 HRW RDL (P = 1,259) RDL (P 1.25 x D) Number of disks Computation time (seconds) ~ ~ Figure 10: Computation time of HRW and RDL with P =1;259 and P ¼1:25£D. We simulated the initial placement of 1MB-data blocksusingRDLandHRWontoa groupof75GB-disks and measured the running time. A Pentium III 933Mhz PC with 256MB of memory was used to run the simu- lation written in C. Since the block layout on D disks from an initial placement is identical to the block layout after scaling up to D disks, we are really measuring the cost of a scaling operation. Figure 10 shows the time needed to scale from any arbitrary number of disks to D disks. Note that for a more realistic simulation we are increasing the number of blocks as we increase the disks instead of keeping the blocks constant. Thus, about 75;000 blocks reside on each disk as they are scaled. However, if we kept the total number of blocks constant, the percentage of im- provement of RDL over HRW would still be the same. WecomparethetimeofHRWwithRDLwhenP isfixed to1;259andwhenP is ~25%higherthanD. Thex-axis is logarithmic and represents the number of disks rang- ingfrom10to1;000. They-axis,alsologarithmic,isthe computation time in seconds. HRW requires much more time to compute all the block locations than RDL. For example, with 1;000 disks, HRW requires 31;492 sec- onds whereas RDL only requires 27 seconds. The large timedifferenceis becauseRDLperformsonly B pseudo- random function calls and some probing for each block as compared to B£D calls for HRW 1 . Next, we conducted several experiments to compare howwellourSDLandRDLalgorithmsreorganizeddata blocks after successive disk scaling operations. We need a metric to measure the uniformity of the block distri- bution in order to gauge the load balancing of the set of disks after scaling. We use a “goodness of fit” statistic, Â 2 , as the metric for comparing how well the distribu- tions from the SDL and RDL algorithms match up with aperfectlyevendistribution(i.e.,equalnumberofblocks on all disks). The Â 2 equation is defined as Â 2 = D X i=1 (x i ¡¹) 2 ¹ (4) where ¹ is the expected value of the number of blocks per disk and x i is the number of blocks on disk i [12]. Foreachsetofexperiments,wesimulateddiskscaling from 10 disks to 10;007 disks using roughly 7:5 million data blocks. Each scaling operation is an addition of 10 disks. For SDL, the first 10 disks are placed in slots 0:::9 and subsequent 10-disk adds are placed in the next 10 available slots (i.e., 10:::19, 20:::29, etc.). For RDL, the first 10 disks are placed in 10 randomly cho- sen empty slots and subsequent 10-disk adds are placed into10randomlychosenslotsfromtheremainingempty slots. Here, P = 10;007 and we observe the trend of the Â 2 values as D approaches P. Figure 11 shows the Â 2 values of SDL and RDL as we scale the number of disks to P. The Â 2 values for the SDL algorithm actu- ally decrease as disks are added since the bowl-shaped distribution becomes wider, causing the nonuniformity 1 Each pseudo-random function call in HRW’s algorithm actu- allyrequirestwosrand()andrand()callswhereasRDL’salgorithm only requires one call of each. 8 0 50000 100000 150000 200000 250000 300000 350000 400000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 SDL RDL Number of disks X value 2 Figure 11: Â 2 values of SDL and RDL (P =10;007). to become less apparent. Here, RDL shows significantly lower Â 2 values than SDL and thus has a more uniform distribution. Only as D approaches P do the Â 2 val- ues become similar. The curves actually converge to the same Â 2 value when D = P because when there are no empty slots, the block distribution will be identical for SDL and RDL. For SDL, when adding one disk there is only one pos- sible slot to add that disk; the next available slot. Thus, for every D value only one Â 2 value is possible. How- ever, for RDL, there are ( P D ) (“P choose D”) possible combinations of which slots the disks can reside in since they are chosen randomly, so there are ( P D ) possible Â 2 values. Because Figure 11 shows the Â 2 value for just one combination of each D value, we want to find the mean and standard deviation of the Â 2 values for all the ( P D ) combinations of each D value. However, as we scale D to P, ( P D ) becomes so large that it is computation- ally intractable to find a Â 2 value for every combination. Therefore, we use a random sampling technique to ap- proximate the mean and the standard deviation of the Â 2 values of all the possible combinations for a specified D and P. Given a D and a P, we take a random sample from a population of size ( P D ) to estimate the population mean, ¹. The estimate of the mean is represented as a confi- dence interval. A 95% confidence interval says that ¹ will fall in this interval with a 95% probability [12]. A confidence interval can be computed from parameters of the random sample by using Equation 5: confidence interval=X§z Ã 1¡ ¯ 100 2 ! £ s p n (5) where ¯ is the percentage of confidence, X is the sample mean,sisthesamplestandarddeviation,nisthesample size,andz(®)isaz-scorevalue. Astandardz-scoretable canbeusedtolookupthez-scoreforaparticular®[12]. We do not display our confidence interval results graphically since the intervals are too small. Instead we provide the interval values in a tabular format. Table 1 shows95%confidenceintervalsofrandomsamplestaken for D = 1;000; 5;000; and 8;000. Each row in the ta- ble is a separate random sample. Table 1 also shows the sample standard deviations, which are good estima- tors of the population standard deviations [12]. When D = 10;007 there is only one possible Â 2 value; where every slot contains a disk. We set n=100 for each sam- ple, in other words, 100 Â 2 values, or sample members, were randomly selected for each random sample from the ( P D ) possible combinations. A sample size where n is greater than 25 or 30 is usually large enough [12]. For eachsamplemember, wechoose D uniqueslotstowhich weassignedDdisks. FromTable1,weillustratethatthe confidence interval size is insignificant and a very small percentage of the sample mean. Therefore, Figure 11 is a good representation for RDL. The distribution uniformity of SDL and RDL can be improved with random probing, but these additional probes increases the overall number of probes. We empirically measured the amount of probing required by SDL-RP and RDL-RP to find blocks. Recall that for these algorithms, we perform K maximum random probes inPhase1 and double hashinginPhase 2. When K = 0, SDL-RP is equivalent to SDL and RDL-RP is equivalent to RDL. We found that varying the K value will cause the average number of probes required by SDL-RP to vary while the average probes for RDL-RP remains relativelyconstant. Weset the number of disks, D,to100andthenumberofslots,P,to10;007forthese experiments. Figure 12a shows that with a low K, SDL- RP requires almost 1:5 times more probes on average than RDL-RP and there is a higher chance that termi- nation of probing will occur in Phase 2 (double hashing) for both algorithms. Termination in Phase 2 for SDL- RP will lead to more probing than termination in Phase 2 for RDL-RP. This results from the large gap of con- secutive empty slots in SDL-RP since all the disks are clustered together. Using a higher K causes SDL-RP’s average probe count to converge with that of RDL-RP andPhase1terminationismorelikely,thusprobabilisti- callyleadingtoP=Daverageprobesforbothalgorithms. Clearly, setting K to a lower value results in a bet- ter probing performance for RDL-RP than SDL-RP. Al- though the amount of probing required by both algo- rithms in Figure 12a seem similar when using large K’s, the maximum probe length of RDL-RP is shorter than that of SDL-RP. We show in Figure 12b that when set- ting K = 550, SDL-RP had a maximum probe length of 9;930 while RDL-RP’s maximum probe length was 1;206. In general, for either algorithm, setting K to a large value should be avoided since it leads to higher maximum total probe lengths. The choice in the number of slots, P, affects the aver- age number of probes for SDL-RP and RDL-RP. We set the number of disks, D, to 100 and use K =50 random probesforPhase1toshowtheaveragenumberofprobes as P varies. Figure 13a shows that, for both algorithms, the number of average probes increases as P increases since there are more empty slots to probe. The maxi- mumprobelengthsalsoincreaseasshowninFigure13b. However, as P increases, SDL-RP requires more probes 9 D Sample size (n) 95% confidence interval for pop. mean Sample std. dev. (s) Interval size£100/X (%) 1,000 100 (10,317.3, 10,503.6) 475.0 1.79% 5,000 100 (29,700.5, 29,976.8) 704.8 0.93% 8,000 100 (32,700.6, 32,957.7) 656.0 0.78% 10,007 1 23,509.5 (actual average) 0 n/a Table 1: Estimation of the Â 2 means and standard deviations. 95 100 105 110 115 120 125 130 135 140 145 150 0 100 200 300 400 500 600 SDL-RP RDL-RP Average number of probes K 0 2000 4000 6000 8000 10000 12000 0 100 200 300 400 500 600 SDL-RP RDL-RP Maximum number of probes K Figure 12a: Average number of total probes. Figure 12b: Maximum number of total probes. Figure 12: Varying the maximum number of random probes, K (D =100;P =10;007). 0 20 40 60 80 100 120 140 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 SDL-RP RDL-RP Average number of probes P slots 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 SDL-RP RDL-RP Maximum number of probes P slots Figure 13a: Average number of total probes. Figure 13b: Maximum number of total probes. Figure 13: Varying the number of slots, P (D =100;K =50). than RDL-RP since the size of the gap of consecutive empty slots increases. We want to set P to a large enough value to allow for more scale-up room since the maximum number of disks that the storage system can scale up to is P. However, large P’s require more probing so, ideally, the growth of the storage system should be gauged beforehand to de- termine a good P since it cannot be altered later with- out causing a complete reorganization of data blocks. An advantage of RDL-RP here is that even with a large P value, RDL-RP requires fewer average and maximum probes than SDL-RP, as shown in Figure 13. RDL produces a more uniform distribution of blocks than SDL as seen in Figure 11. When using a low max- imum number of random probes, K, Figure 12a shows that RDL-RP requires fewer probes than SDL-RP when locatingblocks. Figure13ashowsthatwithalargenum- ber of slots, P, RDL-RP results in fewer probes. More- over, a large P value allows for a more scalable disk set. In all cases, RDL-RP performs the same as or better than SDL-RP. 8 Implementation issues In this section we explore several implementation issues with the disk scaling techniques described in this paper. These issues include continuous file availability during block reorganization and providing block redundancy to improve load balancing. 10 8.1 Availability Whendiskscalingisperformedandblockreorganization is occurring, the availability of all files should be main- tained. In general, if a block is identified and designated to move from a source disk to a target disk, a copy of theblockshouldfirstbeplacedonthetargetdisk. Then eventually, depending on the type of reorganization ap- proach described below, the block copy becomes visible and the original block is deleted. A question that re- mains is how to identify the sequence of blocks to be moved? We address this question with two approaches: a file-by-file approach and a disk-by-disk approach. With file-by-file reorganization, we reorganize an en- tire file before updating system configurations to indi- catethatthisfileisnowinaccordancewiththenewdisk layout. Once the entire file is reorganized, the original blocks are deleted. The new block layout now becomes visible to all file requests. We repeat this for every file on the storage system. The downside of this approach is that disks will be accessed multiple times during reorga- nization since each disk will most likely contain blocks from every file. Anotherapproachisreorganizingblocksdisk-by-disk. This approach will avoid repeated disk accesses and re- duce seek overhead since we are reorganizing an entire disk at a time. Here, we evaluate every block from a list of blocks on a disk without actually accessing the blocks until they are copied to a target disk. For every block on the block list, we can use a block’s name to deter- mine whether the block should be copied or not. Since we are storing each block as a file, the block’s name actually contains the name of the file object that this block belongs to and the block number. For example, a block named “TopGun 311.blk” indicates this block is the 311 rd block of file object TopGun. Now, we can use “TopGun”+311 as the seed to the pseudo-random number generator to determine if it needs to be copied out. Onceallthedisksarereorganized,thesystemisup- dated so all requests switch to the new disk layout. The original blocks are then deleted. The drawback here is that the new disk layout will be realized only after all theblocksarereorganized. Also, whendecidingwhether blocks should move, every block in the list will cause re- peated iterations through the files, which adds to the computational complexity. 8.2 Block replication Withrandomdataplacement,shorttermloadimbalance isstatisticallypossible. Somediskscouldbetemporarily overloadedwhileotherdisksareidle. In[14],theauthors suggest the use of block replication to improve the load balancing in these cases. Even with 25% of the blocks replicated, load balancing is significantly improved. To apply a 25% block replication scheme to our disk labeling technique, we consider each block and decide with a 25% probability whether to replicate the block or not. If a block is to be replicated, we decide where the block copy should reside based on the probe se- quence. Recall that the original block resides on the first disk along the probe sequence. Similarly, the block copy should reside on the next disk along the probe se- quence. Note that the block copy will not reside on the original block’s disk since the probe sequence does not contain any duplicate disks. During reorganization due to disk addition, if an original block is moved, then its copy should move to the disk of the original block. Like- wise, reorganization due to disk removals will cause the original block to move to the block copy location and the block copy will be moved to the next disk along the probesequence. Someoptimizationscanbeappliedhere but they are beyond the scope of this paper. 9 Concluding discussions and future work The three requirements from Section 1.2 are satisfied by RDL and RDL-RP. SDL-RP also satisfies these require- ments with enough random probes. FortheSDL-RP,RDL,andRDL-RPalgorithms,after every scaling operation, the placement of the blocks is identicaltotheplacementthatwouldhavebeenachieved if we had initially placed the blocks on those disks. Therefore, no history information of scaling operations needs to be maintained to find block locations. This is because for a specific block, the probe sequence will always remain the same and scaling operations only af- fect the point along the probe sequence where the block will reside. Because scaling operations always bring the block distribution back to an initial state, we know that uniformityisachievedsinceallinitialstatesareuniform. ThissatisfiesloadbalancingofdisksandRequirement1. The amount of block movement during scaling opera- tions is minimized so Requirement 2 is satisfied. Blocks will only move to new disks during addition operations and never from an old disk to another old disk. Also, blocks to be moved are randomly chosen from the old disks. Similarlyforremovaloperations,blocksaremoved off of removed disks and randomly redistributed across the remaining disks. Finally, a maximum of K +P probes is guaranteed for SDL-RP and RDL-RP to find any particular block. This satisfies Requirement 3 since the access complexity is measured by the number of probes needed to locate a block. Again, probing is only performed on the slots of our memory-resident address space. In practice, even thoughwerarelyobservetheworstcasemaximumprobe lengths, we must take measures to reduce them by using as few random probes as possible. We will continue to refine RDL-RP. Scalability would be improved if we could grow beyond P disks without requiring a complete organization of all the data blocks. One possibility is to use another scaling technique such as SCADDAR [5] once the bound of P is reached. Or, in certain situations, a gradual complete reorganization may be feasible. We are also planning to incorporate RDL-RP into the storage subsystem of an actual real- time server system as well as non-real-time servers. Throughout this paper, we have assumed our disk scaling techniques to operate on a set of homogeneous 11 physical disks. However, often times when retiring an old disk, the new replacement disk might have improved characteristics since the older disk models are no longer available. So we need to extend our techniques to oper- ate on heterogeneous physical disks. Without modifying our algorithms, we can accomplish this by using a tech- nique called Disk Merging described in [18] to build a layer of homogeneous logical disks above the heteroge- neous physical disks. Our disk labeling techniques can then operate on these logical disks as usual. Data mirroring may be a solution for fault tolerance with RDL. Mirrored blocks can be placed at a fixed off- set determined by a function f(D). For example, f(D) would return D=2 as an offset. Also, each storage unit can in fact be a RAID device and scaling would occur by adding or removing these devices. Wewishtoaddresstheproblemofdynamicpartition- ing of a set of logical disks. Although the total sum of disksdoesnotchange,thepartitioningofthediskscould change based on user requests. Some partitions will be scaled up while at the same time other partitions are scaled down. In both cases, data blocks sitting on the partitions must be redistributed in order to maintain a balanced load. We also wish to investigate how these scaling tech- niques could be applied to storage systems that need to efficiently store a high influx of data streams such as those generated by a large population of sensors. An- other interesting issue to explore is with efficient data location in large, scalable peer-to-peer systems. References [1] S. Berson, R. R. Muntz, and W. R. Wong. Ran- domized Data Allocation for Real-Time Disk I/O. In COMPCON, pages 286–290, 1996. [2] J. Byers, J. Considine, and M. Mitzenmacher. Sim- ple Load Balancing for Distributed Hash Tables. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS ’03), February 2003. [3] S. Ghandeharizadeh and D. Kim. On-line Reor- ganization of Data in Scalable Continuous Media Servers. In7 th International Conference and Work- shop on Database and Expert Systems Applications (DEXA’96), September 1996. [4] S. Ghandeharizadeh and S. H. Kim. Striping in Multi-disk Video Servers. In Proceedings of the SPIE High-Density Data Recording and Retrieval Technologies Conference, pages 88–102, October 1995. [5] A. Goel, C. Shahabi, S.-Y. D. Yao, and R. Zimmer- mann. SCADDAR: An Efficient Randomized Tech- nique to Reorganize Continuous Media Blocks. In Proceedings of the 18th International Conference on Data Engineering, pages 473–482, February 2002. [6] D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent Hashing and Random Trees: Distributed Caching Proto- cols for Relieving Hot Spots on the World Wide Web. In Proceedings of the Twenty-Ninth Annual ACMSymposiumonTheoryofComputing(STOC), pages 654–663, May 1997. [7] D. E. Knuth. The Art of Computer Programming, volume 3. Addison-Wesley, 1998. [8] P.- ˚ A. Larson. Dynamic hash tables. Communica- tions of the ACM, 31(4), April 1988. [9] C. Martin, P. S. Narayan, B. ¨ Ozden, R. Rastogi, andA.Silberschatz.TheFelliniMultimediaStorage Server. In S. M. Chung, editor, Multimedia Infor- mationStorageandManagement,chapter5.Kluwer Academic Publishers, Boston, August 1996. ISBN: 0-7923-9764-9. [10] R. Morris. Scatter Storage Techniques. Communi- cations of the ACM, 11(1):38–44, January 1968. [11] R. Muntz, J. Santos, and S. Berson. RIO: A Real- time Multimedia Object Server. In ACM Sigmet- rics Performance Evaluation Review, volume 25, September 1997. [12] J. A. Rice. Mathematical Statistics and Data Anal- ysis. Duxbury Press, 1995. [13] J. R. Santos and R. R. Muntz. Performance Anal- ysis of the RIO Multimedia Storage System with Heterogeneous Disk Configurations. In ACM Mul- timedia, pages 303–308, 1998. [14] J. R. Santos, R. R. Muntz, and B. Ribeiro-Neto. Comparing Random Data Allocation and Data Striping in Multimedia Servers. In SIGMETRICS, Santa Clara, California, June 17-21 2000. [15] C. Shahabi, R. Zimmermann, K. Fu, and S.-Y. D. Yao. Yima: A Second Generation Continuous Me- dia Server. IEEE Computer, pages 56–64, June 2002. [16] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to- peer Lookup Service for Internet Applications. In Proceedings of the 2001 ACM SIGCOMM Confer- ence, pages 149–160, May 2001. [17] D. G. Thaler and C. V. Ravishankar. Using Name- BasedMappingstoIncreaseHitRates. IEEE/ACM Transactions on Networking, 6(1):1–14, February 1998. [18] R.ZimmermannandS.Ghandeharizadeh. Continu- ous Display Using Heterogeneous Disk-Subsystems. In Proceedings of the Fifth ACM Multimedia Conference, pages 227–236, Seattle, Washington, November 9-13, 1997. 12 A Minimizingtherandomprobingtrials for SDL-RP We can control the amount of uniformity resulting from the SDL-RP algorithm by varying the maximum num- ber of random probing trials before switching to Phase 2. Different levels of uniformity may be better suited for different applications. An application that is not as sen- sitive to a uniform block distribution may benefit from fewer probes when finding block addresses. A perfectly uniform block distribution will result in a 1=D percentage of all the blocks on every disk. The worst-case distribution, that of a bowl-shaped one, has more blocks on the first and last disks and the fewest blocks on the center disks. We use the bowl bottom, b, to measure the shape of a bowl resulting from SDL with D disks. More specifically, the bowl bottom is the load percentage of the b D 2 c-th disk. We will show that the maximum number of random probing trials, K, can be estimated given a desired bowl bottom, b des , where b·b des ·1=D. The random probing phase of the SDL-RP algorithm will produce a uniform distribution where, by the law of largenumbers,b des isexpectedtobe1=D,whichisideal. This is observed using SDL-RP when K = 1. On the otherhand, theSDLphaseofSDL-RPresultsinabowl- shaped distribution with an expected bowl bottom, b. This happens with K = 0 using SDL-RP. So increasing (decreasing) K will increase (decrease) b des . A high K valueresultsinamoreuniformdistribution,butrequires more probes. Given a K value, the SDL-RP algorithm is as follows: 0) Allocate disks sequentially. 1) Phase 1: Perform random probing trials until: disk is found. END. K trials have been performed. Goto Step 2). 2) Phase 2: Perform double hashing until: disk is found. END. (A disk will be found in under P probes.) We can achieve the best and worst-case block distribu- tions using K =1 and K = 0 respectively, but how do wecompute an exact K valueto achievesome given b des in the range b:::1=D? To do this, we find the probabil- ity, ®, of termination (finding one disk) in the random probing phase, Phase 1, after performing K trials. If K unsuccessful random probing trials have occurred then termination will occur in Phase 2, which is equivalent to performing SDL alone. The probability of Phase 2 ter- minationisthen1¡®. Sinceeachofthesephasesresults in either a 1=D load percentage, for Phase 1, or a b load percentage, for Phase 2, we can find the expected load of the bowl bottom, E(b), in Equation 6. E(b)=®£ 1 D +(1¡®)£b (6) Wesubstituteb des forE(b),sincetheexpectedbowlbot- tomvalueisourdesiredvalue,andsimplifythisequation into Equation 7 in order to solve for ®. ®= b des ¡b 1 D ¡b (7) The distribution of the number of disk hits we make withinK trialsisapproximatedbyanormaldistribution whereweexpecttohitadiskonaverage¹=K£ D P times with a standard deviation of ¾ = q K£ D P £(1¡ D P ). This approximation is dictated by the Central Limit Theorem and becomes more accurate with a large num- ber of iterations of the SDL-RP algorithm. In other words,wecanassumeanormaldistributionandusenor- mal distribution tables when trying to find home disks for a large number of blocks. By using ® from Equation 7, we can lookup its z- score using a standard z-score table for normal distribu- tions [12]. We denote this z(®). The z(®) is the number of standard deviations between the mean value, ¹, and a certain value, x. K can be found using this z(®) and the z-score formula in Equation 8. z(®)= ¹¡x ¾ =)z(®)= K£ D P ¡x q K£ D P £(1¡ D P ) (8) Now we can easily solve for K since we have ¹ and ¾. We set x=1 since we want the probability, ®, of 1 disk hit occurring. We can solve for K by using Equation 9. K = Ã z(®) p D P £(1¡ D P )§ p z(®) 2 £ D P £(1¡ D P )¡4£ D P 2£ D P ! 2 (9) 0.084 0.086 0.088 0.09 0.092 0.094 0.096 0.098 0.1 0.102 0 20 40 60 80 100 120 140 Number of trials, K b 1/D Desired bowl bottom, b des P/2 K U 3% Figure 14: The minimum K random probing trials needed to reach the desired bowl bottom, b des (D = 10;P =101). Figure14showsthenumberoftrialsneededtoachieve a bowl bottom of b des . A greater number of trials will produce a more uniform distribution. The maximum, or most uniform, b des is 1 D while the minimum b des is the expected value of the bowl bottom, b. In this case, we canachievegooduniformitybyusingaK valueofbP=2c trials. WecansetK tothenumberoftrialsthatgivesus 13 a fairly high b des which is, for example, within 3% from 1=D as shown in Figure 14. In other words, by fixing b des =1=D¡(0:03£(1=D¡b)) we can easily calculate K and we call this K 3% . However, K 3% depends on D and P since these values affect the curve in Figure 14. 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0 20 40 60 80 100 120 140 D=10 D=20 D=50 Number of trials, K Desired bowl bottom, b des K =7 U K =23 U K =50 U Figure 15: The upper-bound, K U , for D = 10;20; and 50 (P =101). We show that K 3% decreases as we increase D and fix P to 101 in Figure 15. This is because the chance of hitting a disk, D=P, becomes greater as D increases, so fewer random probing trials are needed. Using the SDL-RP algorithm, we are not able to change the value of K 3% as we scale-up disks since blocks which require K 3% probes during the random probing phase will not be found if we later use a smaller K 3% . So, to perform disk scale-ups, we fix K 3% to a high value according to a low D :P ratio since a high K 3% value can still be used as D increases. 14
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 799 (2003)
PDF
USC Computer Science Technical Reports, no. 742 (2001)
PDF
USC Computer Science Technical Reports, no. 739 (2001)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 721 (2000)
PDF
USC Computer Science Technical Reports, no. 744 (2001)
PDF
USC Computer Science Technical Reports, no. 650 (1997)
PDF
USC Computer Science Technical Reports, no. 592 (1994)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 845 (2005)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 968 (2016)
PDF
USC Computer Science Technical Reports, no. 868 (2005)
PDF
USC Computer Science Technical Reports, no. 948 (2014)
PDF
USC Computer Science Technical Reports, no. 587 (1994)
PDF
USC Computer Science Technical Reports, no. 943 (2014)
PDF
USC Computer Science Technical Reports, no. 645 (1997)
PDF
USC Computer Science Technical Reports, no. 855 (2005)
Description
Shu-Yuen Didi Yao, Cyrus Shahabi, Per-Ake Larson. "Disk labeling techniques: Hash-based approaches to disk scaling." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 785 (2003).
Asset Metadata
Creator
Larson, Per-Ake
(author),
Shahabi, Cyrus
(author),
Yao, Shu-Yuen Didi
(author)
Core Title
USC Computer Science Technical Reports, no. 785 (2003)
Alternative Title
Disk labeling techniques: Hash-based approaches to disk scaling (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
14 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270826
Identifier
03-785 Disk Labeling Techniques Hash-Based Approaches to Disk Scaling (filename)
Legacy Identifier
usc-cstr-03-785
Format
14 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/