Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 628 (1996)
(USC DC Other)
USC Computer Science Technical Reports, no. 628 (1996)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Mitra A Scalable Con tin uous Media Serv er
Shahram Ghandeharizadeh Roger Zimmermann
W eifeng Shi Reza Rejaie Doug Ierardi T aW ei Li
Computer Science Departmen t
Univ ersit y of Southern California
Los Angeles California F ebruary
Abstract
Mitra is a scalable storage manager that supp orts the displayofcon tin uous media data t yp es eg
audio and video clips It is a soft w are based system that emplo ys otheshelf hardw are comp onen ts Its
presenthardw are platform is a cluster of m ultidisk w orkstations connected using an A TM switc h Mitra
supp orts the display ofamixofmedia t yp es T o reduce the cost of storage it supp orts a hierarc hical
organization of storage devices and stages the frequen tly accessed ob jects on the magnetic disks F or
the n um b er of displa ys to scale as a function of additional disks Mitra emplo ys staggered striping It
implemen ts three strategies to maximize the n um b er of sim ultaneous displa ys supp orted b y eac h disk
First the EVEREST le system allo ws dieren t les corresp onding to ob jects of dieren t media t yp es
to b e retriev ed at dieren tbloc k size gran ularities Second the FIXB algorithm recognizes the dieren t
zones of a disk and guaran tees a con tin uous displa y while harnessing the a v erage disk transfer rate
Third Mitra implemen ts the Group ed Sw eeping Sc heme GSS to minimize the impact of disk seeks on
the a v ailable disk bandwidth
In addition to rep orting on implemen tation details of Mitra w e presen t p erformance results that
demonstrate the scalabilityc haracteristics of the system W e compare the obtained results with theo
retical exp ectations based on the bandwidth of participating disks Mitra attains b et w een to of the theoretical exp ectations
This researchw as supp orted in part b y a HewlettP ac k ard unrestricted cashequipmen tgift and the National Science
F oundation under gran ts IRI IRI NYI a w ard and CD A
In tro duction
The past few y ears ha v e witnessed man y design studies describing dieren t comp onen ts of a serv er that
supp orts con tin uous media data t yp es suc h as audio and video The no v elt y of these studies is attributed
to t w o requiremen ts of con tin uous media that are dieren t from traditional textual and recordbased data
First the retriev al and displayof con tin uous media are sub ject to realtime constrain ts that impact b oth
a the storage sc heduling and deliv ery of data and b the manner in whic h m ultiple users ma y share
resources If the realtime constrain ts are not satised then a displa y migh t suer from disruptions and
dela ys that result in jitter with video and random noises with audio These disruptions and dela ys are
termed hic cups Second ob jects of this media t yp e are t ypically large in size F or example a t w o hour
MPEG enco ded video requiring Megabits per second Mbps for its displa y is Gigab yte in size
Three min utes of uncompressed CD qualit y audio with a Mbps bandwidth requiremen t is Megab yte
MByte in size The same audio clip in MPEGenco ded format migh t require Mbps for its displa y
and is Mb yte
Mitra is a realization of sev eral promising design concepts describ ed in the literature Its primary
con tributions are t w ofold to demonstrate the feasibilit y of these designs and to ac hiev e the non
trivial task of gluing these together in to a system that is b oth high p erformance and scalable Mitra is a
soft w are based system that can b e p orted to alternativehardw are platforms It guaran tees sim ultaneous
displa y of a collection of dieren t media t yp es as long as the bandwidth required b y the displa y of eac h
media t yp e is constan t iso c hronous F or example Mitra can displa y b oth CDqualit y audio with a
Mbps bandwidth requiremen t and an MPEG enco ded stream with Mbps bandwidth requiremen tt w o
dieren t media t yp es at the same time as long as the bandwidth required b y eac h displa y is constan t
Moreo v er Mitra can displa y those media t yp es whose bandwidth requiremen ts migh t exceed that of a
single disk driv e eg uncompressed A TV qualit y video ob jects requiring Mbps for their displa y in
supp ort of highend applications that cannot tolerate the use of compression tec hniques
Due to their large size con tin uous media ob jects are almost alw a ys disk residen t Hence the limiting
resource in Mitra is the a v ailable disk bandwidth ie traditional IO b ottlenec k phenomena Mitra is
scalable b ecause it can service a higher n um ber of sim ultaneous displa ys as a function of additional disk
bandwidth The k ey tec hnical idea that supp orts this functionalit y is to distribute the w orkload imp osed
byeac h displayev enly across the a v ailable disks using staggered striping BGMJ to a v oid the formation
of hot sp ots and b ottlenec k disks
Mitra is high p erformance b ecause it implemen ts tec hniques that maximize the n um ber of displa ys
supp orted byeachdisk These tec hniques b oth minimize the n um b er of seeks using EVEREST GIZ
and the amoun t of time attributed to eac h seek GSS YCK or maximize the transfer rate of m ulti
zone disks FIXB GKSZ Mitras le system is EVEREST As compared with other le systems
P arameter Denition
Num b er of media t yp es
R
C
M
i
Bandwidth required to displa y ob jects of media t yp e i
R
D
Bandwidth of a disk
B M
i
Blo c k size for media t yp e i
D T otal n um b er of disks
d Num b er of disks that constitute a cluster
C Num b er of clusters recognized b y the system
g Num b er of groups with GSS
k Stride with staggered striping
N Num b er of sim ultaneous displa ys supp orted b y the system
S Maxim um heigh t of sections with EVEREST
Num ber of con tiguous buddies of section heigh t i that form a section of heigh t i
T able P arameters and their denition
EVEREST pro vides t w o functionalities First it enables Mitra to retriev e dieren t les at dieren t blo c k
size gran ularities This minimizes the p ercen tage of disk bandwidth that is w asted when Mitra displa ys
ob jects that ha v e dieren t bandwidth requiremen ts Second it a v oids the fragmen tation of disk space when
supp orting a hierarc h y of storage devices CHL where dieren t ob jects are sw app ed in and out of the
a v ailable disk space o v er time GSS minimizes the amoun t of time attributed to eac h seek b y optimizing
the disk sc heduling algorithm Finally FIXB in com bination with EVEREST enables Mitra to guaran tee
acon tin uous displa y while harnessing the a v erage transfer rate of m ultizone disks R W GSZ FIXB
enables Mitra to strik e a compromise bet w een the p ercen tage of w asted disk space and ho w m uc h of its
transfer rate is harnessed With eac h of these tec hniques there are tradeos asso ciated with the c hoices
of v alues for system parameters Although these tradeos ha v e been in v estigated using analytical and
sim ulation studies Mitras k ey con tribution is to demonstrate that these analyses hold true in practice It
sho ws that one do es not ha v e to rewrite soft w are to supp ort div erse applications with dieren t p erformance
ob jectiv es startup latency v ersus throughput v ersus w asted disk space Instead there is a single system
where dieren tc hoices of parameters supp ort dieren t applications
Sev eral related studies ha v e describ ed the implemen tation of con tin uous media serv ers
These can
be categorized in to singledisk and m ultidisk systems The singledisk systems include A OG LS
R C GBC These pioneering studies w ere instrumen tal in iden tifying the requiremen ts of con tin uous
media They dev elop ed sc heduling p olicies for retrieving blo c ks from disk in to memory to supp ort a
con tin uous displa y Mitra emplo ys these p olicies as detailed in Section Compared with Mitra most
of them striv e to be general purp ose and supp ort traditional le system accesses in addition to a b est
W e do not rep ort on commercial systems due to lac k of their implemen tation detail see Nat for an o v erview of these
systems
eort deliv ery of con tin uous media Th us none striv e to maximize the n um ber of displa ys supp orted
b y a disk using alternativ e disk sc heduling p olicies tec hniques that harness the a v erage transfer rate
of disk zones or strategies that constrained the ph ysical le la y out The m ultidisk systems include
Streaming RAID TPBG F ellini ORS and Minnesotas V OD serv er HLL
None claims to
supp ort either the displa y of a mix of media t yp es or a hierarc hical storage structure nor do they describ e
the implemen tation of a le system that ensures con tiguous la y out of a blo c k on the disk storage medium
The authors of F ellini iden tify the design of a le system suc h as the one dev elop ed for Mitra as an
imp ortan t researc h direction in ORS Moreo v er all three systems emplo y disk arra ys where the
n um ber of disks that are treated as a single logical disk is predetermined b y the hardw are Mitra diers
in that the n um b er of disks that are treated as one logical disk is not hardw are dep enden t Instead it is
determined the bandwidth requiremen t of a media t yp e Indeed if one analyzes t w o dieren t displa ys with
eac h accessing a dieren t media t yp e one displa ymigh t treat t w o disks as one logical disk while the other
migh t treat v e disks as one logical disk This has a signican t impact on the n um ber of sim ultaneous
displa ys supp orted b y the system as detailed in Section Streaming RAID implemen ts GSS to maximize the bandwidth of a disk arra y and emplo ys memory
sharing to minimize the amoun t of memory required at the serv er It dev elops analytical mo dels similar
to GRa GK to estimate the p erformance of the system with alternativ e conguration parameters
F ellini analyzes constrain t placemen t of data to enhance the p erformance of the system with m ultizone
disks The design app ears to b e similar to FIXB F ellini describ es sev eral designs to supp ort V CR features
suc h as F ast F orw ard and Rewind W e hintat Mitras designs to supp ort this functionalit y in Section and do not detail them due to lac k of space Neither F ellini nor Streaming RAID presen t p erformance
n um b ers from their system Minnesotas V OD serv er diers from b oth Mitra and the other t wom ultidisk
systems in that it do es not ha v e a cen tralized sc heduler Hence it cannot guaran tee a con tin uous displa y Ho w ev er HLL
presen ts p erformance n um b ers to demonstrate that a mass storage system can displa y
con tin uous media
The rest of this pap er is organized as follo ws In Section w e pro vide an o v erview of the soft w are
comp onen ts of Mitra and its curren t hardw are platform Section describ es the alternativ e comp onen ts
of the system EVEREST GSS FIXB and staggered striping and ho w they in teract with eac h other to
guaran tee a con tin uous displa y Section presen ts exp erimen tal p erformance results obtained from Mitra
As a y ard stic k w e compare these n um b ers with theoretical exp ectations based on analytical mo dels that
determine the p erformance of the system based on the a v ailable disk bandwidth GK GKS The
obtained results demonstrate the scalabilit y of the system sho w that Mitra attains b et w een to of the theoretical exp ectations Our future researc h directions are presen ted in Section
Mbps Megabits p er second
Blo c k Amoun t of data retriev ed p er time p erio d on b ehalf of a PM displa ying an
ob ject of media t yp e i Its size v aries dep ending on the media t yp e and is
denoted as B M
i
F ragmen t F raction of a blo c k assigned to one disk of a cluster that con tains the blo c k
All fragmen ts of a blo c k are equisized
Time p erio d The amoun t of time required to displa y a blo c k at a station This time is
xed for all media t yp es indep enden t of their bandwidth requiremen t
P age Basic unit of allo cation with EVEREST also termed sections of heigh t Startup latency Amoun t of time elapsed from when a PM issues a request for an ob ject to
the onset of the displa y T able Dening terms
An Ov erview of Mitra
Mitra emplo ys a hierarc hical organization of storage devices to minimize the cost of pro viding online access
to a large v olume of data It is curren tly op erational on a cluster of HP w orkstations It emplo ys
a HP Magneto Optical Juk eb o x as its tertiary storage device Eac h w orkstation consists of a MHz
P ARISC CPU MByte of memory and four Seagate STW magnetic disks Mitra emplo ys the
HPUX op erating system v ersion and is p ortable to other hardw are platforms While disks can b e
attac hed to the fast and wide SCSI bus of eac hw orkstation w e attac hed four disks to this c hain b ecause
additional disks w ould exhaust the bandwidth of this bus It is undesirable to exhaust the bandwidth of
the SCSI bus for sev eral reasons First it w ould cause the underlying hardw are platform to not scale as
a function of additional disks Mitra is a soft w are system and if its underlying hardw are platform do es not
scale then the en tire system w ould not scale Second it renders the service time of eac h disk unpredictable
resulting in hiccups
Mitra consists of three soft w are comp onen ts
Sc heduler this comp onen t sc hedules the retriev al of the blo c ks of a referenced ob ject in supp ort of
a hiccupfree displa yat a PM In addition it manages the disk bandwidth and p erforms admission
con trol Curren tly Sc heduler includes an implemen tation of EVEREST staggered striping and
tec hniques to manage the tertiary storage device It also has a simple relational storage manager
to insert and retriev e information from a c atalo g F or eac h media t yp e the catalog con tains the
bandwidth requiremen t of that media t yp e and its blo c k size F or eac h presen tation the catalog
con tains its name whether it is disk residen t if so the name of EVEREST les that represen t this
clip the cluster and zone that con tains its rst blo c k and its media t yp e
mass storage Device Manager DM P erforms either disk or tertiary readwrite op erations
Presen tation Manager PM Displa ys either a video or an audio clip It mightin terface with hardw are
comp onen ts to minimize the CPU requiremen t of a displa y F or example to displa y an MPEG clip
PM
Audio
Player
PM
MPEG-1
Player
PM
MPEG-2
Player
Catalog
DM 1 DM 2
DM 3
Scheduler /
EVER-
EST
Rel.DB
User Interface
NN N
NN
N
ATM Switch
N
N : HP-NOSE DM : Disk Manager
DM 4
N
SCSI-2 (160 Mbps)
Volume 9
Volume 10
Volume 11
DM 9 DM 10
DM 11
NN
N
EVEREST
EVEREST
EVEREST
DM 12
N
Volume 5
Volume 6
Volume 7
EVEREST
EVEREST
EVEREST
EVEREST
Volume 8
Volume 1
Volume 2
Volume 3
EVEREST
EVEREST
EVEREST
EVEREST
Volume 4
DM 5
DM 6
N
N
DM 7
DM 8
N
N
DM 0
N
Volume 12
EVEREST
DM 13
N
DM 14
N
HP Magneto-Optical Disk Library
(2 drives, 32 platters)
HP 9000/735
125 MHz PA-RISC
...
SCSI-2 (80 Mbps)
fast & wide
fast
i 2 1
... ...
SCSI-2
Figure Hardw are and soft w are organization of Mitra
the PM mightemplo y either a program or a hardw arecard to deco de and displa y the clip The PM
implemen ts the PMdriv en sc heduling p olicy of Section to con trol the o w of data from the
Sc heduler
Mitra uses UDP for comm unication bet w een the pro cess instan tiation of these comp onen ts UDP is an
unreliable transmission proto col Mitra implemen ts a ligh tw eigh t k ernel named HPNOSE HPNOSE
supp orts a windo wbased proto col to facilitate reliable transmission of messages among pro cesses In
addition it implemen ts the threads with shared memory p orts that m ultiplex messages using a single
HPUX so c k et and semaphores for sync hronizing m ultiple threads that share memory An instan tiation
of this k ernel is activ e p er Mitra pro cess
F or a giv en conguration the follo wing pro cesses are activ e one Sc heduler pro cess a DM pro cess p er
mass storage readwrite device and one PM pro cess per activ e clien t F or example in our t w elv e disk
conguration with a magneto optical juk e bo x there are sixteen activ e pro cesses fteen DM pro cesses
and one Sc heduler pro cess see Figure There are t w o activ e DM pro cesses for the magneto juk eb o x
b ecause it consists of t w o readwrite devices and optical platters that migh tbesw app ed in and out of
these t w o devices
The com bination of the Sc heduler with DM pro cesses implemen ts async hronous readwrite op erations
on a mass storage device that is otherwise una v ailable with HPUX This is ac hiev ed as follo ws
When the Sc heduler in tends to read a blo c k from a device sa y a disk it sends a message to the DM that
manages this disk to read the blo c k Moreo v er it requests the DM to transmit its blo c k to a destination
p ort address eg the destination migh t corresp ond to the PM pro cess that displa ys this blo c k and issue
a done message to the Sc heduler There are sev eral reasons for not routing data blo c ks to activ e PMs using
the Sc heduler First it w ould w aste the net w ork bandwidth with m ultiple transmissions of a blo c k Second
it w ould cause the CPU of the w orkstation that supp orts the Sc heduler pro cess to b ecome a b ottlenec k
with a large n um ber of disks This is b ecause a transmitted data blo c k w ould be copied man y times b y
dieren tla y ers of soft w are that implemen t the Sc heduler pro cess HPUX HPNOSE and the Sc heduler
While the in teraction b et w een the dieren t pro cesses and threads is in teresting w e do not rep ort on
them due to lackofspace Con tin uous Displa y with Mitra
W e start b y describing the implemen tation tec hniques of Mitra for a conguration that treats the d a v ailable
disks as a single disk driv e This discussion in tro duces EVEREST GIZ Mitras le system and moti
v ates a PMdriv en sc heduling paradigm that pro vides feedbackfromaPMto the Sc heduler to con trol the
rate of data pro duction Subsequen tlyw e discuss an implemen tation of the staggered striping BGMJ tec hnique
One Disk Conguration
T o simplify the discussion and without loss of generalit y conceptualize the d disks as a single disk with the
aggregate transfer rate of d disks When w e state that a blo c k is assigned to the disk w e imply that the
blo c k is declustered GRA Q BGMJ across the d disks Eac h piece of this blo c k is termed a fr agment Moreo v er when westate aDM reads a blo c k from the disk w e imply that d DM pro cesses are activ ated
sim ultaneously to pro duce the fragmen ts that constitute the blo c k
T o displayan object X of media t yp e M
i
sa y CDqualit y audio with bandwidth requiremen t R
C
M
i
Mbps Mitra conceptualizes X as consisting of r blo c ks X
X
X
r Assuming a blo c k size
of B M
i
the displaytime of a blo c k termed a time p erio d GKS equals
B M
i
R
C
M
i
Assuming that the
system is idle when a PM references ob ject X the Sc heduler p erforms t w o tasks First it issues a read
request for X
to the DM It also pro vides the net w ork address of the PM requesting the DM to forw ard
X
directly to the PM Second after a presp ecied dela y it sends a con trol message to the PM to initiate
the displa y of X
This dela y is due to the implemen tation of both GSS YCK detailed b elo w and
FIXB GKSZ describ ed in Section Once the PM receiv es a blo c k it w aits for a con trol message
from the Sc heduler b efore initiating the displa y The Sc heduler requests the DM to transmit the next blo c k
of X ie X
in the next time p erio d to the PM This enables the PM to pro vide for a smo oth transition
bet w een the t w o blo c ks to pro vide for a hiccupfree displa y With the curren t design a PM requires enough
memory to cac he at least t w o blo c ks of data
Giv en a database that consists of dieren t media t yp es sa y MPEG and CDqualit y audio
the blo cksizeof eac h media t yp e is determined suc h that the displa y time of a blo c k ie the duration of
a time p erio d at the Sc heduler is xed for all media t yp es This is done as follo ws First one media t yp e
M
i
sa y CDqualit y audio with bandwidth requiremen t R
C
M
i
Mbps denes the base blo c k size
B M
i
sa y KByte The blo c k size of other media t yp es is a function of their bandwidth R
C
M
i
and B M
i
F or eac h media t yp e M
j
its blo c k size is
B M
j
R
C
M
j
R
C
M
i
B M
i
In our example the blo c k size for MPEG Mbps ob jects w ould b e KByte
In an implemen tation of a le system the ph ysical c haracteristics of a magnetic disk determines the
gran ularit y for the size of a blo c k With almost all disk man ufacturers the gran ularit y is limited to
KByte
Mitra rounds up the blo c k size of eac h ob ject of a media t yp e to the nearest
KByte Th us
in our example the blo c k size for MPEG ob ject w ould b e KByte Ho w ev er Mitra do es not adjust
the duration of a time p erio d to reect this rounding up This forces the system to pro duce more data on
b ehalf of a displa y as compared to the amoun t that the displa y consumes p er time p erio d The amountof
accum ulated data is dep endenton both the n um ber of blo c ks that constitute a clip and what fraction of
eachbloc k is not displa y ed p er time p erio d The amountofaccum ulated data is exp ected to b e insignican t
F or example with a t w o hour MPEG video ob ject a displa y w ould ha v e accum ulated KByte of
data at the end of the displa y Section describ es a sc heduling paradigm that prev en ts the Sc heduler
from pro ducing data should the amountof cac hed data b ecome signican t
Mitra supp orts the displayof N ob jects bym ultiplexing the disk bandwidth among N blo c k retriev als
Its admission con trol p olicy ensures that the service time of these N blo c k retriev als do es not exceed
the duration of a time p erio d The service time of the disk to retriev e a blo c k of media t yp e i is a
function of B M
i
the disk transfer rate rotational latency and seek time Mitra op ens eac h disk in
RA W mo de Hew W e used the SCSI commands to in terrogate the ph ysical c haracteristics of eac h
disk to determine its trac k sizes seek c haracteristics n um b er of zones and transfer rate of eac hzone T o
gather this information one requires neither sp ecialized hardw are nor the use of the assem bly programming
language see GSZ for a detailed description of these tec hniques The Sc heduler reads this information
from a conguration le during its startup
The Sc heduler main tains the duration of a time p erio d using a global v ariable and supp orts a link ed
list of requests that are curren tly activ e In addition to other information an elemen t of this list records
With the buered in terface of the HPUX le system one migh t read and write a single b yte This functionalit y is
supp orted b y a buer p o ol manager that translates this b yte readwrite to a
KByte readwrite against the ph ysical device
the service time of the disk to retriev e a blo ckof the le referenced b y this displa y Mitra minimizes the
impact of seeks incurred when retrieving blo c ks of dieren t ob jects b y implemen ting the GSS algorithm
With GSS a time p erio d migh t b e partitioned in to g groups In its simplest form GSS is congured with
one group g With g a PM b egins to consume the blo c k that w as retriev ed on its b ehalf during
time p erio d at the b eginning of time p erio d This enables the disk sc heduling algorithm to minimize
the impact of seeks b y retrieving the blo c ks referenced during a time period using a scan p olicy Mitra
implemen ts this b y sync hronizing the displa y of the rst blo c kofanobject X
at the PM with the end of
the time p erio d that retriev ed X
Once the displayof X
is sync hronized the displa y of the other blo c ks
are automatically sync hronized due to a xed duration for eac h time p erio d The sync hronization of X
is
ac hiev ed as follo ws A PM do es not initiate the displayof X
un til it receiv es a con trol message from the
Sc heduler The Sc heduler generates this message at the b eginning of the time p erio d that retriev ed X
With g Mitra partitions a time p erio d in to g equisized in terv als The Sc heduler assigns a displa y
to a single group and the displa y remains with this group un til its displa y is complete The retriev al
of blo c ks assigned to a single group emplo ys the elev ator sc heduling algorithm This is implemen ted as
follo ws Assuming that group G
i
retriev es a blo ckof X p er time p erio d the displa yof X
is sync hronized
with the b eginning of group G
i
File System Design
The curren t implemen tation of Mitra assumes that a PM only displa ys a stream ie it do es not p erform
complex op erations suchas F astF orw ard F astRewind or P ause op erations Up on the arriv al of a request
for ob ject X b elonging to media t yp e M
X
the admission con trol p olicy of the Sc heduler is as follo ws First
the Sc heduler c hec ks to see if another sc heduled displa y is b eginning the displayof X ie references X
If so these t w o new requests are com bined with eac h other in to one This enables Mitra to m ultiplex a
single stream among m ultiple PMs If no other stream is referencing X
starting with the currentactiv e
group the Sc heduler lo cates the group with sucien t idle time to accommo date the retriev al of a blo c k
of size B M
i
The implemen tation details of this p olicy are con tained in App endix A If no group can
accommo date the retriev al of this request the Sc heduler queues this request and examines the p ossibilit y
of admitting it during the next time p erio d The p erformance results of Section are obtained based on a
conguration of the Sc heduler that do es not merge sev eral streams referencing the same ob ject in to one
With media t yp es Mitras le system migh t b e forced to manage dieren t blo c k sizes Moreo v er
the blo c ks of dieren t ob jects migh t b e staged from the tertiary storage device on to magnetic disk storage
on demand A blo c k should b e stored con tiguously on disk Otherwise the disk w ould incur seeks when
reading a blo c k reducing disk bandwidth Moreo v er it migh t result in hiccups b ecause the retriev al time
of a blo c k migh t b ecome unpredictable T o ensure a con tiguous la y out of a blo c k w e considered four
alternativ e approac hes disk partitioning exten tbased ABCea CDKK GRb m ultiple blo c k sizes
and an appro ximate con tiguous la y out of a le W e c hose the nal approac h resulting in the design and
implemen tation of the EVEREST le system Belo w w e describ e eac h of the other three approac hes and
our reasons for abandoning them
With disk partitioning assuming media t yp es with dieren t blo c k sizes the a v ailable disk space
is partitioned in to regions one region per media t yp e A region i corresp onds to media t yp e i The
space of this region is partitioned in to x sized blo c ks corresp onding to B M
i
The ob jects of media t yp e
i comp ete for the a v ailable blo c ks of this region The amoun t of space allo cated to a region i migh t be
estimated as a function of b oth the size and frequency of access of ob jects of media t yp e i GI Ho w ev er
partitioning of disk space is inappropriate for a dynamic en vironmen t where the frequency of access to the
dieren t media t yp es mightc hange as a function of time This is b ecause when a region b ecomes cold its
space should b e made a v ailable to a region that has b ecome hot Otherwise the hot region migh t start to
exhibit a thrashing Den b eha vior that w ould increase the n um b er of retriev als from the tertiary storage
device This motiv ates a reorganization pro cess to rearrange disk space This pro cess w ould be time
consuming due to the o v erhead asso ciated with p erforming IO op erations
With an exten tbased design a xed con tiguous c h unk of disk space termed an exten t is partitioned
in to xsized blo c ks Tw o or more exten ts mightha v e dieren t page sizes Both the size of an extentand
the n um b er of exten ts with a presp ecied blo c k size ie for a media t yp e is xed at system conguration
time A single le ma y span one or more exten ts Ho w ev er an extentmaycon tain no more than a single
le With this design an ob ject of a media t yp e i is assigned one or more exten ts with blo c k size B M
i
In addition to suering from the limitations asso ciated with disk partitioning this approac h suers from
in ternal fragmen tation with the last exten t of an ob ject b eing only partially o ccupied This w ould w aste
disk space increasing the n um b er of references to the tertiary storage device
With the Multiple Bo c k Size approac h MBS the system is congured based on the media t yp e with
the lo w est bandwidth requiremen t sa y M
MBS requires the blo c k size of eac h of media t yp e j to be
a m ultiple of B M
ie B M
j
d
B M
j
B M
eB M
This migh t simplify the managemen t of disk space
to a v oid its fragmen tation and ensure the con tiguous la y out of eac h blo c k of an ob ject Ho w ev er
MBS migh t w aste disk bandwidth b y forcing the disk to retriev e more data on b ehalf of a PM per
time period due to rounding up of blo c k size and remain idle during other time periods to a v oid an
o v ero w of memory at the PM These are best illustrated using an example Assume t w o media t yp es
MPEG and MPEG ob jects with bandwidth requiremen ts of Mbps and Mbps resp ectiv ely With
this approac h the blo c k size of the system is c hosen based on MPEG ob jects Assume it is c hosen to b e
KByte B MPEG KByte This implies that B MPEG KByte MBS w ould increase
B MPEG to equal KByte T o a v oid excessiv e amoun t of accum ulated data at a PM displa ying
an MPEG clip the Sc heduler migh t skip the retriev al of data one time p erio d ev ery nine time periods
0
1
2
3
4
Height
Pages
0123456789 101112131415
Section View
Buddies Buddies
Figure Ph ysical division of disk space in to pages and the corresp onding logical view of the sections with
an example base of using the PMdriv en sc heduling paradigm of Section The Sc heduler ma y not emplo y this idle slot
to service another request b ecause it is required during the next time period to retriev e the next blo c k
of curren t MPEG displa y If all activ e requests are MPEG video clips and a time p erio d supp orts
nine displa ys with B MPEG KByte then with B MPEG KByte the system w ould
supp ort ten sim ultaneous displa ys impro v emen t in p erformance In summary the blo c k size for a
media t yp e should appro ximate its theoretical v alue in order to minimize the percen tage of w asted disk
bandwidth
The nal approac h and the one used b y Mitra emplo ys the buddy algorithm to appro ximate a
con tiguous la y out of a le on the disk without w asting disk space The n um ber of con tiguous c h unks that
constitute a le is a xed function of the le size and the conguration of the buddy algorithm Based on
this information Mitra can either prev en t a blo c k from o v erlapping t w o noncon tiguous c h unks or allowablockto o v erlap t woc h unks and require the PM to cac he enough data to hide the seeks asso ciated
with the retriev al of these blo c ks Curren tly Mitra implemen ts the rst approac h T o illustrate the second
approac h if a le consists of v e con tiguous c h unks then at most four blo c ks of this le migh t span t w o
dieren tc h unks This implies that the retriev al of four blo c ks will incur seeks with at most one seek per
blo c k retriev al T o a v oid hiccups the Sc heduler should dela y the displa y of the data at the PM un til it
has cac hed enough data to hide the latency asso ciated with four seeks The amoun t of cac hed data is
not signican t F or example assuming a maxim um seek time of milliseconds with MPEG ob jects
Mbps the PM should cac he KByte to hide eac h seek Ho w ev er this approac h complicates the
admission con trol p olicy b ecause the retriev al of a blo c k migh t incur either one or zero seeks
EVEREST
With EVEREST the basic unit of allo cation is a page
also termed se ctions of height EVEREST
organizes these sections as a tree to form larger con tiguous sections As illustrated in Figure only
sections of sizepage i
for i are v alid where the base is a system conguration parameter If
The size of a page has no impact on the gran ularit y at whic h a pro cess migh t read a section This is detailed b elo w
a section consists of i
pages then i is said to be the heigh t of the section heigh t i sections that are
buddies ph ysically adjacen t migh tbe com bined to construct a heigh t i section
T o illustrate the disk in Figure consists of pages The system is congured with Th us
the size of a section ma y v ary from up to pages In essence a binary tree is imp osed up on
the sequence of pages The maxim um heigh t computed b y
S dlog
b
C apacity
size page c e is With this
organization imp osed up on the device sections of heigh t i cannot start at just an ypage n um b er but
only at osets that are m ultiples of i
This restriction ensures that an y section with the exception of the
one at heigh t S has a total of adjacen t buddy sections of the same size at all times With the base
organization of Figure eac h section has one buddy With EVEREST a p ortion of the a v ailable disk space is allo cated to ob jects The remainder should
an y exist is free The sections that constitute the a v ailable space are handled bya fr e e list This free list
is actually main tained as a sequence of lists one for eac h section heigh t The information ab out an un used
section of heigh t i is enqueued in the list that handles sections of that heigh t In order to simplify ob ject
allo cation the follo wing b ounde d list length pr op erty is alw a ys main tained F or eac h heigh t i S at most free sections of i are allo w ed Informally this prop ert y implies that whenev er there exists
sucien tfree space at the free list of heigh t i EVEREST must compact these free sections in to sections
of a larger heigh t
The materialization of an ob ject is as follo ws The rst step is to c hec k whether the total n um ber of
pages in all the sections on the free list is either greater than or equal to the n um ber of pages denoted
noofpages o
x
that the new ob ject o
x
requires If this is not the case then one or more victim ob jects
are elected and deleted The pro cedure for selecting a victim is based on heat GIKZ The deletion of
a victim ob ject is describ ed further belo w Assuming enough free space is a v ailable at this p oin t o
x
is
divided in to its corresp onding sections as follo ws First the n um ber m noofpages o
x
is con v erted to
base F or example if and noofpages o
x
then its binary represen tation is The full
represen tation of sucha con v erted n um ber is m d
j j d
d
d
In our
example the n um b er can b e written as In general for ev ery digit
d
i
that is nonzero d
i
sections are allo cated from heigh t i of the free list on b ehalf of o
x
In our example
o
x
requires section from heigh t no sections from heigh t section from heigh t and section from
height F or eac h ob ject the n um ber of con tiguous pieces is equal to the n um ber of ones in the binary
T o simplify the discussion assume that the total n um ber of pages is a po w er of The general case can b e handled
similarly and is describ ed b elo w
A lazy v ariantofthissc heme w ould allo w these lists to gro w longer and do compaction up on demand ie when large
con tiguous pages are required This w ould b e complicated as a v arietyof c hoices migh t exist when merging pages This w ould
require the system to emplo y heuristic tec hniques to guide the searc h space of this merging pro cess Ho w ev er to simplify the
description w e fo cus on an implemen tation that observ es the in v arian t describ ed ab o v e
represen tation of m or with a general base P
j
i
d
i
where j is the total n um ber of digits Note
that is alw a ys b ounded b y dlog
me F or an y ob ject denes the maxim um n um ber of sections
o ccupied b y the ob ject The minim um is if all sections are ph ysically adjacen t A complication arises
when no section at the righ t heightexists F or example supp ose that a section of size i
is required but
the smallest section larger than i
on the free list is of size j
j i In this case the section of size
j
can be split in to sections of size j If j i then of these are enqueued on the list
of heigh t i and the remainder is allo cated Ho w ev er if j i then of these sections are again
enqueued at lev el j and the splitting pro cedure is rep eated on the remaining section It is easy to see
that whenev er the total amoun t of free space on these lists is sucien t to accommo date the ob ject then
for eac h section that the ob ject o ccupies there is alw a ys a section of the appropriate size or larger on
the list This splitting pro cedure will guaran tee that the appropriate n um b er of sections eac h of the righ t
size will b e allo cated and that the b ounded list length prop ert yis nev er violated
When the system elects that an ob ject m ust b e materialized and there is insucien t free space then
one or more victims are remo v ed from the disk Reclaiming the space of a victim requires t w o steps for
eac h of its sections First the section m ust be app ended to the free list at the appropriate heigh t The
second step ensures that the b ounded list length prop ert y is not violated Therefore whenev er a section is
enqueued in the free list at heigh t i and the n um b er of sections at that heigh t is equal to or greater than then sections m ust b e com bined in to one section at heigh t i If the list at i no w violates b ounded
list length prop ert y then once again space m ust b e compacted and mo v ed to section i This pro cedure
migh t b e rep eated sev eral times It terminates when the length of the list for a higher heigh t is less than
Compaction of free sections in to a larger section is simple when they are buddies in this case the
com bined space is already con tiguous Otherwise the system migh t be forced to exc hange one o ccupied
section of an ob ject with one on the free list in order to ensure con tiguit y of an appropriate sequence of sections at the same heigh t The follo wing algorithm ac hiev es spacecon tiguit y among free sections at
heigh t i Chec k if there are at least sections for heigh t i on the free list If not stop
Select the rst section denoted s
j
and record its page n um ber ie the oset on the disk driv e
The goal is to free sections that are buddies of s
j
Calculate the page n um bers of s
j
s buddies EVERESTs division of disk space guaran tees the
existence of buddy sections ph ysically adjacentto s
j
F or ev ery buddy s
k
k k j if it exists on the free list then mark it
An y of the s
k
unmark ed buddies curren tly store parts of other ob jects The space m ust be re
arranged bysw apping these s
k
sections with those on the free list Note that for ev ery buddy section
that should be freed there exists a section on the free list After sw apping space bet w een ev ery
0123456789101112131415
BLOCKS:
: O
: O
: O
7
14
0
1
2
3
4
Depth
FREE LIST:
1
2
3
: free BLOCKS
Figure a Tw o sections are on the free list al
ready and and ob ject o
is deallo cated
0123456789101112131415
BLOCKS:
7
14
0
1
2
3
4
Depth
FREE LIST: 13
0123456789101112131415
BLOCKS:
7
14
0
1
2
3
4
Depth
FREE LIST: 13
6
Figure b Sections and should b e com
bined ho w ev er they are not con tiguous
Figure c The buddy of section is Data
m ust mo v e from to 0123456789101112131415
BLOCKS:
6
14
0
1
2
3
4
Depth
FREE LIST: 7
0123456789101112131415
BLOCKS:
6
0
1
2
3
4
Depth
FREE LIST:
14
4
Figure d Sections and are con tiguous
and can b e com bined
Figure e The buddy of section is Data
m ust mo v e from to
0123456789101112131415
BLOCKS:
4
0
1
2
3
4
Depth
FREE LIST:
6
0123456789101112131415
BLOCKS:
4
0
1
2
3
4
Depth
FREE LIST:
Figure f Sections and are no w adjacen t
and can b e com bined
Figure g The nal view of the disk and the
free list after remo v al of o
Figure Deallo cation of an ob ject The example sequence sho ws the remo v al of ob ject o
from the initial
disk residen t ob ject set fo
o
o
g Base t w o
unmark ed buddy section and a free list section enough con tiguous space has b een acquired to create
a section at heigh t i of the free list
Go bac k to Step T o illustrate consider the organization of space in Figure a The initial set of disk residen t ob jects
is fo
o
o
g and the system is congured with In Figure a t w o sections are on the free list at
heigh t and addresses and resp ectiv ely and o
is the victim ob ject that is deleted Once page
is placed on the free list in Figure b the n um b er of sections at heigh t is increased to and it m ust
be compacted according to Step As sections and are not con tiguous section is elected to be
sw app ed with section s buddy ie section Figure c In Figure d the data of section is mo v ed
to section and section is no w on the free list The compaction of sections and results in a new
section with address at heigh t of the free list Once again a list of length t w o at heigh t violates the
b ounded list length prop ert y and pages are iden tied as the buddy of section in Figure e After
mo ving the data in Figure f from pages to another compaction is p erformed with the nal
state of the disk space emerging as in Figure g
Once all sections of a deallo cated ob ject are on the free list the iterativ e algorithm ab o v e is run on
eac h list from the lo w est to the highest heigh t The previous algorithm is somewhat simplied b ecause it
do es not supp ort the follo wing scenario a section at heigh t i is not on the free list ho w ev er it has b een
brok en do wn to a lo w er heigh t sa y i and not all subsections ha v e b een used One of them is still on
the free list at heigh t i In these cases the free list for heigh t i should b e up dated with care b ecause
those free sections ha v e mo v ed to new lo cations In addition note that the algorithm describ ed ab o v e
actually p erforms more w ork than is strictly necessary A single section of a small heigh t for example
ma y end up b eing read and written sev eral times as its section is com bined in to larger and larger sections
This is eliminated in the follo wing manner The algorithm is rst performed virtually that is in main
memory as a compaction algorithm on the free lists Once completed the en tire sequence of op erations that
ha v e b een p erformed determines the ultimate destination of eac h of the mo died sections The Sc heduler
constructs a list of these sections This list is inserted in to a queue of house k eeping IOs Asso ciated with
eac h elemen t of the queue is an estimated amoun t of time required to p erform the task Whenev er the
Sc heduler lo cates one or more idle slots in the time p erio d it analyzes the queue of w ork for the elemen t
that can be pro cessed using the a v ailable time Idle slots migh t be a v ailable with a w orkload that has
completely utilized the n um b er of idle slots due to the PMdriv en sc heduling paradigm of Section The v alue of impacts the frequency of prev en tiv e op erations If is set to its minim um v alue
ie then prev en tiv e op erations w ould b e in v ok ed frequen tly b ecause ev ery time a new section is
enqueued there is a c hance for a heigh t of the free list to consist of t w o sections violates the b ounded
list length prop ert y Increasing the v alue of will therefore relax the system b ecause it reduces the
200 400 600 800 1000
Disk Capacity [MB]
12
16
20
24
28
32
36
40
Transfer Rate [Mbps]
Figure Zone c haracteristics of the Seagate STW magnetic disk
probabilit y that an insertion to the free list w ould violate the b ounded list length prop ert y Ho w ev er
this w ould increase the exp ected n um ber of b ytes migrated p er prev en tiv e op eration F or example at the
extreme v alue of n where n is the total n um b er of pages the organization of blo c ks will consist of t w o
lev els and for all practical purp ose EVEREST reduces to a standard le system that manages xsized
pages
The design of EVEREST suers from the follo wing limitation the o v erhead of its prev en tiv e op erations
ma y b ecome signican t if man y ob jects are sw app ed in and out of the disk driv e This happ ens when the
w orking set of an application cannot b ecome residen t on the disk driv e
In our implemen tation of EVEREST it w as not p ossible to x the n um ber of disk pages as an exact
po w er of The most imp ortan t implication of an arbitrary n um ber of pages is that some sections ma y
not ha v e the correct n um b er of buddies of them Ho w ev er w e can alw a ys mo v e those sections to
one end of the disk for example to the side with the highest pageosets Then instead of c ho osing
the rst section in Step in the ob ject deallo cation algorithm Mitra c ho oses the one with the lo w est
page n um ber This ensures that the sections to w ards the critical end of the disk that migh t not ha v e
the correct n umber of buddies are nev er used in b oth Steps and of the algorithm
Our implemen tation enables a pro cess to retriev e a le using blo c k sizes that are at the gran ularityof
KByte F or example EVEREST migh t b e congured with a KByte page size One pro cess mightread
a le at the gran ularit y of KByte blo c ks while another migh t read a second le at the gran ularit y
of KByte
The design of EVEREST is related to the buddy system prop osed in Kno LD for an ecien t
main memory storage allo cator DRAM The dierence is that EVEREST satises a request for b pages
b y allo cating a n um ber of sections suc h that their total n um ber of pages equals b The storage allo cator
algorithm on the other hand will allo cate one section that is rounded up to dlg be
pages resulting in
fragmen tation and motiv ating the need for either a reorganization pro cess or a garbage collector GRb
MEM
TIME
(Sec)
0
. . .
. . . . . .
. . .
T
T
MUX
Scan
(Z )
i 01 m-2 m-1
T
MUX (Z ) T MUX (Z ) T
MUX
(Z ) T MUX (Z ) T
Max
Required
Memory
R
(R
C
(Z ) i R
C
-
cseek
)*T i (Z )
*(T
i
(Z )-T
T
disk
(Z )
i
disk
MUX i
(Z )
disk
)
Figure Memory requirementwithFIXB
The primary adv an tage of the elab orate ob ject deallo cation tec hnique of EVEREST is that it a v oid in ternal
and external fragmen tation of space as describ ed for traditional buddy systems see GRb MultiZone Disks
A trend in the area of magnetic disk tec hnology is the concept of zoning It increases the storage capacit y
of eac h disk Ho w ev er it results in a disk with v ariable transfer rates with dieren t regions of the disk
pro viding dieren t transfer rates Figure sho ws the transfer rate of the dieren t zones that constitute
eac h of the Seagate disks T ec hniques emplo y ed to gather these n um b ers are rep orted in GSZ A le system that do es not recognize the dieren t zones migh t be forced to assume the bandwidth
of the slo w est zone as the o v erall transfer rate of the disk in order to guaran tee a con tin uous displa y In GKSZ w e describ ed t w o alternativ e tec hniques to supp ort con tin uous displa y of audio and video
ob jects using m ultizone disks namely FIXed Blo c k size FIXB and V ARiable Blo cksize V ARB These
t wotec hniques harness the a v erage transfer rate of zones Mitra curren tly implemen ts FIXB
It organizes
an EVEREST le system on eac h region of the disk driv e Next it assigns the blo c ks of eac h ob ject to
the zones in a roundrobin manner The blo c ks of eac h ob ject that are assigned to a zone are stored as a
single EVEREST le In the catalog Mitra main tains the iden tityofeac h EVEREST le that constitute
a clip its blo c k size and the zone that con tains the rst blo c k of this clip
The Sc heduler scans the disk in one direction sa y starting with the outermost zone mo ving in w ard It
recognizes m dieren t zones ho w ev er only one zone is activ e p er time p erio d denoted as a global v ariable
Z
Activ e
The bandwidth of eac h zone is m ultiplexed among all activ e displa ys Once the disk reads data
from the innermost zone it is rep ositioned to the outermost zone to start another sw eep The time to
p erform on w eep is denoted as T
scan
The blo c k size is c hosen suc h that the amountof data pro duced b y
Wein tend to implementV ARB in the near future
Mitra for a PM during one T
scan
equals the amoun t of data consumed at the PM This requires the faster
zones to comp ensate for the slo w er zones As demonstrated in Figure data accum ulates at the PM when
outermost zones are activ e at the Sc heduler and decreases when reading blo c ks from the innermost zones
In this gure T
Mux
Z
i
denotes the duration of a time that a zone is activ e It is longer for the innermost
zone due to their lo w transfer rate In essence FIXB emplo ys memory to comp ensate for the slo w zones
using the transfer rate of the fastest zones harnessing the a v erage disk transfer rate
If X
is assigned to a zone other than the outermost one sa y Z
X
then its displa yma y not start at
the end of the time period that retriev es X
ie T
MUX
Z
X
This is b ecause both the retriev al and
displa y of data on b ehalf of a PM is sync hronized relativ e to the transfer rate of the outermost zone to
ensure that the amoun t of data pro duced during one sw eep is equiv alen t to that consumed If the displa y
is not dela y ed then the PM migh t run out of data and incur hiccups By dela ying the displa y at a PM the
system can a v oid hiccups In GKSZ w e detail analytical mo dels to compute the duration of a dela y
based on the iden tityof Z
X
A dra wbac k of recognizing a large n um b er of zones is a higher startup latency Mitra can reduce the
n um b er of zones b y logically treating one or more adjacen t zones as a single logical zone This is ac hiev ed
byo v erla ying a single EVEREST le system on these zones Mitra assumes the transfer rate of the slo w est
participating zone as the transfer rate of the logical zone to guaran tee hiccupfree displa ys
PMdriv en Sc heduling
The disk service time required to retriev e N blo c ks of N activ e displa ys migh t be shorter than a time
period This is b ecause of the concept of a logical zone where the transfer rate of this zone is assumed
to equal the transfer rate of the slo w est participating ph ysical zone and the N blo c ks reside in the fastest
ph ysical zone b y luc k of the logical zone When this happ ens the Sc heduler ma y either busyw ait
un til the end of the time p erio d or pro ceed with the retriev al of blo c ks that should b e retriev ed during
thenexttimeperiod With the latter the Sc heduler starts to pro duces data at a faster rate on b ehalf of a
PM This motiv ates an implemen tation of a PMdriv en sc heduling paradigm where the Sc heduler accepts
skip messages from a PM when the PM starts to run out of memory With this paradigm a PM main tains a data buer with a lo w and a high w ater mark These t wow ater
marks are a p ercen tage of the total memory a v ailable to the PM Once the high w ater mark is reac hed the
PM generates a skip message to inform the Sc heduler that it should not pro duce data on b ehalf of this PM
for a xed n um ber of time periods sa y Y time p erio ds Y m ust be a m ultiple of the n um ber of logical
zones recognized on a disk otherwise Y is rounded to b
Y
m
c This is due to the roundrobin assignmen tof
blo c ks of eac h ob ject to the zones where a displa y cannot simply skip one zone when m The n um ber
of time p erio ds is dep enden t on the amoun t of data that falls b et w een the lo w and high w ater marks ie
the n um ber of blo c ks cac hed It m ust corresp ond to at least one sw eep of the zones T
scan
to enable the
PM to issue a skip message During the next Y time p erio ds the Sc heduler pro duces no data on b ehalf of
the PM while the displa y consumes data from buers lo cal to the PM After Y time p erio ds the Sc heduler
starts to pro duce data for this PM
The c hoice of a v alue for the lo w and high w ater marks at the PM are imp ortan t The dierence
bet w een the total a v ailable memory and the high w ater mark should b e at least one blo c k This is b ecause
of the race condition that migh t arise due to net w orking dela ys bet w een the PM and the Sc heduler F or
example the Sc heduler migh t pro duce a blo c k for the PM at the same time that the PM is generating the
skip message Similarlythe lo w w ater mark should not be zero its minim um v alue m ust be one blo c k
This w ould eliminate the p ossibilit y of the PM running out of data resulting in hiccups due to net w orking
dela ys
When the Sc heduler receiv es a skip message from a PM it migh templo y the idle slot to p erform house
k eeping activities ie migrate sections in supp ort of the b ounded list length prop ert y If no suc h activit y
is required the idle slot of a time p erio d migh t b e handled in t w o p ossible w a ys either busyw ait for
the duration of the skipp ed time slot or pro ceed with the retriev al of blo c ks that should b e retriev ed
during the next time p erio d The second strategy w ould cause the Sc heduler to pro duce data at a faster
rate for the other activ e PMs forcing them to generate skip messages This minimizes the a v erage startup
latency of the system as detailed in Section Staggered Striping
Staggered striping w as originally presen ted in BGMJ GK This section describ es its implemen tation
in Mitra With staggered striping Mitra do es not treat all the a v ailable disks sa y D disks as a single
logical disk Instead it can construct man y clusters of disks with eac h treated as a single logical disk
Assuming that the database consists of media t yp es Mitra registers for eac h media t yp e M
i
the
n um b er of disks that constitute a cluster with this media t yp e termed d M
i
and the blo c k size for M
i
ie B M
i
The tradeo asso ciated with alternativ ev alues for d M
i
and B M
i
is rep orted in Section Mitra constructs logical clusters instead of ph ysical ones using a xed stride v alue k This is ac hiev ed
as follo ws When loading an ob ject sa y X of media t yp e M
X
the rst blo c k of X X
recognizes a
cluster as consisting of d M
X
adjacen t disks starting with an arbitrary disk sa y disk
a
Mitra declusters
X
in to d M
X
fragmen ts and assigns eac h fragmen ts to a disk starting with disk
a
disk
a
disk
a mod D
disk
a d M
X
mod D
F or example in Figure X
is declustered in to three fragmen ts d M
X
and
assigned to a logical cluster starting with disk It places the remaining blo c ks of X suc h that the rst
disk that con tains the rst fragmen t of blo c k X
j
is k disks apart from that of blo c k X
j
Th us in our
example the placemen t of X
w ould start with disk
b
where b a k mod D The placemen t of X
Figure Staggered striping for t w o media t yp es
starts with disk
b k mod D
In Figure k Th us X
is declustered across disks and while X
is declustered across disks and With m zones per disk the assignmen t of blo c ks to the zones of
clusters con tin ues to followa roundrobin assignmen t F or example if X
is assigned to zone Z
i
of disks
a to a d M
X
mod D X
is assigned to zone Z
i mod m
of disks b to b k mod D This pro cess
rep eats un til all blo c ks of X are assigned to disks and zones One EVEREST le con tains all fragmen ts
of X assigned to zone i of disk j Th us a total of D m les migh t represen t ob ject X Once ob ject
X is loaded Mitra registers with the catalog the follo wing information the disk and zone that the
assignmentof X
started with X s media t yp e and the iden tityof eac h le that con tains dieren t
fragmen ts of X While the v alue of d M
i
migh t dier for the alternativ e media t yp es k is a constan t for all media
t yp es F or example in Figure the media t yp e of ob ject X requires the bandwidth of three disks while
that of Y requires four disks Ho w ev er the v alue of k for b oth ob jects
T o displayan object Xthe Sc heduler lo oks up the catalog to determine X s media t yp e ie the
v alue of d M
X
for this ob ject the disk that con tains the rst fragmen tof X
sa y disk
a
Once d M
X
disks starting with disk
a
ie disk
a
disk
a mod D
disk
a d M
X
mod D
ha v e sucien t bandwidth
to retriev e the fragmen ts of X
the retriev al of X
is sc heduled During the next time p erio d this displa y
shifts k disks to the righ t to retriev e X
This pro cess rep eats un til all blo c ks of X ha v e b een retriev ed
and transmitted to the PM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Clip Id
0
20
40
60
80
100
Total # of Votes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Clip Id
0
50
100
150
200
250
300
350
400
Length [sec]
a Num ber of v otes for eac h clips b Length in seconds of eac h clip
Figure Characteristics of the CD audio clips
P erformance Ev aluation
This section presen ts p erformance n um bers that demonstrate the scalabilityc haracteristics of Mitra W e
start with an o v erview of the exp erimen tal design emplo y ed for this ev aluation Next w e fo cus on a
single disk conguration of Mitra to demonstrate the tradeo asso ciated with its alternativ e optimization
tec hniques Finallyw epresen t the p erformance of Mitra as a function of the n um b er of disks in the system
and their logical organization as clusters In all exp erimen ts the en tire system w as dedicated to Mitra
with no other users accessing the w orkstations
Exp erimen tal Design
A problem when designing this ev aluation study w as the n um ber of v ariables that could b e manipulated
blo c k size n um b er of groups with GSS mix of media t yp es mix of requests the n um b er of participating
disks the n um b er of disks that constitute a cluster p er media t yp e the bandwidth of eac h disk as a function
of the n um b er of participating disks closed v ersus op en ev aluation the role of the tertiary storage device
the size of database frequency of access to ob jects that constitute the database etc W e sp en t w eeks
analyzing alternativ e w a ys of conducting this study It w as ob vious that w e had to reduce the n um ber
of manipulated parameters to obtain meaningful results As a starting poin t w e decided to ignore
the role of tertiary storage device and fo cus on the p erformance of Mitra during a steady state where all
referenced ob jects are disk residen t and fo cus on a single media t yp e Moreo v er w e partitioned this
study in to t w o parts While the rst fo cused on the p erformance of a single disk and the implemen tation
tec hniques that enhance its p erformance the second fo cuses on the scalabilityc haracteristics of Mitra as
a function of additional disks
Seagate STW
Capacit y gigab yte
Rev olutions p er min ute
Maxim um seek time millisecond
Maxim um rotational latency millisecond
Num ber of zones see Figure Database Characteristics
CD Qualit y Audio
Sampling rate per second
Resolution bits
Channels stereo
Bandwidth requiremen t Mbps
T able Fixed P arameters
The target database and its w orkload w ere based on a WWW page that ranks the top ft y songs
ev ery w eek
Wec hose the top songs of Jan uary to construct b oth the b enc hmark database and its
w orkload W e could not use all ft y b ecause the total size of the top audio clips exhausted the storage
capacit y of one disk Mitra conguration Figure a and b sho ws the frequency of access to the clips and
the size of eac h clip in seconds resp ectiv ely The size of the database w as xed for all exp erimen ts
W e emplo y ed a closed ev aluation mo del with zero think time for our ev aluation With this mo del a
w orkload generator pro cess is a w are of the n um b er of sim ultaneous displa ys supp orted b y a conguration
of Mitra sa y N It dispatc hes N requests for ob ject displa ys to Mitra Tw o or more requests ma y
reference the same ob ject see b elo w As so on as Mitra is done with the displa y of a request the w orkload
generator issues another request to the Sc heduler zero think time The distribution of request references
to clips is based on Figure a This is as follo ws W e normalized the n um ber of v otes to the clips as a
function of the total n um ber of v ote for these ob jects The w orkload generator emplo ys this distribution to
construct a queue of requests that reference the clips This queue of requests is randomized to result in
a nondeterministic reference pattern Ho w ev er it migh t be the case that t w o or more requests reference
the same clip eg the p opular clip at the same time Unless noted otherwise Mitra w as not congured
to m ultiplex a single stream to service these requests
This exp erimen tal design consists of three states w arm up steady state and sh utdo wn During the
system w arm up sh utdo wn Mitra starts to become fully utilized idle In our exp erimen ts w e fo cused
on the p erformance of Mitra during a steady state b y collecting no statistics during b oth system w arm up
and sh utdo wn
This w eb site is main tained b y Daniel T obias h ttpwwwsoftdiskcomcomphits The ranking of the clips is deter
mined through v oting b y the In ternet comm unit y via Email
One Disk Conguration
W e analyzed the p erformance of Mitra with a single disk to observ e the impact of alternativ e mo de
of op eration with the PMdriv en sc heduling paradigm blo c k size dieren t n um ber of groups with
GSS andm ultiple zones In the rst exp erimen t w e congured the system with KByte blo c k size
g a single zone with the lo w and high w ater marks set to and resp ectiv ely In theorythe n um ber
of guaran teed sim ultaneous displa ys supp orted b y our target disk is This is computed based on the
transfer rate of the slo w est zone ie Mbps to capture the w orst case scenario where all blo c ks retriev ed
during a time p erio d reside in this zone Mitra realized the theoretical exp ectations successfully Ho w ev er
during a time p erio d the referenced blo c ks migh t b e scattered across the disk surface causing the system
to observethe a v erage disk transfer rate Mbps This results in a n um b er of idle slots p er time p erio d
As discussed in Section the PMdriv en sc heduling paradigm ma y treat these idle slots in t w o w a ys
either busyw ait to exhaust the curren t time p erio d or pro ceed with the retriev al of blo c ks that
should b e retriev ed during the next time p erio d The second paradigm results in a lo w er a v erage latency as
compared to the rst seconds as compared with seconds This paradigm enhances the probabilit y
of a new request lo cating an idle slot during the curren t time period Note that this requires more memory
at the PM as compared with the rst The amoun t of required memory can b e con trolled b y the c hoice of
av alue for the high w ater mark
In the second exp erimen t w ec hanged the blo c k size from KByte to and KByte The
remaining parameters are unc hanged as compared with the rst exp erimen t As the blo c k size increases
Mitra supp orts a higher n um ber of sim ultaneous displa ys and displa ys resp ectiv ely The
maxim um n um ber of sim ultaneous displa ys supp orted b y the a v ailable disk bandwidth is and can be
realized with a blo c k size of KByte
The explanation for this is as follo ws With magnetic disks the
blo c k size impacts the p ercen tage of w asted disk bandwidth attributed to seek and rotational dela ys As
the blo c k size increases the impact of these dela ys b ecomes less signican t allo wing the disk to supp ort a
higher n um ber of sim ultaneous displa ys GVK
The n um b er of groups g with GSS impacts the seek times incurred b y the disk when retrieving blo c ks
during a time p erio d In general small v alues of g minimize the seek time The n um b er of groups g has
an impact with small blo c k sizes where the seek time is signican t This impact b ecomes negligable with
large blo c k sizes F or example with a KByte blo c k size Mitra supp orts displa ys with six groups
displa ys with three groups and displa ys with one group Ho w ev er with a KByte blo c k Mitra
supp orts displa ys with elev en groups and displa ys with one group With this blo c k size the impact
of a c hoice of v alue for g is only one displa y b ecause the disk seek time has been rendered insignican t
with the retriev al time of a large blo c k
Thirteen is computed based on the bandwidth of the innermost zone consumption rate of CDqualit y audio and maxim um
seek and rotational latency times
In a nal exp erimen t EVEREST w as congured to recognize all the zones of the disk The blo c k size
w as KByte to guaran tee a con tin uous displa y with FIXB In this case Mitra can store only t w elv e clips
instead of on the disk b ecause once the storage capacit y of the smallest zone is exhausted no additional
clips can b e stored due to a roundrobin assignmen t of blo c ks to zones With this conguration Mitra
supp orts displa ys with an a v erage startup latency of seconds The higher n um b er of sim ultaneous
displa ys as compared to in the previous exp erimen ts is due to the design of FIXB that enables Mitra to
harness the a v erage disk transfer rate The higher startup latency is b ecause a displaym ust w ait un til the
zone con taining its rst blo c k is activ ated The n um b er of logical zones recognized b y Mitra is a tradeo
bet w een the n um b er of displa ys supp orted b y the system the a v erage startup latency and the p ercen tage
of w asted disk space W e no w rep ort on sev eral exp erimen ts that demonstrate this tradeo In the rst
exp erimen t w e congured EVEREST to recognize t w o logical zones the rst logical zone consists of zones
Z
to Z
while the second consists of the remaining ph ysical zones In this case Mitra can store
clips on the disk With this conguration while the n um ber of sim ultaneous displa ys is reduced to the a v erage startup latency is reduced to seconds In a second exp erimen t w e congured EVEREST
to recognize one logical zones consisting of only the nine outermost zones eliminated the remaining
innermost zones With this conguration Mitra can store t w elv e clips on the disk This increases the
transfer rate of the disk driv e from a logical p ersp ectiv e allo wing Mitra to supp ort displa ys with an
a v erage startup latency of seconds The higher startup latency is due to a longer duration of a time
period In GKSZ w e detail a planner that determines system parameters to satisfy the p erformance
ob jectiv es of an application it desired throughput and maxim um startup latency tollerated b y its clien ts
Multidisk Conguration
In these exp erimen ts the follo wing system parameters are xed blo c k size is KByte GSS is
congured with a single group g and a single logical zone spans all ph ysical zones of eac h disk
W e analyzed the p erformance of Mitra as a function of additional disks b y v arying D from to
and F or eac h conguration w e analyzed the p erformance of Mitra as a function of the n um b er of disks
that constitute a cluster ie d In all exp erimen ts the stride k equals to d F or example with a
disk conguration D a cluster ma y consist of t w o disks d With this conguration stride w ould
also equal to t wo k Ob viouslythe c hoice of d and k has a signican t impact on the obtained results
W e analyze the p erformance of Mitra for those v alues of d and k that are reasonable
F or example with
D it w ould b e unreasonable to congure Mitra with d k b ecause it w ould force the bandwidth of
four disks to sit idle as the database consists of a single media t yp e With d k the p erformance of
Mitra with D w ould be reduced to that with D Figure a presen ts the n um ber of sim ultaneous
Ho w ev er the results are presen ted suc h that one can estimate the p erformance of the system with unreasonable c hoice of
d and k v alues
X
Y
Z
12
8
4
2
1
1
2
4
8
12
No. of disks per cluster (d) No.
0
50
100
150
200
No. of Displays (N)
168
168
147
100
86
102
112
120
60
56
50
30
30
12
of
disks (D)
12
8
4
2
1
1
2
4
8
12
No. of disks per cluster (d)
0
10
20
30
40
% Difference
28.2
28.2
33.3
28.6
25.0
20.0
0.0
24.4
10.0
25.4
29.0
26.1
21.1
25.9
No.
of
disks (D)
a Num b er of sim ultaneous displa ys b P ercen tage dierence b et w een theoretical
exp ectations and obtained results from Mitra
Figure P erformance of Mitra as a function of D and d k d displa ys supp orted b y Mitra as a function of D and d In this gure the n um ber of disks a v ailable to
Mitra is v aried on the yaxis the n um b er of disks that constitute a cluster is v aried on the xaxis and the
throughput of the system is rep orted on the zaxis
As the n um ber of disks in the system D increases from to with d k the throughput of
the system increases sup er linearly the throughput of Mitra with D is fourteen times higher than
that with D This is b ecause the a v erage transfer rate of eac h disk increases as a function of D The
explanation for this is as follo ws In this exp erimen t the size of the database is xed and the EVEREST
le system organizes les on a disk starting with the outermost zone ie fastest zone The amoun t of data
assigned to eac h disk shrinks as D increases With D the innermost zone of the disk con tains data
while with D only the three outermost zones con tain data The a v erage transfer rate of the three
outermost zones is higher than the a v erage transfer rate of all zones of a disk see Figure In Figure b for a giv en hardw are platform xed D the throughput of Mitra drops as d increases
F or example with D Mitras throughput drops from streams to as d increases from to This is b ecause the p ercen tage of w asted disk bandwidth increases as d increases in v alue GK T o
observ e this note that b oth the maxim um seek time and rotational latency are xed Moreo v er they
w aste disk bandwidth The p ercen tage of w asted disk bandwidth is a function of these t w o v alues along
with the amoun t of data read from eac h disk driv e p er time p erio d As d increases in v alue the amoun t
X
Y
Z
12
8
4
2
1
1
2
4
8
12
No. of disks per cluster (d)
0
2
4
6
Avg. startup latency
(no. of time periods)
No.
of
disks (D)
5.37
4.04
1.78
1.19
0.5
0.5
1.07
2.02
2.62
0.5
1.0
1.31
0.5
0.5
12
8
4
2
1
1
2
4
8
12
No. of disks per cluster (d)
0
2
4
6
8
10
Max. memory
[MB]
No.
of
disks (D)
0.75
1.5
3.0
6.0
9.0
0.75
1.5
3.0
4.5
0.75
1.5
2.25
0.75
0.75
a Av erage startup latency b Maxim um amoun t of memory
required at a PM
Figure Startup latency and memory requiremen t of a PM with Mitra
of data retriev ed from eac h disk decreases b ecause a blo c k is declustered across a larger n um ber of disks This w astes a higher p ercen tage of disk bandwidth resulting in a lo w er throughput
F or eac h c hoice of Dw e lo cated the slo w est participating zone of the disks that con tains data This
zone is the same for all disks due to the roundrobin assignmen t of blo c ks of eac h ob ject to disks W e
computed exp ected p erformance of Mitra as a function of this zones transfer rate for eac h conguration
using the analytical mo dels of GKS GK Next w e examined ho w closely Mitra appro ximates these
theoretical exp ectations Figure b presen ts the p ercen tage dierence b et w een the measured results and
theoretical exp ectations Eac h v alue of this gure is computed based on M easur ed
Theory
With D the
system appro ximates the theoretical exp ectation with accuracy With D Mitras p erformance
is an ywhere from to lo w er than its theoretical exp ectations P art of this is due to loss of net w ork
pac k ets using UDP and their retransmission with HPNOSE Ho w ev er there are other factors eg SCSI
bus soft w are o v erhead system bus arbitration HPUX sc heduling of pro cesses etc that migh tcon tribute
to this dierence These dela ys are exp ected with a soft w are based system based on previous exp erimence
with Gamma DGS
and Omega GCKL b ecause the system do es not ha v e complete con trol on the
underlying hardw are
A limitation asso ciated with v alues of d smaller than D is that the placemen t of data is constrained
with staggered striping This results in a higher a v erage startup latency see Figure a In addition it
increases the amoun t of memory required at eac h PM ev en with the PMdriv en sc heduling paradigm see
Figure b Consider eac h observ ation in turn The a v erage startup latency is higher b ecause a displa y
m ust w ait un til the cluster con taining its rst sub ob ject has sucien t bandwidth to retriev e its referenced
blo c k Similarly eac h PM requires a larger amoun t of memory b ecause the Sc heduler cannot simply skip
one time p erio d on its b ehalf Its next blo c k resides on the cluster adjacen t to the curren tly activ e cluster
Assuming the system consists of C clusters a PM m ust cac he enough data so that the Sc heduler skips
m ultiples of C time p erio ds on b ehalf of this PM
Conclusion and F uture Researc h Directions
Mitra is a scalable storage manager that supp ort the displa y a mix of con tin uous media data t yp es Its
primary con tribution is a demonstration of sev eral design concepts and ho w they are glued together to attain
high p erformance Its p erformance demonstrates that an implemen tation can appro ximate its theoretical
exp ectations
As part of our future researc h direction w e are extending Mitra in sev eral no v el directions First w e
ha vein tro duced tec hniques to supp ort online reorganization of data when additional disks are in tro duced
to a system that has b een in op eration for a while GK These tec hnique mo dify the placemen t of data
to incorp orate new disks b oth their storage and bandwidth without in terrupting service Second w eare
in v estigating sev eral designs based on requestmigration and ob ject replication to minimize the startup
latency of the system GK Third w e are ev aluating tec hniques that sp eedup the rate of displa y to
supp ort V CR functionalities suc h as fastforw ard and fastrewind These tec hniques are tigh tly tied to
those of the second ob jectiv e that minimize the startup latency of a displa y Finallyw e are in v estigating
distributed buer pool managemen t tec hnique to facilitate sharing of a single stream among m ultiple
PMs that are displa ying the same presen tation The buer p o ol is distributed across the a v ailable DMs
Ho w ev er its con ten t is con trolled b y the Sc heduler
References
ABCea M M Astrahan M W Blasgen D D Cham b erlin and et al System R Relational Approac h to
Database Managemen t A CM T r ansactions on Datab ase Systems June A OG D P Anderson Y Osa w a and R Go vindan RealTime Disk Storage and Retriev al of Digital Au
dioVideo Data IEEE T r ansactions on Computer Systems
BGMJ S Berson S Ghandeharizadeh R Mun tz and X Ju Staggered Striping in Multimedia Information
Systems In Pr o c e e dings of the A CM SIGMOD International Confer enc e on Management of Data CDKK H T Chou DJ DeWitt R Katz and T Klug Design and implemen tation of the Wisconsin Storage
System SoftwarePr actic es and Exp erienc e
CHL M Carey L Haas and M Livn y T ap es Hold Data To o Challenges of Tuples on Tertiary Storage
In Pr o c e e dings of the A CM SIGMOD International Confer enc e on Management of Data pages
Den P J Denning W orking Sets Past and Presen t IEEE T r ansactions on Softwar e Engine ering SE
Jan uary DGS
D DeWitt S Ghandeharizadeh D Sc hneider A Bric k er H Hsiao and R Rasm ussen The Gamma
database mac hine pro ject IEEE T r ansactions on Know le dge and Data Engine ering Marc h GBC J Gemmell H Beaton and S Christo doulakis Dela y sensitivem ultimedia on disks IEEE Multime dia GCKL S Ghandeharizadeh V Choi C Ker and K Lin Omega AP arallel Ob jectbased System In Pr o c e e d
ings of the International Confer enceonPar al lel and Distribute d Information Systems Dec
GI S Ghandeharizadeh and D Ierardi Managemen t of Disk Space with REBA TE Pr o c e e dings of the Thir d
International Confer enc e on Information and Know le dge Management CIKM No v em ber GIKZ S Ghandeharizadeh D Ierardi D H Kim and R Zimmermann Placemen t of Data in MultiZone Disk
Driv es T ec hnical Rep ort USCCSTR USC GIZ S Ghandeharizadeh D Ierardi and R Zimmermann An Algorithm for Disk Space Managemen t to
Minimize Seeks T o app e ar in Information Pr o c essing L etters GK S Ghandeharizadeh and S H Kim Striping in Multidisk Video Serv ers In SPIE Symp osium on
Photonics T e chnolo gies and Systems for V oic e Vide o and Data Communic ations Octob er GK S Ghandeharizadeh and S H Kim An Analysis of Striping in Scalable MultiDisk Video Serv ers
T ec hnical Rep ort USCCSTR USC GK S Ghandeharizadeh and D Kim Online Reorganization of Data in Con tin uous Media Serv ers Multi
me dia To ols and Applic ationsJan uary GKS S Ghandeharizadeh S H Kim and C Shahabi On Conguring a Single Disk Con tin uous Media Serv er
In Pr o c e e dings of the A CM SIGMETRICS GKSZ S Ghandeharizadeh S H Kim C Shahabi and R Zimmermann Placementof Con tin uous Media in
MultiZone Disks In S Ch ung editor Multime dia Information Stor age and Management Klu w er
GRa S Ghandeharizadeh and L Ramos Con tin uous retriev al of m ultimedia data using parallelism IEEE
T r ansactions on Know le dge and Data Engine ering August GRb J Gra y and A Reuter T r ansaction Pr o c essing Conc epts and T e chniques pages Morgan
Kaufmann GRA Q S Ghandeharizadeh L Ramos Z Asad and W Qureshi Ob ject Placemen t in Parallel Hyp ermedia
Systems In Pr o c e e dings of the International Confer enceonV ery L ar ge Datab ases
GSZ S Ghandeharizadeh J Stone and R Zimmermann Tec hniques to Quan tify SCSI Disk Subsystem
Sp ecications for Multimedia T ec hnical Rep ort USCCSTR USC GVK
D J Gemmell H Vin D D Kandlur P Rangan and L Ro w e Multimedia Storage Serv ers A Tutorial
IEEE Computer Ma y Hew HewlettP ac k ard Co How HPUX Works Conc epts for the System A dministr ator
HLL
J Hsieh M Lin J Liu D Du and T Ru w art Performance of a Mass Storage System for Videoon
Demand Journal of Par al lel and Distribute d Computing Kno K C Kno wlton A fast storage allo cator Communic ations of the A CM Octob er LD H R Lewis and L Denen b erg Data Structur es Their A lgorithmsc hapter pages Harp er
Collins
LS P Lougher and D Shepherd The Design and Implemen tation of a Con tin uous Media Storage Serv er
Net w ork and Op erating System Supp ort for Digital Audio and Video In Pr o c e e dings of the r d Inter
national Workshop LaJolla CA pages Springer V erlag Nat K Natara jan Video Serv ers TakeRoot In IEEE Sp e ctrum pages April ORS B
Ozden R Rastogi and A Silb ersc hatz Fellini a le system for con tin uous media T ec hnical Rep ort
A TT Bell Lab oratories Murra y Hill ORS B
Ozden R Rastogi and A Silb ersc hatz The Storage and Retriev al of Con tin uous Media Data In
V S Subrahmanian and S Jo jo dia editors Multime dia Datab ase Systems Springer R C R Ro oholamini and V Cherassky A TM Based Multimedia Serv ers IEEE Multime dia R W C Ruemmler and J Wilk es An In tro duction to Disk Driv e Mo deling IEEE Computer Marc h
TPBG FA T obagi J P ang R Baird and M Gang Streaming RAIDA Disk Arra y Managemen t System for
Video Files In First A CM Confer enc e on Multime dia August
YCK P S Y u MS Chen and DD Kandlur Group ed sw eeping sc heduling for DASDbased m ultimedia
storage managemen t Multime dia Systems Jan uary
A Admission Con trol with GSS
This app endix details the implemen tation of the Sc hedulers admission p olicy with GSS A building com
p onen t is a function termed seek!cyl that estimates the disk seek time Its input is the n um ber of
cylinders tra v ersed b y the seek op eration Its output is an estimate of the time required to p erform the
seek op eration using the mo dels of GSZ Assuming CYL cylinders for the disk and n displa ys assigned
to a group G
i
w e assume that the n blo c ks are
CY L
n
cylinders apart
The Sc heduler main tains the amoun t of idle time left for eac h group G
i
With a new request for ob ject
X the sc heduler retriev es from the catalog the record corresp onding to X to determine its media t yp e
M
X
Next it retriev es from the catalog the record corresp onding to media t yp e M
X
to determine B M
X
Starting with the currentgroup G
i
the Sc heduler compares the idle time of G
i
with the disk service time
to retrieveabloc k of size B M
X
The disk service time with G
i
is
S
disk
G
i
B M
X
R
D
max r otational l atency seek CY L It assumes the maxim um seek time ie seek CY L b ecause the blo c ks to be retriev ed during G
i
ha v e
already been sc heduled and the new request cannot b enet from the scan p olicy Assuming that G
i
is
servicing n requests and its idle time can accommo date S
disk
G
i
its idle time is reduced b y S
disk
G
i
Prior to initiating the retriev al of blo c ks that b elong to group G
i the sc heduler adjusts the idle time of
group G
i
to reect that the activ e activ e requests can b enet from the scan p olicy Th us the idle time of
G
i
is adjusted as follo ws
idl e G
i
idl e G
i
seek CY L n seek CY L
n n seek CY L
n
The subtracted portion reects the maxim um seek time of the request that w as just sc heduled and the
seek time of n other activ e requests The added p ortion reects the n seeks incurred during the next
time period b y this group with eac h
CY L
n
cylinders apart
If curren t group G
i
has insucien t idle time the Sc heduler pro ceeds to c hec k the idle time of other
groups G
j
where j i mod g j g and j i Assuming that G
j
is servicing n activ e
requests the disk service time with G
j
is
S
disk
G
j
B M
X
R
D
max r otational l atency n seek CY L
n
n seek CY L
n If the idle time of G
j
is greater than S
disk
G
j
then the new request is assigned to G
j
and its idle time is
subtracted b y S
disk
G
j
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 627 (1996)
PDF
USC Computer Science Technical Reports, no. 659 (1997)
PDF
USC Computer Science Technical Reports, no. 634 (1996)
PDF
USC Computer Science Technical Reports, no. 615 (1995)
PDF
USC Computer Science Technical Reports, no. 653 (1997)
PDF
USC Computer Science Technical Reports, no. 625 (1996)
PDF
USC Computer Science Technical Reports, no. 602 (1995)
PDF
USC Computer Science Technical Reports, no. 598 (1994)
PDF
USC Computer Science Technical Reports, no. 610 (1995)
PDF
USC Computer Science Technical Reports, no. 612 (1995)
PDF
USC Computer Science Technical Reports, no. 766 (2002)
PDF
USC Computer Science Technical Reports, no. 650 (1997)
PDF
USC Computer Science Technical Reports, no. 623 (1995)
PDF
USC Computer Science Technical Reports, no. 666 (1998)
PDF
USC Computer Science Technical Reports, no. 587 (1994)
PDF
USC Computer Science Technical Reports, no. 685 (1998)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 864 (2005)
PDF
USC Computer Science Technical Reports, no. 578 (1994)
PDF
USC Computer Science Technical Reports, no. 748 (2001)
Description
Shahram Ghandeharizadeh, Roger Zimmermann, Weifeng Shi, Reza Rejaie, Doug Ierardi, Ta-Wei Li. "Mitra: A scalable continuous media server." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 628 (1996).
Asset Metadata
Creator
Ghandeharizadeh, Shahram
(author),
Ierardi, Doug
(author),
Li, Ta-Wei
(author),
Rejaie, Reza
(author),
Shi, Weifeng
(author),
Zimmermann, Roger
(author)
Core Title
USC Computer Science Technical Reports, no. 628 (1996)
Alternative Title
Mitra: A scalable continuous media server (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
30 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270894
Identifier
96-628 Mitra A Scalable Continuous Media Server (filename)
Legacy Identifier
usc-cstr-96-628
Format
30 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/