Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 826 (2004)
(USC DC Other)
USC Computer Science Technical Reports, no. 826 (2004)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
W a v elet Disk Placemen t for Ecien t Querying of Large
Multidimensional Data Sets
Cyrus Shahabi
Univ ersit y of Southern California
Departmen t of Computer Science
Los Angeles CA USA
email shahabiuscedu
Rolfe S Sc hmidt
CedarsSinai Medical Cen ter
Louis W arsc ha w Prostate Cancer Cen ter
Los Angeles CA USA
email rolfesc hmidtcshsorg
Marc h Abstract
New datain tensiv e applications op erate on div erse t yp es of data with new c harac
teristics in querying the data In particular the data set is large and m ultidim en
sional p opular examples are spatial and temp oral data as w ell as sensor data streams
the queries are complex asking for trends or outliers in data correlation b et w een dif
feren t dimensions or aggregation of one or more measure attributes giv en a b ounded
m ultidim ensional space termed rangeaggregate queries and the applications are
online and in teractiv e requiring fast resp onse time and hence the results can b e ap
pro ximate andor progressiv ely b ecome exact These c haracteristics lead us to b eliev e
that w a v elet transform will b ecome a lik ely to ol for future database query pro cessing
Although a straigh tforw ard adoption of w a v elets is to utilize it to reduce the data size
at dieren t resolutions unfortunately data compression metho ds are only eectiv eon
datasets that compress w ell and on queries that require the reconstruction of the en tire
signal
Therefore w e are taking a fundamen tally dieren t approac h Our approac h is a data
indep enden t appro ximation tec hnique that is based on query appro ximation rather than
data compression Man y common queries including relational algebra expressions and
a large class of aggregate queries can b e conceptualized as p erforming a linear trans
formation on the densit y distribution of an input relation to pro duce the densityofan
output relation This observ ation leads us to the follo wing generic progressivequery
ev aluation plan decomp ose the query transformation in to a series of small c heaply
computable subtransformations and ev aluate the most imp ortan t subtransformations
rst In this pap er w e discuss the details of this ev aluation plan for p olynomial range
sum queries and sho who w it can b e extended to supp ort a batc h of sev eral submitted
queries In addition w esho w that t ypical queries on w a v elet data require a distinct
access pattern whichw e describ e and exploit to design a disk placemen t strategy for
w a v elet data that yields b estp ossible IO complexit y for p oin t and range query ev alu
ation W e conclude b y discussing some op en problems in dealing with w a v elets from a
database p ersp ectiv e suchas ho w to p erform IO ecien tw a v elet transformation
This researc h has b een funded in part b y NSF gran ts EEC IMSC ER C and I IS
NASAJPL con tract nr and unrestricted cash gifts from Ok a waF oundation Microsoft and NCR
In tro duction
Since the in tro duction of the fast w a v elet transform m ultiresolution analysis has
pro v en to b e a p o w erful to ol for a wide range of applications W a v elets often bring to mind
applications lik e signal and image pro cessing not databases But in order to scale these
applications to v ery large datasets or to pro vide access to m ultiple users it is imp ortan t
to treat the storage and retriev al of w a v elet data as a database problem
Actuallyrecen tw ork suggests ev en stronger reasons for us to b e in terested in manage
mentofm ultiresolution data By storing a w a v elet represen tation of a relation or data
cub e instead of a tabular represen tation one can pro vide fast appro ximate exact and
progressive range aggregate query supp ort So m ultiresolution data ma yplaya
role in general purp ose database systems This should not b e surprising since their b e ginning Online Analytical Pro cessing OLAP systems ha v e used dimension hierarc hies to
store what amoun ts to a m ultiresolution view of a data cub e
T raditionally data transformation tec hniques suchas w a v elets ha v e b een used to com
press data The idea is to transform the ra w data set to an alternativ e form in whic h man y
data p oin ts termed co ecien ts b ecome zero or small enough to b e negligible exploiting
the inheren t redundancy in the ra w data set Consequen tly the negligible co ecien ts can
b e dropp ed and the rest w ould b e sucien t to reconstruct the data later with minim um
error and hence the compression of data
Ho w ev er there is a ma jor dierence b et w een the main ob jectiv e of signal pro cessing
and compression applications using w a v elets and that of database applications With com
pression applications the main ob jectiv e is to compress data in sucha w a y that one can
reconstruct the data set in its entir ety with as minimal error as p ossible Consequen tly at the data generation time one can decide whic hw a v elets to k eep and whic h to drop
Instead with database queries eac h rangesum query is in terested in some b ounded area
ie subset of the data The reconstruction of the en tire signal is only one of the man y
p ossible queries
Hence for the database applications at the data generation or p opulation time one
cannot optimally sp ecify whic h co ecien ts to k eep or drop W e b eliev e that for queries
to observelo w er v ariance in the accuracy of their results the decision of whic h co ecien ts
are imp ortantm ust b e dela y ed to the query time This of course means that all the data
co ecien ts m ust b e k ept with no compression at the data p opulation time This is justied
since the price of storage is lo w and decreasing Ev en the traditional OLAP compression
tec hniques did not ha v e the concern of sa ving space but the implicit ob jectiveofim pro ving query resp onse time b y dealing with smaller size datasets T o w ards this end w e
prop ose a tec hnique that use w a v elets to appro ximate incoming queries rather than the
underlying data
Note that the data set is still transformed using w a v elet ho w ev er it is not appr oximated since wek eep
all the co ecien ts
Prior W ork
F or the past t woy ears w eha vebeen in v estigating ecien ttec hniques to supp ort range
sum queries on large m ultidimensional data sets Weha vein tro duced a tec hnique that
can supp ort an y p olynomial rangesum query up to a degree sp ecied when the database
is p opulated using a single set of precomputed aggregates This extra p o w er comes with
little extra cost the query up date and storage costs are comparable to the b est kno wn
tec hniques see Weac hiev e this b y observing that p olynomial rangesums can b e
translated and ev aluated in the w a v elet domain When the w a v elet lter is c hosen to satisfy
an appropriate moment c ondition most of the query w a v elet co ecien ts v anish making the
query ev aluation faster W e made this observ ation practical byin tro ducing the lazy w a v elet
transform an algorithm that translates p olynomial rangesums to the w a v elet domain in
p olylogarithmic time
Note that with our lazy w a v elet transform the cost of the rangesum query ev aluation
b ecomes indep enden t of the size of the range T o illustrate the main in tuition b ehind our
lazy w a v elet transform assume Haar w a v elet on a onedimensional count query with domain
size of N Here at an y recursiv e step in the Haar transform of a constan t count function
the summary co ecien ts are constan t within an in terv al zero outside and in teresting
on at most t w o b oundary p oin ts Also there are at most t w o nonzero detail co ecien ts
Eachofthe log N steps can b e carried out in constan t time allo wing us to p erform the
en tire transform in time and space log N This is while using the standard D WT algo
rithm w ould require time and space N Because there are at most t w o nonzero detail
co ecien ts p er step the resulting transform has less than log N nonzero terms In weha v e formalized these argumen ts and extended them to deal with general d dimensional
p olynomial rangesums The imp ortan t features of the Haar transformation noted ab o v e
are the recursiv e step of the D WT can b e made in constan t time and the n um ber
of nonzero w a v elet co ecien ts of the query function is Olog N W eha v e sho wn that with
appropriate restrictions and c hoices of w a v elet lters w e can obtain b oth of these features
for general p olynomial rangesums
By using our exact p olynomial rangesum tec hnique but using the largest query w a v elet
co ecien ts rst w e are able to obtain accurate dataindep enden t query appro ximations
after a small n um b er of IOs This approac h naturally leads to a progressiv e algorithm W e
brough t these ideas together byin tro ducing ProP olyne Progressiv eP olynomial RangeSum
Ev aluator a p olynomial rangesum ev aluation metho d whic h
T reats all dimensions including measure dimensions symmetrically and supp orts
rangesum queries where the measure is any p olynomial in the data dimensions
not only COUNT SUM and A VERA GE but also V ARIANCE CO V ARIANCE and
more All computations are p erformed en tirely in the w a v elet domain
Uses the lazy w a v elet transform to ac hiev e query and up date cost comparable to the
b est kno wn exact tec hniques
By using the most imp ortan t query w a v elet co ecien ts rst pro vides excellen t ap
pro ximate results and guaran teed error b ounds with v ery little IO and computational
o v erhead reac hing lo w relativ e error far more quic kly than analogous data compres
sion metho ds
In a later publication w e fo cused on applications that require sim ultaneous ev al
uation of a batc h of range aggregate queries It is p ossible to ev aluate a batc h of range
aggregate queries b y rep eatedly applying an y exact appro ximate or progressiv e tec hnique
designed to ev aluate individual queries While exible this approac hhas t w o ma jor dra w
bac ks
IO and computational o v erhead are not shared b et w een queries
Appro ximate tec hniques designed to minimize singlequery error cannot con trol struc
tur al err or in the result set F or example it is imp ossible to minimize the error of the
dierence b et w een neigh b oring cell v alues
Both of these issues require treating the batc h as a single query not as a set of unrelated
queries
Consequen tlyw eha v e extended ProP olyne to exploit IO sharing across queries in
a batc h to pro vide fast exact results for the en tire batc h Another ma jor con tribution
of has b een the in tro duction of structural error for batc h queries and the denition of
structural error p enalt y functions This generalizes common error measures suc h as sum of
square errors SSE and L
p
norms As a result w eha vedev elop ed a progressivev ersion of
ProP olynebatc h queries that can accept anypenalt y function sp ecied at the query time
and minimize the w orst case and a v erage p enaltyateac h step of the computation
ProP olyne has b een used on sev eral empirical data sets suc h as the ground sensor
data atmospheric data p etroleum sale data and immersiv e sensor data Our exp erimen tal results on these empirical datasets sho w ed that the appro ximate results
pro duced b y ProP olyne are v ery accurate long b efore the exact query ev aluation is complete
These exp erimen ts also sho w ed that the p erformance of w a v elet based data appro ximation
metho ds v aries wildly with the dataset while query appro ximation based ProP olyne deliv ers
consisten t and consisten tly b etter results
New Con tributions
As men tioned b efore ProP olyne is for applications that deal with large amountofw a v elet
data that cannot t in main memory Hence the c hallenge is ho w to optimally store the
w a v elets on the secondary storage ie magnetic disk driv es Thanks to the principle of
lo c ality of r efer enc ew e often nd that when an application needs to access one datum on
a disk blo c k it is lik ely to need to access other data on the same blo c k By designing
applications to takeadv an tage of this w e can amortize the cost of disk access o v er m ultiple
reads signican tly reducing the total IO cost Note that in order to realize these sa vings
wem ust b oth design the consumer to b e disk blo cka w are and ensure that w ela y the data
out on the disk in a w a y that mak es the principle of lo calit y of reference hold
Is there a principle of lo calit y of reference for w a v elet data Or more precisely is there
aw ayw e can store w a v elet data to create suc h a principle In this pap er w e sho w that
w e can and that for common access patterns w eha veam uc h stronger principle It turns
out that for ProP olyne p oin t and range queries if a w a v elet co ecien t is retriev ed w e are
guaran teed that all of its dep endent c o ecients will also b e retriev ed The c hallenge is
that distinct co ecien ts will ha v e common dep enden ts In order to mak e applications that
rely on access to w a v elet data scalable w em ust tak e full adv an tage of this unique access
pattern This is the goal of our pap er
In this pap er w e describ e the access patterns required for pro cessing p oin t and range
queries on w a v elet data W e study the space of all p ossible nonredundan t allo cations of
these data to disk blo c ks Our ma jor con tributions include
An allo cation of w a v elet co ecien ts to disk blo c ks of size B so that if at least one
item on the blo c k is needed to answ er a p oin t query then a total of blg B c items on
the blo c k will b e needed Theorem A pro of that the allo cation of Theorem is essen tially optimal for all disk blo c k
of size B if the blo c km ust b e retriev ed to answ er a query the exp ected n um ber of
needed items on the blo c k is less than lg B Theorem An extension of these results to m ultidimensional data and range queries Theorem
Denition of a query dep enden t imp ortanc e function on disk blo c ks whic h allo ws us
to p erform the most v aluable IOs rst and deliv er excellen t appro ximate results
progressiv ely during query ev aluation Theorem W e nd these results and analysis theoretically satisfying but the w ork solvesav ery practi
cal problem that arose in our dev elopmen t of the ProP olyne system W esho w exp erimen tal
results in Section that demonstrate quan titativ ely ho w imp ortan t blo c k allo cation is for
ecien tProP olyne query answ ering and that some apparen tly natural strategies p erform
three to four times w orse
The remainder of this pap er is organized as follo ws In Section w e discuss the related
w ork Section briey pro vides the required bac kground on ProP olyne m ultiresolution
queries W e then rst discuss our tec hniques assuming one dimensional data in Sections and Subsequen tlyw eshowho w the tec hniques can b e generalized to m ultidimensions
in Section In Section w e explain howProP olyne can b e extended to w ork with disk
blo c ks rather than individual co ecien ts Section rep orts on our exp erimen tal results
Finallyw e conclude in Section and discuss some op en problems in Section Related W ork
Extensiveresearc h has b een done to nd ecien tw a ys to ev aluate range aggregates The
prexsum metho d presen ted in publicized the fact that careful preaggregation can b e
used to ev aluate range aggregate queries in time indep enden t of the range size This led to
an um b er of new tec hniques that pro vide similar b enets with dieren t queryup date cost
tradeos Hier ar chic al Cub es build on the ideas of topro vide congurable
tradeo b et w een query and up date cost as do Iterativ e Data Cub es IDC In fact
IDC generalizes the tec hniques of and certain forms of the Hierarc hical
Cub es IDC is the preaggregation tec hnique most closely related to our w ork one of the
v ariation of ProP olyne with xed measure falls in to this framew ork but our densit ybased
algorithms do not Ev en the xedmeasure v ersion of ProP olyne is more than an IDC it
also pro vides data indep enden t progressiv e query ev aluation
Appro ximate query answ ering has b een prop osed as a w a y to obtain ev en faster results
Histograms and sampling for selectivit y estimation
ha v e a ric h literature and pro vide adaptiv e optimizable data compression tec hniques for
query answ ering There has also b een w ork in mo deling for appro ximate query
answ ering In clustering and mixed Gaussian estimators are used to appro ximate
the data densit y function for appro ximate query supp ort The densit y function is also
exploited b y The exible measures supp orted b y this approac h inspire its use in
this study Recen tly researc hers ha v e addressed progressiv e query answ ering These tec hniques whether tree based or m ultiresolution analysis based share a common
strategy answ er queries quic kly using lo w resolution information and recursiv ely rene the
result with higher resolution data A teac h resolution lev el these tec hniques m ust retriev e
all summary no des that o v erlap the b oundary of the range This mak es the nal w orst
case complexit y prop ortional to the surface area of the range limiting their utilit y as exact
algorithms ProP olyne is fundamen tally dieren t it starts b y extracting information from
the resolution lev els that are most relev an t for the submitted query ProP olyne also has
the b enet that at eac h resolution lev el it only needs to retriev e summary statistics with
domains o v erlapping the corners of the range giving it excellen t p erformance as an exact
algorithm
Recen tly w a v elets ha v e emerged as a p o w erful to ol for appro ximate answ ering of ag
gregate and relational algebra queries Streaming algorithms for
appro ximate p opulation of a w a v elet database are also a v ailable making w a v elet co ef
cien ts a p o w erful appro ximate data storage format Most of the w a v elet query ev aluation
w ork has fo cused on using w a v elets to compress the underlying data reducing the size of
the problem A notable exception is whic h prop oses a metho d to appro ximate the
function that maps ranges to the corresp onding rangesum sim ultaneously appro ximating
all SUM queries for a giv en measure This metho d is the closest in spirit to the tec hniques
w e presen t b esides supp orting a dieren t class of queries our tec hnique diers b y appro xi
mating individual queries at the time of submission rather than appro ximating all queries
at the time of database p opulation
The w a v elet disk placementw ork arose out of our recen t eorts to use w a v eletbased
op erator appro ximation for appro ximate query answ ering ak aProP olyne While
studying these metho ds it b ecame clear to us that ecien t disk access w ould b e necessary
for an y practical system All of the data appro ximation tec hniques discussed ab o v e assume
that the compressed dataset will t in main memory or b e scanned from disk in its en tiret y T o our kno wledge ecien t disk placementof w a v elet data has not b een explored b efore
this w ork
Another area of seemingly related w ork in v olv es the use of spacelling curv es z
ordering and Gra y co des to place a one dimensional index on m ultidimensional data that
still clusters data spatially In fact this is more closely related to our future w ork on
storage of sparse w a v elet data In this pap er w e study storage and retriev al of dense w a v elet
data and ha v e no need for index structures With this said w e do compare the p erformance
of our tec hniques with a domain slicing blo c k allo cation sc heme that is v ery similar to and
sometimes iden tical to spacelling curv e based allo cations In our exp erimen ts w e nd
that our optimal blo c k placemen t strategy requires three to four times less disk access than
the domain slicing metho d
ProP olyne Bac kground
W e b egin b y making our problem precise Haar w a v elets pro vide an orthonormal basis for
the v ector space of all functions on a data domain See for a thorough in tro duction
to w a v elets and related algorithms and see for a concise in tro duction to compactly
supp orted w a v elets suc h as the Haar w a v elets used in this pap er It is w orth noting that
our results hold for anyw a v elet lter that satises certain tec hnical conditions discussed
in Ho w ev er to simplify the presen tation of the pap er and to fo cus on the main
con tributions w e use Haar w a v elets throughout the pap er
W e denote the data domain b y D and assume that it is a ddimensional lattice When
w e store w a v elet co ecien ts w e are really storing a represen tation of a function This mak es
a particular t yp e of query v ery natural
Denition A p oint query on a dataset c ontaining a r epr esentation of a function f D R sp e cies a p oint x Dand r e c eives f x as an answer
A relation can b e represen ted b y its data densit y or measure densit y function When
this is done a large class of range aggregate queries are seen to b e inner pro ducts of query
functions with data functions in the function space Sp ecically it is p ossible to
supp ort traditional aggregate functions suchas Count Sumand A vera geas w ell as less
traditional aggregates including Co v ariance RegressionSlope and Kur tosis as long
as w e can supp ort the follo wing basic range query
Denition A p olynomial r angesum on a dataset c ontaining a r epr esentation of a
function f D R sp e cies a r ange R D and a p olynomial p on the c o or dinates of
p oints in D and r e c eives hf p R
i P
x R
p x f x as an answer
Whichw a v elet co ecien ts do w e need to retriev e from storage in order to answ er these
queries F or the p oin t query the answ er is simple the only w a v elet co ecien ts needed to
reconstruct the v alue of a function at a p oin t x are those corresp onding to w a v elets that are
not zero at x In other w ords w e are only in terested in w a v elets whose supp ort o v erlaps x One of the fundamen tal observ ations of ProP olyne is that when certain tec hnical
conditions are satised the only w a v elets that are relev antfor answ ering a p olynomial
rangesum query are those whose supp ort o v erlaps a c orner of the range RTh us in a d dimensional domain a p olynomial rangesum query requires the same disk access as d
poin t
queries In this pap er w e are only in terested in ecien t disk access not in the computations
that o ccur afterw ards F rom this p ersp ectiv e b oth p oin t and range queries can b e distilled
to a more fundamen tal selection query Denition The wavelet overlap query on a dataset of wavelet c o ecients sp e cies
ap oint x D and r eturns the set of al l wavelet c o ecients for wavelets whose supp ort
overlaps x Denoting this query by W Qx we write
W Qx fa
jk
j jk
x g
wher e a
jk
hf jk
i denotes the wavelet c o ecient of the stor e d function f at r esolution
level j and oset k This is the fundamen tal query of in terest when storing data in anyw a v elet basis but
for the particular case of onedimensional Haar w a v elets w eha v e an explicit denition of
the set
W Qx fa
jk
j j
k x j
k g
Our goal is to ev aluate these queries with as little disk access as p ossible
AF ramew ork for One Dimension
Before w e can nd the b est p ossible disk allo cation w e need a precise notion of what a disk
allo cation is and ho w access patterns for w a v elet o v erlap queries determine whichbloc ks
will b e retriev ed The purp ose of this section is to mak e these notions precise
In particular w e observ e that our access patterns can b e captured b y the wavelet de
p endency gr aph a directed acyclic graph whose leaf no des corresp ond to p oin ts in the data
domain and whose in ternal no des corresp ond to w a v elets The k ey observ ation is that to
answ er W Qx w e need to retriev e exactly the w a v elet co ecien ts corresp onding to no des
on the graph reac hable from the leaf x With this framew ork an allo cation of w a v elet data
to disk blo c ks corresp onds to a tiling of the in ternal no des of the dep endency graph and
the IO cost of a w a v elet o v erlap query is just the n um b er of tiles reac hable from a leaf
no de in the graph
Throughout this pap er w e assume that a disk blo ckholds B w a v elet co ecien ts and
has a unique iden tier eg its ph ysical address With this w e dene
Denition A blo ck al lo c ation forac ol le ction of wavelet c o ecients is a B to one
function fr om wavelets to disk blo ck identiers In other wor ds it is an assignment of
wavelets to disk blo cks
T o help us reason ab out dieren tbloc k allo cations w e try to capture the essence of our
access patterns with the follo wing
ξ3
ξ6 ξ5 ξ4
ξ2
ξ7
ξ0
ξ1
01234567
Figure The dep endency graph for one dimensional Haar w a v elets on a domain of size Denition The dep endency gr aph for Haar wavelets on a domain D is the dir e cte d
acyclic gr aph G V E dene d as fol lows L et
D denote the set of al l Haar wavelets on
D and let the V D D In other wor ds the vertic es of the gr aph ar e either p oints in the
domain or wavelets on the domain The e dges ar e dene d by the fol lowing rules
F or two wavelets the p air E if and only if x implies that x the interval wher e lives is c ontaine d in the interval wher e lives
F or x D D the p air x E if and only if x and ther e is no wavelet
such that E F or two p oints x y D x y E
F or a onedimensional domain this gr aph is a tr e e
This denition is cum b ersome but explicit Tomak e this more concrete an example
dep endency graph is depicted in Figure for Haar w a v elets on a domain of size eigh t A t
eac h no de in the gure w e depict the graph of the corresp onding w a v elet The topmost
no de corresp onds to the constan t function and w a v elet co ecien ts for this no de are just
a v erages o v er the en tire domain A t the next lev el weha vea lo w frequency w a v elet whose
supp ort co v ers the en tire domain As w emo v e to higher resolution lev els the size of the
supp ort is halv ed and the frequency is doubled In one dimension the dep endency graph
is isomorphic to the err or tr e e There is another c haracterization of this graph whichisv ery in teresting to us
Claim The dep endency gr aph for Haar wavelets is the minimal gr aph G with vertic es V D D such that if is ne e de d to answer W Qx and is r e achable fr om then is also
ne e de d to answer the query In p articular W Qx fa
j ther e exists a p ath fr om x to g
In other w ords using this graph turns our w a v elet o v erlap query in to a reac habilit y query
on a dag W e can also replace blo c k allo cations with vertex tilings Denition A tiling for a gr aph G V E with tiles of size B is a p artition of the
vertic es into disjoint tiles T
i
with jT
i
j B There is clearly a one to one corresp ondence b et w een tilings of the in ternal no des of the dep endency graph and blo c k allo cations Moreo v er this giv es us a simple denition of
the IO cost of a w a v elet o v erlap query Claim F or a given blo ck al lo c ation the numb er of disk blo cks that must ber etrievedto
answer a wavelet overlap query W Qx is e qual to the numb er of tiles r e achable fr om x in
the tiling c orr esp onding to the blo ck al lo c ation
So to reduce the IO cost of answ ering w a v elet o v erlap queries w e should nd tilings
that are minimally reac hable In tuitiv ely it seems that w e will b e doing w ell if w e can
simply ensure that whenev er a tile is reac hable from a p oin t it has man yv ertices that are
reac hable from that p oin t The n umberofv ertices reac hable from a p oin t is indep enden t
of the tiling If eac hreac hable tile co v ers a large n um b er of the reac hable v ertices then all
reac hable v ertices will b e co v ered with a small n um b er of tiles So it seems that a particular
tile is go o d if it has a high usage r ate Denition F or a dag G V E a tile T V and a sour c e s V the usage r ate of
T at s is
u T s jfv T j v is r e achable fr om sgj
Denote the numb er of sour c es that c an r e ach T by S T The aver age usage of T is
u T
S T X
s sources G u T s So if w a v elet o v erlap queries are generated randomly with the uniform distribution o v er the
data domain the a v erage usage of a tile corresp onds to the exp ected n um b er of items to
b e used on a disk blo c k giv en the fact that the blo c k had to b e retriev ed Example tiles
with usage rates can b e seen in Figure There is a closely related function that indicates
whether a tile is reac hable from a particular source
Denition F or a dag G V E tile T Vand asour c e s V the r e quir ement
function r T s is if some element of T is r e achable fr om s u T s and is zer o
otherwise
Notice that
S T
X
s sources G r T s
ξ0
ξ1
ξ2
ξ4 ξ5
ξ3
ξ6 ξ7
012 34 56 7 x:
u(T,x): 3 3 2 2 1 1 1 1
u(T) = 1.75
T
a a straigh tline tile
ξ0
ξ1
ξ2
ξ4 ξ5
ξ3
ξ6 ξ7
012 34 56 7
u(T,x): 2 2 2 2 0 0 0 0
u(T) = 2
x:
T
b an optimal tile
Figure Examples of straigh tline and optimal tiles of size along with usage rates
While designing tiles with maximal usage seems lik e a go o d in termediate goal ev en tually
wew an t to minimize a v erage IO cost for query answ ering This corresp onds to minimizing
the a v erage n um b er of tiles reac hable from a randomly selected source Th us w e dene the
c ost of a source s for a tiling T as the n umberoftiles reac hable from s
c s
X
T T
r T s W e will actually b e able to sho w in Section that maximizing usage is the same as mini
mizing cost a result that relies on the follo wing equation whic hisv alid for an y source s in
the one dimensional Haar w a v elet dep endency graph
lg jD j X
T T
u T s This holds b ecause b oth sides represen t the total n um b er of nonsource v ertices reac hable
form s Optimal Placemen t in One Dimension
With this terminology in hand w e can start solving problems W e b egin b y computing the
usage rates of t wotile t yp es
Example Straigh tline Tiles When answering a wavelet overlap query for a p oint
x D it would b e ide al if every tile we r e achedwer e usedentir ely R e c al ling that for the one
dimensional c ase the dep endency gr aph is a binary tr e e we se ethat we c an achieve this by
cho osing str aightline tiles of the form T fv paren t v paren t paren t v paren t
B v g
so that if thereisap ath fr om s to v then s r e aches every element in the tile and u T s B A n example of a str aightline tile of size thr e e in a domain of size eight c an bese en in Figur e
Note that by including in this tile we do very wel l in answering the p oint query at
p oints or but do quite p o orly when answering a p oint query at or We c an
formalize this observation
Notic e that the numb er of sour c es that c an r e ach v is exactly one half of the numb er
of sour c es that c an r e ach paren t v b e c ause paren t v is a wavelet that lives on an interval
double the size of v Thus for every sour c e that gets the optimal usage out of T ther eis
another one that obtains one less than the optimal usage
So far this may not se em disc our aging but the same ar gument says that exactly one
half of al l sour c es that r e ach T r e ach it only at the highest no de paren t
B v and obtain
the worst p ossible usage r ate of one This lets us write a r e cursion for the usage r ate of
str aightline tiles of now variable length B
U
str aightline
B U
str aightline
B Thus no matter what the tile size if T is a str aight line tile u T Unless we nd
something b etter it wil l b e very dicult to pr ovide sc alable ac c ess to wavelet data
Example Dense Subtree Tiles Be c ause the str aightline tiles worke dsop o orly let us
go to the other extr eme and lo ok at c onne cte d tiles that ar e as short as p ossible F or c onve
nienc e assume that B
k
for some natur al numb er k Then for any vertex v that is
at le ast lg B steps away fr om any sour cein the dep endency gr aph we c an c onsider the tile
T fv
V j v paren t
j
v
j lg B g which is a dense binary tr e e A n example tile of
size thr eec an bese en in Figure It is p articularly easytoc ompute the aver age usage of this tile In fact the usage r ate of
this tile is the same fr om any vertex and is e qual to dlg B e Thus u T dlg B e If B is not
of the sp e cial form assumedab ove we may not b e able to c omplete the b ottom level of the
dense binary tr e e but we c an b e assur e d that our usage r ate wil l always beat le ast blg B c It is easy to co v er anygiv en dep endency graph with these dense connected subtiles
p erhaps running out of ro om at the b ottom and settling for connectedasp ossible tiles
An alternativ e approac h is to map this problem to the problem of nding an optimal B par titioning ofaw eigh ted graphtree where B is the size of a disk blo c k In this
case w eigh t of should b e assigned to eac h no de of the dep endency graph and the w eigh ts
on its edges should capture u T One w ayto ac hiev e this is b y assigning eac h edge the
w eigh t of the n um ber of lea v es reac hable do wn of that edge Regardless of ho ww e generate
the co v ering it leads directly to a pro of of Theorem Theorem A binary tr eewith N le af no des has a tiling such that for every sour c e s V
and every tile T which do es not c ontain a sour ceno de has u T s b lg B c A useful corollary is
Corollary Using the blo ck al lo c ation of The or em on a domain of size N one must
r etrieve no mor e than lg
B
N disk blo cks to answer a wavelet overlap or p oint query
R ange aggr e gate queries r e quir e no morethan twic e this amount
Recall that as discussed in Section the cost of a range query is indep enden t of the
size of the range since w e only need to compute w a v elets at the b oundaries of the range
While this is clearly a b etter situation than wesa w with our straigh tline blo c ks w e still
migh t hop e for more Logarithmic usage rates seriously restrict our abilit y to amortize IO
cost b y using large gran ularit y Unfortunately there is no w ay todobetter Theorem F or any tile T with B vertic es on the dep endency gr aph for one dimensional
Haar wavelets we must have u T lg B
Pro of W e pro ceed b y induction on the tile size B The base case of B is trivial
clearly u T lg B No w me mak e the inductiv e step and assume that the result is true for all tiles T
with
jT
j B Consider an y tile T V with jT j B Let r V b e the ro ot of the subtree
generated b y T and denote the part of T that lies in the left and righ t subtrees of r b y
T
L
and T
R
resp ectiv ely Notice that w em ust ha v e jT
R
j jT
L
j B therefore the inductiv e
h yp othesis applies
First w e will assume that b oth jT
L
j and jT
R
j are nonzero the exceptional case will b e
handled b elo w Denote jT
R
j jT j and let b e the probabilit y that a randomly
selected source that lies b elo w T is on the righ t side of r Then
u T u T
R
u T
L
lg jT j lg jT j lg jT j lg lg lg jT j
Noww e catc h the exception If one of T
L
and T
R
is emptythenwem ust ha v e r T since r
is the ro ot of the smallest subtree con taining T In this situation clearly and the
size of the nonempt y subtile sa yit is T
R
is jT
R
j jT j Th us
u T u T
R
lg jT j lg jT j
when jT j So the exception has b een handled and the theorem pro v ed It seems that this limit on a v erage utilit y of a tile m ust also place limits on the cost of
aw a v elet o v erlap query This is the con ten t of the next result
Corollary F or al l tilings with tiles of size B on a domain of size N sampling sour c es
uniformly the aver age c ost of a wavelet overlap query is at le ast lg
B
N
root: (0,0)
array offset = 0
first detail: (1,0)
array offset = 1
(2,0)
offset = 2
(2,1)
offset = 3
(3,0)
off=4
(3,1)
off=5
(3,2)
off=6
(3,3)
off=7
root: (0,0)
array offset = 0
first detail: (1,0)
array offset = 1
(2,0)
offset = 2
(2,1)
offset = 3
(3,0)
off=4
(3,1)
off=5
(3,2)
off=6
(3,3)
off=7
a Tw o dep endency trees one p er eac h dimension
07 6 5 4 3 2 1
0
7
6
5
4
3
2
1
X
X
X
X
X
X
X
X
X
b D blo c k
Figure An example of nine p oin ts mark ed with x stored on a D disk blo c k as pro duct
of t w o D virtual blo c ks eac h mark ed with a triangle
Pro of
E
s
c s N
X
s D
c s
N
X
T T
S T N
X
T T
X
s D
u T s lg B X
s D
lg N
N lg B lg
B
N
Multiple Dimensions
The optimal disk blo c k allo cation for one dimensional w a v elets can b e used to construct
optimal allo cations for tensor pro duct m ultiv ariate w a v elets W e simply decomp ose eac h
dimension in to optimal virtual blo c ks and tak e the Cartesian pro ducts of these virtual
blo c ks to b e our actual blo c ks T o illustrate consider the follo wing example
Example Consider a two dimensional domain d wher e the size of e ach dimension
is N The Haar dep endency tr eefor e ach dimension is depicte d in Figurea F or
e ach no de we also il lustr ate its oset in a onedimensional arr ay Supp ose the size of D
disk blo cks is Thus we ne edto cho ose two D v ir tual blo cks of size and gener ate
the pr o duct F or example as shown in Figur e a the wavelets at oset f g make one
optimal virtual blo ck for one dimension and the wavelets at f g make another optimal
virtual blo ck for the other dimension Figur e b il luster ates what the pr o duct of these two
blo cks lo oks like in a D arr ay marking with an x the p oints that b elong to the pr o duct
blo ck
This m ultidimensional blo c k arrangementallo ws us to immediately obtain the follo wing
result
Theorem In a ddimensional domain of size N
d
when wavelet c o ecients ar estor e d
using the Cartesian pr o duct optimal blo ck al lo c ation the IO c ost of the multidimensional
wavelet overlap query in a domain of size N is O lg
d
B
N F urthermor e if the dimensions
al l have the same size then for any Cartesian pr o duct blo ck str ate gy the aver age IO c ost
wil l b e lg
d
B
N W e conjecture that this lo w er b ound holds for all blo c k allo cations not just Cartesian
pro duct allo cations In an y case this allo ws us to answ er p olynomial rangesum queries
quic kly
Corollary When wavelet c o ecients ar estor e d using the Cartesian pr o duct optimal blo ck
al lo c ation the IO c ost of p olynomial r angesum query evaluation is O d
lg
d
B
N If the
dimensions al l have size N then the IO c ost wil l also have the lower b ound
d
lg
B
N d
Extending ProP olyne
The ProP olynes progressiv e query answ ering strategies presen ted in are based on
the observ ation that some w a v elet co ecien ts are m uc h more imp ortan t for a query answ er
than others This allo ws us to compute the imp ortance of eac h record to b e retriev ed
and fetc h the most imp ortan t items rst The appro ximate answ ers returned progressiv ely
minimize b oth a v erage and w orst case error
The results ab o veshowthat wecan ac hiev e considerable sa vings b y placing w a v elet data
on disk blo c ks in an appropriate w a y and retrieving co ecien ts at blo c k lev el gran ularit yIn
order to obtain progressiv e query results in this setting w e need to compute the imp ortance
of disk blo c ks rather than individual w a v elets
When ev aluating a single rangesum query the imp ortanc e of a data w a v elet co ecien t
can b e dened as the square of the magnitude of the corresp onding query w a v elet co ef
cien t In other w ords denoting the imp ortance of a w a v elet b y and denoting the
w a v elet co ecien ts of a query v ector q by q w ecan tak e j q j
An insp ection of the pro ofs of Theorems and in sho ws that there are t w o natural
w a ys to extend this denition to disk blo c ks w e can either c ho ose the imp ortance function
to minimize w orstcase error or w e can c ho ose it to minimize a v erage error In particular
w e dene t w o blo c kimp ortance functions
B
X
B
j q j
Bmax
B
j q j
and adapt the argumen ts in to pro v e the follo wing result
Theorem Informal By fetching disk blocksinde cr e asing or der of imp ortanc e we
pr o gr essively minimize the worst c ase err or of our query appr oximation By fetching disk
blo cks in de cr e asing or der of imp ortanc e we pr o gr essively minimize the aver age squar e
err or of our query appr oximation
a Query screen with four dimensional
grid cells
b Result screen as a piv ot table
Figure An implemen tation of ProP olyne o v er temp erature data stored as w a v elet disk
blo c ks
These results extend in a natural w a y to pro vide con trol of structural error in batc h
query answ ers and w eha v e implemen ted a batc h query answ ering system that retriev es
blo c ks in order to minimize the mean square error W e presen t exp erimen tal results using
this system in Section but w e defer the detailed theorems to the full pap er
Exp erimen tal Results
Weha v e implemen ted ProP olyne based on the ideas presen ted in this pap er and are using
it to dev elop w eb service access to climate grid datasets pro vided to us b y our colleagues at
NASAs Jet Propulsion Lab oratories The system is implemen ted in C! using an ODBC
connection to a SQL serv er to store and access w a v elet data see Figure This section
pro vides an o v erview of howbloc k allo cation c hoices aect the p erformance of this system
All exp erimen ts w ere p erformed on a sample dataset with million records and four
dimensions altitude latitude longitude and time There are altitudes latitudes
longitudes and time p oin ts recording temp erature ev ery hours during Marc h
and April T emp eratures are measured in degrees Kelvin Our system is designed to
ecien tly answ er batc hes of aggregate queries o v er ranges that partition the data domain
pro viding users with a summary view of the en tire data set
The results rep orted here are based on a randomly generated w orkload of batc h
queries where eac hbatc h query partitions the latitude dimension in to ranges and par
titions the altitude longitude and time dimensions in to ranges The rangesum queries
request the a v erage temp erature p er eac h range Dimensions w ere partitioned in to ranges
uniformly so that all p ossible partitions are equally lik ely Weha v e exp erimen ted with
dieren t batc h sizes and found similar results with the follo wing exceptions F or smaller
Note that these exp erimen ts are v ery dieren t than those rep orted in b ecause the underlying
storage system main tains w a v elets on disk blo c ks rather than individua l co ecien ts
batc hes all metho ds pro duced v ery accurate progressiv e results quic klyev en though the
IO cost v aried dramatically This happ ens b ecause the ranges in a small batc h are large
and queries are w ell appro ximated b y a small n um ber of lo w resolution w a v elets F or large
batc hes the optimal placemen ttec hnique still pro vides m uc h b etter progressiv e estimates
but there is v ery little dierence in IO p erformance for a large batc hy ou are essen tially
scanning the en tire dataset disk placemen t cannot b e of great b enet
Comparisons
The problem addressed in this pap er is no v el and th us welac k appropriate direct compar
isons Ho w ev er there are t w o common sense blo c k allo cations whichpro vide enligh tening
b enc hmarks One of these allo cations is obtained b yessen tially storing the data in ro w
ma jor order on disk The other captures the lo calit y eects of a spacelling curv e
The rst tec hnique is the most naiv e the m ultidimensional w a v elet co ecien ts can b e
laid out naturally in a one dimensional arra y in fact this is ho ww e compute them in the
rst place W e simply cut this arrayin to con tiguous subarra ys of equal size to obtain
our disk blo c ks Sp ecicall y if w e denote the v alue of the ith item in the bth blo c kb y
v b i and denote the arra yofw a v elet co ecien ts b y w then weha v e
v b i w B b i for blo c ks of size B This is appro ximately the allo cation w ew ould obtain b y default
when w e store the w a v elet data as a one dimensional arra y on disk and let the le system
determine the blo c k allo cation W e call this the naive al lo c ation Our next tec hnique has more resp ect for the m ultidimensional structure No ww e think
of the w a v elet co ecien ts as lying m ultidimensional arra y and slice the data domain in to
cub es b y slicing eac h dimension in to virtual blo c ks Eac h cell in this slicing will corresp ond
to a disk blo c k and ev ery w a v elet co ecien t in the cell will b e stored in the blo c k This
giv es us a natural m ultiindex for our disk blo c ks and a natural m ultiindex for w a v elet
co ecien ts in eac h blo c k Again denoting the v alue at lo cation i i
i
i
d in
blo c k b b
b
b
d b y v b i w eha v e
v b i w B
b
i
B
b
i
B
d b
d i
d where the slices are of size B
i
in dimension iW e call this the sliceand diceallo c ation Another ob vious to ol to use for blo c k allo cations is a spacelling curv e Ho w ev er
for our application weha v e no need to place a one dimensional index structure on our
co ecien ts and the slice and dice allo cation captures spatial lo calit y at least as w ell as an y
spacelling curv e W e use the slice and dice strategy as a pro xy for space lling curv es
IO Reduction
When answ ering these batc h queries the n umberofdiskbloc ks fetc hed to answ er a query
using the w a v eletoptimal allo cation is substan tially less than the n um ber retriev ed using
0
500
1000
1500
2000
2500
3000
Block Size
189
Block Size
441
BlockSize
1029
BlockSize
2025
BlockSize
4725
Mean Number of Blocks
Optimal
Slice
Naive
Figure Mean n um ber of blo c ks needed for exact query answ er for a v ariet y of blo c k sizes
and allo cation tec hniques
alternate metho ds In Figure w e sho w the a v erage n um b er of blo c ks needed to ev aluate
the batc h queries in our generated w orkload Results are sho wn for the optimal slice and
naiv e metho ds for a v ariet y of blo c k sizes The blo c k sizes w ere c hosen so that w e could
pro duce optimal blo c ks consisting of complete subtree tiles
The most imp ortan t observ ation is that our optimal strategy outp erforms the other t w o
strategies often b y a factor of three or four The theory suggests that the optimal strategy
should alw a ys giv e the b est results but this evidence demonstrates that the dierence
is substan tial Another in teresting and unexp ected observ ation is that the naiv e blo c k
allo cation consisten tly outp erforms the slicing strategyF or all strategies the n um ber of
blo c ks retriev ed is only a small fraction of the total n um b er of blo c ks in the database This
is b ecause w a v elets pro vide v ery ecien t exact answ ers to ad ho c rangesum queries Another in teresting and unexp ected observ ation is that the naiv e blo c k allo cation strat
egy outp erforms the slicing strategy This could b e due to the fact that the naiv ebloc ks
t ypically con tain b oth high and lo w resolution data while man y of the domain slicing blo c ks
con tain only spatially lo calized high resolution data When these high resolution blo c ks are
retriev ed there are t ypically only a small n um b er of useful items Lo w resolution data are
more lik ely to b e useful
Progressiv e Query Appro ximation
Our earlier w ork on ProP olyne has sho wn that w a v elets pro vide excellen t appro ximations
of the matrices that arise in ev aluation of batc h rangesum queries W e noted ab o v e
that for an y allo cation to disk blo c ks w e can compute the imp ortance of a blo c k so that b y
p erforming the most imp ortan t IO rst w e minimize either w orst case or a v erage error
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
123456789 101112131415161718192021222324252627
Number of Blocks Retrieved
Fraction of Query Energy
Optimal Blocks
Slice Blocks
Naïve Blocks
a Cum ulativ e Imp ortance
1E-10
1E-09
1E-08
1E-07
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
10
100
1 101 201 301 401 501 601 701 801 901
Number of Blocks Retrieved
Mean Relative Error
Naïve Blocks
Slice Blocks
Optimal Blocks
b Mean relativ e error for a v erage temp erature
queries
Figure Eectiv eness of dieren t allo cation strategies for progressiv e and appro ximate
query answ ering
Although weha v e pro v en that our dep endency graph based allo cation is the b est p ossible
for exact query answ ering other strategies ma y pro vide b etter appro ximate answ ers early
in the computation The results in this section suggest that this is not the case and that
the optimal placementgiv es us sup erior appro ximate results from the rst blo c kon w ard
W e measure the appro ximation qualityin t wow a ys Our rst approac h is data indep en
den t W euse the imp ortance function to minimize mean square error and normalize so
that the total imp ortance of all columns in eac h query is one W e measure the cum ulativ e
imp ortance of all retriev ed disk blo c ks as the query progresses aggregate the results o v er
our randomly generated query w orkload and rep ort the results in Figure The optimal strategy dominates from the b eginning but it is in teresting to note ho w
p o or a job the naiv e appro ximation do es compared to the others ev en though it is sub
stan tially b etter than the slicing metho d for exact query answ ering As with the exact
query analysis w e p erformed these tests for a v arietyof bloc k sizes but sa w no qualitativ e
dierences
Our second measure of the appro ximation qualit y measures the accuracy of query an
sw ers on a real data set W e also progressiv ely ev aluated the queries on the temp erature
dataset describ ed ab o v e and rep ort the mean relativ e error as a function of the n um ber of
disk blo c ks retriev ed in Figure b The results are not surprising in ligh t of Figure a
The optimal placemen t p erforms extremely w ell pro viding a mean relativ e error under "
after retrieving only disk blo c ks The slice blo c ks also p erform quite w ell with mean
relativ e error under " after retrieving disk blo c ks The naiv e approac h p erforms
terribly mean relativ e error do es not fall b elow" un til the query ev aluation is almost
complete# Still w e see that after retrieving ab out disk blo c ks the error for the naiv e
approac h falls of rapidly This is b ecause the naiv e approac h only requires an a v erage of
blo c ks to answ er the query exactly The reduction comes almost en tirely from the fact
that as queries results b ecome exact the relativ e error b ecomes zero
Conclusion
In this pap er w e rst briey review ed our prior w ork on ProP olyne whic hisane cientw a v eletbased tec hnique for exact appro ximate and progressiveev aluation of range
aggregate queries Next w e iden tied a unique access pattern that arises when ev aluating
poin t and range queries on w a v elet data with ProP olyne in particular but ev en more im
p ortan tly with anyother w a v eletbased data managemen ttec hnique in general W e pro v ed
that these query t yp es place a serious limit on our abilit y to eectiv ely cluster w a v elet data
Hence w e designed an allo cation strategy with p erformance close to this theoretical limit
and sho w ed that it leads to signican timpro v emen ts in p erformance of an implemen ted
query answ ering system Finally w e extended ProP olyne exploiting our w a v elet disk place
men t tec hnique in order to design disk blo c ka w are progressiv e query answ ering strategies
that deliv er excellen t appro ximate results for hard queries after a small n um b er of disk
accesses
W e note that there are man y other forms of m ultiresolution and preaggregated data
that mayha v e access patterns similar to w a v elets W ebeliev e that the tec hniques prop osed
in this pap er will also pro duce go o d disk placementtec hniques in these areas
F uture W ork
F uture w ork is plen tiful Belo w is a nonexhaustiv e list of our future plans in this area
This list pro v es that weha v e just scratc hed the surface and this v ery fruitful researc h area
still has sev eral other in teresting and op en problems for in v estigation
Sp arse wavelets This pap er deals with dense w a v elet data a v oiding the problems solv ed
b y tec hniques suc h as spacelling curv e based indices Real data will often b e sparse
and large w a v elet synopses of massivedatasetsma y not t w ell in main memoryTh us w e
need to use the observ ations of this pap er to nd w a ys to handle sparse w a v elet data Our
rst eort will b e on dev eloping dep endency graphlling curv es
IO ecient wavelet tr ansformation and up dates Our curren t implemen tation of ProP olyne
op erates on datasets that ha v e b een transformed oline and prepared for the system More
o v er no incremen tal up date of the data set is supp orted F or our large datasets the trans
formation sometimes tak es da ys to complete T omakeProP olyne practical its w ebservices
should b e extended with transformation and up date w ebservices This ho w ev er is more
than a simple implemen tation eort The c hallenge is that currentw a v elet transformation
tec hniques assume data can b e t in to the main memory during the transformation Ho w
ev er for large datasets an IO ecien t transformation tec hnique is required Wein tend
to build up on the lifting sc heme and extend it with an appropriate buer managemen t
tec hnique so that the in termediate w a v elet arrays canbebrok en in to c h unks that can b e
read and written optimally to minimize IO during transformation Similarly as new data
sets b ecome a v ailable the up date w ebservice should b e able to up date the co ecien ts
without requiring a full transformation W e curren tly ha v e one approac h where this can b e
ac hiev ed if the total size of the data set is kno wn a priori W e need to extend this w ork for
cases where the ultimate size is not kno wn
Err or b ound guar ante es and optimal or dering of c o ecients In order to supp ort appro x
imate queries for scien tic applications it is desirable for the user to get informed of an error
b ound for anygiv en appro ximate query result or during the progression of the result F or
example the user ma y tradeo time for accuracy ie prior to the query execution giv en
an error tolerance the user can see the appro ximate resp onse time for a query or vicev ersa
Curren t utilization of w a v elets for data compression assumes that alw a ys the en tire signal
needs to b e reconstructed and hence suggests dropping lo w energy co ecien ts to obtain
the b est L
norm error Ho w ev er once w a v elets are used to appro ximate range queries
dep ending on the range w ema y observea h uge v ariation on the error The error dep ends
hea vily on whether the corresp onding co ecien ts for a giv en range query are k ept or not A
v ery recen t study prop oses a new approac h in dropping w a v elet co ecien ts that w ould
result in guaran teed error b ounds for range and p oin t queries Wein tend to build up on
this w ork b y mo difying it for ProP olyne The idea is to instead of dr opping co ecien ts use
the tec hnique to sor t the corresp onding data co ecien ts for a giv en ProP olyne query This
w a y the appro ximate and progressiv e query features of ProP olyne will b e realized while an
error b ound can b e pro vided along the w a y Recall that ProP olyne suggests an ordering for
query co ecien ts W e plan to mo dify the tec hnique prop osed in in order to suggest
a priorityor w eigh t for sets of data co ecien ts The com bination of the ProP olyne query
co ecien ts ordering and the w eigh ts of their corresp onding data co ecien ts w ould pro vide
us with a new ordering that will takein to consideration the imp ortance of b oth data and
query Hence w e b eliev e that this new framew ork w ould not only pro vide us with error
b ounds but also result in b etter query appro ximations The other adv an tage of this frame
w ork is that b y ignoring ProP olyne query ordering w e can mimic the traditional w a v elet
data compression tec hniques and b y ignoring the data co ecien tw eigh ts w e mimic the
pure ProP olyne tec hnique
Hybridization of Pr oPolyne Wein tend to generalize the mec hanism underlying ProP olyne
b y lo oking b ey ond pure w a v elets to nd other basis whic hma y b e more eectiv e on a par
ticular dataset or for a particular query w orkload Not only do query ev aluation algorithms
need to b e dev elop ed in this setting but there is also a need for b estbasis or at least go o d basis algorithms that ecien tly select an appropriate basis from a library of p ossibilities
As a rst step in this direction w ew antto dev elop a h ybrid v ersion of ProP olyne whic h
uses the standard basis in a subset of the dimensions the standard dimensions and uses
w a v elets in all other dimensions T o illustrate consider a database of groundstation data
with sc hema gr oundstationid latitude longitude time heightvariation Supp ose the
stations ha v e xed lo cations ie lats and longs Hence if w e pro ject a w a y the time and
heigh tv ariation dimensions w ewill ha v e a relativ ely small result set Consequen tlyw e
mayw an t to use the standard basis ie no transform on the small relation gr ound
stationid latitude longitude and use w a v elets on the others Giv en this decomp osition of
the dimensions relational selection and aggregation op erators can b e used in the standard
dimensions to accum ulate the results of ProP olyne queries in the other dimensions Clearly
the b est c hoice of h ybridization will p erform at least as w ell as a pure relational algorithm
or pure ProP olyne Our preliminary analysis indicates that for man y realistic datasets
and query patterns h ybridizations can p erform dramatically b etter The c hallenge here is
making the correct c hoice of standard dimensions W ew ould liketo dev elop one algorithm
whic h ecien tly iden ties go o d dimension decomp ositions as part of the database p opula
tion pro cess and a complemen tary algorithm whic h selects the most appropriate a v ailable
basis to use for ev aluation of a particular query The basis library used b y this h ybrid
algorithm is a subset of the full w a v elet pac k et basis library D WPT D WPT isa
generalization of w a v elet transform that includes w a v elet co ecien ts as w ell as summary
and details of details at dieren t lev els Hence b y recursiv ely applying a summary and a
detail lter on b oth summaries and details D WPT quic kly computes a large amountof
information ab out the space and frequency c haracteristics of a function at dieren t scales
Not only will the tec hniques dev elop ed here b e v aluable in practice our understanding of
this simplied problem will pro vide a foundation for future use of the full w a v elet pac k et
transform D WPT
Pr oPolyne for gener al r elational algebraop er ators Finallyw ein tend to generalize the
applicabili t y of the principles underlying ProP olyne While range aggregate queries are
useful linear algebraic appro ximation can b e used for m uc h more general t yp es of queries
T o w ards this end w ein tend to extend our w ork on batc h of range queries whic h require
the sim ultaneous ev aluation of m ultiple related range aggregates These queries are v ery
common and include SQL groupb y queries drilldo wn queries or general MD X expressions
The k ey observ ation there is that these queries act as linear maps where range queries act
as linear functionals Th us where w e appro ximate a v ector to estimate a range query
result w em ust appro ximate a matrix to estimate a general query result W e prop oses
no v el tec hniques to select bases in whic h these matrices are v ery sparse giving natural
query ev aluation algorithms with lo w computational complexit y With these bases in hand
weha v e dev elop ed query ev aluation algorithms whic h share IO maximally and retriev e
the most imp ortan t data rst in order to pro vide fast appro ximate results This w ork on
batc h queries has help ed us in understanding the mec hanics of matrix appro ximation
for appro ximate query answ ering at the same time it has pro vided insightin to appropriate
error measures Relational algebra op erators also ha v e matrix represen tations and once w e
ha v e a thorough understanding of ho w matrix appro ximation w orks in the simpler setting
describ ed ab o v e w e will b e prepared to dev elop and analyze fundamen tally no v el exact
progressiv e and appro ximate ev aluation strategies for relational algebra queries
Ac kno wledgmen ts
The authors w ould lik e to thank the follo wing studen ts for their help in dev elopmentof
ProP olyne Mehrdad Jahangiri Dimitris Sc haradis and Mehdi Sharifzadeh
References
A l CH A Jensen R ipples in Mathematics The Discr ete Wavelet T r ansform Springer S Ac hary a V P o osala and S Ramasw am y Selectivit y estimation in spatial databases In A Delis
C F aloutsos and S Ghandeharizad eh editors SIGMOD Pr o c e e dings A CM SIGMOD Interna
tional Confer enc e on Management of Data June Philadephia Pennsylvania USA pages
A CM Press J L Am bite C Shahabi R R Sc hmidt and A Philp ot F ast appro ximate ev aluation of OLAP queries
for in tegrated statistical data In Natl Conf for Digital Government R ese ar ch L os A ngelesMa y D Barbar a and X W u Data cub es in dynamic en vironmen ts IEEE Data Engine ering Bul letin N Bruno S Chaudh uri and L Gra v ano STHoles A m ultidimensi ona l w orkload a w are histogram
In Pr o c A CM SIGMOD pages K Chakrabarti M N Garofalakis R Rastogi and K Shim Appro ximate query pro cessing using
w a v elets In VLDB Pr o c e e dings of th Internationa l Confer enceon V ery L ar ge Data Bases pages CY Chan and Y E Ionnidis Hierarc hical cub es for rangesum queries In VLDB Pr o c of th
International Confer enceon V ery L ar ge Data Bases pages S Chaudh uri and R Mot w ani On sampling and relational op erators IEEE Data Engine ering Bul letin S Chaudh uri R Mot w ani and V R Narasa yy a Random sampling for histogram construction Ho w
m uc h is enough In L M Haas and A Tiw ary editors SIGMOD Pr o c e e dings A CM SIGMOD
International Confer enc e on Management of Data June Se attle Washington USA pages
A CM Press S Chaudh uri R Mot w ani and V R Narasa yy a On random sampling o v er joins In A Delis
C F aloutsos and S Ghandeharizad eh editors SIGMOD Pr o c e e dings A CM SIGMOD Interna
tional Confer enc e on Management of Data June Philadephia Pennsylvania USA pages
A CM Press I Daub ec hies Orthonormal bases of compactly supp orted w a v elets Comm Pur e and Appl Math A Deshpande M Garofalakis and R Rastogi Indep endence is go o d Dep endencybased histogram
synopses In Pr o c A CM SIGMOD pages M N Garofalakis and P B Gibb ons W a v elet synopses with error guaran tees In SIGMOD A CM Press S Gener D Agra w al and A E Abbadi The dynamic data cub e In EDBT th Internationa l
Confer enc e on Extending Datab ase T e chnolo gyv olume of L e ctur e Notes in Computer Scienc e pages Springer S Gener D Agra w al A E Abbadi and T Smith Relativ e prex sums An ecien t approac h for
querying dynamic OLAP data cub es In Pr o c e e dings of the th International Confer enc e on Data
Engine ering pages IEEE Computer So ciet y L Geto or B T ask ar and D Koller Selectivit y estimation using probabili sti c mo dels In Pr o c A CM
SIGMOD pages A C Gilb ert Y Kotidis S Muth ukrishnan and M J Strauss Optimal and appro ximate computa
tion of summary statistics for range aggregates In PODS Pr o c e e dings of the Twente enth A CM
SIGA CTSIGMODSIGAR T Symp osium on Principles of Datab ase Systems pages A C Gilb ert Y Kotidis S Muth ukrishnan and M J Strauss Surng w a v elets on streams Onepass
summaries for appro ximate aggregate queries In VLDB D Gunopulos G Kollios V J Tsotras and C Domeniconi Appro ximating m ultidimensi on al aggre
gate range queries o v er real attributes In SIGMOD Pr o c e e dings A CM SIGMOD Internationa l
Confer enc e on Management of Data pages
C Ho R Agra w al N Megiddo and R Srik an t Range queries in OLAP data cub es In SIGMOD
Pr o c e e dings A CM SIGMOD Internationa l Confer enc e on Management of Data pages A CM Press H Jagadish J Hui B C Ooi and KL T an Global optimization of histograms In Pr o c A CM
SIGMOD pages H V Jagadish N Koudas S Muth ukrishnan V P o osala K C Sev cik and T Suel Optimal
histograms with qualit y guaran tees In A Gupta O Shm ueli and J Widom editors VLDB
Pr o c e e dings of r d International Confer enceon V ery L ar ge Data Bases A ugust New
Y ork City New Y ork USA pages Morgan Kaufmann S Kundu and J Misra A linear tree partitionin g algorithm SIAM Journal of Computing I Lazaridis and S Mehrotra Progressiv e appro ximate aggregate queries with a m ultiresolution tree
structure In SIGMOD Pr o c e e dings A CM SIGMOD Internationa l Confer enc e on Management of
Data pages J A Luk es Ecien t algorithm for the partitioning of trees IBM Journal of R ese ar ch and Development S Mallat Multiresoluti on appro ximation and w a v elet orthonormal bases of l T r ansactions of the
A meric an Mathematic al So ciety Y Matias J S Vitter and M W ang Dynamic main tenance of w a v eletbased histograms In VLDB
Pr o c of th Intl Conf on V ery L ar ge Data Bases pages Morgan Kaufmann B Mo on H Jagadish C F aloutsos and J H Satz Analysis of the clustering prop erties of the Hilb ert
spacellin g curv e T r ansactions on Know le dge and Data Engine ering V P o osala and V Gan ti F ast appro ximate answ ers to aggregate queries on a data cub e In SSDBM
th Internation al Confer enc e on Scientic and Statistic al Datab ase Management pages IEEE Computer So ciet y V P o osala and Y E Ioannidis Selectivit y estimation without the attribute v alue indep endence as
sumption In M Jark e M J Carey K R Dittric h F H Lo c ho vskyP Loucop oulos and M A
Jeusfeld editors VLDB Pr o c e e dings of r d International Confer enceon V ery L ar ge Data Bases
A ugust A thens Gr e e c e pages Morgan Kaufmann W Press S T euk olskyW V etterling and B Flannery Numeric al R e cip es in CCam bridge Univ
Press M Riedew ald D Agra w al and A E Abbadi pCub e Up dateecien t online aggregation with pro
gressiv e feedbac k In SSDBM Pr o c e e dings of the th Internation al Confer enc e on Scientic and
Statistic al Datab ase Management pages M Riedew ald D Agra w al and A E Abbadi Spaceecien t datacub es for dynamic en vironmen ts In
Pr o c of Conf on Data War ehousing and Know le dge Disc overy DaWaK pages M Riedew ald D Agra w al and A E Abbadi Flexible data cub es for online aggregation In ICDT
th Internationa l Confer enceon Datab ase The ory pages R R Sc hmidt and C Shahabi W a v elet based densit y estimators for mo deling OLAP data
sets In SIAM Workshop on Mining Scientic Datasets Chicago April Av ailable at
h ttpinfolabuscedu pu bl ica tion h tml
R R Sc hmidt and C Shahabi Ho wto ev aluate m ultiple rangesum queries progressiv elyIn PODS
A CM R R Sc hmidt and C Shahabi Prop olyne A fast w a v eletbased tec hnique for progressiveev aluation
of p olynomial rangesum queries In EDBT Lecture Notes in Computer Science Springer C Shahabi AIMS An Immersidata Managemen t System In VLDB First Biennial Confer enc e
on Innovative Data Systems R ese ar ch CIDR Asilomar CA Jan uary Av ailable at
h ttpinfolabuscedu pu bl ica tion h tml
J Shanm ugasundaram U F a yy ad and P Bradley Compressed data cub es for OLAP aggregate
query appro ximation on con tin uous dimensions In Fifth A CM SIGKDD Internationa l Confer enc e
on Know le dge Disc overy and Data Mining August J S Vitter and M W ang Appro ximate computation of m ultidimensi ona l aggregates of sparse data
using w a v elets In SIGMOD pages A CM Press J S Vitter M W ang and B R Iy er Data cub e appro ximation and histograms via w a v elets In CIKM
Pr o c e e dings of the th Internation al Confer enc e on Information and Know le dge Management pages A CM M V Wic k erhauser A dapte d Wavelet A nalysis F r om The ory to Softwar e IEEE Press YL W u D Agra w al and A E Abbadi Using w a v elet decomp osition to supp ort progressiv e and
appro ximate rangesum queries o v er data cub es In CIKM Pr o c e e dings of the th Interntationa ll
Confer enc e on Information and Know le dge Management pages A CM YL W u D Agra w al and A E Abbadi Applying the golden rule of sampling for query estimation
In Pr o c A CM SIGMOD pages
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 754 (2002)
PDF
USC Computer Science Technical Reports, no. 839 (2004)
PDF
USC Computer Science Technical Reports, no. 650 (1997)
PDF
USC Computer Science Technical Reports, no. 855 (2005)
PDF
USC Computer Science Technical Reports, no. 744 (2001)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 959 (2015)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 896 (2008)
PDF
USC Computer Science Technical Reports, no. 736 (2000)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 948 (2014)
PDF
USC Computer Science Technical Reports, no. 840 (2005)
PDF
USC Computer Science Technical Reports, no. 835 (2004)
PDF
USC Computer Science Technical Reports, no. 699 (1999)
PDF
USC Computer Science Technical Reports, no. 785 (2003)
PDF
USC Computer Science Technical Reports, no. 799 (2003)
PDF
USC Computer Science Technical Reports, no. 795 (2003)
PDF
USC Computer Science Technical Reports, no. 592 (1994)
Description
Cyrus Shahabi, Reolfe S. Schmidt. "Wavelet disk placement for efficient querying of large multidimensional data sets." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 826 (2004).
Asset Metadata
Creator
Schmidt, Reolfe S. (author), Shahabi, Cyrus (author)
Core Title
USC Computer Science Technical Reports, no. 826 (2004)
Alternative Title
Wavelet disk placement for efficient querying of large multidimensional data sets (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
25 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270317
Identifier
04-826 Wavelet Disk Placement for Efficient Querying of Large Multidimensional Data Sets (filename)
Legacy Identifier
usc-cstr-04-826
Format
25 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/