Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 721 (2000)
(USC DC Other)
USC Computer Science Technical Reports, no. 721 (2000)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
An Adaptiv e Prob ebased T ec hnique to Optimize Join
Queries in Distributed In ternet Databases
Latifur Khan Dennis Mcleo d and Cyrus Shahabi
In tegrated Media Systems Cen ter Departmen t of Computer Science
Univ ersit y of Southern California
Los Angeles California latifurk mcleod shahabiu sc ed u
Abstract
An adaptiv e prob ebased optimization tec hnique is dev elop ed and demonstrated in the con text
of an In ternetbased distributed database en vironmen t More and more common are database sys
tems whic h are distributed across serv ers comm unicating via the In ternet where a query at a giv en
site migh t require data from remote sites Optimizing the resp onse time of suc h queries is a c hal
lenging task due to the unpredictabilit y of serv er p erformance and net w ork trac at the time of
data shipmen t this ma y result in the selection of an exp ensiv e query plan using a static query
optimizer W e constructed an exp erimen tal setup consisting of t w o serv ers running the same DBMS
connected via the In ternet Concen trating on join queries w e demonstrate ho w a static query op
timizer mightc ho ose an exp ensiv e plan b y mistak e This is due to the lac k of a priori kno wledge
of the runtime en vironmen t inaccurate statistical assumptions in size estimation and neglecting
the cost of remote metho d in v o cation These shortcomings are addressed collectiv ely b y prop osing
a probing mec hanism F urthermore w e extend our mec hanism with an adaptivetec hnique that
detects suboptimalit y of a plan during query execution and attempts to switchtothe c heap est plan
while a v oiding redundantw ork and imp osing little o v erhead An implem en tation of our runtime
optimization tec hnique for join queries w as constructed in the Ja v a language and incorp orated in to
an exp erimen tal setup The results demonstrate the sup eriorit y of our prob ebased optimization
o v er a static optimization
In tro duction
A distributed database is a collection of partially indep enden t databases that share a common sc hema
and co ordinates pro cessing of nonlo cal transactions Pro cessors comm unicate with one another through
a comm unication net w ork SKS YM W e fo cus on distributed database systems with sites run
ning homogeneous soft w are ie database managemen t system DBMS on heterogeneous hardw are
eg PC and Unix w orkstations connected via the In ternet The In ternet databases are appropriate
This researchw as supp orted in part b y gifts from Informix In tel NASAJPL Con tract no and NSF gran ts EEC
IMSC ER C and MRI
for organizations consisting of a n um b er of almost indep enden t suborganizations suc h as a Univ ersit y
with man y departmen ts or a bank with man y branc hes The idea is to partition data across m ulti
ple geographically or administrativ ely distributed sites where eac h site runs an almost autonomous
database system
In a distributed database system some queries require the participation of m ultiple sites eac h
pro cessing part of the query as w ell as transferring data bac k and forth among themselv es Since
usually there is more than one plan to execute sucha query it is crucial to obtain the cost of eac h
plan whic h highly dep ends on the amoun t of participation byeachsite asw ell as the amountof
data shipmentbet w een the sites Assuming a priv atededicated net w ork and serv ers this cost can b e
computed a priori due to the predictabilityof serv ers and net w ork conditions and a v ailabilit y of eectiv e
net w ork bandwidth Ho w ev er in the In ternet en vironmen t whic h is based on a b est eort service there
are a n um b er of unpredictable factors that mak e the cost computation complicated PF A static
query optimizer that do es not consider the c haracteristics of the en vironmen t or only considers the
a priori kno wledge on the runtime parameters migh t end up c ho osing exp ensiv e plans due to these
unpredictable factors In the follo wing paragraph w e explain some of these factors via simple examples
P articipating sites or serv ers of In ternet database systems mightha v e dieren t pro cessing p o w ers
One site migh t b e a highend m ultipro cessor system while the other is a lo wend PC running sa y
Windo ws NT In addition since most queries are IO in tensiv e a site ha ving faster disk driv es migh t
observ e a b etter p erformance In an In ternetbased en vironmen t these sites migh t b e dedicated to
a single application or m ultiple sim ultaneous applications F or example one site migh t only run a
database serv er while the other is a database serv er a w eb serv er and an email serv er Moreo v er
the w orkload on eac h serv er mightv ary o v er time Aserv er running o v ernightbac kup pro cesses is
more loaded at nigh t as compared to a serv er running ampm oce transactions Due to time
dierences a serv er in New Y ork migh t receiv e more queries at am in pacic standard time as
compared to those receiv ed byaserv er in Los Angeles The net w ork trac is another ma jor factor It
is not easy to predict net w ork dela y in the In ternet due to v ariabilit y of eectiv enet w ork bandwidth
among the sites A query plan whic h results in less tuple shipmen ts mightor migh t not b e sup erior to
the one preferring extensiv e lo cal pro cessing dep ending on the net w ork trac and serv er load at the
time of query pro cessing Briey there is just to o m uc h uncertain t y and a v ery dynamic b eha vior in
an In ternetbased en vironmen t that mak es the cost estimation of a plan a v ery sophisticated task
Although w e b eliev e our prob ebased runtime optimization tec hnique is applicable to multidatab ases
with sites running heterogeneous DBMS w e do not consider suc h a complex en vironmen t in order to fo
cus on the query pro cessing and optimization issuessee Sec There has b een an extensiv e researc h
in query pro cessing and optimization in b oth distributed databases and m ultidatabases ABF
AHY BGW
BRJ BRP CY EDNO KYY RK ZL Among those only a few
considered runtime parameters in their optimizers W e distinguish these studies from ours in Sec Briey most of these studies prop ose a detectiv e approac h to comp ensate for lac k of runtime informa
tion while our approachisrst pr edictiv e and prev en ts the selection of exp ensiv e queries at runtime
and then b ecomes adaptiv e to adapt itself with runtime v ariations In this pap er w e demonstrate the
imp ortance and eectiv eness of an adaptiv e prob ebased optimization tec hnique for join queries in the
In ternet databases W e fo cussed on j oin queries b ecause join op eration is not only frequen tly used but
also exp ensiv e YM In order to demonstrate the imp ortance of runtime optimization w e implemen ted an exp erimen tal
distributed database system connected through the In ternet Our setup consists of t w o iden tical serv ers
b oth running the same ob jectrelational DBMS ie Informix Univ ersal Serv er Inf connected via
the In ternet W e then split the BUCKY database from the BUCKY b enc hmark CDN
across
the t w o sites W e implemen ted a prob ebased runtime optimization mo dule for join queries in Ja v a
language The optimizer rst issues t w o prob e queries eac h striving to estimate the cost of either
semijoin or simple join plans Consequen tly the c heap est plan will b e selected The query optimizer
of a distributed database system can b e extended with our prob e queries to capture runtime b eha vior
of the en vironmen t F urthermore as a b ypro duct the result of the prob e queries can b e utilized for
estimating the size of in termediate relations in a join plan This estimation is sho wn to b e less sensitiv e
to statistical anomalies as compared to that of static optimizers Finally the prob ebased tec hnique
iden tied some hidden costs eg the cost of remote in v o cation of metho ds with RMI that should
b e considered in order to select the c heap est plan That is our probing mec hanism can capture an y
surprises asso ciated with sp ecic implemen tations eg RMI in our case whic h can nev er b e accoun ted
for b y static optimizers The exp erimen ts sho w that for exp ensiv e queries pro cessing man y tuples the
resp onse time can b e impro v ed on the a v erage by o v er a static optimizer while the probing
o v erhead only results in an a v erage of increase in resp onse time W e also discuss an enhanced
v ersion of our optimizer whic h reduces the o v erhead byan a v erage of ie observing increase
in resp onse time due to o v erhead b y utilizing the results of the prob e queryOb viously these n um bers
dep end on the n um b er of tuples sampled b y the prob e queries and the size of relations In addition w e
prop ose an adaptiv e optimization tec hnique that cop es with sudden c hanges of runtime en vironmen t
on the y during the execution of query Ho w ev er w esho w that adaptiv e optimization incurs either
no o v erhead or a little o v erhead only in a few cases
The remainder of this pap er is organized as follo ws Section co v ers some related w ork on query
pro cessing and optimization in b oth distributed databases and m ultidatabases Section states the
By spa wning a n um b er of auxiliary pro cesses on one of the serv ers w eem ulated an en vironmen t with heterogeneous
serv ers
problem reviews a con v en tional solution and nally explains our prop osed extensions to capture run
time parameters and utilize them to impro v e the optimizer Section consists of a p erformance study
to compare the p erformance of our runtime optimization tec hnique with that of a static optimizer
Finally Section concludes the pap er and pro vides an o v erview on our future plans
Related W ork
There ha v e b een v arious studies on query pro cessing and optimization in distributed federated and
m ultidatabase systems AHY BGW
BPR CY KYY RK What distinguish us from
these studies is our consideration of runtime en vironmen t in order to optimize the queries There are
ho w ev er other studies considering runtime en vironmentABF
UF A BRJ BRP EDNO ONK
ML CG ZL An t Here w e discuss those studies in more details and distinguish
them from this study In CG they prop osed a dynamic query optimization mo del in a c entr alize d database system
in order to solv e the problem of unkno wn runtime bindings for host v ariables in em b edded queries
In An t sev eral plans in a cen tralized database system are executed sim ultaneously for a short time
and nally all plans but the b est are terminated The purp ose of these sim ultaneous runs is to capture
the cost function instabilit y for a single table access Ho w ev er in a distributed database system these
sim ultaneous runs at a site will comp ete with eachother o v er resources and they do not correctly
capture the runtime en vironmen t
In EDNO ONK
assuming a m ultidatabase system they realized the imp ortance of run
time optimization they used the term dy namic optimization and prop osed a w eigh t function to
capture the w orkload and transmission cost for eac h participating site The ob jectiv eisto c ho ose the
sites whose cost functions are less than a certain threshold in order to participate them in the query
execution They use a similar tec hnique to our probing mec hanism to capture the w orkload ho w ev er
the comm unication cost is calculated as the prop ortionate to the size of the tuples transmitted Due to
net w ork bandwidth v ariabilityo v er the In ternet it is not p ossible to estimate the comm unication cost
a prior In addition among dieren t sites o v er the In ternet net w ork bandwidth mayv ary signican tly Finallyw e sho w that other factors suchas serv er load and c hoice of implemen tation also impact the
comm unication cost
In ZL a query sampling tec hnique w as prop osed to estimate the cost parameters of an au
tonomous lo cal database system in order to p erform global query optimization in a m ultidatabase
system Their ob jectiv e is to estimate the lo cal costs oline in order to later utilize them b y a global
query optimizer to determine go o d execution plans for a series of queries Therefore the o v erheads
of sample queries are not imp ortan t In addition their approac h do es not address when and ho w the
sampling should b e in v ok ed to capture a c hanging en vironmen t at runtime Ho w ev er our probing
mec hanism cannot aord to do a complex statistical analysis on samples b ecause it is in v ok ed p er
query and hence needs to imp ose a lowo v erhead on the system
In ABF
UF A they also realized the inadequacy of static query optimization and prop osed
a detectiv e tec hnique to iden tify sites with dela ys higher than exp ected during query execution Sub
sequen tly they stop w aiting for problematic sites and resc hedule the plan for other sites in order to
hide dela ys b y pushing the dela y ed sites as far bac k in the optimizer plan as p ossible In this case
their tec hnique migh t generate incomplete results if the problematic sites nev er reco v er While their
approac h detect the problem and try to resolv e it our approac his pr edictiv e and try to a v oid the
problem all together Although a predictiv e approac h results in an initial o v erhead w e sho w that in
some cases w e can minimize the o v erhead b y utilizing the results of our probing queryF urthermore
for exp ensiv e queries the o v erhead is marginal Similar to previous studies in their sim ulation mo del
they assume comm unication cost is prop ortional to the size of data transfer in b ytes
In BRJ BRP v arious t yp es of adaptiv e query execution tec hniques are discussed The idea is
to monitor the execution of a plan and if the p erformance is lo w er than what estimated then the plan
is corrected utilizing the newly captured information Again this is a detectiv e tec hnique trying to
comp ensate for a wrong decision b y replanning Finally BRJ BRP also assume comm unication
costs are directly prop ortional to the v olume of data transferred
In ML query optimizer estimates the total cost of a plan b y summing up the CPU cost
IO cost message passing cost and comm unication cost The last t w o costs are computed based
on heuristics Comm unication cost is estimated as the pro duct of n um ber of b ytes transfered and
eectiv e bandwidth a v ailable b et w een the t wosites Ov er the In ternet it is not trivial to obtain the
eectiv e bandwidth b et w een t w o sites F urthermore eectiv e bandwidth is c hanging frequen tly due to
the net w ork dynamics and it is hard to main tain up dated eectiv e bandwidth information
RunTime Optimization R TO for Join Queries
In this section w e start b y dening the problem of query optimization for join queries in distributed
databases Subsequen tlyw e briey describ e a con v en tional solution to the problem Finally w epropose
our probing mec hanism and compare it with the con v en tional solution Note that query optimization
within a database site is b ey ond the scop e of this pap er and our tec hniques rely on eac h site for lo cal
query optimizations
Problem Statemen t
Supp ose there are t w o relations R
at local site S
and R
r
at r emote site S
Consider the query that
joins R
and R
r
on attribute A and requires the nal result to b e at
S
The ob jectiv e is to minimize
the query resp onse time A straigh tforw ard plan termed simple join plan P
j
is to send relation R
r
to
site S
and p erform a lo cal join at S
This approac h observ es one data transfer and one join op eration
The second plan emplo ys semijoin
and is termed semijoin plan P
sj
This strategy incurs t w o data
transfers and also p erforms join t wice Utilization of semijoins to reduce the size of the in termediate
relations has receiv ed a great deal of atten tion BGW
YM The decision b et w een c ho osing one
plan o v er the other is not straigh tforw ard and dep ends on a n um b er of parameters suc h as the size and
cardinalit y of relations R
and R
r
Therefore the problem is ho w to decide whic h plan to c ho ose in
order to minimize the resp onse time of a certain join query It is the resp onsibilit y of a query optimizer
to assign a cost to eac h plan and then c ho ose the c heap er plan In tuitiv elyif j R
l
j and j R
r
j are
the cardinalit y of relation R
l
and R
r
resp ectiv ely then when j R
l
jj R
r
j the semijoin plan seems
promising and vicev ersa
Static Query Optimizer SQO
In this section w e explain a con v en tional metho d BGW
CP AHY to estimate the costs
asso ciated with b oth simple join and semijoin plans Since the parameters used for this cost estimation
are all kno wn a priori b efore the execution of the plans this query optimizer is termed Static Query
Optimizer SQO Giv en the n um b er of tuples in R
r
as N
r
and the size of a tuple in R
r
as S
R r
the cost of simple join
is trivially computed as follo ws
Cost P
j
C
C
S
R r
N
r
where C
is the cost to startup a new connection and C
is the comm unication cost p er b yte transfer
The computation of the cost of semijoin plan is more complicated
a Let us denote the size of the common attribute A as S
R
A
and the n um b er of distinct v alues for
attribute A in lo cal relation R
as N
The cost to transfer A
R
from S
to S
is
C
C
S
R
A
N
F or the remainder of this pap er w e fo cus on the same exact scenario without loss of generalit y F or a detailed description of semijoin consult CP YM
b No w
A
R
is joined with R
r
at S
with a zero cost R
A
R
R
r
c Supp ose j R
j is the cardinalit y of relation R
the cost of sending R
to S
is
C
C
S
R r
j R
j d Finally R
is joined with R
at S
withazerocost Res R
R
Therefore the o v erall cost for the semijoin plan is
C ost P
sj
C
C
S
R
A
N
S
R r
j R
j SQO will c ho ose the semijoin plan if C ost P
sj
Cost P
j
or if
C
C
S
R
A
N
S
R r
j R
j C
S
R r
N
r
SQO can examine the ab o v e inequalit y accurately only if it has all the required information eg
N
S
R r
a priori Note that C
is simply recipro cal of net w ork bandwidth Ho w ev er o v er the In ternet
eectiv enet w ork bandwidth b et w een t w o sites is extremely dicult to estimate b ecause it is c hanging
more frequen tly Finally SQO needs to estimate the size of in termediate results ie j R
j One
estimation is as follo ws
j R
j domain A sel R
A sel R
r
A where sel R
A and sel R
r
A are the selectivit y of attribute A in relations R
and R
r
resp ectiv ely Eq is based on t w o assumptions First it assumes that the domain of A is discrete and can b e
considered as As sample space Second tuples are distributed b et w een R
and R
r
indep enden t of the
v alues of A That is there is no correlation b et w een R
and R
r
based on the join attribute A Later in
Sec w e sho w that as a b y pro duct of our probing tec hnique w e do not need to makean y of these
assumptions
Our Prop osed Solution
Weno w describ e our runtime optimization R TO tec hnique whic h is an extension to SQO T o sum
marize R TO rst submits t w o pr obe queries to estimate the runtime costs corresp onding to plans
P
j
and P
sj
b y measuring the resp onse time observ ed byeac h prob e query Subsequen tly it replaces
C
S
R
A
and C
S
R r
in Eq b y the estimated costs In addition R TO analyzes the results of the
prob e queries to estimate the size of R
more accurately This last step of R TO is of course iden tical
to the concept of sampl ingF or the time b eing w e assume that there w ould b e no sudden c hanges in
the b eha vior of runtime en vironmentbet w een the time that a prob e query is submitted and the time
that the original query will b e executed This assumption is relaxed in Sec F or the remainder of this section w e rst describ e the prob e queries and ho w their measured
p erformance v alues are incorp orated in to Eq Next w e prop ose an enhanced v ersion of R TO to reduce
the o v erhead of probing b y utilizing its results to supp ort the original query Later in Sec w e argue
ho w our mo dication to Eq can capture runtime b eha vior and estimate the size of in termediate
relations more accurately Finally in Sec w e showho w our optimizer cop es with sudden c hanges
of runtime en vironmen t
Prob e Queries
Our main ob jectiv e is to mo dify the SQO main equation Eq in order to tak e the runtime parameters
in to the consideration Toac hievethis w e submit the follo wing t w o prob e queries to collect some
parameters at runtime
Prob e Query A The rst prob e query striv es to replace the term C
S
R
A
of Eq with a more
accurate estimation This is b ecause C
S
R
A
is based on the simplistic assumption that com
m unication cost is a linear function of the amoun t of data transferred and net w ork bandwidth
C
is also a v ailable This prob e sends the A attribute of X n um b er of tuples of R
denoted
as R
XA
fromlocal site S
to remote site S
joins R
XA
with R
r
at remote site S
and receiv es
bac k the size of the result denoted as X
j
The time to execute this prob e query is measured and
then is normalized b y dividing it b y X The result is the cost of this prob e and is denoted b y
C
r
T o illustrate the costs that ha v e b een captured b y C
r
consider the follo wing equation
C
r
S X RI C
Q
RI C X JC
r
X
In Eq S X is the cost to ship X tuples eac h tuple consists of only one attribute Afrom S
to S
RI C
Q
is the remote in v o cation cost for the join op eration at S
RI C X is the remote
in v o cation cost to insert X tuples in to S
and JC
r
is the cost to p erform the join op eration at
S
Note that due to stateless nature of HTTP whic h is the proto col used within our setup to
access remote sites see Sec Observ ethat asa b ypro duct R
can no w b e estimated more
accurately b ecause X
j
is the n um b er of tuples in R
if R
had X tuples No w that R
has N
W e ignored the cost of returning X j to S since X j is only a single in teger
tuples then size of R
can b e estimated as
Sample estimate R
X
j
N
X
Prob e Query B The second prob e query striv es to replace the term C
S
R r
of Eq with a more
accurate estimation It receiv es X n um b er of tuples of R
r
denoted as R
X
from remote site S
joins R
X
with R
at lo cal site S
and measures the time to complete this pro cess This time is
then normalized b y dividing it b y X and is the cost of this prob e denoted b y C
r T o illustrate
the costs that ha v e b een captured b y C
r consider the follo wing equation
C
r RI C X S X JC
X
In Eq S X is the cost to ship X tuples from S
to S
JC
is the cost to p erform the join
op eration at S
and RI C X is the remote in v o cation cost to request X tuples from S
In
Eqs and S X is capturing the follo wing runtime parameters
S X Delay
send
X Delay
netw or k
X Delay
r eceiv e
X where Delay
send
X is the time required at the sender site to emit X tuples Delay
r eceiv e
Xis
the time required at the receiv er site to receiv e X tuples and Delay
netw or k
X is the net w ork dela y It is imp ortan t to note that shipmen t cost remote in v o cation cost and join cost are in termixed
in C
r
and C
r This is not an obstacle in our case since it is not required to estimate eac hof
these costs separately Noww e can mo dify Eq of SQO as follo ws
N
C
r
S ampl e estimate R
C
r N
r
C
r In Eq the terms C
S
R
A
and C
S
R r
of Eq are replaced b y C
r
and C
r and R
is
computed using Eq instead of Eq Selection of X tuples Both prob e queries transfer X tuples for their estimations Therefore the
v alue of X ie the n um b er of tuples transferred has an impact on the accuracy of the esti
mations T rivially the larger the v alue of X the more accurate the estimation Moreo v er the
amoun t of data transferred for X should b e large enough to exercise the net w orks TCP connec
tion b ey ond its slo w start Ho w ev er large v alue of X results in more o v erhead observ ed b y the
prob e queries In our exp erimen ts w ev aried X from to of N
Besides the v alue of X the w a y that X tuples are selected impacts the estimated size of R
This sampl ing should b e
done in a w a y that X b e a go o d represen tativeof R
This can b e ac hiev ed b y random selection
of tuples from the relation R
There are alternativ etec hniques describ ed in the literature for
random selections of tuples from a relation suc h as heap scan index scan and an index sampling
tec hnique Olk HS There are man y issues in obtaining a go o d random represen tativespe cially when there are index structures on the relation The details of sampling are b ey ond the
scop e of this pap er
Scalability Although w e describ e our prob e queries for joins b et w een t w o relations ie w a y join
the tec hnique is indeed generalizable to k w a y join When joining k relations on a common
attribute the k w a y join can b e considered as k w a y joins The purp ose of this join
is to reduce the size of relations and determine whic h tuples of relations are participating in
the nal result Finally all pro cessed relations are transmitted to a nal site where joins are
p erformed and the answ er to the query obtained CL Hence the optimization c hallenge in
the reducing phase is to iden tify the optimal execution order of the k w a y join Static optimizers
for distributed databases address this c hallenge b y sorting the k relations in ascending order of
their v olumes AHY Assuming the comm unication cost is indep enden t of the net w ork load
and is linearly prop ortional to the v olume of transferred data then this sorted order sp ecies
the optimal execution order But o v er the In ternet net w ork bandwidth among the sites v ary
signican tly This v ariable net w ork load should b e tak en in to accoun t to iden tify the optimal
plan Therefore our prob ebased tec hnique can b e utilized in a similar w a y to estimate the
comm unication cost among all the k participating sites assuming one relation p er site As
a result k k prob e queries are generated among the k sites One can argue that our
tec hnique is not scalable due to the extensiv e increase in the n um b er of prob e queries in a k w a y
join optimization Ho w ev er it is imp ortan t to note that these prob e queries are indep enden t
of eac h other and th us can b e executed in parallel In our exp erimen ts w e utilized Ja vam ulti
threading primitiv es Ree to p erform prob e queries concurren tly Therefore the o v erhead
observ ed for k w a y join optimization is equal to the maxim um dela y incurred among all the
prob e queries After estimating the comm unication costs from site to site the optimal execution
order is determined b y the ascending order of n um b er of tuples transferred m ultiplied b y the
comm unication cost b et w een the corresp onding sites It is imp ortan t to note that our probing
tec hnique do es not require to kno w the net w ork bandwidth among the sites On the y it
inheren tly captures the net w ork bandwidth and tak es in to consideration sites loads and remote
in v o cation costs Curren tlyw eare in v estigating the extension of our prob ebased tec hnique to
supp ort k w a y join within our exp erimen tal setup
Enhanced R TO
One ma jor problem with our R TO is the o v erhead asso ciated with probing queries This o v erhead can
b e alleviated b y a simple enhancemen t Recall that during the rst step of prob e query A X tuples of
R
eac h consisting of single common attribute A are transfered to S
The idea is to k eep that relation
R
XA
at S
and do not discard it Therefore if P
sj
is selected byR TO as the sup erior plan it will
not b e required to send that X tuples to S
again This results in sa ving b oth S X and RI C X
Weev aluated the impact of this enhancemen t in our p erformance ev aluation and an a v erage of reduction in o v erhead has b een observ ed for a giv en v alue of X Analysis and Comparison
In this section w e analyze wh y Eq can no w capture runtime b eha vior and estimate the size of
in termediate relations more accurately than SQO
Comm unication Cost Almost all the previous studies on distributed query optimization see Sec assumed comm unication cost is prop ortional to the size of data transferred They also assume
net w ork bandwidth information is a v ailable to the system and remains constan t This is reason
able for a priv atededicated net w ork The same assumptions ha v e also b een made b y the static
query optimization tec hnique discussed in this pap er see Sec Ho w ev er researc hers P ax in net w ork comm unit y demonstrate that o v er the In ternet it is hard to estimate the eectiv e
net w ork bandwidth In addition net w ork bandwidth b et w een t w o sites v aries signican tly with
time due to the In ternet dynamics In our exp erimen ts ho w ev er w e observ ed that the comm u
nication cost is indeed a linear function of the n um ber of tuples transfered see our tec hnical
rep ort eliminated from citations due to the doubleblind reference p olicy This is b ecause the
gran ularit y of data transfer in our exp erimen ts w as in tuples With R TO C
r
and C
r are the
linear extrap olation of the time to mo v e X tuples and hence are based on number of tuples mo v ed
at the time of query execution b et w een the t w o participating sites In addition the size of tuples
is also tak en in to consideration b y measuring the actual time to transfer X tuples of size S
R
A
and S
R r
By doing this w e are inheren tly capturing the a v ailable net w ork bandwidth b et w een
t w o sites at runtime Note that the same argumen t holds if the gran ularit y of data transfer is
in blo c ks instead of tuples Ho w ev er the prob e queries m ust b e mo died to extrap olate on the
n um b er of blo ckmo v emen t as opp osed to tuple mo v emen t This is a straigh tforw ard extension
Remote In v o cation Cost As discussed in Sec in our exp erimen tal setup Remote Metho d In v o
cation RMI w as emplo y ed in order to access a remote serv er An in teresting distinction b et w een
simple join and semijoin plan is that in general semijoin plan uses remote in v o cation more often
as compared to that of simple join plan T o illustrate P
sj
utilizes remote in v ocation N
times to
insert tuples in to S
one time to execute join remotely at S
and R
times to fetc h the semijoin
results backto S
This is while P
j
utilizes RMI only N
r
times to fetc h the remote tuples in to S
Ob viously this hidden RMI cost has not b een captured b y SQO b ecause this cost is v ery sp ecic
to our implemen tation and exp erimen tal setup The in teresting observ ation ho w ev er is that
this cost has automatically b een captured see Sec b y C
r and C
r
Therefore a general
conclusion is that our runtime probing mec hanism can capture an y surprises asso ciated with
sp ecic implemen tations eg RMI in our case whic h can nev er b e accoun ted for b y the static
optimizer Note that other alternativ e implemen tations will also observ e some o v erheads similar
to that of RMI F or example if Ja v a Database Connectivit y JDBC is emplo y ed to connect to
the database serv ers remote sites can b e accessed in three alternativ ew a ys dep ending on the
JDBC driv er implemen tation Ree distributed ob jects implemen ted in RMI message
passing tec hnique or Common Ob ject Request Brok er Adapter CORBA F ar T rivially all three metho ds in tro duce some o v erheads when accessing remote sites Hence C
r and C
r
automatically capture these v arying o v erheads regardless of dieren t implemen tations of JDBC
Load Cost F rom Eq it is ob vious that SQO do es not consider the time to pro cess dieren t
op erations suc h as pro ject join and semijoin whic h are impacted b y serv er w orkload This is
b ecause it assumes that comm unication cost is the dominan t factor in estimating the cost of a
plan Ho w ev er in our tec hnical rep ort w e sho w the imp ortantimpactofthe loadin c ho osing the
b est plan On the other hand with R TO it is trivial from Eqs and that the w orkload of
the serv er can b e captured b y C
r and C
r
due to the follo wing terms JC
r
JC
D el ay
send
and
Delay
r eceiv e
Hence another distinction b et w een P
sj
and P
j
can b e captured b y our R TO That is
semijoin p erforms t w o ligh t joins one at remote site and the other at lo cal while simple join only
p erforms one hea vy but lo cal join op eration Beside these op erations that are highly dep enden t
on the serv er w orkload there are other dep endencies as w ell A hea vily loaded serv er also impacts
the comm unication cost since it sends and receiv es tuples slo w er than a ligh tly loaded serv er ie
Delay
send
and Delay
r eceiv e
Consequen tlyitisnot straigh tforw ard to mo del the impact of load
on the cost of a plan This is exactly wh y our probing mec hanism can automatically capture
these c haotic b eha viors and aggregate them out within t w o simple terms of C
r and C
r
Statistical Assumptions Regarding the statistical assumptions R TO has t w o ma jor adv an tages
o v er SQO First R TO do es not rely on remote proles Accessing metadata from the remote
sites is not easy b ecause statistic proles are c hanged frequen tly and hence the pro cess of collecting
and up dating the statistical information ab out the remote site is exp ensiv e Recall that while
SQO needs the v alue of S
R r
and sel R
r
A for its computations R TO relies on neither of these
v alues Second R TO is less sensitiv e to the statistical anomalies as compared to SQO SQO
mak es t w o ma jor assumptions in order to estimate the size of R
in Eq domain of A is
discrete and can b e considered as As sample space and there is no correlation b et w een R
and R
r
Instead R TO estimates the size of R
b y sampling see Eq and th us is indep enden t
of b oth of these assumptions That is with R TO As sample space is R
moreo v er it utilizes
the en tire R
r
whichis R
r
s b est p ossible sample In addition if there is a correlation b et w een the
t w o relations it will impact X
j
in Eq accordingly Therefore a p ositiv e correlation results
in higher v alue of X
j
and vicev ersa
An Adaptiv e Optimization T ec hnique
In Sec w e made the simplifying assumption that there w ould b e no sudden c hanges in the run
time en vironmentbet w een the time the prob e queries are submitted and the time the original query
is executed Ho w ev er in some cases a single probing ma y not b e enough to predict the runtime
en vironmen t during the original query execution time This is b ecause some queries migh t tak e min utes
to execute and hence there is a p ossibilityof c hanges in the runtime parameters Therefore it is
necessary to examine during the query execution whether the selected plan still pro vides optimal
solution or not If not then the optimizer should discard the plan and c ho ose a new one Moreo v er
the new plan should b e in telligen t enough to a v oid redundantw ork that has already b een done b y the
earlier plans
With our adaptive optimization tec hnique w e partition a join query in to K series of smaller joins
Subsequen tly for eac h smaller join w ereev aluate the runtime parameters and mak e a decision to
either con tin ue with the curren t plan or switc h to another plan Our tec hnique ho w ev er do es not
treat eac h smaller join in isolation It ensures that no smaller join p erforms redundantw ork that
has already b een done b y the previous joins in the series Briey the optimizer collects statistics to
up date the cost mo del at eachreev aluation p oin t termed c ostup date p oints Using the up dated cost
mo del costs of dieren t plans to complete the query are estimated and the optimizer c ho oses the least
exp ensiv e one T oac hiev e this w e need to recompute C
r
and C
r at eac h costup date p oin t F or
most of the cases our adaptiv etec hnique can estimate C
r
and C
r b y just timing the execution of
plan as it progresses Hence new prob e queries are not required to b e sen t explicitlyF or other cases
it needs to submit new prob e queries The n um b er of prob e queries submitted explicitly for K series
of joins is sho wn in T able These extra prob e queries can either b e issued at the c ostup date p oint or
b eing executed on the bac kground during the execution of the plan Both approac hes ha v e adv an tages
and disadv an tages The former observ es an o v erhead for cases where the next selected plan is not P
sj
assuming our enhanced R TO The latter do es not observ e this o v erhead but it mayo v erestimate the
parameters b ecause itself mighto v erload the system T o explain ho w our tec hnique decides on a plan
for eac h smaller joins and howit a v oids redundantw ork in case of a switc h of plans w e need to dene
some terms
Denition Let there b e K costup date p oin ts cu
cu
cu
k
then a plan P
cu
i
is selected b y
the optimizer at cu
i
where P
cu
i
f P
j
P
sj
g Denition If N P
cu
i
tuples are transmitted from one site to another at cu
i
then for P
cu
i
P
j
N P
cu
i
tuples of R
r
relation are sentfrom S
to S
Ho w ev er if P
sj
is selected at cu
i
then N P
cu
i
tuples of R
l
relation o v er common attribute A are sentfrom S
to S
In order to a v oid redundan t tuple transfers N P
cu
i
tuples are c hosen in plan P
cu
i
suchthatnoneof
these tuples ha v e b een transmitted b efore from S
to S
b y P
cu m
where mi and P
cu
i
P
sj
Let
P
cu
i P
sj
and P
cu
i
b e the plans at cu
i and cu
i
resp ectiv elythen for P
cu
i
P
sj
the n um ber of
tuples of R
l
that are required to transfer from S
to S
is
N
i X
m P cu m
P
sj
N P
cu m
These tuples are joined with R
r
at S
and nally the exp ected n um b er of tuples that are further
required to ship from S
to S
is
P
i m P cu m
P
sj
N P
cu m
N
l
P
i m P cu m
P
j
N P
cu m
N
r
Sample estimate R
The subtracted terms presen ts the exp ected n um b er of tuples that ha v e already b een transferred A t
cu
i
C
r
and C
r are up dated with the recen t P
sj
cost estimate for P
cu
i P
sj
Note that for the
recen t P
sj
cost estimate
N P cu
i N
l
Sample estimate R
tuples w ere exp ected to ship from S
to S
This fact is tak en in to accoun t during the estimation of C
r Hence the o v erall cost for the P
sj
plan
to p erform join for the rest of the tuples at cu
i
is
Cost P
sj
N
i X
m P cu m
P
sj
N P
cu m
C
r
P
i m P cu m
P
sj
N P
cu m
N
l
P
i m P cu m
P
j
N P
cu m
N
r
Sample estimate R
C
r Let P
cu
i f P
j
P
sj
g and P
cu
i
b e the plans at cu
i and cu
i
resp ectiv ely then for P
cu
i
P
j
the
n um b er of tuples of R
r
that are required to transfer from S
to S
is
N
r
P
i m P cu m
P
sj
N P
cu m
N
l
Sample estimate R
i X
m P cu m
P
j
N P
cu m
Therefore the cost of the P
j
plan to p erform join for the rest of the tuples at cu
i
is
C ost P
j
N
r
P
i m P cu m
P
sj
N P
cu m
N
l
Sample estimate R
i X
m P cu m
P
j
N P
cu m
C
r Note that if P
cu
i P
j
C
r is up dated at cu
i
In this case C
r
cannot b e estimated unless either
a new prob e query A is issued at cu
i
or prob e query A has already b een issued in the bac kground
during the execution of P
j
The o v erhead of prob e query A can b e en tirely a v oided if P
cu
i
P
sj
see
Sec Finally if P
cu
i P
sj
C
r and C
r
are up dated at cu
i
accordingly and no extra prob e
query is required Hence R QO with adaptive optimization c ho oses P
sj
plan if C ost P
sj
Cost P
j
otherwise the optimizer switc hes to P
j
It is imp ortan t to realize that once a tuple of R
r
is sentbya
certain plan P
cu m
from S
to S
that tuple is not sen t again ev en if it is selected in a plan P
cu
i
where
m i and P
cu
i
P
sj
P
j
and P
cu m
P
sj
P
j
Therefore in a plan P
cu
i
w e select tuples from R
r
whichw ere not sentto S
in P
cu m
m i In order to do this as SQL op eration will b e executed
remotely hence JC
r
no w b ecomes exp ensiv e due to the additional condition see Eq Finally for
eac h P
cu
i
after gathering tuples from S
nal join is carried out at S
bet w een the R
l
and the shipp ed
data set
There is a tradeo in determining the frequency of c ostup date p oints Chec king to o man y p oin ts
for costup date can lead to an unacceptably high o v erhead In con trast few c ostup date p oints ma y
result in loss of some optimization opp ortunities F or K costup date p oin ts the o v erhead for prob e
queries A and B are depicted in T able T rivially K is a function of N P
cu
i
n um b er of plan switc hes
and their execution orders F or no w w e assume equal v alues of N P
cu
i
for dieren t cu
i
and wex
N P
cu
i
at X see Sec Ho w ev er w e are in v estigating howto c ho ose K in order to strikea
compromise b et w een these tradeos in order to imp ose a minim um o v erhead on the system
Plan Num b er of Prob e Query A Num b er of Prob e Query B
P
sj
plan observ ed for all K costup date p oin ts P
j
plan observ ed for all K costup date p oin ts K P
j
and P
sj
plan observ ed alternativ ely T able Prob e query o v erheads for adaptiv e optimization
P erformance Ev aluation
As w e argued in Sec the runtime b eha vior is to o unpredictable and sophisticated to b e captured
and analyzed b y analytical mo dels or sim ulations Hence w e decided to implemen t a real exp erimen tal
setup W e conducted a n um b er of exp erimen ts to demonstrate the sup eriorityof R TO o v er SQO for
join queries In these exp erimen ts rst w ev aried the w orkload on the t w o serv ers in order to sim ulate
CLIENT
Local Server
RMI
RMI
BUCKY
Database
Java API
Informix Server
Remote Server
BUCKY
Database
Java API
Informix Server
Run-time
Optimizer
Internet
Figure Exp erimen tal Setup
a heterogeneous en vironmen t andor v ariable runtime b eha vior of the en vironmen t Our exp erimen ts
v eried that R TO can adapt itself to w orkload c hanges and alw a ys c ho oses the b est plan while SQOs
decision is static and alw a ys a sp ecic plan is c hosen indep enden t of the load on the serv ers Second our
exp erimen ts sho w ed that ev en in case of a balance load R TO outp erforms SQO b ecause it captures
b oth the comm unication cost and the o v erhead attributed to a sp ecic implemen tation setup eg
RMI cost correctly W e did not rep ort our exp erimen ts for v ariable net w ork load b ecause b oth of our
join plans utilize net w ork almost iden tically and hence a congested or free net w ork will not result
in preferring one plan to the other W e plan to do more exp erimen ts with other sorts of queries
that giv e rise to plans utilizing net w ork dieren tlyFinallyw e did some exp erimen ts to in v estigate
the o v erhead asso ciated with prob e queries and quan tify the reduction in o v erhead b y emplo ying our
enhanced v ersion of R TO In these exp erimen ts w e did not v ary the runtime en vironmen t during the
query execution Therefore more exp erimen ts are required to study the eectiv eness of our adaptive
optimization tec hnique
Exp erim en tal Setup
Fig depicts our exp erimen tal setup whichconsistsoft wosites S
and S
whic h are not within a
LAN but within the campus area net w ork CAN The sites are Unixb o xes with an iden tical hardw are
platform a SUN Sparc Ultra mo del with MBytes of main memory and clo c k tic kssecond
sp eed The buer p o ol w as k ept at MB for the system W ein ten tionally c hose not to ha v e a large
buer p o ol to a v oid the database b ecomes memory residen t This is b ecause w ew an ted to study the
eect of load o v er comm unication cost Note that in our exp erimen ts w e degrade the p erformance of
one serv er b y loading it with additional pro cesses and emaluating an en vironmen t with heterogeneous
serv ers Eac h pro cess increases serv er disk IO b y repeatedly running Unix nd system call The
additional load is quan tied b y the n um b er of these pro cesses spa wned on a serv er
Eac h site runs an Informix Univ ersal Serv er IUS whic h is an ob jectrelational DBMS The runtime
optimizer and its dieren t plans are implemen ted in Ja v a The runtime optimizer comm unicates with
the database serv ers through Ja v a API whic h is a library of Ja v a classes pro vided b y Informix It pro
vides access to the database and metho ds for issuing queries and retrieving results F rom applications
running on one site Remote Metho d In v o cation RMI is used to op en a connection to the database
serv er residing on the other site The C r edential class of RMI has a public constructor that sp ecies
enough information to op en a connection to a database serv er Tw ot yp es of Creden tials are used Dir e ct Cr e dentials for lo cal applications and R emote Cr e dentials to access the remote database serv er
using t ypical HTTP creden tials The BUCKY database from the BUCKY b enc hmark CDN
w as
distributed across the t w o sites
The queries are submitted to site S
as a lo cal serv er and migh t require data to b e shipp ed from
site S
whic h is the remote serv er R TO resides at S
and emplo ys RMI and its HTTP creden tials to
access the remote site W e concen trate on the t w o T A and PR OFESSOR relations of BUCKY The T A
relation or R
at S
and the PR OFESSOR relation or R
r
resides at S
F or example in a realw orld
univ ersit y application the information on facult yisk ept at a site in the h uman resources S
while
the T A information is k ept at sa y computer science departmen tsite S
The n um b er of tuples p er
relation residing on eac h site has b een v aried for our exp erimen ts W e xed the total n um b er of tuples
ie N
N
r
at Without loss of generalit y and to simplify the exp erimen ts w e assumed no
duplications in the relations
The join query is Find the Name Str e et City State Zip c o de for every T A and hisher advisor in SQL
Select TName TStreet TCit y TStateTZip co de
P Name P Street P Cit yP State P Zip co de
from TATPR OFESSOR P
where TadvisorP id
The size of the join attribute idadvisor ie S
R
A
is b ytes and the size of attributes Name Street
Cit y State and Zip co de of P ROF ESSOR relation are and b ytes resp ectiv ely When
the query is submitted through an in terface a Ja v a applet running at S
the query optimizer consults
the metadata to iden tify the lo cation of the TA and the P ROF E S S OR relations R TO will then decide
using pr ob e queries A B whic h plan to c ho ose Wev aried the n um b er of tuples p er relation the
Xaxis of the rep orted graphs and measured the resp onse time of the join query in milliseconds
for eac h tuple distribution the Yaxis of the rep orted graphs The Xaxis is the p ercen tage of the
n umberoftuplesof TA relation that resides at S
ie N
N
N r
F or comparison purp oses w e
also measured the resp onse time of SQO semijoin and simple join for eac h exp erimen t
Results
F or the rst set of exp erimen ts w e compared the p erformance of SQO and R TO when the t w o serv ers
are equally loaded In this case one exp ect to see a similar p erformance for SQO and R TO Ho w ev er
as seen in Fig R TO the dotted line alw a ys c ho oses the correct plan byswitc hing from semijoin to
simple join plan at tuple distribution Instead SQO the solid line wrongly con tin ues preferring
semijoin to simple join un til of tuple distribution That is when the dierence b et w een N
and
N
r
is signican t b oth optimizers can correctly determine the b est plan The decision b ecomes more
c hallenging when N
and N
r
ha vev alues with marginal dierences SQO prefers semijoin b ecause
it o v erestimates the comm unication cost of simple join due to Eq R TO ho w ev er realizes that
comm unication cost is not only aected b y the amoun t of data shipp ed but also other factors and
hence simple join whic h ships more v olume of data migh t not b e as bad as exp ected In this situation
the cost of remote in v o cation see Sec impacts semijoin more than simple join By capturing
the facts and amortizing the asso ciated cost b y incorp orating C
r
and C
r in to its equations R TO
detects the sup eriorit y of simple join to semijoin after tuple distribution Note that switc hing at
the p oin t of tuple distribution cannot b e generalized b y a static optimizer b ecause it is v ery m uc h
dep enden t on our exp erimen tal setup and the participating BUCKY relations This is exactly wh ya
runtime optimizer is required
10 20 30 40 50 60 70 80 90 100
0
2
4
6
8
10
12
14
16
x 10
5
Nl/(Nl+Nr) fraction of tuples on Local Site
Response Time of a join query(milisecond)
Semi−join
Simple join
SQO Plan
RTO Plan
Figure Resp onse time of a join query for dieren t plans
F or this exp erimen t R TO outp erformed SQO byan a v erage of Mean while R TO incurred
an a v erage of extra dela y as compared to the optimal plan due to the o v erhead of prob e queries
W e further reduced this marginal o v erhead of R TO b yemplo ying our enhanced R TO see Sec As exp ected when the optimal plan is simple join the o v erhead cannot b e a v oided and b oth of R TOs
b eha v ed almost iden ticallyHo w ev er an a v erage reduction of in o v erhead w as observ ed for the
cases where semijoin w as the optimal plan
In the second set of exp erimen ts w espa wned some pro cesses p erforming IOs in cycles on the
10 20 30 40 50 60 70 80 90
0
0.5
1
1.5
2
2.5
3
x 10
6
Nl/(Nl+Nr) fraction of tuples on Local Site
Response Time of a join query(milisecond)
Semi−join
Simple join
SQO Plan
RTO Plan
10 20 30 40 50 60 70 80 90
0
0.5
1
1.5
2
2.5
3
x 10
6
Nl/(Nl+Nr) fraction of tuples on Local Site
Response Time of a join query(milisecond)
Semi−join
Simple join
SQO Plan
RTO Plan
a Pro cesses running on lo cal site b Pro cesses running on lo cal site
Figure Impact of load on the lo cal serv er S
lo cal serv er Fig a and b demonstrate the p erformance of dieren t optimizers when and pro cesses are activ ated on the lo cal site resp ectiv ely Recall that simple join p erforms one hea vy join
op eration at the lo cal site Therefore as the lo cal site b ecomes more loaded simple join b ecomes a less
attractiv e plan This b eha vior is illustrated in Fig a and b where the switc hing p oin t the p oin t
that simple join starts to outp erform semijoin is shifting to the righ t also see Fig T rivially since
SQO do es not tak e the serv er w orkload in to consideration it alw a ys p erforms iden tically indep enden t
of the load R TO on the other hand captures the serv er load and hence switc hes to the sup erior plan
exactly at the switc hing p oin ts see Fig In Fig observ eho w the query resp onse time has b een
increased as weactiv ate more pro cesses on the lo cal serv er Note that the v ariable load on serv ers can
also b e in terpreted as if the lo cal serv er is a lo wend system as compared to a highend remote serv er
Therefore R TO can also capture the heterogeneityofserv ers
10 20 30 40 50 60 70 80 90
0
2
4
6
8
10
12
14
x 10
5
Nl/(Nl+Nr) fraction of tuples on Local Site
Response Time of a join query(milisecond)
RTO(No Process)
RTO(10 Processes)
RTO(15 Processes)
Break Point(BP)
Figure Adaption of R TO to w orkload c hanges
Finally to sho w that the impact of load on the remote serv er and lo cal serv er is not symmetrical w e
activ ated some pro cesses on the remote serv er see Fig The rst impression is that since semijoin
utilizes the remote serv er more than simple join hence the switc hing p oin t should shift to the left the
rev erse b eha vior as compared to previous set of exp erimen ts That is as one increases the load on
the remote serv er the simple join plan should outp erform semijoin so oner Ho w ev er as illustrated in
Fig this is not the case The reason is that b yo v erloading the remote serv er it will send data to
the lo cal serv er at a lo w er rate this is due to the impact of Delay
send
X and Delay
r eceiv e
X factors
in Eq Therefore the simplejoin plan will suer as w ell The b eaut y of our tec hnique is that R TO
do es not need to tak e all these argumen ts in to consideration in order to decide whic h plan to c ho ose
The prob e queries b y measuring C
r
and C
r automatically capture all these b eha viors Therefore
as depicted in Fig R TO can still c ho ose the optimal plan
10 20 30 40 50 60 70 80 90
0
2
4
6
8
10
12
14
x 10
5
Nl/(Nl+Nr) fraction of tuples on Local Site
Response Time of a join query(MiliSecond)
SQO Plan(No Process)
RTO Plan(No Process)
SQO Plan(5 Processes)
RTO Plan(5 Processes)
SQO Plan(10 Processes)
RTO Plan(10 Processes)
Figure Impact of load on the remote serv er S
Conclusions and F uture Directions
By implemen ting a sample distributed database system consisting of heterogeneous serv ers running
homogeneous DBMS and connecting them via the In ternet the imp ortance and eectiv eness of run
time optimizations ha v e b een demonstrated Our runtime j oin optimizer R TO issues t w o prob e
queries striving to estimate the cost of semijoin and simple join plans By measuring the p erformance
of the prob e queries and analyzing the results R TO selects an optimal plan taking in to accoun t runtime
b eha vior of the en vironmen t at the time of query execution W e demonstrated through analysis and
exp erimen ts that our R TO can capture the comm unication dela y serv er w orkload and other hidden
costs sp ecic to certain implemen tation ie RMI cost in our case This is ac hiev ed without making
an y assumptions or attempts to mo del the c haotic b eha vior of the In ternetbased en vironmen t As
ab ypro duct our R TO is less sensitiv e to statistical anomalies than SQO F urthermore R TO relies
less on the remote relation proles than SQO since most of these information are captured during the
probing pro cess as a b ypro duct Therefore it b ecomes a b etter candidate for query optimization in
m ultidatabase systems where the proles residen t on one site is not readily accessible to other sites
Finallyw e prop osed an adaptive optimization tec hnique with R TO that captures sudden c hanges of
runtime en vironmen t during the execution of query Wein tend to extend this w ork in four directions First w ew an t to extend our exp erimen tal setup to
m ultiple sites to extend R TO to supp ort k w a y joins Second w ew an t to p opulate our ob jectrelational
database with m ultimedia data t yp es in order to compare CPU in tensiv e plans with comm unication
in tensiv e ones W e exp ect that here net w ork congestion w ould ha v e a high impact on preferring one plan
to the other Third w ew ould lik e to implementour adaptive optimization tec hnique see Sec in
order to capture sudden c hanges in runtime b eha vior Finally w ew an t to run other DBMS soft w ares
eg DB and Oracle on some of the sites in order to study our optimizer in a m ultidatabase
en vironmen t
References
ABF
L Amsaleg P Bonnet M F ranklin A T omasic and T Urhan Impro ving Resp onsiv eness
for Widearea Data Access Data Engine e ering ! Septem ber AHY P Ap ers A Hevner and S B Y a Algorithms for Distributed Querie IEEE Tr ansactions
on Softwar e Engine ering ! Jan uary An t Gennady An toshenk o v Dynamic Query Optimization in rdbvms In Pr o c of th IEEE
Intl Confer enc e on Data Engine ering ICDE BGW
P Bernstein N Go o dman E W ong C Reev e and J Rothnie Query Pro cessing in
a System for Distributed Databases A CM Tr ansactions on Datab ase SystemsTODS ! Decem ber BPR P Bo dorik J Pyra and J Riordo Correcting Execution of Distributed Queries In
Pr o c e e dings Se c ond Intl Symp On Datab ase in Par al le and Distribute d System July BRJ P Bo dorik J Riordo and C Jacob Dynamic Distributed Query Pro cessing Tec hniques
In Pr o c of A CM CSC Confer enc eF ebruary BRP P Bo dorik J Riordo and J Pyra Deciding to Correct Distributed Query Pro cessing
IEEE Tr ansactions on Know le de ge and Data Engine ering June CDN
Mic hael CareyDa vid DeWitt Jerey Naugh ton Mohammad Asgarian P aul Bro wn Jo
hannes Gehrk e and Dha v al Shah The Buc ky Ob jectRelational Benc hmark In Pr o c of
A CM SIGMODMa y CG Ric hard L Cole and Go etz Graefe Optimization of Dynamic Query Ev aluation Plans In
Pr o c of A CM SIGMOD CL A L P Chen and Victor O K LI Impro v emen t Algorithms for Semijoin Query Pro cessing
Programs in Distributed Database Systems A CM Tr ansactions on Computers !
No v em b er CP S Ceri and G P elagatti Distribute d Datab ases Principles and Systems pages !
McGra wHil
CY M Chen and P Y u In terlea ving a Join Sequence with Semijoins in Distributed Query Pro
cessing IEEE T r ansactions of Par al lel and Distribute d Systems ! Septem ber
EDNO C Evrendile A Dogac S Nural and F Ozcan Multidatabase Query Optimization
Journal of Distribute d and Par al lel Datab ases Jan uary F ar Jim F arley Java Distribute d Computing pages ! OREILL Y HS P eter J Haas and Arun N Sw ami Samplingbased selectivit y estimation for joins us
ing augmeted frequentv alue statistics In Pr o c of th IEEE Intl Confer enc e on Data
Engine ering ICDE Marc h Inf Informix Informix Universal Server Informix guide to SQL Syntax volume version
KYY Y Kam ba y aashi M Y oshik a w a and S Y a jima Query Pro cessing for Distributed
Databases using Generalized Semijoin In A CM SIGMOD International Confer enceon
Management of Data San Jose CA Ma y ML Lothar F Mac k ert and Guy M Lohman R" optimizer v alidation and p erformance ev al
uation for distributed queries In Pr o c of th Intl Confer enc e on Very Lar ge Data
BasesVLDB August Olk F Olk en Random Sampling from Databases PhD thesis Univ ersit y of California
Berk eley ONK
F Ozcan Sena Nural Pinar Kosk al C Evrendile and Asuman Dogac Dynamic Query
Optimization in Multidatabases Data Engine e ering ! Septem b er P ax V ern P axson Measuremen ts and analysis of End to End In ternet Dynamics PhD thesis
Univ ersit y of California Berk eley httpwwwnrgeelblgovn rgpape rshtml April
PF V P axson and S Flo yd Wh y We dont Kno wHowtoSim ulate the In ternet In Pr o c of
the Winter Simulation Confer enc e Septem b er
Ree George Reese Datab ase Pr o gr amming with JDBC and JA V A pages ! OREILL Y
RK N Roussop oulos and H Kang A Pip e nw a y join algorithm based on the w a y semijoin
program IEEE Tr ansactions on Know le de ge and Data Engine ering !
SKS A Silb ersc hatz H Korth and S Sudarshan Datab ase System Conc eptsc hapter pages
! McGra wHill UF A T Urhan M F ranklin and L Amsaleg Costbased Query Scram bling for Initial Dela ys
In Pr o c of A CM SIGMOD June YM ClementY u and W eiyi Meng Principles of Datab ase Query Pr o c essing for A dvanc edAp plic ationc hapter pages ! Morgan Kaufman ZL Qiang Zh u and P erAk e Larson A query sampling metho d for estimating lo cal cost pa
rameters in a m ultidatabase system In Pr o c of th IEEE Intl Confer enc e on Data
Engine ering ICDEF ebruary
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 720 (2000)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 744 (2001)
PDF
USC Computer Science Technical Reports, no. 840 (2005)
PDF
USC Computer Science Technical Reports, no. 754 (2002)
PDF
USC Computer Science Technical Reports, no. 833 (2004)
PDF
USC Computer Science Technical Reports, no. 785 (2003)
PDF
USC Computer Science Technical Reports, no. 879 (2006)
PDF
USC Computer Science Technical Reports, no. 572 (1994)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 742 (2001)
PDF
USC Computer Science Technical Reports, no. 959 (2015)
PDF
USC Computer Science Technical Reports, no. 966 (2016)
PDF
USC Computer Science Technical Reports, no. 849 (2005)
PDF
USC Computer Science Technical Reports, no. 845 (2005)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 739 (2001)
PDF
USC Computer Science Technical Reports, no. 647 (1997)
PDF
USC Computer Science Technical Reports, no. 839 (2004)
PDF
USC Computer Science Technical Reports, no. 896 (2008)
Description
Latifur Khan, Dennis McLeod, and Cyrus Shahabi. "An adaptive probe-based technique to optimize join queries in distributed internet databases." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 721 (2000).
Asset Metadata
Creator
Khan, Latifur
(author),
McLeod, Dennis
(author),
Shahabi, Cyrus
(author)
Core Title
USC Computer Science Technical Reports, no. 721 (2000)
Alternative Title
An adaptive probe-based technique to optimize join queries in distributed internet databases (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
22 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270554
Identifier
00-721 An Adaptive Probe-based Technique to Optimize Join Queries in Distributed Internet Databases (filename)
Legacy Identifier
usc-cstr-00-721
Format
22 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/