Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Approximate query answering in unstructured peer-to-peer databases
(USC Thesis Other)
Approximate query answering in unstructured peer-to-peer databases
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APPROXIMATE QUERY ANSWERING IN UNSTRUCTURED PEER-TO-PEER DATABASES by Farnoush Banaei-Kashani A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2007 Copyright 2007 Farnoush Banaei-Kashani Epigraph “An approximate answer to the exact problem is worth far more than the exact answer to an approximate problem.” John Wilder Tukey Statistician, 1915-2000 ii Dedication To My Parents iii Contents Epigraph ii Dedication iii List of Tables vii List of Figures viii Abstract x 1 Chapter 1: Introduction 1 1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Peer-to-Peer Databases . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Approximate Query Answering in Peer-to-Peer Databases . . . 3 1.1.2.1 Set-Valued Queries . . . . . . . . . . . . . . . . . . 5 1.1.2.2 Aggregate Queries . . . . . . . . . . . . . . . . . . . 7 1.2 Digest: An Approximate Query Answering System . . . . . . . . . . . 8 1.2.1 Set-Valued Query Answering Component . . . . . . . . . . . . 10 1.2.2 Aggregate Query Answering Component . . . . . . . . . . . . 13 1.3 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Chapter 2: Related Work 16 2.1 Set-Valued Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Aggregate Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Chapter 3: Approximate Set-Valued Query Answering 23 3.1 Partial Query Processing Engine . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Main Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1.1 Query and Service Model . . . . . . . . . . . . . . . 23 3.1.1.2 Partial Query Processing . . . . . . . . . . . . . . . 25 3.1.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2.1 Absolute Approximate Query Model . . . . . . . . . 27 3.1.2.2 Progressive Query Answering Service Model . . . . . 28 iv 3.2 Partial Read Operation: DBParter . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1.1 Communication Graph . . . . . . . . . . . . . . . . 31 3.2.1.2 Efficiency Measures for Sampling . . . . . . . . . . 31 3.2.2 DBParter: Partial Read by Epidemic Dissemination . . . . . . . 32 3.2.2.1 Epidemic Dissemination based on SIR Disease Spread- ing Model . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.2.2 Percolation Model for Epidemic Dissemination . . . 34 3.2.2.3 Tuning Epidemic Dissemination . . . . . . . . . . . 38 3.2.2.3.1 Definitions . . . . . . . . . . . . . . . . . . 38 3.2.2.3.2 Analysis . . . . . . . . . . . . . . . . . . . 39 3.2.3 A Real-World Example of DBParter . . . . . . . . . . . . . . . 41 3.2.3.1 Network Topology . . . . . . . . . . . . . . . . . . . 42 3.2.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.4 Variants of DBParter . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4 Chapter 4: Approximate Aggregate Query Answering 53 4.1 Sample-based Query Processing Engine . . . . . . . . . . . . . . . . . 55 4.1.1 Approximate Continuous Query . . . . . . . . . . . . . . . . . 55 4.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1.3 Sample-based Query Evaluation . . . . . . . . . . . . . . . . . 62 4.1.3.1 Continual Querying . . . . . . . . . . . . . . . . . . 62 4.1.3.2 Approximate Querying . . . . . . . . . . . . . . . . 65 4.1.3.2.1 Independent Sampling . . . . . . . . . . . 65 4.1.3.2.2 Repeated Sampling . . . . . . . . . . . . . 68 4.2 Random Sampling Operator: DBSampler . . . . . . . . . . . . . . . . 76 4.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.2 Forwarding Probabilities . . . . . . . . . . . . . . . . . . . . . 77 4.2.3 Convergence Time . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Overcoming the Limitations of Conventional Sampling Design . . . . . 80 4.3.1 Snapshot Approximate Aggregate Query . . . . . . . . . . . . 82 4.3.2 Conventional Sampling . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 Sampling with Unknown Parameters . . . . . . . . . . . . . . . 88 4.3.3.1 Double Sampling . . . . . . . . . . . . . . . . . . . 88 4.3.3.2 Sequential Sampling . . . . . . . . . . . . . . . . . . 92 4.3.4 Sampling with Skewed Data Distribution . . . . . . . . . . . . 94 4.3.4.1 Cluster Sampling . . . . . . . . . . . . . . . . . . . 96 v 4.3.4.2 Inverse Sampling . . . . . . . . . . . . . . . . . . . 100 4.3.5 Universal Sampling . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.4.1 Query Answering . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.4.1.1 Experimental Methodology . . . . . . . . . . . . . . 105 4.4.1.2 Experimental Results . . . . . . . . . . . . . . . . . 106 4.4.1.2.1 Effect of the Extrapolation Algorithm . . . 107 4.4.1.2.2 Effect of the Repeated Sampling Algorithm 108 4.4.1.2.3 Overall Efficiency of Digest . . . . . . . . . 109 4.4.2 Sampling Designs . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.2.1 Experimental Methodology . . . . . . . . . . . . . . 111 4.4.2.2 Experimental Results . . . . . . . . . . . . . . . . . 112 4.4.2.2.1 Query Answering with Unknown Parameters 112 4.4.2.2.2 Query Answering with Skewed Data Distri- bution . . . . . . . . . . . . . . . . . . . . 113 4.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 References 117 Appendix: Similarity Search in Structured Peer-to-Peer Databases 127 A Formal Definition of the Problem . . . . . . . . . . . . . . . . . . . . . 128 A.1 Data and Query Model . . . . . . . . . . . . . . . . . . . . . . 128 A.2 Efficiency Measures for Peer-to-Peer Database Access Methods 132 B SWAM: Small-World Access Methods . . . . . . . . . . . . . . . . . . 132 B.1 Small-World as an Index Structure . . . . . . . . . . . . . . . . 133 B.2 SWAM Family . . . . . . . . . . . . . . . . . . . . . . . . . . 136 B.3 SWAM-V: A V oronoi-based SWAM . . . . . . . . . . . . . . . 140 vi List of Tables 4.1 Regular and Regression Estimators at2 nd Occasion . . . . . . . . . . . 70 4.2 Regular and Regression Estimators atk th Occasion . . . . . . . . . . . 72 4.3 Parameters of the Datasets . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4 Parameters of the Datasets . . . . . . . . . . . . . . . . . . . . . . . . 111 vii List of Figures 1.1 Two-tier query answering framework for peer-to-peer databases: dis- tributed data retrieval from the network (bottom tier) followed by local processing of the retrieved data at the querying node (top tier) . . . . . 1 1.2 Digest: Two-tier approximate query answering system for peer-to-peer databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Digest User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Answering Partial Queries with Digest . . . . . . . . . . . . . . . . . . 26 3.3 Parameter Mapping Process for Tuning a Generic Dissemination Mech- anism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Communication graph . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 State Diagram for SIR Spreading Model . . . . . . . . . . . . . . . . . 34 3.6 Site Percolation Problem . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7 Bond Percolation Problem . . . . . . . . . . . . . . . . . . . . . . . . 37 3.8 Critical Probability in Power-Law Peer-to-Peer Databases . . . . . . . . 44 3.9 Verification of the Analytical Results . . . . . . . . . . . . . . . . . . . 48 3.10 DBParter vs. Random Walk . . . . . . . . . . . . . . . . . . . . . . . . 50 3.11 DBParter vs. Scope-Limited Flooding . . . . . . . . . . . . . . . . . . 51 4.1 Fixed-Precision Approximate Continuous Aggregate Query . . . . . . . 57 4.2 Two-Tier Architecture of Digest . . . . . . . . . . . . . . . . . . . . . 60 4.3 Computing t u i+1 by Polynomial Extrapolation: at t = t u i+1 , we have j¢P n [t u i+1 ]j+jR n [t u i+1 ]j>± . . . . . . . . . . . . . . . . . . . . . . 63 viii 4.4 A number of snapshot approximate aggregate queries ( b Q t i ) and their corresponding exact queries (Q t i ) over a sequence of timest i . The value of p for the approximate queries is not shown. Note that the result of the approximate query b Q t 7 lies outside the desired confidence interval, suggesting that ifp<1, there is a positive probability that the estimated result exceeds the confidence limits. . . . . . . . . . . . . . . . . . . . 83 4.5 Query Evaluation by Double Sampling . . . . . . . . . . . . . . . . . . 90 4.6 Query Evaluation by Sequential Sampling . . . . . . . . . . . . . . . . 93 4.7 Query Evaluation by Cluster Sampling . . . . . . . . . . . . . . . . . . 98 4.8 Query Evaluation by Inverse Sampling . . . . . . . . . . . . . . . . . . 101 4.9 Query Evaluation by Universal Sampling . . . . . . . . . . . . . . . . 104 4.10 Effect of the Extrapolation Algorithm . . . . . . . . . . . . . . . . . . 107 4.11 Effect of the Repeated Sampling Algorithm . . . . . . . . . . . . . . . 108 4.12 Efficiency of Digest in Number of Samples . . . . . . . . . . . . . . . 110 4.13 Efficiency of Digest in Communication Cost . . . . . . . . . . . . . . . 111 4.14 Efficiency of the Digest Sampling Designs . . . . . . . . . . . . . . . . 113 5.1 Reducing the General Peer-to-Peer Database Model . . . . . . . . . . . 129 5.2 Small-World Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3 Partitioning of Key Space . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.4 SWAM-V Index Structure . . . . . . . . . . . . . . . . . . . . . . . . 140 ix Abstract Peer-to-peer networks are considered the new generation of distributed databases, termed peer-to-peer databases. With very large size, open architecture, and extreme dynamism and autonomy, approximate query answering is arguably the most promising approach for query answering in peer-to-peer databases. To enable approximate query answer- ing, in this dissertation we propose a set of universal sampling operations specifically designed for probing the data in peer-to-peer databases. We complement these opera- tions at the bottom tier by a set of approximate query processing techniques at the top tier to develop a two-tier system for answering both set-valued and aggregate queries in peer-to-peer databases. x Chapter 1 Introduction 1.1 Motivation and Problem Statement 1.1.1 Peer-to-Peer Databases A peer-to-peer network consists of a variable set of autonomous nodes that federate to pool their data and share the data among themselves. Each individual node dynamically creates, collects, and/or stores its own data, and communicates with other nodes via an evolving overlay network that interconnects the nodes. It is through this overlay network that a node can query the peer-to-peer database, i.e., the pool of data distributed among all the nodes (see Figure 1.1). The query is disseminated through the network Top Tier Bottom Tier Query Relevant Data Figure 1.1: Two-tier query answering framework for peer-to-peer databases: dis- tributed data retrieval from the network (bottom tier) followed by local processing of the retrieved data at the querying node (top tier) 1 to retrieve the relevant data (bottom tier), and the retrieved data is processed locally at the querying node to generate the final answer (top tier). In such a database, each node is both a provider and a user of the data; hence, nodes are peers in functionality. Peer-to-peer file-sharing networks such as Kazaa [Sha05] are familiar (but simplistic) examples of peer-to-peer databases. With dynamic data-set, node-set, and topology, peer-to-peer databases are considered as the new generation of distributed databases with significantly less limiting assumptions as compared to their ancestors. Our description of peer-to-peer databases above, partly overlooks the so-called structured peer-to-peer databases which are enabled for in-network query processing [HLL + 03]. Structured peer-to-peer databases are organized with distributed index struc- tures such as Distributed Hash Tables [RFH + 01, SMK + 01, RD01] that allow distributed query processing within the network. With such a capability, the local processing at the querying node (top tier in Figure 1.1) can be entirely delegated to the network (bottom tier), merging the two tiers into one distributed data retrieval and processing tier. How- ever, with many applications of the peer-to-peer database systems due to considerable amount of churn and autonomy, structuring the database by constructing and maintain- ing distributed index structures is inefficient and/or impossible [LHH + 04, She04b] 1 . In this dissertation, we focus on the unstructured peer-to-peer databases, which refuse to undertake global structure. With unstructured peer-to-peer databases data is not indexed and therefore, in analogy with sequential scan in regular unindexed databases, queries are inevitably executed by disseminating the query throughout the network of the nodes and in situ evaluation of the query at each visited node to retrieve the relevant data for further processing at the querying node. Hereafter in this manuscript, “peer-to-peer databases” refers to unstructured peer-to-peer databases, unless otherwise stated. 1 An overview of our previous work on distributed query processing in structured peer-to-peer databases in included as appendix to this dissertation for reference. 2 1.1.2 Approximate Query Answering in Peer-to-Peer Databases We argue that exact query answering is neither efficient nor necessary in peer-to-peer databases, and instead advocate approximate query answering as the most promising approach for query answering. By definition, a peer-to-peer database is an open com- puting environment characterized by dynamism and large scale. The former results in uncertainty in availability and reachability of the data, and the latter increases the time and cost for data access. Respectively, in such an environment exact query answering, i.e., answering a query by returning complete/precise result, is either logically impossi- ble, or extremely inefficient in terms of the query time and query cost (e.g., communi- cation cost of disseminating the query to the entire network). On the other hand, since peer-to-peer databases are inherently open and dynamic environments, the conventional querying scheme for these databases is exploratory querying. Users often issue sev- eral back-to-back queries, each time revising and enhancing the query based on cursory observation of the results of the previous query, just to explore the unknown content of the database and narrow down their search for available useful data. Even when they find their desired formulation of the query, an exact result is most often unnecessary and redundant. Approximate query answering eliminates this redundancy to achieve efficiency. With approximate query answering, it is assumed that queries (now called approxi- mate queries) are satisfied by approximate result. Users can specify the required com- pleteness or precision (for set-valued queries and aggregate queries, respectively) of the approximate result that satisfies the query. Correspondingly, the database system uses approximate query answering techniques to lighten the load of data retrieval and query processing to the least sufficient to satisfy the approximate query, effectively enhancing the efficiency of query answering by eliminating the redundancy of the result. We argue that approximate query answering is not only sufficient to answer peer-to-peer database 3 queries, but considering the enormous size of the peer-to-peer databases and abundance of the users, an inevitable approach to develop an efficient scalable query answering sys- tem for peer-to-peer databases. It is important to note that approximate query answering is a generalization of exact query answering and does not preclude exact queries, in case required. An exact query is a specific and extreme case of approximate query with a complete/precise result. According to the two-tier query answering framework (see Figure 1.1), approximate query answering consists of two parts: approximate data retrieval at the bottom tier, for partial retrieval of the relevant data, and approximate query processing at the top tier, for further processing of the retrieved approximate data to generate the final approxi- mate result. Unlike traditional databases, the main challenge with approximate query answering in peer-to-peer databases is with approximate data retrieval at the bottom rather than approximate query processing on top. With traditional database systems, data is retrieved from a local secondary storage device such as disk, which trivially sup- ports approximate data retrieval requests received from the query processing engine on top (e.g., “retrieve 20% of the tuples in relationR”). Consequently, with these databases most research efforts on approximate query answering is focused on developing approx- imate query processing techniques. A large body of such techniques has been developed and with adequate modifications, some of them can be adopted for approximate query processing (i.e., the top tier) in peer-to-peer databases. Conversely, with peer-to-peer databases, since data are to be retrieved from a network with a dynamic distributed set of nodes, fulfilling even the most primitive approximate data retrieval requests pose challenge. We term the approximate data retrieval operations for peer-to-peer databases the data sampling operations or simply sampling operations, as they are used to sam- ple (probabilistically or nonprobabilistically, as discussed below) the data content of the peer-to-peer database. 4 In this dissertation, our agenda is to design essential sampling operations that enable approximate query answering in peer-to-peer databases. Moreover, we complement these operations at the bottom tier by a set of approximate query processing techniques at the top tier, in order to develop an inclusive two-tier approximate query answering system for peer-to-peer databases. We consider the two main types of queries, i.e., both set-valued queries and aggregate queries. In the rest of this section, we discuss the challenges in answering each type of query in more details. 1.1.2.1 Set-Valued Queries Assuming a relational data model for peer-to-peer databases, a set-valued query is defined as a query that takes one or more relations as input and returns a single rela- tion (which is a set of tuples, so the term “set-valued”) as the result. In peer-to-peer databases set-valued queries are often one-time queries, i.e., they are issued and exe- cuted once without anticipation of further repetition. An approximate set-valued query is an arbitrary set-valued query with a user-defined completeness ratio ² 2 [0;1]; any fraction² of the complete result-setX, noted asX (²) , is sufficient to satisfy the approx- imate query. An exact query is a specific case of approximate query with²=1. An approximate set-valued query is also called a partial query. To answer a partial query, first the peer-to-peer database should be sampled to retrieve a sufficiently large fraction ® i of each input relation R i . Subsequently, the partial input relations R (® i ) i are processed using partial query processing techniques (described in Sections 2.1 and 3.1.1.2) to generate the final partial result-setX (²) . As mentioned above, the main chal- lenge with answering partial queries is with devising a sampling operation that given ® 2 [0;1], retrieves a fraction ® of an input relation R (i.e., a partial relation R (®) ), where R is horizontally fragmented among the nodes of the peer-to-peer database. To distinguish this sampling operation from a probability-sampling operation, we term this 5 operation the partial read operation. Partial read is a nonprobability-sampling opera- tion, because it is not expected to guarantee randomness of the sample, but only the correct size of the sample. Probability-sampling is required for answering aggregate queries (see next section). To be applicable, the partial read operation must be both correct and efficient. A partial read is correct ifjR 0 j=jRj ¸ ®, where R 0 is the retrieved portion of the input relationR. This is a sufficient condition for correctness of the partial read; ideally, the redundancy of the data retrieval is minimal when jR 0 j=jRj = jR (®) j=jRj = ®. The peer-to-peer database is a fragmented database which is randomly (and not necessarily uniform randomly) distributed among a variable set of nodes with unknown population, where each node may autonomously decide to leave the network or refrain from par- ticipating in the distributed data retrieval at any time. Under such circumstances, to be correct the partial read operation must implement a distributed query dissemination scheme to visit sufficient number of nodes and retrieve sufficient number of tuples, such that in expectation the collective set of retrieved tuplesR 0 satisfies the correctness condi- tion with high probability. On the other hand, the efficiency of the partial read operation is defined in terms of the sampling time and sampling cost, which map to the total com- munication time and communication cost of disseminating the query for distributed data retrieval. There is a trade-off between the sampling time and sampling cost, with higher cost to achieve shorter time and vice versa. An efficient partial read operation satisfies the ideal case of the correctness condition with an optimally balanced sampling time versus sampling cost. 6 1.1.2.2 Aggregate Queries While set-valued queries allow exemplary data access, aggregate queries enable study of the aggregate characteristics of the peer-to-peer database (both the data and the net- work) as a whole. Aggregate queries quantify these aggregate characteristics by eval- uating corresponding aggregate functions over the database. Typical aggregate func- tions range from the basic database aggregate functions (AVERAGE,COUNT,SUM,MIN, and MAX) to more complex functions such as statistical functions (e.g., STANDARD DEVIATION, VARIANCE, MOMENT), ranking functions (e.g, MEDIAN, QUANTILE), and summaries (e.g., HISTOGRAM, CUBE). Unlike set-valued queries, in peer-to-peer databases aggregate queries are often continuous (and/or repeated) queries mainly used for monitoring purposes. A continuous query is a long-running query that is repeat- edly evaluated to reflect changes of the database in the result. Therefore, the result of the aggregate query is, as so-called, a running aggregate value which is continuously updated. With an approximate aggregate query, the running result of the query is an estimation of the actual (exact) running aggregate value, such that at any point of time the absolute error of the estimation is bounded in the confidence interval [¡²;²] with probabilityp. The probabilityp determines the so-called confidence level of the estima- tion, and² defines the upper and lower confidence limits. There are various approaches for answering approximate aggregate queries (see Sec- tion 2.2 for a brief review). One of the most fundamental approaches that enables other approaches and best applies to the peer-to-peer databases is the sample-based query answering. With this query answering approach, the database is represented by a proper random sample retrieved from the database 2 . Subsequently, at the query processing 2 Here, by “random” sample we refer to the general definition of a probability-sample. With a probability-sample, each element of the sample space is included in the retrieved sample with a known required probability. 7 phase, sample-based estimation techniques can be applied to estimate the aggregate characteristics of the database based on the corresponding characteristics of the sample with guaranteed precision (or confidence). However, to guarantee the precision of the estimation these techniques strictly rely on the randomness of the sample (as required, uniform or nonuniform). Therefore, unlike the partial read operation for answering set-valued queries, the sampling operation for answering aggregate queries must be a probability-sampling operation. We call a probability-sampling operation designed for sampling from peer-to-peer databases the random sample operation. The random sample operation must be both correct and efficient. A random sample operation is correct if it retrieves samples that 1) are of designated size, and 2) include each element of the sample space with a required probability. Since unlike partial read samples, a random sample is of absolute size (e.g., 1000 tuples rather than 10% of the tuples) which is selected mostly independent of the size of the database, even though the peer-to-peer database is dynamically evolving, ensuring the correct sample size is straightforward. However, considering the distributed and dynamic nature of the peer- to-peer databases, guaranteeing the correct randomness of the sample (i.e., whether for various sample designs each data element in the sample space is visited and included in the sample with proper likelihood) is challenging. As far as efficiency is concerned, since with the randomness requirement the sampling time of a random sample operation is intrinsically high, the efficiency of the random sample operation is mainly evaluated based on its sampling cost (communication cost). 1.2 Digest: An Approximate Query Answering System In this dissertation we propose Digest, a two-tier system for answering approximate set- valued queries and approximate aggregate queries in peer-to-peer databases (see Figure 8 Top Tier Bottom Tier Digest Set-Valued Query Answering Component Aggregate Query Answering Component Set-Valued Query Aggregate Query Query Data Data Random Sampling Operations (DBSampler) Sample-based Query Processing Engine Partial Read Operations (DBParter) Partial Query Processing Engine Peer-to-Peer Network Figure 1.2: Digest: Two-tier approximate query answering system for peer-to-peer databases 1.2). For each of these two types of queries, we devise appropriate sampling operations for data retrieval at the bottom tier, and complement them by a suite of query processing techniques at the top tier to develop a comprehensive system for approximate query 9 answering. In the rest of this section, we present an overview of the two query answering components of Digest, i.e., the set-valued query answering component and the aggregate query answering component. 1.2.1 Set-Valued Query Answering Component For partial set-valued query processing at the top tier, we adopt and extend the partial query processing techniques originally developed for traditional databases. Details of the customized query processing techniques are presented in Section 3.1. For partial read at the bottom tier, we propose the sampling operation DBParter. To sample the database, beginning from the originator of the set-valued query DBParter spreads sam- pler agents throughout the network. While spreading, the agents inspect the nodes of the network to locate and retrieve a fraction® of the tuples of the input relationR. With DBParter, the spread of the agents is modelled on epidemic dissemination of diseases in social networks 3 . By epidemic dissemination, agent spreading is probabilistic, i.e., when a node receives a sampler agent, it replicates and forwards the agent to each of its neighbors with a forwarding probability p (where 0 · p · 1). Therefore, a node may forward replicas of the agent to zero or more neighbors. Such an agent forward- ing algorithm subsumes agent forwarding by both flooding [Lim05] and random walk [LCC + 02]. The communication graph of the epidemic dissemination (i.e., the subgraph of the network which is covered by the disseminated agents) grows with larger values ofp, such that withp = 1 the epidemic dissemination is equivalent to regular flooding which covers the entire network. 3 In the literature, sometimes gossip-based or rumor-based spreading techniques are also termed epidemic techniques [DGH + 87, KSSV00, KDG03, BGPS05]. Here, we are not referring to such many-to-many communication techniques, but specifically to the disease spreading models such as SIR (Susceptible-Infected-Removed), SIS (Susceptible-Infected-Susceptible), etc. 10 DBParter specifically implements SIR (Susceptible-Infected-Removed), which is a classic epidemic dissemination model [Het00]. Our main contribution with DBParter is derivation of a closed-form formula that given a partial read request, maps the value of the completeness ratio ® to an appropriate value for the forwarding probability p such that the request is correctly satisfied. Leveraging on this derivation, DBParter on-the- fly and per read-request tunes p based on ® such that the communication graph grows just sufficiently large to cover a fraction of the database that satisfies the partial read request. For partial read requests with small ® the communication graph is sparse and as® increases the graph becomes denser. Since both the communication cost and com- munication time of the sampling increase proportional to the size of the communication graph, the partial read requests with higher completeness ratio are more expensive, as expected. DBParter satisfies the ideal case of the correctness condition for the partial read oper- ation while the size of the peer-to-peer network is unknown, the set of available nodes is dynamically changing, and the data are randomly distributed (not necessarily uniform randomly) among the nodes. First, assume a static network where data are uniformly distributed among the nodes. As we show in Section 3.2.2.3, in such a network with each value ofp DBParter covers a certain fixed fraction of the network nodes indepen- dent of the size of the network. In other words, for the same value of p, size of the communication graph is always proportional to the size of the entire network, such that its relative size (i.e., the covered fraction of the network) is fixed. Intuitively, this occurs because unlike flooding and random walk with which a sampler agent never dies unless it is explicitly terminated (e.g., when TTL expires), with epidemic dissemination agent forwarding is probabilistic and with some non-zero probability each replica of the agent may naturally die at each step. The dissemination terminates whenever all replicas of an agent die. The larger the network, the more it takes for the dissemination to die and, 11 therefore, the larger is the communication graph of the dissemination. Moreover, we introduce a number of variants of the basic case of DBParter discussed above 1) to con- sider the dynamics of the nodes and adjust our derivation of p accordingly to cover ® fraction of the nodes in dynamic peer-to-peer databases, 2) to consider the nonuniform distribution of the data among the nodes and bias the agent forwarding probability such that ® fraction of the data (and not the nodes) is covered, and 3) to consider the case where some nodes refrain from forwarding the agents and again accordingly adjust our derivation such that the read request is satisfied. These variations of DBParter can be combined to simultaneously consider several cases that apply to a certain peer-to-peer database. DBParter is also efficient, in that it strikes a balance between the communication cost and communication time of the sampling. Since epidemic dissemination is essen- tially a flood-based technique, as we show in Section 3.3, its communication time is low and comparable with that of the regular flooding. On the other hand, due to the phase transition phenomenon associated with the SIR epidemic model, for the common case of the partial read requests the communication cost of the DBParter is up to two orders of magnitude less than that of the regular flooding and comparable with the low communication cost of the random walk. Intuitively, with epidemic dissemination the dense communication graph of the regular flooding, which with numerous loops repre- sents a large amount of redundant and duplicate agent forwarding, is reduced to a sparse communication graph. With fewer loops, the sparse graph contains less redundant paths and therefore, causes less duplicate agents, while covering almost the same set of nodes. Hence, epidemic dissemination can be tuned such that the overhead of the flooding is effectively eliminated while its reachability and communication time is preserved. It is also important to note that DBParter is simple to implement, and since it is a randomized 12 mechanism (and not deterministic), it is inherently reliable to use with dynamic peer-to- peer databases. We perform an empirical study via simulation to compare the efficiency of DBParter versus other possible partial read operations. 1.2.2 Aggregate Query Answering Component With Digest, we also propose a sample-based approach to answer approximate continu- ous aggregate queries in peer-to-peer databases [BKSa]. At the top tier of the aggregate query answering component, we develop a query processing engine that uses the sam- ples collected from the peer-to-peer database to continually estimate the running result of the approximate continuous aggregate query with guaranteed precision. For efficient query evaluation, we propose an extrapolation algorithm that predicts the evolution of the running result and adapts the frequency of the continual sampling occasions accord- ingly to avoid redundant samples. We also introduce a repeated sampling algorithm that draws on the correlation between the samples at successive sampling occasions and exploits linear regression to minimize the number of the samples derived at each occasion. At the bottom tier, we introduce a distributed sampling algorithm for ran- dom sampling (uniform and nonuniform) from peer-to-peer databases with arbitrary network topology and tuple distribution. Our random sampling operator, DBSampler, is developed based on the Metropolis Markov Chain Monte Carlo method that guaran- tees randomness of the sample with arbitrary small variation difference with the desired distribution, while it is comparable to optimal sampling in sampling cost/time. While assuming ideal conditions, at each sampling occasion (instance) Digest can use the conventional sampling design to evaluate the snapshot aggregate result of the query at the instance, the typical conditions in many real-world peer-to-peer databases differ from the ideal conditions and require new sampling designs for query evaluation. To overcome the limitations of the conventional sampling design, we also propose a 13 collection of novel sampling designs [BKSb]. We address two main limitations of the conventional sampling: unknown parameters and skewed data distribution. To over- come the first limitation, we introduce a double sampling design with which a pilot sample is obtained from the database prior to the main sample in order to precalcu- late the unknown parameters required for estimation of the query result. Alternatively, we propose a sequential sampling design that calculates the unknown parameters incre- mentally during deriving the main sample from the database. To address the second limitation, a sampling design must guarantee unbiased estimation of the query result despite the skewness of the data distribution (due to existence of rare data values) in the peer-to-peer database. We present a cluster sampling design that leverages on the clustering property of the data in peer-to-peer databases and derives extra samples from the clusters of rare data values to correct for the biased estimation of the query result with conventional sampling. We also propose an inverse sampling design with a stop condition that guarantees the sampling is terminated if and only if sufficient number of rare samples are captured from the database for unbiased estimation of the query result. Finally, we integrate our proposed sampling designs to a universal sampling design for query answering in peer-to-peer databases with which both the estimation parameters are unknown and the data distribution is skewed. We evaluate the perfor- mance of Digest and the proposed sampling designs rigorously as well as empirically via simulation using real data. 14 1.3 Road Map The remainder of this dissertation is organized as follows. In Chapter 2, we review the related work. Chapters 3 and 4 describe the approximate set-valued query answer- ing component and the approximate aggregate query answering component of Digest, respectively. 15 Chapter 2 Related Work 2.1 Set-Valued Queries In the database literature, there are two widely different approaches for processing approximate set-valued queries: histogram-based query processing [IP99] and partial query processing [BDW88]. With the first approach, the result of the set-valued query is of the same size as the exact result, but each tuple is an approximation estimation of the actual tuple it represents. This approach relies on a histogram-based reduced model for the database which is difficult to maintain in dynamic peer-to-peer databases. Moreover, most of the peer-to-peer database applications can tolerate partial result but expect exact tuples as the result. For example in a peer-to-peer database representing a grid comput- ing system, the query “List the system specifications of up to 10 computing units with at least 1GB available memory space.”, is a partial query that requires exact specification of the matching units. In this dissertation, we take the second approach, partial query processing, for processing approximate set-valued queries in peer-to-peer databases. With partial query processing, the approximate result is defined as a set which is sandwiched between a subset and a superset of the exact result [BDW88, Mot84]. The tighter the sandwich, the result-set is a more complete approximation of the exact result. The completeness of the approximate result is quantified based on the cardinality of the symmetric difference between the subset and superset. CASE-DB [ODGH92, Hou93] and APPROXIMATE [VL93] are two real-world database systems that implement query approximation by partial query processing. However, with peer-to-peer databases where 16 maintaining a database catalog is infeasible, deriving the superset of the result from the database itself defeats our main purpose of reducing the cost of the query by approxi- mation. Instead, we define the approximate result (the partial result) as a subset of the exact result. In addition, unlike previous work our approximate query model allows user to specify the expected completeness of the partial result in order to upper bound the result-set (see Section 1.1.2.1). Partial query processing is particularly challenging when the tuples of the partial input relations arrive progressively. Various techniques are proposed to avoid the redundancy of the comprehensive re-evaluation of the query while ensuring monotonic improvement of the completeness of the partial result as more input tuples arrive. Special emphasis is on incremental/differential evaluation techniques for the queries with blocking operators [CDTW00, STD + 00]. On the other hand, as mentioned in Section 1.1.2.1, partial read does not pose a chal- lenge in the context of traditional database systems. Also, in the context of unstructured peer-to-peer networks, although data retrieval from the network has been studied as a search problem, none of the proposed search mechanisms can satisfy the correctness and efficiency requirements of the partial read operation. There are two main propos- als for search in peer-to-peer networks: flooding [Lim05, YGM03] and random walk [LCC + 02, ALPH01, LRS02, CGM02]. With both of these search mechanisms, query is disseminated throughout the network by recursive forwarding from node to node. With flooding each node that receives the query forwards it to all of its neighbors, whereas with random walk query is forwarded to only one (uniformly or nonuniformly) selected random neighbor. None of these approaches can strike a balance between the two met- rics of efficiency for search, i.e., the communication time and the communication cost. Flooding is most efficient in communication time but incurs too high of redundant com- munication to be practical, whereas a random walker is potentially more efficient in communication cost but is intolerably slow in scanning the network. In [YGM03], a 17 two-tier hierarchy is proposed where flooding is restricted to the supernodes at the top tier. This solution only alleviates the communication cost of flooding and the problem resurfaces as the top tier scales. In [LCC + 02], using k random walkers in parallel is proposed as a way to balance the communication cost and the communication time of the query. However, this proposal does not provide any theoretical basis for selecting the value ofk for optimal performance. Previous search mechanisms are not only inefficient, but also specifically inappropri- ate for partial read. As mentioned above, the main benefit of the partial query answering is that it allows trading off completeness of the result for better efficiency, by limiting the retrieved data to a just sufficiently large fraction of the database that satisfies the query. To enable such a trade-off, a search mechanism that is used to implement partial read should allow adjusting the coverage of the database (i.e., the fraction of the nodes, and hence data objects, visited) according to the completeness ratio ®. With both flooding and random walk, TTL (Time-To-Live) is used to limit the coverage of the network and eliminate runaway queries. However, TTL is an absolute value and it is not clear how one can adjust TTL such that in a network with unknown size, a sufficiently large frac- tion of the database is covered. TTL is often set to a fixed value, a value that is selected in an ad hoc fashion based on the average performance of the typical search queries and must be re-adjusted as the peer-to-peer database evolves. Alternatively, TTL is gradu- ally increased to expand the coverage, each time repeating the query dissemination from the beginning, until sufficient fraction of the database is covered to answer the query. Although this last scheme may result in correct partial read, due to the redundancy of repeating the query dissemination, communication cost of the scheme can even exceed that of the regular flooding. Also, with flooding the granularity of coverage is too coarse (the number of covered nodes grow exponentially with TTL), rendering fine adjustment of the coverage impossible. 18 DBParter is inspired by epidemic disease dissemination. The process of epidemic disease dissemination has been previously used as a model to design other informa- tion dissemination techniques [HKMP96]. Particularly, in the networking community, epidemic dissemination is termed probabilistic flooding and is applied for search and routing in ad hoc networks and sensor networks [LHH02, GKW + 02]. We distinguish DBParter from this body of work in two ways. First, although epidemic algorithms are simple to implement, due to their randomized and distributed nature they are often dif- ficult to analyze theoretically. For the same reason, most of the previous work restrict themselves to empirical study of the performance with results that are subject to inac- curacy and lack of generality. We employ the percolation theory to rigorously tune DBParter to its best operating point. Second, those of the few theoretical studies of epi- demic algorithms adopt simplistic mathematical models [Het00] that assume a homoge- nous topology (a fully connected topology) for the underlying network to simplify the analysis. However, recently it is shown that considering the actual topology of the net- work in the analysis extensively affects the results of the analysis [GMT05]. We perform our analysis of DBParter assuming an arbitrary random graph as the underlying topology of the peer-to-peer network and specifically derive final results for a power-law random graph, which is the observed topology for some peer-to-peer databases [SGG02]. 2.2 Aggregate Queries There are two architecturally different classes of aggregate query processing approaches: in-network processing approaches, mostly pursued by the networking research community, and centralized processing approaches, which home at the database community. With in-network processing, the process of aggregation is performed within the network, whereas with centralized processing data is collected from the network and 19 processed outside the network as a unified database to derive the aggregates. Aside from the DHT-based in-network processing approaches [GD, HLL + 03] which are developed for query answering in structured networks, the in-network query processing approaches can be categorized into two groups: tree-based processing approaches and graph-based processing approaches. With tree-based processing which is best represented by TAG [MFHH02], the aggregates are calculated by constructing a tree rooted at the origi- nator of the query on-the-fly, followed by incasting the partial aggregate values and fusing them while flowing back to the originator. Tree-based processing approaches are prone to severe miscalculations due to frequent fragmentation of the weak tree structure in dynamic networks. With graph-based processing, aggregates are calcu- lated by collaborative diffusion of partial aggregate values according to a gossip-based algorithm [KDG03, BGPS05] or other randomized distributed localized algorithms [BGMGM, ZGE03, CK04, BGS03, CLKB04, CPX05, NGSA04]. Once the diffusion process converges, all nodes of the network have an approximate estimation of the global aggregates. These approaches are more robust to network dynamics, however, to justify the substantial cost of distributed diffusion they assume all nodes of the network are interested in the same aggregate, which is not necessarily the case. Our two-tier query answering framework (see Figure 1.1) assumes the second class of aggregate query pro- cessing approaches, i.e., centralized query processing. Assuming centralized query processing, previous approaches for answering continu- ous queries are not applicable with peer-to-peer databases. The approaches proposed for centralized databases, whether for continuous query over regular databases [TGNO92], continuous query over data streams [BW01], or materialized view maintenance [GM95], naturally assume all changes of the data are given locally, whereas with peer-to-peer databases the data is distributed and changes occur autonomously at the nodes with- out notification. Even with continuous query systems for the Internet-scale distributed 20 databases, such as OpenCQ [LPT99] and NiagaraCQ [CDTW00], an active database model is assumed with which data changes are detected as events and all events are col- lected locally at the querying node to update the result of the continuous query. Consid- ering the rate of the changes in peer-to-peer databases, pushing all events to the querying nodes fails to scale. With Digest, event detection is pull-based (rather than push-based) and only a sufficiently large sample of the events are collected from the database to satisfy the precision requirements of the query. Similarly, with the exception of the (on-the-fly) sample-based approaches, previ- ous approaches developed for approximate aggregate query answering in traditional databases are not applicable to peer-to-peer databases. The model-based approaches [CR94, DGM + 05] rely on the models that accurately predict the characteristics of the data in the database, whereas with peer-to-peer databases the data characteristics are typically unpredictable. The histogram-based [DIR00] and the precomputed-sample based [GM98, GLR00] data reduction approaches are not appropriate either. Although dynamically updated, with the high rate of change in peer-to-peer databases maintaining histograms and precomputed samples is intolerably costly. The on-the-fly (not precomputed) sample-based approaches, which were origi- nally developed for query size estimation in regular databases [HO91, LNS90, HS92, AGP00], are recently adopted for approximate query answering in the networked databases such as peer-to-peer databases and sensor databases [BBC04, ADGK06]. However, none of these new proposals consider addressing the limitations of the con- ventional sampling design. To our best knowledge, Digest is the first proposal that intro- duces new sampling designs to address the limitations of the conventional sampling in the context of approximate aggregate query answering in peer-to-peer databases. The double sampling and the sequential sampling designs are originally proposed by Cox [Cox52] and Chow et al. [CR65], respectively, and subsequently adopted by 21 Hou et al. [HOD91] and Haas et al. [HS92] for query size estimation. Here, in order to enable answering fixed-precision approximate queries based on these sampling designs, we extend the designs where we determine the required sample size based on the con- fidence interval of the estimate rather than the variance of the estimate. Also, the main idea with our cluster sampling design is similar to that of an adaptive sampling mecha- nism proposed by Thompson et al. [TS96] for sampling from nonuniformly distributed animal species. However, cluster sampling significantly differs with adaptive sampling in the details of the sampling procedure, and consequently, in the estimate variance and the sample size. Finally, inverse sampling is first introduced by Haldane [Hal45] for sampling rare subpopulations. Here, we rigorously calculate the variance of the estimate with inverse sampling and determine the required stop condition to satisfy the precision requirements of the fixed-precision approximate aggregate queries. On the other hand, random sample operations for data retrieval are also studied both in the context of databases and networks. Olken [Olk93] proposes various tech- niques for sampling from database files (e.g., hash files [ORX90]) and secondary index structures (e.g., B + -trees [OR89]). However, these approaches are designed for local databases and most of them do not apply to distributed peer-to-peer databases. Recently, a number of techniques are proposed for sampling from the data distributed among nodes of a network. Some of these techniques are restricted to the networks with particular structure (e.g., trees [KRA + 03] or DHTs [Man03, MBR03, KS04]). Oth- ers assume networks with arbitrary topology but only allow uniform random sampling [GMS04, She04a, EGH + 03, GKM01]. The most universal network sampling approach appropriate for peer-to-peer databases is based on the Markov Chain Monte Carlo (MCMC) methods [Tie94]. 22 Chapter 3 Approximate Set-Valued Query Answering In this chapter, we describe the set-valued query answering component of Digest (refer to Figure 1.2). In Section 3.1, we discuss the query processing engine of this component [BKS06], and in Section 3.2 we explain DBParter, the partial read operation of the component [BKS03a, BKS03b]. 3.1 Partial Query Processing Engine We first describe the basic function of the query engine. Subsequently, we discuss how we extend the basic functionality to support some variants of the basic query model and service model. 3.1.1 Main Design 3.1.1.1 Query and Service Model Assuming a relational data model for peer-to-peer databases, a set-valued query is defined as a query that takes one or more relations as input and returns a single relation as the result. In peer-to-peer databases set-valued queries are often one-time queries, i.e., they are issued and executed once without anticipation of further repetition. We define the basic query model for an approximate set-valued query as follows: an approximate query is an arbitrary set-valued query with a user-defined completeness ratio²2 [0;1]; 23 Figure 3.1: Digest User Interface any fraction ² of the complete result-set X, noted as X (²) , is sufficient to satisfy the approximate query. An approximate set-valued query is also called a partial query, with X (²) as the partial result of the query. An exact query is a specific case of partial query with ² = 1. Also, the basic service model of Digest assumes batch query answering; i.e., the user receives the answer of the query all at once (not progressively), after com- putation of the entire result-setX (²) is completed. Figure 3.1 shows an snapshot of the Digest user interface. 24 3.1.1.2 Partial Query Processing Figure 3.2 depicts an overview of the process of answering partial queries with Digest. In this figure, we use capital characters, e.g.,R, to refer to relations, and small characters such asr to refer to a partial relation of the corresponding relationR. Similar to a regular database, we consider two schemas for a peer-to-peer database: a conceptual schema at the top tier, and an internal schema at the bottom tier. The conceptual schema represents the structure of the database as observed by users whereas the internal schema captures the physical storage structure of the database. The internal schema of a peer-to-peer database is simply a universal relationM, which is the product of the set of relationsR i (for i : 1::k) from the conceptual schema. The relation M is horizontally fragmented and its tuples are distributed among the nodes of the networkN. To answer a partial query, first a sufficiently large fraction® i of each input relation R i must be retrieved by partial read from the database. Subsequently, considering these partial relations r i (= R (® i ) i ) as the base relations for the query, Digest evaluates the partial query simply as a regular query in order to compute the final partial resultX (²) . The details of this partial query processing procedure are enumerated in six steps as follows (see Figure 3.2). At Steps 1-3, the required size for the partial read is calculated, and subsequently at Steps 4-6, the data retrieved by partial read is processed to compute the final partial result: 1. Query-Conceptual Mapping (² ! ® i ): As a new query arrives, first Digest exploits query sampling [CMN99] to determine the required fractional size ® i for each partial input relationr i , based on the expected completeness ratio² of the query. 25 Top Tier Bottom Tier Internal Schema Conceptual Schema Query/Result Digest (Set-Valued Component) M m R 1 R 3 R k R 2 r 1 r 2 r 3 r k x X N n Peer-to-Peer Network |x|/|X| |r i |/|R i | i |m|/|M| |n|/|N| 1 2 5 6 3 4 Figure 3.2: Answering Partial Queries with Digest 26 2. Conceptual-Internal Mapping (® i ! ¯): Next, the required fractional size ¯ of the partial relation m is calculated as ¯ = max i f® i g. The relation m is a sub- relation of the universal relation M and it must be sufficiently large such that it contains the relationsr i . 3. Internal-Network Mapping (¯ ! °): Thereafter, the partial read operation is called with the input parameter¯. The partial read operation must cover a subset n of the networkN, sufficiently large to retrieve enough tuples from the network to form the relationm. The required fractional size° of the subsetn is determined based on ¯ and some characteristics of the network. The details of the mapping ¯ ! ° are discussed in Section 3.2, where we explain the partial read operation of Digest. 4. Derivingm fromn: The tuples retrieved from the nodes inn are collected (at the query originator) and merged to formm. 5. Derivingr i fromm: Tuples of each relationr i are selected and projected out from the product relationm to form the relationsr i . 6. Deriving X (²) from r i : Finally, considering the relations r i as the base relations, query is evaluated as a regular query to derive the final partial resultX (²) . 3.1.2 Extensions 3.1.2.1 Absolute Approximate Query Model With an absolute approximate query, instead of the fractional (relative) size ² of the result, user indicates the absolute sizejxj of the expected partial result. To answer an absolute approximate query, Digest maps the query to an equivalent regular approximate query with a completeness ration² corresponding to the absolute sizejxj. We summarize 27 this mapping process as follows. Steps 1-3 are slow but they are performed off-line; Steps 4-6 are executed on-line when the query arrives: 1. EstimatingjNj: Use a network size estimation technique such as [BGMGM] to derive an estimation for the size of the network,jNj. 2. Estimating jMj: With the known network size jNj, estimating the size of the database, jMj, is a classic urn problem [LN89] that requires random sampling from the network [ZSS05]. 3. Estimating jR i j: With the known database size jMj, use selectivity estimation [LNS90] to estimate the size of each relationR i embedded in the databaseM. 4. Estimating jXj: Use query size estimation [HO91] to estimate the size of the complete result-setX based on the sizes of the base relationsR i . 5. Mapping the query (jxj!²): Calculate the completeness ratio² for the equivalent regular approximate query as²=jxj=jXj. 6. Answer the query as a regular approximate query with completeness ration ², as explained in Section 3.1.1.2. 3.1.2.2 Progressive Query Answering Service Model With exploratory querying, users often issue several back-to-back queries, each time revising and enhancing the query based on quick and incomplete observation of the results of the previous query. To support exploratory querying, we extend the service model of Digest for progressive query answering [HHW97]. With a progressive query answering service model, unlike batch query answering the result tuples are progres- sively reported to the user as they become available, and user is allowed to interrupt and resume/stop the query while the result is being reported. 28 For progressive query answering, we enable the partial read operation of Digest to cover the network and deliver the retrieved data iteratively (instead of in-batch). When the user interrupts the query, data retrieval and consequently query processing is inter- rupted until either the query is stopped (i.e., aborted), or resumed. With the latter case, to avoid re-evaluation of the query from scratch, Digest uses incremental/differential partial query processing [LPBZ96, Hou93] with which only the newly retrieved data are used to derive new results. 3.2 Partial Read Operation: DBParter Given ¯ 2 [0;1], a partial read operation must retrieve a fraction ¯ of the relation M, which is horizontally fragmented and distributed among the nodes of the network N (Step 3 in Figure 3.2). We term such a partial read request a ¯-query. To answer a ¯-query, a partial read operation samples the peer-to-peer database by disseminating sampler agents (alternatively we call them queries because they carry a copy of the ¯- query) throughout the network. Beginning from the originator of the ¯-query (i.e., the originator of the main set-valued query), agents are disseminated through the network to visit a fraction ° of the nodes which collectively store m (= M (¯) ). The set of tuples stored at the visited nodes are retrieved and collected at the query originator for further processing (at Step 4 in Figure 3.2). The main challenge is that, regardless of the choice of the dissemination mechanism to implement the partial read operation, the dissemination mechanism must be “tuned” per ¯-query such that the query is satisfied both correctly and efficiently (refer to Section 1.1.2.1 for the definition of correctness and efficiency). Figure 3.3 depicts the parameter mapping process for tuning a generic dissemination mechanism. At Step I, considering the distribution of the data objects among the nodes, 29 p Data Distribution Parameter Tuning I II Figure 3.3: Parameter Mapping Process for Tuning a Generic Dissemination Mechanism ° (i.e., the fractional size of the subset n of the nodes that should be visited to retrieve m) is calculated based on¯. Subsequently, at Step II some parameter(s) of the dissem- ination mechanism (say parameter p) is tuned based on ° such that the dissemination actually covers n nodes from the network, correctly and efficiently. Specifics of Step II of the mapping process depend on the particular dissemination mechanism applied to implement the partial read. For our partial read operation, called DBParter, we employ an epidemic dissemina- tion mechanism. For the basic case of DBParter we assume uniform object distribution among the nodes of the network (hence, the mapping at Step I will be trivial° =¯), and focus on the specifics of Step II for epidemic dissemination. Later, we extend the basic case of DBParter by introducing other variants of DBParter that consider non-uniform data distribution, dynamic network, and query blocking nodes. In the remainder of this section, after providing some definitions first we elaborate on epidemic dissemination with DBParter and particularly, we use a percolation model to formalize the parameter tuning procedure (Step II) for epidemic dissemination. Sub- sequently, we introduce a specialized real-world example of DBParter that applies to the peer-to-peer databases with power-law topology. Finally, we introduce some variants of DBParter that extend the basic case. 30 3.2.1 Definitions 3.2.1.1 Communication Graph Communication graphs are used to represent and visualize query dissemination over networks. A network with the node-set N and link/edge-set E can be modelled as an undirected graph G(N;E). For a query initiated at time t = t 0 , the communication graph of the query at time t ¸ t 0 is a subgraph G t (N t ;E t ) of G, where E t µ E is the set of links traversed by at least one query replica during the time interval [t 0 ;t], and N t µ N is the set of nodes visited by at least one query replica during the same time interval. Associated with any link e 2 E t is a weight w e that is the number of times the linke is traversed during the time interval. We assume a discrete time model. Thus, the dynamic process of disseminating a query is completely represented by the set of communication graphsfG t 0 ;G t 0 +1 ;G t 0 +2 ;:::;G t 0 +T g, where at time t = t 0 +T query dissemination terminates (hence, for all t ¸ (t 0 + T), G t 0 +t = G t 0 +T ). The communication graph is a generic model to visualize the query dissemination process employed by any dissemination mechanism. For example, Figure 3.4 depicts the6 first communication graphs of a query that is initiated at node A and disseminated based on the random walk dissemination mechanism. 3.2.1.2 Efficiency Measures for Sampling We define two metrics to measure the efficiency of query dissemination mechanisms used for sampling from networks: 1. Query cost (or sampling cost, or communication cost)C: Assuming uniform link cost and uniform query size, we model the communication cost of disseminating a query as the total number of query replicas communicated between the nodes during the query dissemination process. In communication-graph terminology: 31 A t=t 0 A 1 t=t 0 +1 A 1 1 t=t 0 +2 A 1 1 1 t=t 0 +3 A 1 2 1 t=t 0 +4 A 2 1 1 1 t=t 0 +5 Figure 3.4: Communication graph C= X e2E t 0 +T w e 2. Query time (or sampling time, or communication time)T: Assuming uniform link latency, the sampling time is the total time it takes to disseminated the query. In communication-graph terminology: T=T 3.2.2 DBParter: Partial Read by Epidemic Dissemination Epidemic dissemination is inspired by epidemic spreading of contagious diseases in a social network (i.e., a network that models a society, with nodes as people and links as social links between people who are in regular contact). A contagious disease first appears at one node (the originator of the disease), and then through the social links disseminates throughout the network in a flood-like fashion, from the infected person to its direct neighbors, from the infected neighbors to their neighbors, and so on. However, 32 the success in transmission of the disease through a social link is probabilistic, i.e., the disease is transmitted from an infected node to its susceptible neighbor with probability p (0 · p · 1) and is ceased with probability 1¡ p. The value of p is determined based on the infectiousness of the particular disease as well as some other environmental parameters, but the value is generic to all links of the network. When the spreading terminates, the disease has covered/reached a sample h of the total node population H (hµH), where the relative size ofh increases with increasing value ofp (although not necessarily linearly). With epidemic dissemination by DBParter, we model the query dissemination mech- anism on the disease spreading process. By analogy, we take the dissemination of a query in a peer-to-peer database as the spreading of a disease in a social network. With this analogy, the infection probability p translates to the agent/query forwarding prob- ability, and the infected sample h maps to the sampled node-set n. Among various disease spreading models, we model epidemic dissemination on the SIR (Susceptible- Infected-Removed) disease spreading model. Below, first we describe the basic SIR dis- ease spreading model, which readily maps to our epidemic query dissemination mecha- nism with DBParter. Thereafter, for our SIR-based query dissemination mechanism we develop a percolation model to derive ° as a function of the query forwarding proba- bility p for peer-to-peer databases with arbitrary random graph topology. The function °(p) is a one-to-one function. Therefore, in turn, it defines p as a function of°, which enables Step II of the tuning process of DBParter. 3.2.2.1 Epidemic Dissemination based on SIR Disease Spreading Model With the SIR disease spreading model, at any particular time during dissemination of a disease each node of the social network is in one of the three states susceptible (S), infected (I), or removed (R). A “susceptible” node is capable of being infected but is not 33 I τ R S Figure 3.5: State Diagram for SIR Spreading Model infected yet; an “infected” node is currently infected; and a “removed” node has recov- ered from the infection and is both immune to further infection and impotent to infect other nodes. The discrete-time and dynamic process of SIR epidemic dissemination can be explained as follows. Initially, at time t 0 all nodes of the network are suscep- tible except the originator of the disease, which is infected. As the disease propagates throughout the network, if at timet¸ t 0 a susceptible nodeA has an infected neighbor B, at timet+1 with probabilityp nodeA conceives the disease fromB and becomes infected (see Figure 3.5 for the state transition diagram of a node). An infected node remains in the infectious state for a period of time¿ (during which it is able to infect its neighbors), and then finally becomes removed. We assume ¿ = 1 without loss of gen- erality. The disease dissemination terminates (dies out) at time t 0 +T (where T ¸ 1) when all nodes of the network are either in the removed state (affected by the disease) or the susceptible state (survived the disease), and none of the nodes are in the infected state. By analogy, with the SIR-based epidemic query dissemination of DBParter, when the query dissemination terminates the set of the removed nodes is the set n of nodes visited by the query, and the set of susceptible nodes is the set Nnn of the nodes not covered by the query dissemination (see Figure 3.2). 3.2.2.2 Percolation Model for Epidemic Dissemination To tune the epidemic dissemination with DBParter, we need to answer two questions: 1. How large p should be for the query dissemination to prevail a large network? For a query to prevail a networkN, we should have: 34 lim jNj!1 jnj jNj =° (3.1) with° >0. In other words, the size of the covered node-setn must be comparable to the size of the entire network N, otherwise the partial read operation cannot satisfy any ¯-queries other than the trivial ¯-query with ¯ = 0. With too small values ofp, the dissemination dies out quickly, and the query fails to prevail and covers an infinitesimally small number of nodes as compared to the total number of nodesjNj in large networks; i.e., we have: lim jNj!1 jnj jNj =0 2. How can we derive° as a function ofp? ° is an increasing function ofp. Having °(p), we can fulfill Step II of the tuning process (see Figure 3.3) by derivingp(°) as° ¡1 (p). In other words, given a¯-query and assuming° = ¯, we can tune the forwarding probabilityp of DBParter on-the-fly to satisfy the¯-query. To answer these questions, we model the epidemic dissemination process as a per- colation problem. First, we illustrate this percolation problem by describing two toy percolation problems (see Figure 3.6). Consider the grid in Figure 3.6-a. Each cell of the grid is called a site. Suppose we color each site of the grid independently with probability p. Figure 3.6-b depicts several instances of the colored grid with various probabilities. As p increases, larger clusters of colored sites appear. Particularly, at p=0:6 there is a giant cluster (marked in black) that spans from the top side of the grid to the bottom side. When such a giant cluster appears in a colored grid, we say the grid percolates. It is proved [SA92, MR95, NSW01] that there exists a critical probability 35 a. A site grid p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 p = 0.6 b. Site percolation instances Figure 3.6: Site Percolation Problem p c (in this case, p c ¼ 0:59) such that whp 1 the grid percolates only ifp¸ p c . The size of the giant cluster depends on the value ofp and asp increases the giant cluster grows such that at p = 1 the giant cluster covers the entire grid. The toy problem described above is called a site percolation problem on two dimensional grid. 1 “whp” stands for “With High Probability”. 36 p = 0.6 a. A bond grid b. Bond percolation instance Figure 3.7: Bond Percolation Problem Equivalently, one can define the dual bond percolation problem where a set of nodes are arranged into a grid (see Figure 3.7-a) with a bond (not shown) between every pair of adjacent nodes. Suppose each bond is colored independently with probabilityp. In this case a giant cluster is a cluster of nodes connected with colored bonds that percolates from one side to the other side of the grid (in Figure 3.7-b the giant cluster is marked in black). We model the epidemic dissemination process as a bond percolation problem on an arbitrary random graph. Our problem differs from the toy bond percolation problem described above in two ways. First, instead of a grid we assume an arbitrary random graph as the underlying bonding structure among the nodes. With an arbitrary random graph, each node is not restricted to a fixed number of neighbors (e.g., four neighbors as in two dimensional grid) but can have a random number of neighbors according to some arbitrary degree distribution. On the other hand, unlike a grid, a random graph is not delimited by sides. For a bond percolation problem on a random graph, a giant cluster is defined as a cluster of nodes (i.e., a set of nodes connected with colored bonds) where the size of the cluster is comparable to the size of the entire graph. This bond percolation problem models the epidemic dissemination as follows: the underlying random graph represents the physical topology of the peer-to-peer network 37 (i.e., the logical overlay); and a cluster generated with the coloring probability p, is an instance of the communication graph of a query if the query was initiated by a node within the cluster and disseminated with the forwarding probabilityp. With this model, we can answer the two questions raised before as follows: 1. How largep should be for the query dissemination to prevail a large network? The query prevails the network if and only if a giant cluster exists. Thus, whp query prevails the network if and only if p ¸ p c , where p c is the critical percolation probability of the bond percolation problem. 2. How can we derive ° as a function of p? To derive °(p) we should derive the relative size of the giant cluster as a function of the coloring probabilityp, for all p¸p c . Next, we solve the bond percolation problem for the critical probabilityp c , and the size of the giant cluster as a function ofp. 3.2.2.3 Tuning Epidemic Dissemination 3.2.2.3.1 Definitions We use the generating-function formalism [Wil94] to represent probability distribution functions. Particularly, the generating function G 0 (x) for the distribution of the node-degreek in an arbitrary random graph is defined as: G 0 (x)= 1 X k=0 p k x k (3.2) where p k is the probability that a randomly chosen node of the graph has degree k. From Equation (3.2), one can derive then-th moment of the node degree distribution as follows: hk n i= · (x d dx ) n G 0 (x) ¸ x=1 (3.3) 38 Also, for a random graph represented by the generating function G 0 (x), we define G 1 (x), which is the generating function for the distribution of the degree of the nodes we arrive at by following a randomly chosen link from the graph. G 1 (x) depends on G 0 (x) and is derived as follows: G 1 (x)= 1 hki G 0 0 (x) (3.4) 3.2.2.3.2 Analysis First, consider the bond percolation model described in Section 3.2.2.2. Suppose the coloring probability is p. A cluster of nodes connected by the colored bonds is itself a random graph embedded within the underlying random graph G 0 (x). It is easy to derive the generating function G 0 (x;p) for the degree distribution of the graphs representing the clusters based onG 0 (x): G 0 (x;p) = 1 X m=0 1 X k=m p k µ k m ¶ p m (1¡p) k¡m x m (3.5) = 1 X k=0 p k k X m=0 µ k m ¶ (xp) m (1¡p) k¡m (3.6) = 1 X k=0 p k (1¡p+xp) k (3.7) = G 0 (1+(x¡1)p) (3.8) Similarly one can deriveG 1 (x;p) as follows: G 1 (x;p)=G 1 (1+(x¡1)p) (3.9) Next, we derive the distribution of the sizes (i.e., the number of nodes) of the clus- ters. Assume H 0 (x;p) is the generating function for the distribution of the size of the clusters. Observing that each cluster consists of a node connected tok other sub-clusters 39 (where k is distributed according to G 0 (x;p)), we derive the distribution of the cluster size by a recursive argument as follows: H 0 (x;p)=xG 0 (H 1 (x;p);p) (3.10) and similarly: H 1 (x;p)=xG 1 (H 1 (x;p);p) (3.11) FromH 0 (x;p), we can also compute the average sizehsi of the clusters using Equation (3.3): hsi=H 0 0 (1;p)=1+G 0 0 (1;p)H 0 1 (1;p) (3.12) However, according to Equation (3.11) we have: H 0 1 (1;p)=1+G 0 1 (1;p)H 0 1 (1;p)= 1 1¡G 0 1 (1;p) (3.13) Thus: hsi=1+ G 0 0 (1;p)) 1¡G 0 1 (1;p) =1+ pG 0 0 (1) 1¡pG 0 1 (1) (3.14) Now, since at the critical probabilityp c the giant cluster appears, the average size of the clustershsi goes to infinity atp=p c ; i.e.: hsi=1+ p c G 0 0 (1) 1¡p c G 0 1 (1) !1 (3.15) Therefore, the critical probabilityp c can be computed as: 40 p c = 1 G 0 1 (1) = 1 hk 2 i hki ¡1 (3.16) This concludes the solution for the first question raised in Section 3.2.2.2. To answer the second question, i.e., calculating the relative size of the giant cluster as a function of p, we observe that forp¸p c ,H 0 (x;p) remains the distribution of the finite size clusters, i.e., all clusters except the giant cluster. Thus, we have: H 0 (1;p)=1¡°(p) (3.17) Using Equation (3.10) we can derive°(p) as follows: °(p)=1¡G 0 (y;p) (3.18) where according to Equation (3.11),y =H 1 (1;p) is the solution of: y =G 1 (y;p) (3.19) We solve these equations for°(p) numerically by iteration. 3.2.3 A Real-World Example of DBParter In Section 3.2.2, we described the generic case of DBParter for the peer-to-peer databases with arbitrary random graph topology. As captured by empirical studies, some of the real-world peer-to-peer databases such as Gnutella [Lim05] and Kazaa [Sha05] have power-law random graph topologies [SGG02, Rip01, Jov01, GDS + 03]. Here we specialize the generic case of DBParter for the peer-to-peer databases with power-law topology. 41 3.2.3.1 Network Topology In this section, we assume the topology of the peer-to-peer database is a power-law (or scale-free) random graph, i.e., a random graph [Bol85] with power-law probability distribution for node degrees. Intuitively, in a power-law random graph most of the nodes are of low degree while there are still a few nodes with very high connectivity. We define the power-law probability distribution function for the node degreek as follows: p k =Ck ¡´ e ¡k=º (3.20) where ´, º, and C are constants. ´ is the skew factor of the power-law distribution, often in the range 2 < ´ < 3:75 for real networks. For example, a case study reports ´ =2:3 for Gnutella [SGG02]. The less the skew factor, the heavier the tail of the power- law distribution, which translates to larger number of highly connected nodes. A pure power-law distribution does not include the exponential cut-off factor (e ¡k=º ), allowing for nodes with infinite degree, which is unrealistic for real peer-to-peer databases. The cut-off factor with indexº shortens the heavy tail of the power-law distribution such that the maximum node degree for the nodes of the graph is in the same order of magnitude as º. Finally, C is the normalization factor that is computed as C = [Li ´ (e ¡1=º )] ¡1 , whereLi ´ (x)= P 1 k=1 x k k ´ is the´-th polylogarithm function ofx. 3.2.3.2 Analysis The generating function of the power-law degree distribution (see Equation 3.20) can be represented based on the polylogarithm function as follows: G 0 (x)= Li ´ (xe ¡1=º ) Li ´ (e ¡1=º ) (3.21) 42 From Equation (3.3), we can compute the first and the second moments of the power- law degree distribution 2 : hki = (x d dx )G 0 (x) ¯ ¯ ¯ ¯ x=1 = Li ´¡1 (e ¡1=º ) Li ´ (e ¡1=º ) hk 2 i = (x d dx ) 2 G 0 (x) ¯ ¯ ¯ ¯ x=1 = Li ´¡2 (e ¡1=º ) Li ´ (e ¡1=º ) Consequently, from Equation (3.16) we can derive the critical probabilityp c for a power- law graph as follows: p c = Li ´¡1 (e ¡1=º ) Li ´¡2 (e ¡1=º )¡Li ´¡1 (e ¡1=º ) (3.22) In Figure 3.8, we illustrate p c as a function of º for various ´ values in a real- world power-law peer-to-peer database, i.e., Gnutella. For Gnutella, the skew factor ´ is estimated as low as´ = 1:4 and as high as´ = 2:3, in different scenarios. Also,º is in the range of 100 to 1000. As illustrated by this example, the critical probabilityp c in power-law networks can be as low as 0.01. We also solved Equation (3.18) to derive°(p) for power-law networks by numerical iteration. In Figure 3.9-a, we show our result for a power-law random graph with the skew factor´ =2:3. 3.2.3.3 Algorithm ² Phase 1: Seeding the Query 2 Note: d dx Li ´ (x)= 1 x Li ´¡1 (x) 43 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 100 200 300 400 500 600 700 800 900 1000 Cut-off Index (nu) Critical Probability (Pc) Power-law skew factor (eta) = 2.3 Power-law skew factor (eta) = 1.4 Figure 3.8: Critical Probability in Power-Law Peer-to-Peer Databases To make sure the query dissemination is initiated by a node belonging to the giant cluster, the actual query originator initiates a selective walker to locate a highly connected node in the network. With the selective walk, at each hop the query is forwarded to a neighbor with the maximum connectivity degree until the walker reaches a node with a connectivity degree higher than the degree of all its neigh- bors. In [ALPH01], it is shown that in a power-law network whp such a selective walker finds a highly connected node belonging to the giant cluster in O(logN) hops, whereN is the size of the network. Our experiments also verify this result. ² Phase 2: Disseminating the Query Next, the SIR-based epidemic query dissemination is initiated at the highly con- nected node located by the selective walker. The query is disseminated with a forwarding probability p > p c selected according to Equation (3.18), such that °(p) satisfies the given¯-query. 44 3.2.4 Variants of DBParter Since in real-world peer-to-peer databases some nodes may refrain from participating in query dissemination, here we introduce a family of variants for the basic case of DBParter to model this behavior. With this family, unlike the original SIR model where initially all nodes are in the “susceptible” state, some nodes begin in the “removed” state. In the context of the disease epidemic, these nodes represent the people who are vaccinated. We term such nodes the blocked nodes. We consider three different variants with blocking for the basic DBParter, each rep- resenting a particular real-world scenario: 1. DBParter with uniform blocking: In this case, nodes are blocked with uniform probability. This case models the networks where nodes autonomously decide whether or not to participate in the query dissemination. 2. DBParter with negative degree-correlated blocking: In this case, the nodes with lower connectivity degrees are blocked with higher probability. For example, this case models the peer-to-peer file-sharing networks where low degree nodes are usually also low-bandwidth and therefore, to avoid congestion and possible isola- tion, may refrain from participating in query dissemination with higher probabil- ity. 3. DBParter with positive degree-correlated blocking: This case is opposite to the previous case, where nodes with higher connectivity degrees are more probably blocked. This case models the scenario where, for example, high degree nodes of a power-law network are attacked and disabled. Formal analysis of these DBParter variants is similar to that of the basic case. For brevity, we omit this analysis as well as the discussion of other DBParter variants. 45 3.3 Empirical Study We conducted two sets of experiments via simulation to 1) study the behavior of DBParter empirically, and 2) evaluate the efficiency of DBParter. For this purpose, we implemented a discrete-time event-driven simulator in C++. We used an Enterprise E220 SUN server to perform the experiments. 3.3.1 Methodology With the first set of experiments, we studied the relation between the forwarding prob- abilityp and the size of the sample node-set covered by the DBParter query dissemina- tion. Therefore, with these experiments data content of the nodes is irrelevant. With the second set of experiments, we evaluated the efficiency of various partial read operations in resolving ¯-queries. Our Monte Carlo simulation was organized as a set of “runs”. For the first set of experiments each run consists of 1) selecting a network topology, 2) selecting a query originator, and finally 3) initiating 50 query disseminations per for- warding probabilityp (for each one of the partial read operations) while varyingp from 0 to 1, and recording the average size of the covered node-set as well asC andT. For the second set of experiments a run comprises 1) selecting a network topology, 2) select- ing an object-set (a multiset of tuples), 3) distributing the object-set among the network nodes, 4) selecting a query originator and finally 5) initiating the query for 50 times per ¯ (for each of the partial read operations) while varying ¯ from 0 to 1, and recording the average value of their efficiency numbersC andT. Each result data-point reported in Section 3.3.2 is the average result of 50 runs. The coefficient of variance across the runs was below2:5% and hence show the stability of the result. We generated a set of 100 undirected power-law graphs G(N;E) each withjNj = 50000 and jEj ¼ 200000. The skew factors of the graphs are all about ´ = 2:3 as 46 measured in [SGG02]. The minimum number of edges per node is 4, and the cut- off index of the graphs is º = 100. The graphs are generated using the preferential attachment model proposed by Barabasi et al. [BA99]. We considered a 5-dimensional content space and generated 100 object-sets. For each object-set, we first generatedjUj = 100000 objects u = ha 1 ;a 2 ;:::;a 5 i, where a i is an integer value uniformly selected from the interval[1;10]. Thereafter, we replicated the objects according to the object replication scheme defined in Section 3.2.3.1 with the total number of objectsjMj=500000. M is uniformly distributed among the set of network nodesN. 3.3.2 Results Figure 3.9 illustrates the results of our first set of experiments. Figure 3.9-a depicts the relation between the query forwarding probabilityp and the relative size° of the covered node-set n. First, we notice how close the results of our theoretical analysis conforms with the performance of the basic case of DBParter in practice, specially for ourp val- ues of interest close to the critical forwarding probability p c . Also, we observe that while performance of the DBParter with negative degree-correlated blocking is almost identical to that of the basic DBParter (they overlap in the figure), with positive degree- correlated blocking, the coverage for the same forwarding probability decreases signifi- cantly. This shows 1) the importance of the highly connected nodes in the performance of DBParter, and 2) the independence of its performance from the nodes with lower connectivity degrees, which are often low-bandwidth and volatile. Figure 3.9-b con- firms our previous conjecture that the sampling cost of DBParter is linearly proportional to the query forwarding probability. Also, notice that with p = 0:3 almost 80% of the network nodes can be covered with only about 25% of the sampling cost of the regular flooding (with p = 1). Besides, the size of the covered node-set becomes sublinearly 47 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability (p ) γ Basic DBParter Uniform Blocking (10%) Uniform Blocking (20%) Uniform Blocking (30%) +ve Degree-Cor Blocking -ve Degree-Cor Blocking (33%) -ve Degree-Cor Blocking (52%) Theoretical Analysis a. Nodes sample size vs. forwarding probability 0 50000 100000 150000 200000 250000 300000 350000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability (p) Number of Forwards (C) b. Sampling cost vs. forwarding probability 0 5 10 15 20 25 30 35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability (p) Time (T) c. Sampling time vs. forwarding probability Figure 3.9: Verification of the Analytical Results 48 proportional to the network size starting atp c ¼0:05, where the sampling cost is almost two orders of magnitude less than that of the flooding. Finally, Figure 3.9-c illustrates how the sampling time reduces as the forwarding probability goes from p c towards 1, because the giant cluster becomes more strongly connected. Also, we observe that in the worst case, the sampling time with DBParter only increases by a factor of 4 over the minimum possible sampling time with flooding. Figures 3.10 and 3.11 illustrate the results of our second set of experiments. With these experiments, we compared the efficiency of DBParter in answering¯-queries with that of partial read operations based on random walk and scope-limited-flooding dis- semination mechanisms. Unlike DBParter, these two partial read operations are unable to determine whether they have covered a sufficiently large fraction of the network to satisfy a particular ¯-query. Nevertheless, to be able to compare DBParter with these partial read operations, for each particular¯-query with given¯ we calculated the abso- lute (not relative) number of nodes that must be covered to satisfy the query, and ter- minated the random walk and flooding as soon as their coverage exceed the required number of nodes. DBParter can decide on the required coverage on its own. Figure 3.10-b shows that, as one can expect, the sampling time of DBParter is always incomparably shorter than that of the random walk, even with 32 parallel walkers (in the figure, the DBParter plot lies on the x axis). However, to our surprise, DBParter also outperforms random walk in sampling cost (see Figure 3.10-a). Thus, to cover the same number of nodes the SIR-based dissemination uses a “lighter” communication graph as compared to that of the random walk. As illustrated in Figure 3.10-a, the random walk algorithms with different number of walkers incur the same sampling cost. This should not be surprising; more walkers enhance the sampling time of the random walk by scanning the network in parallel, but a single random walker walks as much as 32 random walkers walk in aggregate to cover the same number of nodes. Also, notice 49 DBParter vs. Random Walk 0 100000 200000 300000 400000 500000 600000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 β β β β Sampling Cost (# forwards) DBParter 1-RW 1-RW Self Avoiding 16-RW 16-RW Self Avoiding 32-RW 32-RW Self Avoiding a. Sampling cost DBParter vs. Random Walk 0 50000 100000 150000 200000 250000 300000 350000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 β β β β Sampling Time (# simulation steps) DBParter 1-RW 1-RW Self Avoiding 16-RW 16-RW Self Avoiding 32-RW 32-RW Self Avoiding b. Sampling time Figure 3.10: DBParter vs. Random Walk that among random walk algorithms, self-avoiding random walk (which avoids repeated paths) always outperforms regular random walk in sampling time. Figure 3.11-a shows that DBParter always outperforms scope-limited flooding in sampling cost. Notice the step-like diagram for the scope-limited flooding. Since at each hop flooding scans an exponentially larger number of new nodes, unlike DBParter it cannot be tuned properly to cover a certain fraction of the network nodes with fine granularity. Finally, Figure 50 DBParter vs. Scope-limited Flooding 0 50000 100000 150000 200000 250000 300000 350000 400000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 β β β β Sampling Cost (# forwards ) Scope-limited Flooding DBParter a. Sampling cost DBParter vs. Scoped-limited Flooding 0 5 10 15 20 25 30 35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 β β β β Sampling Time (# simulation steps) Scope-limited Flooding DBParter b. Sampling time Figure 3.11: DBParter vs. Scope-Limited Flooding 3.11-b shows that although the sampling time of DBParter always exceeds that of the flooding (which is optimal), even in the worst case it remains tolerable. 3.4 Future Work We plan to extend our proposed family of epidemic-based partial read operations to include partial read operations to answer continuous partial queries. For this purpose, we 51 will adopt the SIS (Susceptible-Infected-Susceptible) disease spreading model. Unlike SIR, which models epidemic diseases that disseminate once throughout the social net- work and quickly disappear, SIS models endemic diseases that become resident in the social network and continuously disseminate. 52 Chapter 4 Approximate Aggregate Query Answering In this chapter, we focus on answering approximate continuous aggregate queries in peer-to-peer databases. Continuous queries [TGNO92, LPT99, CDTW00] allow users to obtain new results from the database without having to issue the same query repeat- edly. Continuous queries are especially useful with peer-to-peer databases which inher- ently comprise of large amounts of frequently changing data. For example, in a weather forecast system with thousands of interconnected stations the system administrator can issue a continuous aggregate query of the form: “Over next 24 hours, notify me whenever the average temperature of the area changes more than2 o F . ” Or in a peer-to-peer computing system with distributed resources, users can issue the following query to determine when there is enough memory space available to schedule their tasks: “Notify me whenever the total amount of available memory is more than 4GB. ” Digest evaluates approximate continuous aggregate queries by continual execution of the approximate snapshot aggregate queries, where each snapshot query is evaluated by sampling the database. The snapshot queries probe the database and accordingly the 53 running 1 result of the continuous query is updated. As we elaborate below, with con- tinuous queries the main issue transcends how to execute each snapshot query, but how to execute snapshot queries continually such that while the fixed precision requirements of the continuous query are guaranteed, the query is answered efficiently by deriving minimum number of samples. With fixed-precision approximate continuous aggregate queries, the required (or fixed) precision of the approximate result is defined by the user in terms of 1) the reso- lution of the result in capturing the changes of the actual running aggregate value (e.g., the result reflects the changes of the average temperature iff a change is more than2 o F), and 2) the confidence (or accuracy) of the result at each time as compared to the exact aggregate value at that time. Using continual snapshot queries to answer such queries, the resolution of the result is determined by the frequency of the snapshots, and the confidence of the result depends on the number of the samples derived to approximate the result of each snapshot query. Therefore, for efficient evaluation of the continu- ous queries while guaranteeing the fixed precision, both the frequency of the snapshot queries and the number of the samples derived at each snapshot query must be mini- mized while the resolution requirement and the confidence requirement of the query are still satisfied, respectively. With Digest, the query engine uses the samples collected from the database (using the random sampling operator) to evaluate the snapshot queries. To minimize the fre- quency (or equivalently, the number) of the snapshot queries, the query engine exploits an extrapolation algorithm that predicts the evolution of the running aggregate value based on its previous behavior and adapts the frequency of the continual snapshot queries accordingly. With this approach, the more varying the aggregate value, the 1 Here, we use the term “running” to refer to a quantity which is a function of time, in a sense different from a progressive/online (aggregate) quantity as used in [HHW97]. 54 frequency of the snapshot queries increases to maintain the resolution of the result, and when the aggregate value is steady the frequency of the snapshot queries decreases accordingly to avoid redundant sampling. On the other hand, to minimize the number of the samples derived at each snapshot query, the query engine employs a repeated sampling algorithm. Repeat sampling draws on the observation that across successive snapshot queries the values of the database tuples are expected to be autocorrelated and, therefore, exploiting the regression of the value of a sampled tuple at the current query on that of the previous query can improve the accuracy of the current estimate. Repeated sampling uses regression estimation to achieve the required confidence using fewer samples as compared with the straightfor- ward independent sampling which ignores the correlation between the snapshots. The remainder of this chapter is organized as follows. In Section 4.1, first we define the semantics of the fixed-precision approximate continuous aggregate queries, and after providing an overview of the aggregate query processing engine of the Digest, we describe the functionality of the query engine in details. In Section 4.2, we present the distributed random sampling operator developed for Digest. Section 4.3 covers our solu- tions for the limitations of the conventional sampling design for sample-based aggregate query answering in peer-to-peer databases. In Section 4.4, we present the results of our empirical study on the aggregate query answering component of the Digest, and finally Section 4.5 concludes the chapter and discusses the future directions of this research. 4.1 Sample-based Query Processing Engine 4.1.1 Approximate Continuous Query In this chapter, we model a peer-to-peer network as an undirected graph G(V;E) with arbitrary topology. The set of vertices V = fv 1 ;v 2 ;:::;v r g represent the set of the 55 network nodes, and the set of edgesE =fe 1 ;e 2 ;:::;e q g represent the set of the network links, where e i = (v j ;v k ) is a link between v j and v k . As nodes autonomously join and leave the network, the member-set of V , and accordingly, that of E vary in time. Consequently, the set sizesr andq are also variable and unknown a priori. We assume the rate of the changes inG is relatively low as compared to the sampling time (i.e., the time required to draw a sample from the peer-to-peer database), such that the network can be assumed almost static during each sampling occasion (although it may change significantly between successive sampling occasions). For a peer-to-peer database stored in such a peer-to-peer network, without loss of generality we assume a relational model. Suppose the database consists of a single rela- tionR =fu 1 ;u 2 ;:::;u N g. R (a multiset) is horizontally partitioned and each disjoint subset of its tuples is stored at a separate node. The number of tuples stored at the node v i is denoted by m v i . The member-set of R also varies in time; the changes are either due to the changes of V , as nodes with new content join the network (as if inserting tuples) and existing nodes leave and remove their content (as if deleting tuples), or as the existing nodes autonomously modify their local content by insertion, update, and deletion. With such a model for peer-to-peer databases, we define our basic query model for the continuous aggregate queries as follows. Consider the queries of the form: SELECTop(expression)FROMR where op is one of the aggregate operations AVG, COUNT, or SUM, and expression is an arithmetic expression involving the attributes of R. Suppose Q is an instance of such queries. Assuming a discrete-time model (i.e., the time is modelled as a discrete quantity with some fixed unit), the snapshot aggregate query Q t is the query Q evaluated at timet. Correspondingly, the continuous aggregate query Q c is the query Q evaluated continuously (i.e., repeated successively without intermission) for all t ¸ t 0 , where t 0 56 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time Aggregate Value Exact Result X[t] Approximate Result X[t] ^ t u 0 t u 1 ΔX 1 ΔX 2 δ ε t 0 = t u 2 t u 3 t u 4 t u 5 t u 6 t u 7 t u 8 t u 9 t u 10 t u 11 t u 12 Figure 4.1: Fixed-Precision Approximate Continuous Aggregate Query is the arrival time of Q c . The result of Q c isX[t], a discrete-time function where for all t¸t 0 ,X[t] is the aggregate-value result of the snapshot query Q t . For instance, in a peer-to-peer computing system each node (a node consists of one or more computing units) keeps a current record of its available resources by maintain- ing tuples of the form u i = hcpu; memory; storage; bandwidthi, one tuple for each local computing unit. ConsideringR(cpu; memory; storage; bandwidth) as a single- relation peer-to-peer database representing the resources available in such a peer-to-peer computing system, the following continuous query returns X, the total amount of the space currently available throughout the system, as a function of time: SELECTSUM(memory+ storage)FROMR 57 With a fixed-precision approximate version of the exact continuous query Q c defined above, the exact result X[t] is approximated by an estimate b X[t] with guaranteed pre- cision (see Figure 4.1). Our model for approximate queries includes three extra user- defined parameters,±,², andp, to specify the desired precision of the estimation. With ±, user specifies the resolution of b X[t] in capturing the incremental changes in X[t] as it evolves in time. To answer an exact query,X[t] is updated (i.e., re-evaluated by snap- shot query) at every time instant t, regardless of the amount of change in X since the last update at t¡1. However, with approximate queries smaller changes below some threshold may be insignificant to the user/application and, therefore, not required to be reflected in the estimated result b X[t]. The parameter ± defines this application-specific threshold. Suppose t u i is the most recent time at which the result b X[t] is updated (ini- tially, t u 0 = t 0 ). For t > t u i , the approximate query is not required to re-update the result until t = t u i+1 , where t u i+1 is the earliest time at which ¢X ¸ ± (by defini- tion ¢X =jX[t u i+1 ]¡X[t u i ]j). For all times t in the interval (t u i ;t u i+1 ), b X[t] can be estimated without update/re-evaluation, e.g., by “holding” (i.e., b X[t]= b X[t u i ]) or inter- polation. With this semantic for approximation, the smaller changes of X during the intervals(t u i ;t u i+1 ) fori = 0;1;2;3;::: are filtered out of the estimated result. Back to our running example mentioned above, changes on the order of several megabytes in the total space may not be noteworthy for a distributed task scheduling application and/or may be too costly to monitor. In such a case, e.g.,± =1GB might be an effective choice to formulate an approximate query that is both useful and practical. Next, the parameter² indicates the maximum tolerable absolute error in the estimate b X[t] at each time t u i . The approximate query should guaranteej b X[t u i ]¡X[t u i ]j · ² for all i. The interval [X[t u i ]¡²;X[t u i ]+²] is termed the confidence interval of the estimation at timet u i , withX[t u i ]¡² andX[t u i ]+² as the lower and upper confidence limits, respectively. The provided guarantee is probabilistic and with the parameter p 58 user specifies the desired confidence level of the guarantee, i.e., the probability that the estimation b X[t u i ] is actually confined within the confidence interval. The user-defined parameters ² and p together determine the required confidence of the estimate b X[t u i ]. Note that, approximate query generalizes exact query; an exact query is an approximate query with± =0,²=0, andp=1. To answer an exact continuous aggregate query, snapshot queries must be executed continuously, each evaluated for an exact result; hence, termed continuous-exact snap- shot queries. Alternatively, an approximate continuous aggregate query can be answered by executing the more flexible and general continual-approximate snapshot queries. With continual-approximate queries, the less frequent the snapshot queries and the less accurate the approximation by each snapshot query, the cost of evaluating the continuous aggregate query decreases, although so does the precision (i.e., the resolution and the confidence, respectively) of the result. That allows a trade-off between the precision and the cost of obtaining the result, such that while the fixed-precision approximate query is correctly satisfied, the cost of evaluating the query can be optimized for efficiency. An extreme case of the trade-off is with continuous-exact snapshot queries to answer exact continuous aggregate queries. With continuous-exact snapshot queries, both the fre- quency of the snapshot queries and the accuracy of the approximation by each snapshot query are maximal, such that the estimated result of the continuous query is exact (i.e., b X[t]=X[t]) while it costs the most to evaluate. Digest executes continual-approximate snapshot queries by sampling the database, and optimizes the frequency and accuracy of the snapshot queries to answer approximate continuous aggregate queries both correctly (with guaranteed precision) and efficiently. 59 Bottom Tier Digest Query Samples Random Sampling Operator Peer-to-Peer Database Sample-based Query Evaluation Engine Top Tier Figure 4.2: Two-Tier Architecture of Digest 4.1.2 Overview Figure 4.2 depicts the two-tier architecture of the aggregate query answering compo- nent of the Digest. Each node of the peer-to-peer database operates its own individual instance of Digest to answer the continuous queries received from the local user. As mentioned before, the query evaluation engine at the top tier exploits an extrapolation algorithm (see Section 4.1.3.1 on continual querying) and a repeated sampling algorithm (see Section 4.1.3.2 on approximate querying) to optimize the number of the samples derived to answer continuous aggregate queries. In addition to the query evaluation engine, Digest benefits from a sampling operator at the bottom tier that (in collabo- ration with other instances of Digest distributed throughout the peer-to-peer network) efficiently derives random samples from the peer-to-peer database. Here, we describe the interface of the sampling operator; the distributed sampling algorithm developed to implement the sampling operator itself is presented in Section 4.2. 60 The sampling operator of Digest implements a distributed sampling algorithm to draw random samples (sample nodes and correspondingly sample tuples) from the peer- to-peer databases with arbitrary network topology and tuple distribution. The inter- face of the sampling operator is defined as follows. First, consider a generic weight function w that assigns a weight w v to each node v of the database. For instance, w 1 =f8v2V jw v =1g is a uniform weight function, and w 2 =f8v2V jw v =m v g is a (possibly) nonuniform weight function with which each node is weighted according to the number of the tuplesm v stored at the node. We assume that with w the weight of each node is a function of the local properties of the node (such as the content-sizem v or the degree of connectivityd v of the node), and the assigned weight is not necessarily normalized. Given such a weight function w as input, once invoked the sampling opera- torS randomly derives a sample nodev fromV such thatp v =w v = P v2V w v , wherep v is the probability of sampling the nodev. In other words, the distribution of the sampling probability among the nodes is proportional to the distribution of the weight according to the desired (uniform or nonuniform) weight function w. As we show in Section 4.2, with our distributed sampling algorithm each node only needs to know the weights of its local neighbors; hence, no need to acquire global information. The sampling operator S is used to draw a sample node. Additionally, to draw a sample tuple one can first use S to derive a sample node, and then randomly sample the tuples stored locally at the sampled node 2 . The combination of the two samplings, i.e., the distributed node sampling viaS and the local tuple sampling from the sampled node, uniquely specifies the random distribution of the sampled tuple in the entire R. For example, to obtain a uniform random sample fromR,S is invoked with the weight function w=f8v2V jw v =m v g to derive a sample node with a sampling probability proportional to its content-size. Thereafter, the content of the sampled node is uniformly 2 This sampling scheme is termed two-stage sampling [Coc77]. 61 sampled to derive a sample tuple. The derived tuple is a uniform sample fromR. With the above two-stage sampling scheme, sampling the tuples stored at the sampled node is performed locally; hence, it is standard and is standard and inexpensive. The sam- pling operatorS, which implements the more complicated and costly distributed node sampling, ensures suboptimal performance (comparable to that of the optimal sampling) in terms of communication cost and sampling time, while guaranteeing randomness of the derived sample node with arbitrary small error (i.e., variation difference) as desired uniform or nonuniform sampling probability distribution (see Section 4.2 for details). 4.1.3 Sample-based Query Evaluation 4.1.3.1 Continual Querying Suppose the most recent snapshot query is executed at time t u i = k (see Figure 4.3). For continual querying, we should predict the next update time t u i+1 such that jX[t u i+1 ]¡X[t u i ]j ¸ ±. To predict t u i+1 , in brief our approach is to fit a curve to the previously observed values ofX to predict its values in the near future with guaranteed error bounds. If the ratio of the change/variation ofX[t] in time is unbounded, the future values of X are unpredictable and, therefore, inevitably continual querying reduces to continuous querying to ensure the required resolution. However, the aggregate value X[t] is expected to be a smooth function of time with considerable autocorrelation in short time intervals that bounds its variation (e.g., consider the variation of the total amount of the available space in a peer-to-peer computing system). Hereafter, by con- vention we model such a smooth function X[t] with bounded variation as an analytic function of time, i.e., we assumeX[t] possesses derivatives of all orders and agrees with its Taylor series in the neighborhood of every point. 62 Time Aggregate Value X[t] P [t] k-8 k-7 k-6 k-5 k k-1 k-2 k-3 k-4 k+1 k+2 k+3 k+4 k+5 t u i-3 t u i-2 t u i-1 t u i t u i+1 ΔP [t] n R [t] n n δ Figure 4.3: Computing t u i+1 by Polynomial Extrapolation: at t = t u i+1 , we have j¢P n [t u i+1 ]j+jR n [t u i+1 ]j>± To findt u i+1 , our continual querying algorithm predicts the evolution of the analytic function X[t] by polynomial extrapolation using the Taylor series expansion. Based on the Taylor’s theorem, at the neighborhood of t u i , X[t] can be approximated by a degree-n Taylor polynomialP n [t]: P n [t]=X[t u i ]+(t¡t u i )X 0 [t u i ]+ (t¡t u i ) 2 2! X 00 [t u i ] +:::+ (t¡t u i ) n n! X (n) [t u i ] (4.1) The upper bound error for the polynomial approximation is the Lagrange remainder R n [t], such thatjX[t]¡P n [t]j<jR n [t]j, where: R n [t]= (t¡t u i ) n+1 (n+1)! X (n+1) [c t ] (4.2) 63 withc t 2[t u i ;t] a constant minimizingX (n+1) [t] in this interval. First,P n [t] is computed by fitting a degree-n polynomial to n + 1 previous values of X[t] at t = (k ¡ n), t = (k ¡ n + 1), ..., and t = k. However, the exact values of X[t] are unknown unless the snapshot queries are exact. Instead of the exact values, assuming sufficiently accurate approximate snapshot results, we computeP n [t] usingn+1 previous values of b X[t] att = t u i¡n ,t = t u i¡n+1 , ..., andt = t u i (n = 3 in Figure 4.3). Next, havingP n [t] as an approximation with bounded error forX[t],t u i+1 is derived by extrapolation as the minimumt satisfying: jP n [t]¡P n [t u i ]j+jR n [t]j>± (4.3) Note that the upper bound errorjR n [t]j of the polynomial approximation is a decreas- ing function ofn. Therefore, the higher the degree of the polynomial approximation, the tighter is the error bound and, thus, the predicted update timet u i+1 is less conservative, which makes the continual querying more efficient. Also, while t < t u n , a degree-n polynomial approximation is not applicable. During the bootstrapping period, i.e., the interval [t u 0 ;t u n ), our continual querying algorithm implements continuous querying instead of continual querying. 64 4.1.3.2 Approximate Querying 4.1.3.2.1 Independent Sampling Among the three types of the aggregate queries AVG,COUNT, andSUM, each two types can be reduced to the third type with minor mod- ifications. Here, we begin with describing the query evaluation process for an approxi- mateAVG query (which is easier to explain using our notation) based on the straightfor- ward independent sampling algorithm using the conventional sampling design. We fol- low by a brief discussion of the required modifications for theCOUNT andSUM queries. (A)AVG Query: With the independent sampling algorithm, to answer anAVG snapshot query Q t SELECTAVG(expression)FROMR n sample tuplesu 1 ,u 2 , ...,u n are derived fromR, uniformly random and with replace- ment. The time interval (beginning at t) during which the database is probed to draw samples for evaluating Q t is called the sampling occasion for Q t . During the sampling occasion, each sample is drawn by first calling the sampling operatorS with the weight function w =f8v 2 V jw v = m v g to derive a sample node with a sampling probabil- ity proportional to its content-size. Next, the content of the sampled node is uniformly sampled to derive a sample tuple. Suppose the value of the expression when applied to the sample tupleu i is denoted byy i . Based on the derived samples, the result (X =) Y =1=N P N i=1 y i of theAVG query is estimated by the unbiased and consistent estima- tor: b Y =1=n n X i=1 y i (4.4) 65 The number of the samplesn is computed such that with probabilityp the estimate b Y is within the confidence interval[Y ¡²;Y +²]. For relatively large² (i.e.,² comparable to b Y in order of magnitude), we use the Hoeffding’s tail bound to calculaten as follows. The Hoeffding inequality asserts that: Prfj b Y ¡Yj·²g¸1¡2e ¡2n² 2 =(b¡a) 2 (4.5) wherea andb are two constants such thata < y i < b for alli. We set the right side of the above inequality equal top and solve the equation forn: n= (b¡a) 2 2² 2 ln(2=(1¡p)) (4.6) However, the Hoeffding bound is not a tight bound and for small ², according to the Hoeffding inequality n should be very large; hence, inefficient sampling. Instead, when ² is relatively small we use tighter bounds based upon the standard central limit theorem to computen. Let¾ 2 =1=N P N i=1 (y i ¡Y) 2 be the true variance ofy i inR, and b ¾ 2 =1=n P n i=1 (y i ¡ b Y) 2 the estimated variance. For sufficiently largen, it follows from the central limit theorem that the random variable b Y has a normal distribution with mean Y and variance¾ 2 =n, or equivalently, the standardized random variable p n( b Y ¡Y)=¾ has a normal distribution with mean 0 and variance 1. Therefore: Prfj b Y ¡Yj·²g = Pr (¯ ¯ ¯ ¯ ¯ p n( b Y ¡Y) ¾ ¯ ¯ ¯ ¯ ¯ · ² p n ¾ ) ¼ 2 µ ©( ² p n b ¾ )¡1=2 ¶ (4.7) 66 where© is the standard cumulative normal distribution function. Letl p be the(p+1)=2 quantile of this distribution (i.e.,©(l p ) = (p+1)=2). To deriven, we set the rightmost term in Equation 4.7 equal top and solve the equation forn: n= µ b ¾l p ² ¶ 2 (4.8) (B)COUNT Query: To answer aCOUNT snapshot query SELECTCOUNT(expression)FROMR we need an estimate for the size of the network, r = jVj. We use an algorithm pro- posed by Horowitz et al. [HM03] to estimate the size of the network. This algorithm is relatively expensive to execute (in terms of the communication cost); however, it is only executed occasionally as a background process. Note that although the peer-to-peer network is dynamically evolving, typically the total size of the network is not rapidly changing. Besides, since with the independent sampling the size of the sample set is anticipated to be much less than the size of the network, an approximate estimate of the network size is sufficient. Therefore, occasional update of the network size estimate is adequate. To estimate the result of theCOUNT query,n sample nodesv 1 ,v 2 , ...,v n are derived from V , uniformly random and with replacement. Each sample is drawn by callingS with the uniform weight function w = f8v 2 V j w v = 1g. Let c i denote the count of the tuples locally stored at the sample nodev i . Then, the resultN = P r i=1 c i of the query is estimated as b N = r=n P n i=1 c i . The computation of n is similar to that in the previous section. 67 (C)SUM Query: ASUM snapshot query SELECTSUM(expression)FROMR can be answered by anAVG query and aCOUNT query. The resultY = P N i=1 y i of the SUM query is estimated by b Y =N=n P n i=1 y i ¼ b N£ b Y . 4.1.3.2.2 Repeated Sampling With independent sampling, each snapshot query is answered independently, disregarding the results and the samples derived for the pre- vious queries. However, across successive queries the values of the database tuples are expected to be autocorrelated and, therefore, exploiting the regression of the value of a sampled tuple at the current sampling occasion on that at the previous occasion can improve the accuracy of the current estimate. Alternatively, by regression estima- tion one can achieve the same accuracy/confidence using fewer samples at each sam- pling occasion; hence, more efficient query evaluation. Repeated sampling relies on this observation to improve the efficiency of independent sampling while still satisfying the confidence requirement of the query. Below, we explain the repeated sampling algo- rithm in details for evaluation of theAVG queries. Other types of aggregate queries are evaluated similarly. We begin with regression estimation for the evaluation of the 2 nd snapshot query (i.e., the special bootstrapping snapshot query) of a continuous aggregate query. Later, we extend the analysis for evaluation of the generalk th snapshot query. (A) Evaluating2 nd Snapshot Query: Suppose the sample-set is of size n in both the first and the second sampling occa- sions of a continuous AVG query. Let y ik denote the corresponding value of a sample tupleu i at occasionk when theexpression in theAVG query is evaluated. At the first 68 occasion, there is no prior information to utilize. Therefore, alln samples are new sam- ples derived from the database and the result of the first snapshot query is estimated simply based on the independent sampling algorithm as discussed in Section 4.1.3.2.1. At the second occasion, each sample u i is either replaced by a new sample from the database, or retained (and re-evaluated). A new sample is derived using the sampling operatorS and incurs communication overhead to locate, whereas a retained sample is already located and is only retrieved to be re-evaluated (to refresh y i1 to y i2 ) after pos- sible tuple updates in between the two sampling occasions. If a sample tuple is deleted or the node storing the tuple leaves the network, the tuple is always replaced. Repeated sampling uses the new samples for regular estimation as in the first occasion, while utilizing the retained samples for regression estimation (where y i2 regresses on y i1 as the auxiliary regression variate). The final estimate of the repeated sampling algorithm for the result of the second snapshot query is a combined estimate, a weighted sum of the regular estimate and the regression estimate from the new portion and the retained portion of the sample-set, respectively. With the above estimation scheme, an optimal sample replacement policy is required to determine the proportion of the new and retained portions of the sample-set such that the combined estimate of the result is optimal (i.e., the most accurate estimate with minimum variance). With two extreme cases of the replacement policy, the samples are either all replaced or all retained. As we show below, none of these policies are optimal. With repeated sampling we establish the optimal replacement policy as follows. Suppose among n samples at the k th sampling occasion (here k = 2), g samples are retained from the previous occasion and the restf samples (‘f’ for fresh) are new sam- ples. Lety kf ,y kg , andy k denote the average value of the samples in the new portion, the retained portion, and the entire sample-set (both portions together) at the k th sampling occasion, respectively. Correspondingly, the values of the regular estimate, regression 69 Estimator Variance b Y 2f =y 2f b ¾ 2 f = 1 W 2f b Y 2g =y 2g +b(y 1 ¡y 1g ) b ¾ 2 (1¡b ½ 2 ) g +b ½ 2b ¾ 2 n = 1 W2g Table 4.1: Regular and Regression Estimators at2 nd Occasion estimate, and the combined estimate for the result Y k of the k th snapshot query are denoted by b Y kf , b Y kg , and b Y k , respectively. The estimators b Y 2f and b Y 2g and their corresponding variances are defined in Table 4.1.b ¾ 2 2 =1=n P n i=1 (y i2 ¡y 2 ) 2 is an estimate of the true variance¾ 2 2 =1=N P N i=1 (y i2 ¡ Y 2 ) 2 of y i in R at the second occasion, and we haveb ¾ 2 ¼ b ¾ (= b ¾ 1 ). The estimator b Y 2f is simply the average value of the samples y 2f in the new portion of the sample- set. For the regression estimate b Y 2g , we consider a linear regression with the regression coefficient b = b ¾ 1;2 b ¾ 2 1 . The parameter b is an estimate of the true regression coefficient B = ¾ 1;2 ¾ 2 1 , where ¾ 1;2 is the covariance of y i1 and y i2 in the entire R, andb ¾ 1;2 is its estimate in the sample-set. Similarly, b ½ = b ¾ 1;2 b ¾ 1 b ¾ 2 is an estimate of the true correlation coefficient½= ¾ 1;2 ¾ 1 ¾ 2 . The combined estimate b Y 2 is derived as the sum of the two independent estimates b Y 2f and b Y 2g weighted inversely by their variance: b Y 2 =® b Y 2f +(1¡®) b Y 2g (4.9) where® = W 2f W 2f +W 2g . By the least squares theory the variance of b Y 2 is: var( b Y 2 )= 1 W 2f +W 2g 70 which from Table 4.1 works out as: var( b Y 2 )= b ¾ 2 (n¡gb ½ 2 ) n 2 ¡g 2 b ½ 2 (4.10) The minimum variancevar min ( b Y 2 ) is calculated from Equation 4.10 by derivation with respect tog. This gives the optimal partitioning of the sample-set as: g opt = n 1+ p 1¡b ½ 2 f opt = n p 1¡b ½ 2 1+ p 1¡b ½ 2 (4.11) and with optimal partitioning, the minimum variance is derived as: var min ( b Y 2 )= b ¾ 2 2n (1+ p 1¡b ½ 2 ) (4.12) Note that if g = 0 (all samples replaced, like independent sampling) or f = 0 (all samples retained), the estimate variance (see Equation 4.10) is equal to that of the independent sampling, i.e.,b ¾ 2 =n (¼ ¾ 2 =n). However, with optimal partitioning (g = g opt ) repeated sampling improves the variance with the ratio: ¯ = (b ¾ 2 =n)¡var min ( b Y 2 ) var min ( b Y 2 ) = 1¡ p 1¡b ½ 2 1+ p 1¡b ½ 2 (4.13) Based on Equation 4.13, depending on the correlationjb ½j (· 1) between values of the tuples at successive occasions, repeated sampling can improve the accuracy of the esti- mation over that of the independent sampling by up to 100% (with maximum correlation 71 Estimator Variance b Y kf =y kf b ¾ 2 k f = 1 W kf b Y kg =y kg + b ¾ 2 k (1¡b ½ 2 k ) g + b k ( b Y (k¡1) ¡y (k¡1)g ) b ½ 2 k var( b Y (k¡1) )= 1 W kg Table 4.2: Regular and Regression Estimators atk th Occasion jb ½j=1). Also, as the correlationjb ½j increases, with optimal partitioning a larger portion of the samples are retained because regression estimation is more effective. However, unless the correlation is maximum, repeated sampling replaces a considerable portion of the samples to account for the tuple insertions, deletions, and pathological updates. (B) Evaluatingk th Snapshot Query: Query evaluation at the k th occasion is a generalization of that of the second occa- sion. At the k th occasion, in the most general case, one can consider the regression between the current tuple values and those of not only the previous occasion, but some or all of the previous occasions. Besides, one can consider the regression of the previous tuple values on those at the current occasion to adjust previous results. Here, consid- ering that the farther apart the occasions become, the less the correlation, and in turn, the less effective the regression estimation, we only consider regression on the(k¡1) th occasion. Also, we leave result adjustment with forward regression for future work. Table 4.2 gives the estimators b Y kf and b Y kg and their corresponding variances for the k th occasion. Note that unlike the bootstrapping case with which the first occasion does not have a combined estimate, with thek th occasion we can use the combined estimate of the (k¡ 1) th occasion b Y (k¡1) (instead of the regular estimate y 1 ) to calculate the regression estimate b Y kg for thek th occasion (compare Tables 4.1 and 4.2). The param- etersb ¾ k , b k , andb ½ k correspond to the parametersb ¾, b, andb ½ for the second occasion, respectively, and we haveb ¾ k ¼b ¾, b k ¼ b, andb ½ k ¼b ½. Here, we calculate the optimal 72 partitioning and optimal estimate variance at the k th occasion and show that they both rapidly approach limiting values ask increases. For the combined estimate b Y k we have: var( b Y k )= 1 W kf +W kg =c k b ¾ 2 n wherec k denotes the ratio of the variance at thek th occasion to that of the first occasion (b ¾ 2 =n). Substituting forW kf andW kg from Table 4.2 we get: b ¾ 2 var( b Y k ) = n c k =f k + 1 1¡b ½ 2 g k + b ½ 2 c (k¡1) n (4.14) wheref k andg k (=n¡f k ) are the sizes of the new portion and the retained portion of the sample-set at the k th occasion, respectively. To find minimum variance var min ( b Y k ) at occasionk, we maximize the right hand side of the Equation 4.14 by optimal partitioning f k = f k opt and g k = g k opt . We differentiate the right hand side with respect to f k and solve forf k to obtain the optimal partitioning as follows: f kopt = n p 1¡b ½ 2 c (k¡1) (1+ p 1¡b ½ 2 ) (4.15) By optimal partitioningf k =f kopt (andg k =g kopt ), Equation 4.14 becomes: 1 c k =1+ 1¡ p 1¡b ½ 2 ) c (k¡1) (1+ p 1¡b ½ 2 ) (4.16) which can be rewritten as: 73 h k =1+Dh (k¡1) (4.17) where h k = 1=c k with h 1 = 1=c 1 = 1, and D = (1¡ p 1¡b ½ 2 )=(1 + p 1¡b ½ 2 ). Repeated use of this recurrence relation gives the variance factorc k as: 1 c k =h k =1+D+D 2 +:::+D (k¡1) = 1¡D k 1¡D (4.18) Consequently, from Equation 4.14 we obtain the minimum variancevar min ( b Y k ): var min ( b Y k )= b ¾ 2 n µ 1¡D 1¡D k ¶ (4.19) Considering the asymptotic case, since0<D < 1, from Equation 4.18 the limiting variance factorc 1 is: c 1 =1¡D = 2 p 1¡b ½ 2 1+ p 1¡b ½ 2 (4.20) Correspondingly, from Equation 4.19 the minimum variance tends to: var min ( b Y 1 )= b ¾ 2 n à 2 p 1¡b ½ 2 1+ p 1¡b ½ 2 ! (4.21) Finally, the limiting value off kopt is obtained from Equation 4.15 as: 74 f 1 opt = n p 1¡b ½ 2 c 1 (1+ p 1¡b ½ 2 ) = n 2 (4.22) irrespective of the value ofb ½. Thus far, we discussed and analyzed how repeated sampling can improve the accu- racy of the estimate over that of the independent sampling. To answer approximate con- tinuous aggregate queries, alternatively we employ repeated sampling to improve the efficiency of the query evaluation. As compared with independent sampling, repeated sampling uses fewer samples per occasion to satisfy the confidence requirement of the query. To compute the required size n of the sample-set with repeated sam- pling, we follow the same routine as described in Section 4.1.3.2.1. Here, instead of the regular estimate variance b ¾ 2 =n, we use the asymptotic combined estimate vari- ance var min ( b Y 1 ) = 1 I (b ¾ 2 =n) of the repeated sampling from Equation 4.21, where I =(1+ p 1¡b ½ 2 )=(2 p 1¡b ½ 2 ) is the improvement factor. Following the same routine, we derive the counterpart for Equation 4.8 as follows: n= 1 I µ b ¾l p ² ¶ 2 (4.23) However, according to Equation 4.22, with repeated sampling the actual number of the new samplesf 1 opt drawn from the database is n 2 . Thus, with repeated sampling we only use 1=(2I) of the number of samples required for independent sampling to satisfy the same confidence requirement. 75 4.2 Random Sampling Operator: DBSampler 4.2.1 Algorithm We discussed the interface of the random sampling operator in Section 4.1.2. Here, we explain the algorithm and the implementation of the operator. Given a weight function w = f8v 2 V j w v g as input, the sampling operatorS should randomly derive a sam- ple node v from V with the sampling probability distribution p v = w v = P v2V w v . We implement our sampling operator based on a distributed sampling algorithm inspired by the Markov Chain Monte Carlo (MCMC) methods for sampling from a desired proba- bility distribution. To sample from a distribution, first an MCMC method constructs a Markov chain that has the desired distribution as its stationary distribution. With such a Markov chain, starting a traversal of the chain from any initial state, under certain conditions the distribution of the covered states of the chain converges to the stationary distribution of the chain after a sufficiently large number of steps. Once converged, the current state is returned as a sample from the desired distribution. With our distributed sampling algorithm, we consider a peer-to-peer network as a Markov chain, with nodes as the states of the chain, and links as the transitions between the states. Also, our algorithm uses random-walking sampling agents that are forwarded from node to node to emulate the state transition process. To sample the peer-to-peer database, the sampling operator at a node initiates a random walk (a sampling agent). If the forwarding probabilities of the random walk (corresponding to the transition proba- bilities of the constructed Markov chain) are properly assigned such that the stationary distribution of the walk is equivalent to the desired sampling distributionp v , after a suf- ficiently large number of steps the distribution of the nodes covered by the random walk converges to p v and the current node is returned to the originating node as the sam- pled node. Next, first we describe how our distributed sampling algorithm employs the 76 Metropolis Markov Chain construction algorithm to assign the forwarding probabilities of the random walk for the desired stationary distribution p v . Second, we present our result that determines the number of steps required to converge to the desired distribu- tion with arbitrary difference. 4.2.2 Forwarding Probabilities Let the undirected connected graphG(V;E) model a peer-to-peer network with arbitrary topology. A random walk that starts at a nodev 0 (the originating node), arrives at a node v t at timet and with certain forwarding probability moves to a neighbor nodev t+1 at time t+1. Suppose¼ t denotes the distribution of the nodev t such that¼ t (i)=Pr(v t =i), for alli2 V . Let P = (P ij ), i;j 2 V , denote the forwarding matrix of the random walk, whereP ij is the probability that the random walk moves from nodei to nodej. P ij =0 if i and j are not adjacent. By definition, we have ¼ t+1 = ¼ t P = ¼ 0 P t+1 , where ¼ 0 is the distribution of the originating nodev 0 . The following existence result is classic: Theorem 1. IfP is irreducible (i.e., any two nodes are mutually reachable by random walk) andP is aperiodic (which will be ifG is non-bipartite), then¼ t converges to the unique stationary distribution¼ such that¼P =¼ independent of the initial distribution ¼ 0 . The Metropolis algorithm [MRR + 53] is designed to assign the forwarding proba- bilities P ij such that the stationary distribution ¼ corresponds to a desired distribution (uniform or nonuniform) such asp v : Theorem 2. Consider the graph G(V;E) and let d i denote the degree of the node i in G. For each neighborj ofi, the forwarding probabilityP ij ,i6=j, is defined as follows: 77 P ij = 8 < : 1 2 ( 1 d i ) if p i d i · p j d j 1 2 ( 1 d j )( p j p i ) if p i d i > p j d j (4.24) andP ii = 1¡ P j2neighbors(i) P ij . Then, with the forwarding matrixP ,p v is the unique stationary distribution of the random walk onG. The proof for Theorem 2 is complicated [MRR + 53]. Intuitively, to achieve a random walk with the desired stationary distribution, the Metropolis method modifies a random walk with uniform stationary distribution by biasing its forwarding probabilities. With the modified random walk, the forwarding probabilities are not only inversely propor- tional to the degrees of the neighbor nodes (as with the uniform random walk) but also proportional to the sampling probabilities of the neighbor nodes. The Metropolis forwarding matrix P is irreducible [DSC95]. Also, the laziness factor1=2 adds a virtual self-loop to each node of theG, which makesG non-bipartite andP aperiodic. Thus, convergence of the Metropolis follows from Theorem 1. Note that using the Metropolis algorithm,S implements a fully distributed sampling process with which it does not require to know/compute the global normalization factor P v2V w v (to calculatep v = w v = P v2V w v ) in order to assign the forwarding probabil- itiesP ij . Each nodei determines its local forwarding probabilities P ij (j is a neighbor ofi) individually and only based on the local information. According to Equation 4.24, to determineP ij ,i only needs to know the ratiow j =w i (=p j =p i ), which it computes by obtaining the weightw j from its neighborj. 4.2.3 Convergence Time To determine how rapidly¼ t converges top v , first consider the following definitions and the subsequent classic result (Theorem 3): 78 Definition 1. The variance difference between two distributions¼ t andp v is defined as k¼ t ;p v k= 1 2 max v 0 P i j¼ t (i)¡p i j. The variance difference is a measure to quantify the total difference between two prob- ability distributions, and we have0·k¼ t ;p v k·1. Definition 2. For° >0, the mixing time is defined as¿(°)=minftj8t 0 ¸t;k¼ t 0;p v k· °g. The mixing time¿(°) is the time (i.e., number of time steps) it takes for¼ t to converge top v to within the difference°, such thatk¼ t ;p v k· °. The following theorem bounds the mixing time when the random walk is on a graphG with arbitrary topology [DS91]: Theorem 3. Let p v min = min i p i , then ¿(°) · µ ¡1 P log((p v min °) ¡1 ), where µ P is the eigengap of the forwarding matrixP . The eigengap ofP is defined asµ P =1¡j¸ 2 j, where¸ 2 is the second eigenvalue of the matrix. Thus, the larger the eigengap, the more rapidly the random walk converges to the desired distribution. However, computing the exact eigengap of P for peer-to-peer networks with large size and dynamic topology is difficult, if not infeasible. Instead, one can utilize the geometric bounding approach [DS91] to derive a bound for the eigengap. Considering power-law graph as a generic and realistic model for the topology of the peer-to-peer networks [SGG02], we use the geometric bounding approach to derive the mixing time (or convergence time) of the random walk on a graphG with power-law topology: Theorem 4. SupposeG is a random graph with the node degree distributionp k /k ¡® , where2<®< 3. Then,¿(°) is of orderO(N ¡® log° ¡1 log 4 N), whereN is size of the network. 79 Considering this result, the mixing time, i.e., the sampling cost/time of our sampling operatorS, is suboptimal, comparable to the optimal mixing time which is achieved by a centralized algorithm [BDX04]. 4.3 Overcoming the Limitations of Conventional Sam- pling Design In Section 4.1.3.2, with both independent sampling and repeated sampling approaches we assumed using the conventional sampling design for evaluating each snapshot aggre- gate query. With conventional sampling, we inevitably presume ideal conditions for the peer-to-peer database. Particularly, for conventional sampling to apply, one must assume the following: ² First, it is assumed that the parameters required for estimation of the query result (see Section 4.3.2) are known in advance of the query evaluation, whereas in real peer-to-peer databases the dataset is considerably dynamic, and therefore, any a priori estimate of such parameters (if available) soon becomes inaccurate and inapplicable. Moreover, even if one can continuously maintain a valid estimate of the parameters, since these parameters are dependent on the constraints of each specific query, maintaining the estimates of the parameters for all queries fails to scale. ² Second, the data distribution in peer-to-peer databases is assumed to be a normal (not skewed) distribution. However, in real-world peer-to-peer databases due to existence of rare but atypically large (in absolute size) data values, the distribution of the data is often strongly non-normal with a Zipfian heavy-tailed skew [SGG02, GDS + 03]. 80 In practice, these assumptions effectively limit the application of the proposed query answering approaches to a restricted subset of the peer-to-peer databases with essentially static dataset and normal data distribution. In this section, we introduce a set of novel sampling designs to relax the strong assumptions of the conventional sampling and allow approximate aggregate query answering in generic real-world peer-to-peer databases. Digest can use any of the pro- posed sampling designs (and optionally, the conventional sampling design) to evaluate the aggregate queries based on the samples derived from the peer-to-peer database. We propose two alternative sampling designs to address each of the two limitations of the conventional sampling design. For unknown parameters, we introduce double sampling and sequential sampling. Both of these sampling designs estimate the unknown param- eters on-the-fly during the query evaluation rather than relying on the previous estimates of the parameters, and hence, both are self-sufficient in terms of the estimation param- eters. With the double sampling design, the parameter estimation is performed by a pilot sample, before deriving the main sample to estimate the query result, whereas with sequential sampling the parameters are estimated during the main sampling procedure in parallel with estimation of the query result. On the other hand, for the peer-to-peer databases with skewed data distribution, we introduce cluster sampling and inverse sampling to evaluate the queries. With these sampling designs, unlike the conventional sampling with which the sample-set is fixed- size, the sample-set is adaptively extended to include a minimum amount of rare data values sufficient for unbiased estimation of the query result. Cluster sampling leverages on the clustering property of the data in peer-to-peer databases [SGG02, GDS + 03], with which the rare data values are observed to be clustered in certain neighborhoods within the peer-to-peer network. With cluster sampling, when a rare data value is sampled from a neighborhood, the neighborhood is further sampled to derive more rare data values for 81 unbiased estimation. With inverse sampling, sampling the database is extended until the percentage of the rare data in the derived sample-set exceeds a certain threshold. The threshold is rigorously calculated via analysis such that unbiased estimation of the query result is guaranteed. Each of our proposed sampling designs can be used individually, as appropriate for query answering in the peer-to-peer databases with the corresponding characteristics. For query answering in the peer-to-peer databases with both dynamic dataset and skewed data distribution, we integrate these sampling designs to a universal sampling design that addresses both of the limitations of the conventional sampling design simultaneously. The remainder of this section is organized as follows. For the reader convenience, in Section 4.3.1 we re-state our model for a snapshot aggregate query and in Section 4.3.2 we review the basics of the conventional sampling design to answer a snapshot aggre- gate query. In Section 4.3.3, we cover the double sampling and sequential sampling designs to address the problem of unknown parameters, and follow in Section 4.3.4 by explaining the cluster sampling and inverse sampling designs to address the problem with skewed data distribution. Finally, in Section 4.3.5 we discuss the universal sam- pling design. 4.3.1 Snapshot Approximate Aggregate Query We model a peer-to-peer network as an undirected graph G(V;L) with arbitrary topol- ogy. As nodes autonomously join and leave the network, the node-setV and the link-set L can vary in time. We assume the rate of the churn inG is relatively low as compared to the sampling time (i.e., the time required to draw a sample from the peer-to-peer database), such that the network can be assumed almost static during each sampling occasion, although it may change significantly between successive sampling occasions. 82 Time (t) Aggregate Value (X) ε t 1 t 2 t 3 t 4 t 5 t 6 t 7 ^ Exact Query (Q ) Approximate Query (Q ) t i t i X(t 3 ) Q t 3 ^ Q t 1 ^ Q t 2 ^ Q t 4 ^ Q t 5 ^ Q t 6 ^ Q t 7 ^ Figure 4.4: A number of snapshot approximate aggregate queries ( b Q t i ) and their cor- responding exact queries (Q t i ) over a sequence of times t i . The value of p for the approximate queries is not shown. Note that the result of the approximate query b Q t 7 lies outside the desired confidence interval, suggesting that ifp < 1, there is a positive probability that the estimated result exceeds the confidence limits. For the peer-to-peer database corresponding to such a peer-to-peer network, without loss of generality we assume a relational model. Suppose the database consists of a single relation R = fu 1 ;u 2 ;:::;u M g. The relation R (which is a multiset, possibly containing duplicate tuples) is horizontally partitioned, with each disjoint subset of its tuples stored at a separate node. The number of tuples stored at the node v i is denoted by m v i . The member-set of R may also vary in time; the changes are either due to the changes of V , as nodes with new content join the network (as if inserting tuples) and existing nodes leave and remove their content (as if deleting tuples), or as the existing nodes autonomously modify their local content by insertion, update, or deletion. 83 Assuming such peer-to-peer network and database models, we consider the follow- ing template for the aggregate queries: SELECTop (E) FROMR WHEREP where op is one of the aggregate operations AVG, COUNT, or SUM, E is an arithmetic expression involving the attributes ofR, andP is an arbitrary selection predicate. Sup- pose the exact (and one-time) aggregate query Q is an instance of such queries evaluated at timet. The result of the query Q is the exact value of the desired aggregate quantityX at timet. With the approximate aggregate query b Q corresponding to Q, the exact result X is approximated by an estimate b X with guaranteed precision. The approximate query b Q includes two extra user-defined parameters, ² and p, to specify the desired (fixed) precision of the estimation. The parameter ² indicates the maximum tolerable relative error in the estimate b X; i.e., the approximate query should guaranteej b X¡Xj=jXj·². The interval[X¡²X;X+²X] is termed the confidence interval of the estimation, with X¡²X andX +²X as the lower and upper confidence limits, respectively (see Figure 4.4). The provided guarantee is probabilistic and with the parameterp user specifies the desired confidence level of the guarantee, i.e., the probability that the estimation b X is actually confined within the confidence interval. The user-defined parameters ² and p together determine the required confidence of the estimate b X. Note that, an approxi- mate query is a generalization of the corresponding exact query; the exact query is an approximate query with ² = 0 and p = 1. The semantics defined above subsumes all definitions of the approximate aggregate queries in the related work. 4.3.2 Conventional Sampling Among the three types of the aggregate queriesAVG,COUNT, andSUM, each two types can be reduced to the third type with minor modifications. Here, we describe the query 84 evaluation process for an approximateAVG query, which is easier to explain using our notation. Consider theAVG query b Q: SELECT AVG(E) FROMR WHEREP To answer b Q by conventional sampling, initially n sample tuples s 1 , s 2 , ..., s n that satisfy the predicate P are derived from R. The samples are uniformly random (with replacement). To derive each sample, first the sampling operator S is called with the weight function w = f8v 2 V j w v = m v g to draw a sample node with a sampling probability proportional to its content-size. Thereafter, the content of the sample node is uniformly sampled to derive a sample tuple. The sample tuple is a uniform random sample from R and it is a valid sample unless it fails to satisfy P , in which case it is rejected and the procedure is repeated to derive a new sample. We call the time interval during which then required samples are derived from the database the sampling occasion for the query b Q. Next, based on the derived samples the result of the query is approximated as fol- lows. Suppose y u j denotes the value of the expression E when applied to a tuple u j . The exact result of theAVG query is Y = 1=N P N k=1 y r k , where r k is a tuple in R that satisfies the predicate P and N is the total number of such tuples. With conventional sampling, the exact result Y is approximated by the unbiased and consistent estimator b Y based on the sample tupless i : b Y =1=n n X i=1 y s i (4.25) However, to ensure the fixed-precision of the query, the number of the samples n must be sufficiently large such that with probability p the estimate b Y is guaranteed to be confined within the confidence interval [Y ¡²Y;Y +²Y]. To determine n, on the 85 one hand assuming a relatively large confidence interval (i.e., when ²Y is comparable to Y in order of magnitude), one can use the Hoeffding’s tail bound as follows. The Hoeffding inequality asserts that: Prfj b Y ¡Yj·²Yg¸1¡2e ¡2n² 2 Y 2 =(b¡a) 2 (4.26) wherea andb are two constants such thata < y u j < b for allu j inR. To deriven, we set the right side of the Inequality 4.26 equal top and solve the equation forn: n= (b¡a) 2 2² 2 Y 2 ln(2=(1¡p)) (4.27) On the other hand, for relatively small confidence intervals since the Hoeffding bound is not a tight bound, it renders the sample size n redundantly large and, conse- quently, the query evaluation inefficient. In such a case, instead we use a tighter bound based upon the standard central limit theorem to calculaten. Let¾ 2 =1=N P N k=1 (y r k ¡ Y) 2 be the variance ofy r k . Assuming thaty r k has a normal distribution, for sufficiently largen it follows from the central limit theorem [Fel57] that the random variable b Y , our estimator, also has a normal distribution with meanY and variance¾ 2 =n. Equivalently, under such conditions the standardized random variable p n( b Y ¡Y)=¾ has a normal distribution with mean 0 and variance 1. Therefore we have: Prfj b Y ¡Yj·²Yg = Pr (¯ ¯ ¯ ¯ ¯ ( b Y ¡Y) p n ¾ ¯ ¯ ¯ ¯ ¯ · ²Y p n ¾ ) = 2 µ ©( ²Y p n ¾ )¡1=2 ¶ (4.28) 86 where © is the standard cumulative normal distribution function. Let l p be the (p + 1)=2 quantile of this distribution (i.e., ©(l p ) = (p + 1)=2). Finally, considering the definition of the confidence interval and confidence level, we set the rightmost term in Equation 4.28 equal to p and solve the equation for n. The derived value for n determines sufficient number of samples to satisfy the precision requirements of the query: n= µ ¾l p ²Y ¶ 2 (4.29) The limitations of the conventional sampling design for query evaluation in peer- to-peer databases are twofold. First, according to Equations 4.27 and 4.29, one at least needs a preliminary estimation of the parametersY and¾ to determine the sample sizen (note that with theAVG query,Y happens to be the desired result of the query as well). While for relatively static populations/datasets having a sufficiently accurate previous estimation of these parameters is probable, in a peer-to-peer database with presumably dynamic dataset any a priori estimation of the parameters (if available) soon becomes inaccurate and inapplicable. Moreover, even if one can continuously maintain a valid estimation of these parameters, since the parameters depend on the selection conditionP of each specific query, maintaining the estimation of the parameters for all queries fails to scale. Second, for the central limit theorem to hold with conventional sampling, the distribution of the data (y r k ) is assumed to be approximately normal, whereas in real- world peer-to-peer databases the distribution of the data is often strongly non-normal with a Zipfian heavy-tailed skew [SGG02, GDS + 03]. 87 4.3.3 Sampling with Unknown Parameters In this section, we propose two alternative sampling designs, double sampling and sequential sampling, which are self-sufficient in terms of the parameters required for determining the sample size n. With both of these sampling designs the parameters are assumed to be unknown, and the main idea is to estimate the unknown parameters on-the-fly during the query evaluation rather than relying on the previous estimations of the parameters. Note that the number of the samples with any sampling design without a priori knowledge of the parameters is more than the optimal number of the samples with conventional sampling (with which the parameters are assumed to be known). The extra samples are required for parameter estimation, and the less the number of the extra samples, the sampling design is more efficient. Both double sampling and sequential sampling follow the same procedure used with the conventional sampling to derive each sample, such that it is uniformly random and satisfies the selection condition P (see Section 4.3.2). Also, both sampling designs assume the same estimator used with the conventional sampling (Equation 4.25) to approximate the result of theAVG query b Q. However, they differ from the conventional sampling (and from each other) in the approach they employ to derive the unknown parameters and to determine the sample sizen. Next, we explain the approach adopted by each of these sampling designs. 4.3.3.1 Double Sampling With double sampling, the total number of the required samplesn is initially unknown. The samples are taken (all uniformly random and satisfying P ) in two batches at two consecutive steps. At the first step, a small number of samples, n 1 , are drawn that comprise the pilot sample. The pilot sample is used to obtain a preliminary estimation of the unknown parametersY and¾. Based on this preliminary estimation, the required 88 total sample size,n, that guarantees the fixed-precision of the query is computed. At the second step, additional samples (i.e.,n¡n 1 samples, ifn>n 1 ) are drawn and the final estimate of the query result is produced based on the combined sample (which consists of the samples derived at the second step as well as those of the pilot sample). Figure 4.5 depicts the details of our query evaluation algorithm with the double sampling design. As shown in the figure, at the first step after deriving the pilot sample S 1 , we use standard mean and standard variance estimators to produce the preliminary estimates b Y 1 andb ¾ 1 for the unknown parametersY and¾, respectively. The size of the pilot sample S 1 , n 1 , is a fixed number (more discussion on how to choose a value for n 1 follows). Having a preliminary estimate of the parameters, we compute the required total sample size, n, based on the following. Consider Lemma 5, a specific case of the theorem established by Cox [Cox52]: Lemma 5. Let b Y 1 andb ¾ 1 be the preliminary estimates for the unknown parametersY and¾ with a pilot sample sizen 1 . The double sampling design can achieve a variance f(Y) (f is an arbitrary function) for the final estimate b Y iff the sizen of the combined sample is at least: n=n( b Y 1 ;b ¾ 1 )= b ¾ 2 1 f( b Y 1 ) (1+b( b Y 1 ;b ¾ 1 )) where: b( b Y 1 ;b ¾ 1 )=2f 00 ( b Y 1 )+ [f 0 ( b Y 1 )] 2 f( b Y 1 ) + b ¾ 2 1 f 00 ( b Y 1 ) 2n 1 f( b Y 1 ) + 2 n 1 Using Lemma 5, one can derive the required total sample size to obtain a fixed-precision estimate of the query result when the precision is defined in terms of the variance of 89 EvaluateQuery DS( b Q;²;p) begin // STEP 1 // Deriving the pilot sampleS 1 w=f8v2V jw v =m v g; S 1 =;;i=0; while(i<n 1 )f s=S(w); //S(:) is the sampling operator if(samples satisfies the predicateP)f S 1 =S 1 [fsg; i=i+1; g g; // Estimating the parameters b Y 1 =1=n 1 P s i 2S 1 y s i ; b ¾ 2 1 =1=n 1 P s i 2S 1 (y si ¡ b Y 1 ) 2 ; // Computing the total sample sizen n=( b ¾ 1 l p ² b Y 1 ) 2 (1+8( ² lp )+ b ¾ 2 1 n1 b Y 2 1 + 2 n1 ); // STEP 2 // Deriving additional samples if (n<=n 1 ) S c =S 1 ; elsef S 2 =;;i=0; while(i<(n¡n 1 ))f s=S(w); if(samples satisfies the predicateP)f S 2 =S 2 [fsg; i=i+1; g g; S c =S 1 [S 2 ; g; // Producing the final estimate of the query result b Y D =1=n P s i 2S c y s i ; return b Y = b Y D ; end Figure 4.5: Query Evaluation by Double Sampling the estimated result. However, with our approximate aggregate query model (see Sec- tion 4.3.1) the required precision of the query result b Y is defined by specifying a fixed 90 confidence interval (determined by ²) and confidence level (determined by p) for the estimated result. With Theorem 6, we extend Lemma 5 to derive the required combined sample size according to our fixed-precision approximation model as follows: Theorem 6. Let b Y 1 andb ¾ 1 be the preliminary estimates for the unknown parametersY and¾ with a pilot sample sizen 1 . For the double sampling design to guarantee that with probabilityp its estimate b Y D is confined within the confidence interval[Y¡²Y;Y+²Y], the sizen of the combined sample must be at least: n = n( b Y 1 ;b ¾ 1 ;²;p) = à b ¾ 1 l p ² b Y 1 ! 2 0 @ 1+8( ² l p )+ b ¾ 2 1 n 1 b Y 2 1 + 2 n 1 1 A (4.30) Note that according to Equations 4.29 and 4.30, the extra cost of double sampling (as compared to the conventional sampling) for parameter estimation is almost in the order of: à b ¾ 1 l p ² b Y 1 ! 2 0 @ 8( ² l p )+ b ¾ 2 1 n 1 b Y 2 1 + 2 n 1 1 A It is shown that this extra cost is near optimal for any sampling design without a priori knowledge of the parameters [Cox52]. As mentioned above, the samples derived at the first step are not only used for pre- liminary estimation, but are also re-used as part of the combined sample to save some sampling cost. There is a trade-off for choosing the proper size for the pilot sample. On the one hand, according to Equation 4.30 the larger the first sample size n 1 , the total size of the required combined samplen is smaller. On the other hand, ifn 1 is too large such that we have n · n 1 (in which case the sampling at the second step is skipped and the pilot sample comprises the entire combined sample), the combined sample is 91 already oversized, resulting in redundant sampling. Unfortunately, the optimal value for n 1 cannot be determined until after we have a preliminary estimation of the parameters (i.e., after the first step), which is overdue. However, fortunately according to Equation 4.30 the total sample size n is not very sensitive to the value of n 1 . Here, we choose a fixed value between 100 and 200 that almost never exceeds the total sample size with most approximate queries (unless the precision requirements of the query are too tight with²!0 andp!1). 4.3.3.2 Sequential Sampling In contrast with double sampling with which the samples are derived in batches, with the first batch of samples for parameter estimation and the second batch to estimate the final query result, with sequential sampling the samples are drawn in an (initially) indefinite sequence of steps, one sample at each step. The unknown parameters are re-estimated at each step using all the samples obtained so far, and the sampling terminates when according to the most recent estimate of the parameters a stop condition is satisfied. The stop condition guarantees sufficiency of the collected samples for the fixed precision of the query result. Figure 4.6 illustrates the details of our query evaluation algorithm with the sequential sampling design. As with double sampling, we use standard mean and standard variance estimators to produce the estimates b Y n andb ¾ n for the unknown parameters Y and ¾ at step n. However, to avoid redundant computation, we use a one-pass estimation procedure to incrementally compute the estimates at each step based on the estimates from the previous step and the newly derived sample. For the stop condition, the idea is to use a condition similar to Equation 4.29 (which specifies the sufficient number of the samples with conventional sampling), replacing the exact (but unknown) parametersY and¾ with the estimates b Y n andb ¾ n , respectively. The following theorem follows from 92 EvaluateQuery SS( b Q;²;p) begin // Parameter initialization w=f8v2V jw v =m v g; b Y n =0;b ¾ n =0; // Parameter estimates at stepn n=0; // Starting the sequential sampling dof do s=S(w); //S(:) is the sampling operator while(samples does not satisfy the predicateP); // Estimating the parameters at stepn b ¾ 2 n =b ¾ 2 n + n( b Y n ¡y s ) 2 (n+1) ; b Y n = 1 (n+1) (n b Y n +y s ); n=n+1; // Checking the stop condition g while(n<( b ¾ n l p ² b Y n ) 2 ); // Producing the final estimate of the query result return b Y = b Y n ; end Figure 4.6: Query Evaluation by Sequential Sampling the theorem established by Chow et al. [CR65] and ascertains the asymptotic validity of our stop condition: Theorem 7. Let b Y n andb ¾ n be the intermediate estimates for the unknown parameters Y and ¾ at step n of the sequential sampling process. The sequential sampling design guarantees that with probabilityp the estimate b Y is confined within the confidence inter- val[Y ¡²Y;Y +²Y] as²!0 iff: n¸ à b ¾ n l p ² b Y n ! 2 (4.31) 93 Sequential sampling has two advantages over double sampling. First, sequential sampling does not require a pilot sample; hence, no need for the trade-off in select- ing the size of the pilot sample either. Second, as we also show by empirical results (see Section 4.4.2.2), the cost of sampling (i.e., the number of samples derived) with sequential sampling is less than that of the double sampling (although still more than the minimal cost with the conventional sampling). We attribute this property to the fact that with sequential sampling the estimates of the parameters are enhanced grad- ually and the sampling is terminated as soon as the estimates are sufficiently accurate, whereas with double sampling the estimates are produced based on a batch of samples that might be over-sized and redundant for the required precision. For these reasons, with our universal sampling solution we choose the sequential sampling approach to address the unknown parameters problem (see Section 4.3.5). 4.3.4 Sampling with Skewed Data Distribution In this section, we propose two other alternative sampling designs, cluster sampling and inverse sampling, to overcome the second limitation of the conventional sampling. As mentioned in Section 4.3.2, with the skewed distribution of the data y u in real-world peer-to-peer databases (recall thaty u denotes the value of the expressionE in the query when applied to the tuple u, where u is a tuple in R that satisfies the predicate P of the query), the basic assumption of the central limit theorem is not maintained and the theorem fails to hold. Consequently, the fundamental conjecture of the conventional sampling that the estimator b Y is normally distributed is also contradicted and its results (Equations 4.28 and 4.29) are invalidated. One can interpret this limitation of the con- ventional sampling more intuitively as follows. With the typically heavy-tailed skew of the data distribution in real-world peer-to-peer databases, the database contains mostly 94 typical data values along with a few very large (in absolute size) data values which dra- matically affect the average/mean Y of the data (and other aggregate characteristics of the data, for that matter). On the other hand, with conventional sampling the sample-size is limited and fixed, and therefore, there is always a reasonable probability that the rare large data values are excluded from the sample. Consequently, with skewed data distri- bution the estimator b Y of the conventional sampling often underestimates the result Y of theAVG query. As a result, the distribution of the estimator is biased, with a positive skew with which most estimates underestimate the average value. To address this problem, the main idea with both cluster sampling and inverse sam- pling is to base the estimation on a sample-set with adaptive (not fixed) size. With adaptive sampling, the number of the derived samples is not fixed in advance, but rather depends on the observed values of the samples. While sampling the database, the sample-set is extended so far as to include sufficient number of samples with rare large values to avoid underestimation of the query result. Hence, these sampling designs allow unbiased estimation with normal estimate distribution (more details follow in Sec- tions 4.3.4.1 and 4.3.4.2). The definition of the condition that constitutes a large data value is query-dependent. For instance, with the query “What is the current average temperature at the Downtown Area in the City of Los Angeles?”, one can consider T>100 o F as high temperature. In general, a conditionC partitions the set of the data valuesD =fy u g to two subsetsD c andD c = D¡D c , whereD c =fy u jy u 2 D andy u 2 Cg is the set of the data values satisfying the conditionC. A common form for a conditionC that identifies large data values is C = fy u j y u > cg, where c is a constant value in the domain of y u . With cluster sampling and inverse sampling, user is expected to define such a conditionC in addition to other query parameters. 95 4.3.4.1 Cluster Sampling Cluster sampling is applied when the rare large data values are clustered within the peer-to-peer database. The data is so-called clustered if, whenever we find large data value(s) at a node v, it is probable that we also find large data values at the nodes in the neighborhood of v (i.e., the nodes in short network-distance from v). It is known that within most real-world peer-to-peer databases the data is in fact strongly clustered [SGG02, GDS + 03]. With cluster sampling, an initially random sample-set is extended to include the large data values that may be clustered in the neighborhood of the samples with large data value. The extended sample-set is used to estimate the query result. To initiate cluster sampling, at first a primary sample S p of fixed size n is derived from the database (all samples inS p are drawn uniformly random and satisfyP ). Thereafter, for each sample s i 2 S p , i : 1 to n , if y s 2 D c the neighborhood of the node v storing s i is also sampled as follows. First, all (adjacent) neighbors of the node v are sampled. The local data at each neighbor is sampled randomly to draw one sample per node, and the derived samples are added to the sample-setS s i (initially,S s i =;). Next, if any of the new samples s 0 in S s i are such that y s 0 2 D c , the neighbors of the nodes storing these samples are also similarly sampled and appended toS s i . This procedure continues in a cascade fashion, at each step sampling the neighbors of the nodes storing the samples derived at the previous step, but only the samples with large data values. The procedure terminates when none of the newly derived samples satisfy the condition C. Once the neighborhood sampling procedure is completed for every primary sample s i in S p , the final query result is estimated based on the combined sample-setS c = S n i=1 S s i [S p . The combined sample consists of the primary samples as well as the samples derived from the neighborhoods of the primary samples with large data values. Consequently, 96 leveraging on the clustering property of the data, by capturing large data values the com- bined sample enables unbiased estimation of the result. Besides, as we show below the variance of the estimate based on the combined sample is less than that of the estimate based on a fixed-size sample of the same size with conventional sampling. Hence, with cluster sampling the precision of the result (or equivalently, the efficiency to achieve the same precision) is also improved. Figure 4.7 depicts more details of our query evaluation algorithm with the cluster sampling design. Since cluster sampling is an adaptive sampling design which favors the samples with large data values, unlike conventional sampling applying the standard average estimator (Equation 4.25) to the entire combined sampleS c introduces bias into the final estimate. Instead, we use a modified version of the Hansen-Hurwitz estimator [HH43]. As illustrated in the figure, with this estimator first we compute the standard average for the samples derived from the neighborhood of each primary sample, includ- ing the primary sample itself (i.e., the average for the samples in the combined sample of the neighborhoodS c i =S s i [fs i g). Thereafter, we calculate the final estimate of the query result as the mean of the average estimates at all neighborhoods. With Theorem 8, we show our estimator is unbiased, and also propose an unbiased estimator for the variance of the average estimator: Theorem 8. Let b Y c be the modified Hansen-Hurwitz estimator with cluster sampling, and¾ 2 c the variance of the estimate b Y c . The cluster sampling design guarantees that b Y c is an unbiased estimator forY with normal estimate distribution, i.e., the average of all estimates b Y c over all possible sample-setsS c is equal toY . Moreover,b ¾ 2 c is an unbiased estimator for the variance¾ 2 c : b ¾ 2 c = 1 n n X i=1 (y ¤ i ¡ b Y c ) 2 n (4.32) 97 EvaluateQuery CS( b Q;²;p;C) begin // Deriving the primary sampleS p w=f8v2V jw v =m v g; S p =;;i=1; while(i·n)f s i =S(w); //S(:) is the sampling operator if(samples i satisfies the predicateP)f S p =S p [fs i g; i=i+1; g g; // Extending the primary sample to the combined sample for (i=1;i=n; i++)f // Deriving the samples from the neighborhood // of each primary sample if (y s i 2D c ) //v is the node storings i S s i =SampleNeighborhood(v;C); else S si =;; // Combining the neighborhood sample with the // primary sample; the combined neighborhood // samplesS c i constitute the combined sampleS c S ci =S si [fs i g; // Computing the standard average for the combined // sample at each neighborhood y ¤ i =(1=jS c i j) P s2Sc i y s ; g; // Producing the final estimate of the query result // based on the combined sample estimates at all // neighborhoods b Y c =(1=n) P n i=1 y ¤ i ; return b Y = b Y c ; end Figure 4.7: Query Evaluation by Cluster Sampling where y ¤ i is the standard average for the combined sample S c i at the neighborhood of thei-th primary samples i . 98 Considering¾ ¤2 = P n i=1 (y ¤ i ¡ b Y c ) 2 n as an estimate for the variance of the data across the data clusters (neighborhoods) in the peer-to-peer database, we haveb ¾ 2 c = ¾ ¤2 =n. Therefore, following the same argument with conventional sampling (Equations 4.28 and 4.29), for cluster sampling we derive the required sample sizen that guarantees with probabilityp the estimate b Y c is confined within the confidence interval[Y¡²Y;Y +²Y] as follows: n= à ¾ ¤ l p ² b Y c ! 2 (4.33) Similar to the conventional sampling design, with cluster sampling we assume a prelim- inary estimate of the parameters b Y c and¾ ¤ is available to determine the sample sizen. In Section 4.3.5, we integrate cluster sampling with the self-sufficient sampling designs proposed in Section 4.3.3 to address both the unknown parameters and the skewed data distribution problems simultaneously. As mentioned above, the estimation with the cluster sampling design is not only unbiased, but it is also more efficient as compared with the conventional sampling. Note that on the one hand, with the skewed data distribution, b Y underestimates the averageY whereas b Y c is unbiased; thus, whp 3 b Y c ¸ b Y . On the other hand, due to the clustering property of the data, we have ¾ ¤ · ¾, and the more clustered the data, the larger the gap between ¾ ¤ and ¾. Therefore, comparing Equations 4.29 and 4.33 it is easy to observe that to achieve the same precision the sample size n is almost always smaller with the cluster sampling as compared to that of the conventional sampling; hence, cluster sampling is more efficient. We verify this observation further in Section 4.4.2 via experiments. It is important to note that although the actual size of the combined 3 By whp (with high probability), we mean with a probability greater than1¡ 1 n O(1) . 99 sample S c is larger than n (in fact, n is only the size of the primary sample S p ), the cost of deriving the additional samples from the neighborhood of the primary samples is negligible, because these samples are drawn from the neighbors of the nodes storing the primary samples which are accessible within a few hops. Also, it is interesting to note that the more strongly the data is clustered, the variance ¾ ¤ is smaller, and therefore, according to Equation 4.33 the required sample size decreases and the efficiency of the cluster sampling is enhanced. On the other hand, as the data is declustered, the sampling procedure and the performance of the cluster sampling design converges to that of the conventional sampling. 4.3.4.2 Inverse Sampling To achieve unbiased estimation despite the skewness of the data distribution, the main idea with inverse sampling is to define the “stop condition” of the sampling such that it guarantees adequate number of samples with large data values are captured from the database. With inverse sampling, the database is sampled adaptively and sequentially, deriving one sample at each step (uniformly random and satisfying P ). The sampling procedure is stopped at a step if and only if a certain percentage® of the samples derived up to that step have large data values (i.e.,y s 2D c ). The parameter® of the stop condi- tion (determined analytically) is chosen sufficiently large to ensure unbiased estimation of the query result, while satisfying the precision requirements of the query (see below). Figure 4.8 illustrates our query evaluation algorithm with the inverse sampling design. With inverse sampling we use the standard average estimator to calculate the query result. As shown in the figure, the estimate is computed incrementally. Theorem 9 is to ascertain that our estimate is unbiased and proposes an unbiased estimator for the variance of the estimate: 100 EvaluateQuery IS( b Q;²;p;C) begin // Parameter initialization w=f8v2V jw v =m v g; n=0; // Total number of the derived samples k =0; // Number of the derived samples inD c // Starting the sampling dof do s=S(w); //S(:) is the sampling operator while(samples does not satisfy the predicateP); // Updating the estimate incrementally b Y I = 1 (n+1) (n b Y I +y s ); n=n+1; if (y s 2D c ) k =k+1; // Checking the stop condition g while( k n <®); // Returning the final estimate of the query result return b Y = b Y I ; end Figure 4.8: Query Evaluation by Inverse Sampling Theorem 9. Let b Y I be the inverse sampling estimator, and¾ 2 I the variance of the esti- mate b Y I . The cluster sampling design guarantees that b Y I is an unbiased estimator for Y with normal estimate distribution. Furthermore,b ¾ 2 I is an unbiased estimator for the variance¾ 2 I : b ¾ 2 I =®¾ 2 c +(y c ¡y c 0) 2 +(1¡®)¾ 2 c 0 (4.34) wherey c andy c 0 are the means, and¾ 2 c and¾ 2 c 0 are the variances of the data inD c and D c 0, respectively. 101 Since the estimate b Y I has a normal distribution, we can follow the same argument with conventional sampling (Equations 4.28 and 4.29) to derive the required parame- ter ® such that it guarantees with probability p the estimate b Y I is confined within the confidence interval[Y ¡²Y;Y +²Y]: ® = ( ²Y I l p ) 2 ¡(y c ¡y c 0) 2 ¡¾ 2 c 0 ¾ 2 c ¡¾ 2 c 0 (4.35) To calculate ® prior to beginning the evaluation of the query by inverse sampling, it is assumed that preliminary estimates of the parameters y c , y c 0, ¾ 2 c , ¾ 2 c 0 and Y I are available. Next, we integrate inverse sampling with sequential sampling (from Sec- tion 4.3.3.2) to be self-sufficient in terms of these parameters (see Section 4.3.5). We study efficiency of the inverse sampling empirically in Section 4.4.2. 4.3.5 Universal Sampling Universal sampling is self-sufficient in terms of the unknown parameters required for estimation of the query result, while allowing unbiased estimation despite the skewness of the data distribution. With universal sampling, the basic sampling procedure is simi- lar to that of the inverse sampling, with a stop condition that ensures a certain percent- age® of the large data values are sampled to guarantee unbiased estimation. However, with inverse sampling ® is fixed and it is calculated in advance based on the a priori known estimates of the required parameters (see Equation 4.35), whereas with univer- sal sampling the parameters required to calculate® are initially assumed unknown and are progressively estimated by an approach adopted from sequential sampling. Corre- spondingly,® is calculated progressively based on the estimated parameters and evolves throughout the sampling procedure. Moreover, to leverage on the clustering property of 102 the data in peer-to-peer databases (whenever applicable) for more efficient sampling, we also integrate cluster sampling into universal sampling by including the neighborhood of each sample in the sample-set derived from the database. Figure 4.9 illustrates the details of our query evaluation algorithm with the universal sampling design. With universal sampling, at each step a new samples is derived from the database and is utilized to produce the incremental estimates b y ¤ c ,b ¾ ¤2 c , b y ¤ c 0, andb ¾ ¤2 c 0 for the unknown parameters y c , ¾ 2 c , y c 0, and ¾ 2 c 0, respectively. The parameter estimates produced at each step are used to calculate the stop condition for the same step (see The- orem 10 below). To produce these estimates, we employ the same one-pass estimators introduced with the sequential sampling (see Section 4.3.3.2). However, here we also adopt cluster sampling and instead of the data valuey s of each derived samples, we use the more accurate valuey ¤ s , i.e., the mean of the data values at the neighborhood of the samples, in order to compute the estimates. Theorem 10 proves the asymptotic validity of the stop condition derived based on the incremental estimates of the parameters: Theorem 10. Let b Y U be the estimate of the query result with universal sampling. The universal sampling design guarantees that with probabilityp the estimate b Y U is confined within the confidence interval[Y ¡²Y;Y +²Y] as²!0 iff: ® = ( ² b Y U lp ) 2 ¡( b y ¤ c ¡ b y ¤ c 0) 2 ¡b ¾ ¤2 c 0 b ¾ ¤2 c ¡b ¾ ¤2 c 0 (4.36) where®, the stop condition at each step of the universal sampling, is calculated based on the parameter estimates at the same step. 103 EvaluateQuery US( b Q;²;p;C) begin // Parameter initialization w=f8v2V jw v =m v g; b y ¤ c =0; b y ¤ c 0 =0; // Means of the samples inD c andD c 0 b ¾ ¤ c =0;b ¾ ¤ c 0 =0; // SDs of the samples inD c andD c 0 n=0; // Total number of the derived samples k =0; // Number of the derived samples inD c dof do s=S(w); //S(:) is the sampling operator while(samples does not satisfy the predicateP); // Sampling the neighborhood by cluster sampling if (y s 2D c ) //v is the node storings S s =SampleNeighborhood(v;C); else S s =;; S s =S s [fsg; y ¤ s =(1=jS s j) P u2S s y u ; // Estimating unknown parameters by sequential sampling if (y ¤ s 2D c )f b ¾ ¤2 c =b ¾ ¤2 c + k( b y ¤ c ¡y ¤ s ) 2 (k+1) ; b y ¤ c = 1 (k+1) (k b y ¤ c +y ¤ s ); k =k+1; g elsef b ¾ ¤2 c 0 =b ¾ ¤2 c 0 + (n¡k)( b y ¤ c 0¡y ¤ s ) 2 (n¡k+1) ; b y ¤ c 0 = 1 (n¡k+1) ((n¡k) b y ¤ c 0 +y ¤ s ); g; n=n+1; b Y U =( k n ) b y ¤ c +( n¡k n ) b y ¤ c 0; g while( k n < ( ² b Y U l p ) 2 ¡( b y ¤ c ¡ b y ¤ c 0) 2 ¡b ¾ ¤2 c 0 b ¾ ¤2 c ¡b ¾ ¤2 c 0 ); // Returning the final estimate of the query result return b Y = b Y U ; end Figure 4.9: Query Evaluation by Universal Sampling 104 4.4 Empirical Study We conducted a set of experiments via simulation using real data to study the efficiency of query answering with Digest (Section 4.4.1) and the performance of our proposed sampling designs (Section 4.4.2). We implemented a multi-threaded simulator in C++ and used two Enterprise 250 Sun servers to preform these experiments. 4.4.1 Query Answering 4.4.1.1 Experimental Methodology To study Digest empirically, we used two sets of real data: the TEMPERATURE dataset and the MEMORY dataset. The TEMPERATURE data are collected from a set of interconnected weather forecast stations from JPL/NASA, and the MEMORY data are collected from the nodes of the SETI@HOME peer-to-peer computing system. Each dataset consists of timestamped tuples with a single attribute (temperature and available memory space, respectively). Each tuple records the current value of the attribute at a particular time at a particular node. Whenever the value of the attribute is modified at a node (i.e., autonomously updated by the node, or inserted/deleted due to the node join/leave), a new tuple is appended to the dataset to record the modification. Tuples are collected from a large set of nodes over a specific duration. The nodes of the weather forecast network are almost stable whereas the nodes of SETI@HOME join and leave the network more frequently. Table 4.3 lists the dataset parameters. We simulated the weather forecast network and the peer-to-peer computing network with two networks of the correspondingly same size, with mesh and power-law topolo- gies, respectively. Each node of the network represents a node of the real network and 105 TEMPERATURE MEMORY Number of Tuples 8640000 95445 Number of Nodes 8000 1000 Duration of Recording 18 months 1 hour Frequency of Updates Twice per day Continuous b ½ 0.89 0.68 b ¾ 8 10 Table 4.3: Parameters of the Datasets emulates the updates of the local attribute according to the recorded values in the cor- responding dataset. The nodes of the two networks are Digest enabled. To perform the experiments, we considered a continuousAVG query of the form: SELECTAVG(a)FROMR where a is the recorded attribute. The duration of the continuous query is equal to the duration of recording for each dataset (see Table 4.3). We picked random nodes from the networks to issue the queries and combined the results of the queries to derive a statistically reliable estimation of the result, wherever applicable. As an aside, we should mention that we used the sampling operatorS in batch mode, i.e., to derive n samples we invoke S for n times simultaneously, which initiates n random walks with overlapping convergence time, to expedite the experiments. Also, once converged for the first time, to derive successive samples we continue the random walk from where it stops. In this case, the time to re-converge is reduced from the mixing time to the reset time, which is much shorter than the mixing time of the random walk. 4.4.1.2 Experimental Results We studied the efficiency of Digest by considering the improvement due to the extrapo- lation algorithm and the repeated sampling algorithm, individually and combined. 106 0 200 400 600 800 1000 1200 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 (δ (δ (δ (δ / / // σ) σ) σ) σ) Number of Snapshot Queries PRED-3 PRED-5 PRED-10 ALL ^ Figure 4.10: Effect of the Extrapolation Algorithm 4.4.1.2.1 Effect of the Extrapolation Algorithm Figure 4.10 illustrates the results of our experiment with the extrapolation algorithm. For this experiment, we used the TEMPERATURE dataset. In Figure 4.10, PRED-k denotes the extrapolation algorithm whenk previous values are used for prediction. The PRED-k algorithms are compared with the naive continuous querying algorithm (ALL), which executes snapshot queries at all time steps. The time step (i.e., the discrete-time unit) for executing snapshot queries is 12 hours, equal to the data update period with the TEMPERATURE dataset. With a fixed confidence (for the reported result, ² = 2 and p = 0:95), we vary the required resolution± of the query (in the figure, it is normalized to the varianceb ¾), and observe the number of the snapshot queries executed to maintain the resolution using different algorithms. As depicted in Figure 4.10, all of the extrapolation algorithms behave similarly. With small ± (relative to b ¾), there are not many sampling occasions that an extrapolation algorithm can skip and, therefore, the performance is similar to ALL. However, with larger resolution thresholds, the extrapolation algorithms significantly outperform the naive continuous querying algorithm by gracefully eliminating the redundant snapshot 107 0 100 200 300 400 500 600 0.00 0.25 0.50 0.75 1.00 1.25 (ε (ε (ε (ε / / // σ) σ) σ) σ) Number of Samples per Snapshot Query (n ) TEMPERATURE-RPT TEMPERATURE-INDEP MEMORY-RPT MEMORY-INDEP ^ Figure 4.11: Effect of the Repeated Sampling Algorithm queries according to the required resolution. For example, with ± = 8 (i.e., ±=b ¾ = 1), the number of the snapshot queries executed to answer the query are up to 75% reduced. 4.4.1.2.2 Effect of the Repeated Sampling Algorithm Figure 4.11 shows the results of our experiment with the repeated sampling algorithm (RPT) as compared with the independent sampling algorithm (INDEP). For this experiment, we used both of the datasets. Assuming a fixed resolution (±=b ¾ = 1, whereb ¾ is known for each dataset) and fixed confidence level (p = 0:95), we vary the required confidence interval² of the query, and observe the (average) number of the samples required per snapshot query to satisfy the confidence requirement of the query using each algorithm. Note that here, for RPT we report the total number of the samples required per snapshot query, including both the retained and the fresh samples (for INDEP, all samples are fresh). This is to isolate and show the effect of considering the correlation in reducing the total number of the required samples with RPT. In Section 4.4.1.2.3, we investigate another advantage of RPT due to the retained samples. Although the retained samples must be re-evaluated, 108 they incur negligible communication cost to derive; therefore, in fact with RPT only the fresh samples actually cost to derive from the database (refer to Section 4.1.3.2.2). As depicted in Figure 4.11, the behavior of INDEP and RPT follow our analyti- cal results (Equation 4.8 and Equation 4.23, respectively), and RPT consistently out- performs INDEP with both datasets. From the experiments, we measure the average improvement factor I = n indep =n rpt (where n indep and n rpt are the total number of the samples per snapshot query for INDEP and RPT) as 1.63 and 1.21 for the TEMPER- ATURE and the MEMORY datasets, respectively, which correspondingly translate to 39% and 18% less samples with RPT. As suggested by the results, the benefit of the repeated sampling algorithm is more when applied to the TEMPERATURE dataset; this is expected because of the higher correlation (as indicated by the correlation coefficient b ½) as well as less churn at the weather forecast network. 4.4.1.2.3 Overall Efficiency of Digest To evaluate the overall efficiency of Digest due to the combined effect of the extrapolation algorithm and the repeated sampling algorithm, we measured the total number of samples required to answer a continu- ous query (for the reported result, ±=b ¾ = 1, ²=b ¾ = 0:25, and p = 0:95) using four different combinations of the algorithms: (ALL + INDEP), (ALL + RPT), (PRED3 + INDEP), and (PRED3 + RPT). We performed this experiment with both of the datasets. As shown in Figure 4.12, with the TEMPERATURE dataset, Digest (i.e., PRED3 + RPT) outperforms a naive solution (i.e., ALL + INDEP) up to 320%. Similar results are obtained for other continuous queries with a full spectrum of different precision param- eters; these results reveal a cost-precision trade-off that conform with those represented by the results in Figures 4.10 and 4.11. 109 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 Total Number of Samples (x10 3 ) ALL+INDEP ALL+RPT PRED3+INDEP PRED3+RPT TEMPERATURE Dataset MEMORY Dataset Figure 4.12: Efficiency of Digest in Number of Samples Thus far, we used the total number of the samples derived to answer the query as the measure of efficiency for the algorithms. Assuming a fixed (in average) communi- cation cost for deriving each sample, this can be translated to the total communication cost (i.e., the total number of messages sent from node to node) for answering the query. To factor in and evaluate the communication cost for deriving each sample using our random sampling algorithm, we considered the performance of the Digest (PRED3 + RPT) for the same query (±=b ¾ = 1, ²=b ¾ = 0:25, p = 0:95), this time measuring the total communication cost as the measure of efficiency. Here, we compared Digest with the naive sample-based solution (ALL + INDEP) as well as the baseline non-sample- based solution (ALL + ALL), which at every snapshot query collects all tuples from the entire network to evaluate the query (only supporting exact queries). As reported in Fig- ure 4.13 (note that the vertical axis is in logarithmic scale), the sample-based solutions incur almost up to two orders of magnitude less communication cost as compared with the non-sample-based solution to evaluate this typical approximate continuous query. Also, comparing our results in Figure 4.13 versus those in Figure 4.12, we note that as mentioned in Section 4.4.1.2.2 the improvement of Digest over the naive sample- based solution almost doubles in terms of the communication cost, which reflects the 110 1.00 10.00 100.00 Total Number of Messages (x10 6 ) ALL+ALL ALL+INDEP PRED3+RPT TEMPERATURE Dataset MEMORY Dataset Figure 4.13: Efficiency of Digest in Communication Cost fact that with repeated sampling the cost of deriving the retained samples is negligi- ble and asymptotically only half of the required samples are fresh samples costly to derive from the database. Finally, based on our results, the average costs of deriving each sample are 65 and 43 messages for the simulated weather forecast network and the SETI@HOME network, respectively, loosely consistent with Theorem 4 that predicts a poly-logarithmic complexity for our random sampling operator. 4.4.2 Sampling Designs 4.4.2.1 Experimental Methodology To study the sampling designs empirically, we used same two sets of real data: the TEMPERATURE dataset and the MEMORY dataset. Table 4.4 lists more of the relevant parameters and characteristics of the datasets. TEMPERATURE MEMORY Number of Tuples 8640000 95445 Number of Nodes 8000 1000 Range ofY over Time 53 o F -97 o F 237MB-626MB Range of¾ over Time 8 to 10 12 to 18 Data Distribution Normal Skewed Table 4.4: Parameters of the Datasets 111 To perform the experiments, we considered anAVG query of the form: SELECT AVG(a) FROMR WHEREP where a is the recorded attribute, and P is (a < 75 o F) for TEMPERATURE and (a > 400MB) for MEMORY . To derive each data point in the reported results, we issued 100 queries with the same constraints from a random node at a random time, and computed the average of the measurements among the queries to report statistically reliable values, wherever applicable. 4.4.2.2 Experimental Results 4.4.2.2.1 Query Answering with Unknown Parameters Figure 4.14-a illustrates the results of our experiment with the double sampling (DS) and the sequential sampling (SS) designs for query answering. With this experiment, we compare the efficiency of these sampling designs with that of the conventional sampling (CV). We measure the efficiency of the sampling designs in terms of the number of the samples derived to answer a query. For this experiment, we use the TEMPERATURE dataset. While we allow DS and SS independently derive the required estimation parameters from the TEMPERATURE dataset, we calculate and provide these parameters as input for CV . With a fixed confidence level (for the reported result, p = 0:95), we vary the required confidence interval ² of the query (in the figure, ² is normalized to the variance ¾ 2 of the data at the time query is issued), and observe the number of the samples derived to answer the query using different sampling designs. In Section 4.3.3 we mentioned that with any sampling design without a priori knowl- edge of the estimation parameters, the number of the samples required to answer a query is inevitably more than the optimal number of samples with CV . As depicted in Figure 4.14-a, the efficiency of both DS and SS are comparable to that of CV . Also note that SS 112 0 50 100 150 200 250 300 350 400 450 500 0.00 0.25 0.50 0.75 1.00 1.25 HV Samples per Query DS SS CV 2 0 5 10 15 20 25 30 0.00 0.13 0.25 0.38 0.50 0.63 0.75 0.88 1.00 VV Message per Query (x10 3 ) USW CV CS IS USWO 2 *2 a. With Unknown Parameters b. With Skewed Data Distribution Figure 4.14: Efficiency of the Digest Sampling Designs consistently outperforms DS. We attribute this advantage to the fact that SS can avoid redundant pilot sampling. 4.4.2.2.2 Query Answering with Skewed Data Distribution Figure 4.14-b shows the results of our experiment with the cluster sampling (CS), the inverse sampling (IS), and the universal sampling (US) designs. With this experiment, we compare the effi- ciency of these sampling designs for query answering with skewed data distribution. Although with skewed distribution the result of the query with conventional sampling (CV) is biased, we can still compare the efficiency of our sampling designs with CV , assuming that CV guarantees the same confidence interval for the query result. In the previous section, we used the number of the samples derived to answer the query as the measure to evaluate the efficiency of the sampling designs. In this section, to factor in and evaluate the performance of our random sampling operatorS as well, we measure the efficiency of the sampling designs based on the total communication cost (i.e., the total number of messages sent from node to node) for answering a query. 113 For this experiment we use the MEMORY dataset, which has a considerably skewed distribution (a power-law distribution p k / jkj ¡´ with the skew factor ´ = 2:4, where p k is the frequency of the data value k in the dataset). Also, the typical form for the condition C that identifies the rare data values in the MEMORY dataset is C = fy j y > 2000MBg. We calculate and provide the relevant estimation parame- ters as input for CV , CS, and IS. For US, we consider two cases: universal sampling with parameter estimation (USW), and universal sampling without parameter estima- tion (USWO); USW independently derives the required estimation parameters from the dataset, while USWO receives them as input. In Figure 4.14-b, assuming a fixed confi- dence level and confidence interval (p=0:9 and²=¾ 2 =0:25) for the queries, we depict the communication cost of answering a query versus the value¾ ¤2 of “cluster data vari- ance” at the time the query is issued (in the figure,¾ ¤2 is normalized to the data variance ¾ 2 at the time query is issued). Recall from Section 4.3.4.1 that ¾ ¤2 is an indicator of the clustering strength. As illustrated in Figure 4.14-b, with skewed data distribution CS consistently out- performs CV , and its performance converges to that of CV as the data in the database is declustered (i.e.,¾ ¤ ! ¾). Also, note that USWO, which benefits from both CS and IS is more efficient as compared to each of them individually. Finally, since USW is expected to derive the estimation parameters in addition to handling the skewness of the data distribution, it incurs more communication cost as compared to USWO. However, it is interesting to note that the efficiency of USW is still comparable to the optimal efficiency with CV . 114 4.5 Future Work In this section, we presented the aggregate query answering component of Digest to answer fixed-precision approximate continuous aggregate queries in peer-to-peer databases. With Digest, we introduced three algorithms, namely the extrapolation algo- rithm, the repeated sampling algorithm, and the distributed sampling algorithm. The first two algorithms enable Digest to answer continuous queries efficiently with guar- anteed precision. The distributed sampling algorithm allows applying Digest to answer queries in the distributed peer-to-peer databases. We demonstrated the efficiency of query answering with Digest both analytically and empirically. Particularly, we showed via simulation that Digest can improve the efficiency of query answering up to 320% over that of a naive query answering scheme. In addition, with Digest we proposed a collection of new sampling designs that allow answering fixed-precision approximate aggregate queries in real-world peer-to- peer databases without a priori knowledge about the characteristics of the data in the database. Particularly, we discussed double sampling and sequential sampling which are not only self-sufficient in terms of the estimation parameters required for query answer- ing, but also benefit from efficiency comparable to that of the optimal efficiency with the conventional sampling. We also described cluster sampling and inverse sampling that guarantee unbiased estimation of the query result even when the distribution of the data in the database is significantly skewed. We showed that our universal sampling design can overcome both of these problems simultaneously. We intend to extend this study in three directions. First, with repeated sampling we plan to complement our reverse regression algorithm by forward regression, which allows adjusting the previous result. Second, we intend to expand on our contributions to cover more complex aggregate queries with multiple relations: 115 SELECT op(expression) FROM R 1 ,R 1 , :::,R z WHERE predicate Finally, with the peer-to-peer databases where the time-scale of the data changes is comparable with the sampling time, our snapshot sampling assumption no longer holds. With such peer-to-peer databases, either the sampling techniques should be improved or new semantics should be defined for continuous queries. 116 References [ADGK06] B. Arai, G. Das, D. Gunopulos, and V . Kalogeraki. Approximating aggre- gation queries in peer-to-peer networks. In Proceedings of the 22nd Inter- national Conference on Data Engineering (ICDE), April 2006. [AGP00] S. Acharya, P. Gibbons, and V . Poosala. Congressional samples for approximate answering of group-by queries. In Proceedings of the Inter- national Conference on Management of Data (SIGMOD), May 2000. [ALPH01] L. Adamic, R.M. Lukose, A.R. Puniyani, and B.A. Huberman. Search in power-law networks. Physics Review Letters, 64(46135), 2001. [BA99] A.L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. [BBC04] B. Bash, J. Byers, and J. Considine. Approximately uniform random sam- pling in sensor networks. In Proceeedings of the 1st international work- shop on Data Management for Sensor Networks (DMSN) in conjunction with VLDB, August 2004. [BDW88] P. Buneman, S.B. Davidson, and A. Watters. A semantics for complex objects and approximate queries. In Proceedings of the 7th Symposium on Principles of Database Systems (PODS’88), April 1988. [BDX04] S. Boyd, P. Diaconis, and L. Xiao. Fastest mixing markov chain on a graph. SIAM Review, 46(4):667–689, 2004. [BGMGM] M. Bawa, H. Garcia-Molina, A. Gionis, and R. Motwani. Estimating aggregates on a peer-to-peer network. Submitted for publication. [BGPS05] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip algorithms: Design, analysis and applications. In Proceedings of IEEE INFOCOM, March 2005. 117 [BGS03] A. Boulis, S. Ganeriwal, and M. Srivastava. Aggregation in sensor net- works: An energy-accuracy trade-off. In The First IEEE International Workshop on Sensor Network Protocols and Applications (SNPA’03), May 2003. [BKSa] F. Banaei-Kashani and Cyrus Shahabi. Fixed-precision approximate con- tinuous aggregate queries in peer-to-peer databases. Submitted for publi- cation. [BKSb] F. Banaei-Kashani and Cyrus Shahabi. Overcoming limitations of sample-based aggregation in peer-to-peer databases. Submitted for pub- lication. [BKS03a] F. Banaei-Kashani and C. Shahabi. Criticality-based analysis and design of unstructured peer-to-peer networks as complex systems. In Proceed- ings of Third International Workshop on Global and Peer-to-Peer Com- puting (GP2PC) in conjunction with CCGrid’03, May 2003. [BKS03b] F. Banaei-Kashani and C. Shahabi. Efficient flooding in power-law net- works. In Proceedings of Twenty-Second ACM Symposium on Principles of Distributed Computing (PODC’03), July 2003. [BKS03c] F. Banaei-Kashani and C. Shahabi. Searchable querical data networks. In Proceedings of the International Workshop on Databases, Informa- tion Systems and Peer-to-Peer Computing in conjunction with VLDB’03, September 2003. [BKS04] F. Banaei-Kashani and C. Shahabi. SWAM: A family of access methods for similarity-search in peer-to-peer data networks. In Proceedings of the Thirteenth Conference on Information and Knowledge Management (CIKM’04), November 2004. [BKS06] F. Banaei-Kashani and C. Shahabi. Partial selection query in peer-to-peer databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), April 2006. [Bol85] B. Bollobas. Random Graphs. Academic Press, New York, 1985. [Bri95] S. Brin. Near neighbor search in large metric spaces. In Proceedings of the 21th International Conferenceon Very Large Data Bases (VLDB’95), September 1995. [BW01] S. Babu and J. Widom. Continuous queries over data streams. SIGMOD Record, September 2001. 118 [CDTW00] J. Chen, D.J. DeWitt, F. Tian, and Y . Wang. Niagaracq: A scalable con- tinuous query system for internet databases. In ACM International Con- ference on Management of Data (SIGMOD’00), May 2000. [CGM02] A. Crespo and H. Garcia-Molina. Routing indices for peer-to-peer sys- tems. In Proceedings of the 22nd International Conference on Distributed Computing Systems(ICDCS), July 2002. [CK04] E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. In Proceedings of the International Conference on Management of Data (SIGMOD), June 2004. [CLKB04] J. Considine, F. Li, G. Kollios, and J. Byers. Approximate aggregation techniques for sensor databases. In Proceedings of the 20th International Conference on Data Engineering (ICDE), March 2004. [CMN99] S. Chaudhuri, R. Motwani, and V . Narasayya. On random sampling over joins. In ACM International Conference on Management of Data (SIG- MOD’99), June 1999. [CNBYM01] E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3), 2001. [Coc77] W.G. Cochran. Sampling Techniques. John Wiley and Sons, 3rd edition, 1977. [Cox52] D. Cox. Estimation by double sampling. Biometrika, 39(3–4):217–227, 1952. [CPX05] J. Chen, G. Pandurangan, and D. Xu. Robust computation of aggregates in wireless sensor networks: distributed randomized algorithms and anal- ysis. In Proceedings of the Fourth International Symposium on Informa- tion Processing in Sensor Networks (IPSN’05), April 2005. [CR65] Y . Chow and H. Robbins. On the asymptotic theory of fixed-width sequential confidence intervals for the mean. The Annuals of Statistics, 36(2):457–462, 1965. [CR94] C. Chen and N. Roussopoulos. Adaptive selectivity estimation using query feedback. In Proceedings of the International Conference on Man- agement of Data (SIGMOD), May 1994. [DDH01] A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of dis- parate data sources: A machine-learning approach. In Proceedings of ACM International Conference on Management of Data (SIGMOD’01), November 2001. 119 [DGH + 87] A. Demers, D. Greene, C. Hauser, W. Irish, and J. Larson. Epidemic algorithms for replicated database maintenance. In Proceedings of the sixth annual ACM Symposium on Principles of Distributed Computing (PODC’87), August 1987. [DGM + 05] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-based approximate querying in sensor networks. In Proceed- ings of 31st International Conference on Very Large Data Bases (VLDB), September 2005. [DIR00] D. Donjerkovic, Y . Ioannidis, and R. Ramakrishnan. Dynamic his- tograms: Capturing evolving data sets. In Proceedings of the 16th Inter- national Conference on Data Engineering (ICDE), February 2000. [DS91] P. Diaconis and D. Stroock. Geometric bounds for eigenvalues of markov chains. Annals of Applied Probability, 1(1):36–61, 1991. [DSC95] P. Diaconis and L. Saloff-Coste. What do we know about the metropolis algorithm. In Proceedings of the 27th Symposium on Theory of Comput- ing (STOC), May 1995. [EGH + 03] P. Eugster, R. Guerraoui, S. Handurukande, P. Kouznetsov, and A. Ker- marrec. Lightweight probabilistic broadcast. ACM Transactions on Com- puter Systems (TOCS), 21(4):341–374, November 2003. [Fel57] W. Feller. An Introduction to Probability Theory and Its Applications. John Wiley and Sons, second edition, 1957. [GD] L. Galanis and D.J. DeWitt. Scalable distributed aggregate computations through collaboration in peer-to-peer systems. Submitted for publication. [GDS + 03] K.P. Gummadi, R.J. Dunn, S. Saroiu, S.D. Gribble, H.M. Levy, and J. Zahorjan. Measurement, modeling, and analysis of a peer-to-peer file- sharing workload. In Proceedings of the 19th Symposium on Operating Systems Principles (SOSP), October 2003. [GKM01] A. Ganesh, A. Kermarrec, and L. Massoulie. Scamp: peer-to-peer lightweight membership service for large-scale group communication. In Proceedings of the 3rd International Workshop on Network Group Com- munication (NGC’01), November 2001. [GKW + 02] D. Ganesan, B. Krishnamachari, A. Woo, D. Culler, D. Estrin, and S. B. Wicker. An empirical study of epidemic algorithms in large scale mul- tihop wireless networks. Technical Report CSD-TR 02-0013, UCLA, 2002. 120 [GLR00] V . Ganti, M. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In Proceedings of the International Conference on Management of Data (SIGMOD), May 2000. [GM95] A. Gupta and I. S. Mumick. Maintenance of materialized views: Prob- lems, techniques, and applications. IEEE Data Engineering Bulletin, 18(2):3–18, 1995. [GM98] P. Gibbons and Y . Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the Interna- tional Conference on Management of Data (SIGMOD), June 1998. [GMS04] C. Gkantsidis, M. Mihail, and A. Saberi. Random walks in peer-to-peer networks. In Proceedings of IEEE Conference on Computer Communi- cations (INFOCOM), March 2004. [GMT05] A. Ganesh, L. Massouli, and D. Towsley. The effect of network topology on the spread of epidemics. In Proceedings of IEEE INFOCOM, March 2005. [Hal45] J. Haldane. On a method of estimating frequencies. Biometrika, 33(3):222–225, 1945. [Het00] H. Hethcote. The mathematics of infectious diseases. SIAM Review, 42(4):599–653, Otober 2000. [HH43] M. Hansen and W. Hurwitz. On the theory of sampling from finite popu- lations. The Annuals of Mathematical Statistics, 14(4):333–362, 1943. [HHW97] J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In ACM Inter- national Conference on Management of Data (SIGMOD’97), May 1997. [HKMP96] J. Hromkovic, R. Klasing, B. Monien, and R. Peine. Dissemination of information in interconnection networks (broadcasting and gossiping). Combinatorial Network Theory, pages 125–212, 1996. [HLL + 03] R. Huebsch, N. Lanham, B.T. Loo, J.M. Hellerstein, S. Shenker, and I. Stoica. Querying the inernet with PIER. In Proceedings of 29th Inter- national Conference on Very Large Data Bases (VLDB), September 2003. [HM03] K. Horowitz and D. Malkhi. Estimating network size from local informa- tion. Information Processing Letters, 88(5):237–243, December 2003. [HO91] W. Hou and G. Ozsoyoglu. Statistical estimators for aggregate rela- tional algebra queries. ACM Transactions on Database Systems (TODS), 16(4):600–654, December 1991. 121 [HOD91] W. Hou, G. Ozsoyoglu, and E. Dogdu. Error-constrained count query evaluation in relational databases. In Proceedings of the International Conference on Management of Data (SIGMOD), May 1991. [Hou93] W. Hou. Processing time-constrained aggregate queries in case-db. ACM Transactions on Database Systems (TODS), 18(2):224–261, June 1993. [HS92] P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proceedings of the International Conference on Manage- ment of Data (SIGMOD), June 1992. [IP99] Y .E. Ioannidis and V . Poosala. Histogram-based approximation of set- valued query answers. In Proceedings of the 25th International Confer- enceon Very Large Data Bases (VLDB’99), September 1999. [Jov01] M. Jovanovic. Modeling large-scale peer-to-peer networks and a case study of gnutella. Master’s thesis, University of Cincinnati, 2001. [KDG03] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggre- gate information. In Proceedings of the Annual Symposium on Founda- tions of Computer Science (FOCS), 2003. [Kle00] J. Kleinberg. The small-world phenomenon: an algorithmic perspective. In Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC’00), pages 163–170, May 2000. [KRA + 03] D. Kostic, A. Rodriguez, J. Albrecht, A. Bhirud, and A. Vahdat. Using random subsets to build scalable network services. In Proceedings of 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003. [KS04] V . King and J. Saia. Choosing a random peer. In Proceedings of the 23rd Annual Symposium on Principles of Distributed Computing (PODC), July 2004. [KSSV00] R. Karp, C. Schindelhauer, S. Shenker, and B. V ocking. Randomized rumor spreading. In Proceedings of the Annual Symposium on Founda- tions of Computer Science (FOCS), 2000. [LCC + 02] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th Interna- tional Conference on supercomputing (ICS ’02), June 2002. [LHH02] L. Li, J. Halpern, and Z. Haas. Gossip-based ad hoc routing. In Proceed- ings of the IEEE INFOCOM, 2002. 122 [LHH + 04] B. Loo, J.M. Hellerstein, R. Huebsch, S. Shenker, and I. Stoica. Enhanc- ing p2p file-sharing with an internet-scale query processor. In Pro- ceedings of 30th International Conference on Very Large Data Bases (VLDB’04), September 2004. [Lim05] Limewire.com. Gnutella, 2005. http://www.limewire.com/. [LN89] R. Lipton and J. Naughton. Estimating the size of generalized transitive closures. In Proceedings of the 15th International Conferenceon Very Large Data Bases (VLDB’89), August 1989. [LNS90] R. Lipton, J. Naughton, and D. Schneider. Practical selectivity estimation through adaptive sampling. In Proceedings of the International Confer- ence on Management of Data (SIGMOD), May 1990. [LPBZ96] L. Liu, C. Pu, R. Barga, and T. Zhou. Differential evaluation of con- tinual queries. In Proceedings of the 16nd International Conference on Distributed Computing Systems(ICDCS), May 1996. [LPT99] Ling Liu, Calton Pu, and Wei Tang. Continual queries for internet scale event-driven information delivery. IEEE Transcations Knowledge and Data Engineering, 11(4):610–628, 1999. [LRS02] Q. Lv, S. Ratnasamy, and S. Shenker. Can heterogeneity make gnutella scalable? In Proceedings of the 1st International Workshop on Peer-to- Peer Systems (IPTPS ’02), 2002. [Man03] G. Manku. Routing networks for distributed hash tables. In Proceedings of the 22nd annual symposium on Principles of distributed computing (PODC’03), July 2003. [MBR03] G. Manku, M. Bawa, and P. Raghavan. Symphony: Distributed hashing in a small world. In Proceedings of 4th USENIX Symposium on Internet Technologies and Systems (USITS’03), March 2003. [MFHH02] S.R. Madden, M.J. Franklin, J.M. Hellerstein, and W. Hong. TAG: a Tiny AGgregation service for ad-hoc sensor networks. In Proceedings of 5th Annual Symposium on Operating Systems Design and Implementation (OSDI), December 2002. [Mot84] A. Motro. Query generalization: a method for interpreting null answers. In Proceedings of the First Workshop on Expert Database Systems, Octo- ber 1984. [MR95] M. Molloy and B. Reed. A critical point for random grraphs with a given degree sequence. Random Structures and Algorithms, 6:161–180, 1995. 123 [MRR + 53] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1091, 1953. [NGSA04] S. Nath, P. Gibbons, S. Seshan, and Z. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In Proceedings of the 2nd inter- national conference on Embedded networked sensor systems (SenSys’04), November 2004. [NSW01] M.E.J. Newman, S.H. Strogatz, and D.J. Watts. Random graphs with arbitrary degree distribution and their applications. Physical Review E, 64(026118), 2001. [OBSC00] A. Okabe, B. Boots, K. Sugihara, and S. Chiu. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. John Wiley, 2nd edi- tion, 2000. [ODGH92] G. Ozsoyoglu, K. Du, S. Guruswamy, and W. Hou. Processing real-time, non-aggregate queries with time-constraintsin case-db. In Proceedings of the 8th International Conference on Data Engineering (ICDE’92), Febru- ary 1992. [Olk93] F. Olken. Random Sampling from Databases. PhD thesis, University of California, Berkeley, 1993. [OR89] F. Olken and D. Rotem. Random sampling from B+ trees. In Proceedings of 15th International Conference on Very Large Data Bases (VLDB’89), August 1989. [ORX90] F. Olken, D. Rotem, and P. Xu. Random sampling from hash files. In ACM International Conference on Management of Data (SIGMOD’90), May 1990. [RD01] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object loca- tion and routing for large-scale peer-to-peer systems. In Proceedings of ACM International Conference on Distributed Systems Platforms (Mid- dleware’01), pages 329–350, November 2001. [RFH + 01] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scal- able content-addressable network. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ’01), August 2001. [Rip01] M. Ripeanu. Peer-to-peer architecture case study: Gnutella network. In Proceedings of the First International Conference on Peer-to-Peer Com- puting (P2P’01), August 2001. 124 [SA92] D. Stauffer and A. Aharony. Introduction to Percolation Theory. Taylor and Francis, second edition, 1992. [SBK] C. Shahabi and F. Banaei-Kashani. Modelling peer-to-peer data networks under complex system theory. International Journal of Computational Science and Engineering, Accepted for Publication. [SGG02] S. Saroiu, P.K. Gummadi, and S.D. Gribble. A measurement study of peer-to-peer file sharing systems. In Proceedings of Multimedia Comput- ing and Networking (MMCN), January 2002. [Sha05] SharmanNetworks. Kazaa, 2005. http://www.kazaa.com/. [She04a] K. Shen. Structure management for scalable overlay service construc- tion. In Symposium on Networked Systems Design and Implementation (NSDI’04), March 2004. [She04b] S. Shenker. Structured versus unstructured peer-to-peer networks. USC Distinguished Lecture Series, February 2004. [SMK + 01] I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the Conference on Applications, Technologies, Archi- tectures, and Protocols for Computer Communication (SIGCOMM ’01), August 2001. [STD + 00] J. Shanmugasundaram, K. Tufte, D.J. DeWitt, J.F. Naughton, and D. Maier. Architecting a network query engine for producing partial results. In Third International Workshop on the Web and Databases in conjunction with SIGMOD’00, May 2000. [TGNO92] D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In ACM International Conference on Manage- ment of Data (SIGMOD’92), June 1992. [Tie94] L. Tierney. Markov chains for expolring posterior distributions. The Annuals of Statistics, 22(4):1701–1762, 1994. [TS96] S. Thompson and G. Seber. Adaptive Sampling. John Wiley and Sons, 1996. [VL93] S.V . Vrbsky and J.W.S. Liu. Approximate - a query processor that pro- duces monotonically improving approximate answers. IEEE Transac- tions on Knowledge and Data Engineering (TKDE), 5(6):1056–1068, December 1993. 125 [Wil94] H.S. Wilf. GeneratingFunctionology. Academic Press, second edition, 1994. [WS98] D.J. Watts and S.H. Strogatz. Collective dynamics of small world net- works. Nature, 393(6684):440–442, 1998. [YGM03] B. Yang and H. Garcia-Molina. Designing a super-peer network. In Proceedings of the 19th International Conference on Data Engineering (ICDE’03), March 2003. [ZGE03] J. Zhao, R. Govindan, and D Estrin. Computing aggregates for monitor- ing wireless sensor networks. In The First IEEE International Workshop on Sensor Network Protocols and Applications (SNPA’03), May 2003. [ZSS05] M. Zhong, K. Shen, and J. Seiferas. Non-uniform random membership management in peer to peer networks. In Proceedings of IEEE INFO- COM, 2005. 126 Chapter 5 Similarity Search in Structured Peer-to-Peer Databases We formalize the problem of similarity search in structured / indexable peer-to-peer databases, and propose a family of distributed access methods, termed Small-World Access Methods (SWAM), for efficient execution of various similarity search queries, namely exact-match, range, and k-nearest-neighbor queries. Unlike LH ¤ and DHTs, SWAM does not control the assignment of data objects to the network nodes; each node autonomously stores its own data. Besides, SWAM supports all similarity search queries on multiple attributes. SWAM guarantees that the query object will be found (if it exists in the network) in average time logarithmically proportional to the network size. Moreover, once the query object is found, all the similar objects would be in its proxi- mate network neighborhood and hence enabling efficient range and k-nearest-neighbor queries. As a specific instance of SWAM, we propose SWAM-V, a V oronoi-based SWAM that indexes peer-to-peer databases with multi-attribute data objects. For a peer-to-peer database withN nodes SWAM-V has query time, communication cost, and computation cost of O(logN) for exact-match queries, and O(logN +sN) and O(logN +k) for range queries (with selectivitys) andkNN queries, respectively. Our experiments show that SWAM-V consistently outperforms a similarity search enabled version of CAN in query time and communication cost by a factor of 2 to 3. Here, we omit the details of our analytical and experimental results. For more details, refer to [BKS03c, BKS04, SBK]. 127 A Formal Definition of the Problem A.1 Data and Query Model We assume a relational data model for the content of the indexable peer-to-peer databases. A set of (maybe duplicate) tuples with the same schema are distributed among the nodes of the peer-to-peer database (for multi-schema peer-to-peer databases, we rely on schema reconciliation techniques such as that of Doan et al. [DDH01]). Tuples are uniquely identified by a set ofd attributes, the key of the schema. Hereafter, we use the terms tuple and key interchangeably wherever the meaning is clear. A sim- ilarity query is originated at a peer-to-peer database node and is answered by locating at least one replica of all the tuple(s) with key similar to the query key. A peer-to-peer database access method is a mechanism that defines 1) how to organize the peer-to-peer database topology (interconnection) to an index-like structure, and 2) how to use the index structure to process the similarity queries. We are interested in the access meth- ods for efficient processing of similarity queries in indexable peer-to-peer databases. We model the peer-to-peer database key space as a Hilbert space (V;L p ). V = V 1 £V 2 £:::£V d is ad-dimensional vector space, whereV i , the domain of the attribute a i for the key ¡ ! k =ha 1 ;a 2 ;:::;a d i inV , is a contiguous and finite interval ofR. TheL p norm withp2Z + is the distance function to measure the dissimilarity (or equivalently similarity) between two keys ¡ ! k 1 and ¡ ! k 2 asL p ( ¡ ! k 1 ¡ ¡ ! k 2 ), whereL p ( ¡ ! x)= ³ P d i=1 jx i j p ´1 p . We are interested in content-based access methods, i.e., access methods that orga- nize the peer-to-peer database topology based on the content of the peer-to-peer database nodes. In general each peer-to-peer database node may include more than one tuple. For better explanation of our content-based access methods, without loss of generality, we find it simple to assume a peer-to-peer database model where each node stores one and only one tuple. To justify this assumption, here we show how to reduce the general 128 1 4 3 2 5 1 6 4 7 1 7 6 4 3 5 2 1 4 3 2 5 1 6 4 7 1 7 6 4 3 5 2 III III Figure 5.1: Reducing the General Peer-to-Peer Database Model peer-to-peer database model to our assumed peer-to-peer database model. Consider K as the set of keys (tuples) available in peer-to-peer database and N as the set of peer- to-peer database nodes. Assuming a general peer-to-peer database model, we define a one-to-many mappingM : N ! K that maps each peer-to-peer database node to the set of keys stored at the node 1 (Figure 5.1, Step I). Each key is considered as a virtual node embedded inV . Note that since tuples are replicated, there might be several virtual nodes with the same key. A content-based access method defines how to organize the set of virtual nodes corresponding to all nodes inN to a virtual peer-to-peer database with particular topology and how to process the queries in the virtual peer-to-peer database (Figure 5.1, Step II). Finally, the topology of the actual peer-to-peer database is deduced by inverse mapping from the topology of the virtual peer-to-peer database: a peer-to- peer database noden is connected to a nodem if and only if at the virtual peer-to-peer database some virtual node inM(n) is connected to some other virtual node inM(m) 1 Depending on the peer-to-peer database application, if some of the data objects within a node are closely similar, then alternativelyM can map a node to the centroid of the similar objects. Without loss of generality, we focus on the general case where the objects within a node are not closely similar. 129 (Figure 5.1, Step III). Also, the semantic of the query processing at the actual peer-to- peer database nodes is defined by the query processing semantic at the corresponding virtual nodes such that the flow of the query at the actual peer-to-peer database is logi- cally identical to that of the virtual peer-to-peer database. With this approach, the map- ping and inverse mapping steps (Steps I and III) are independent of the access method used in Step II, and each access method for virtual peer-to-peer databases (which is a peer-to-peer database with only one tuple per node) defines an access method with sim- ilar characteristics for general peer-to-peer databases. Hereafter, we assume the reduced model for peer-to-peer databases and characterize the primitives of an access method to construct the topology/index and process the queries in such a peer-to-peer database. The topology of a peer-to-peer database can be modelled as a directed graph G(N;E), where the edgee(n;m)2E represents an asymmetric neighborhood relation- ship in which nodem is a neighbor of noden. Schematically, we depict this relationship by drawing an arrow from noden to nodem. A(n) is the set of neighbors for the node n. To achieve scalability, a node only maintains a limited amount of information about its neighbors, which includes the key of the tuples maintained at the neighbors and the physical addresses of the neighbors. A node can directly communicate with its neigh- bors. To construct the peer-to-peer database index, an access method defines the join primitive 2 (similar to the insert operation with the traditional database access methods), which is used by the new noden to delineateA(n) as it joins the existing peer-to-peer database. We assume that at least the physical address of one node in the existing peer- to-peer database (if any) is available to n as it joins the peer-to-peer database. As the new nodes join the peer-to-peer database, its topology incrementally converges to the intended index structure. Similarly an access method defines the leave operation (equiv- alent to the delete primitive with the traditional access methods). 2 This join is different from the join operation in the relational algebra. 130 We are interested in the following types of similarity queries: ² Exact-Match Query: Given the query key ¡ ! q , return the tuplet with key ¡ ! k such that ¡ ! k = ¡ ! q . ² Range Query: Given the query key ¡ ! q and the ranger, return all tuples t with key ¡ ! k such thatL p ( ¡ ! k ¡ ¡ ! q )·r. ² k-Nearest-Neighbor (kNN) Query: Given the query key ¡ ! q and the numberk, return thek-ary(t 1 ;t 2 ;:::;t k ) such that ¡ ! k i , key oft i , is thei-th nearest neighbor of the key ¡ ! q . A similarity query can originate from any peer-to-peer database node atT 0 -th time slot (8 T 0 2 Z), assuming a discrete wall-clock time with fixed time unit. A node that originates a query or receives the query from other nodes at the (T 0 + i)-th time slot (8i2Z + [f0g), can process the query locally and/or forward zero or more processed replicas of the query to its immediate neighbors at the (T 0 +i+1)-th time slot. The collective processing of the query by the peer-to-peer database nodes is completed when all expected tuples in the relevant result set of the query are visited by at least one of the replicas of the query. Besides the join and leave primitives, an access method defines the forward primitive for query processing based on the constructed peer-to- peer database index. The forward primitive can only use the information at the local node to process the query and to make forwarding decisions. During query processing, the L p distance between the query key ¡ ! q and the local key is computed to verify if the local tuple satisfies the query condition. Also, with content-based access methods the forward primitive may measure theL p distances between the query key ¡ ! q and the neighbor keys to guide the query. 131 A.2 Efficiency Measures for Peer-to-Peer Database Access Methods An access method can be evaluated based on its construction cost, and/or based on its query processing cost and performance. Unless the set of nodes participating in peer-to- peer database is extremely dynamic, the computation (CPU time) and communication costs of constructing and maintaining the index structure are negligible as compared to those of the query processing. We define three metrics to measure the efficiency of a peer-to-peer database access method for query processing. The first two metrics evaluate the cost of query processing in terms of the required system resources, whereas the last one measures the system performance from the user perspective: 1. Communication cost (C 1 ): Average number of query replicas forwarded to com- plete the processing of a query. 2. Computation cost (C 2 ): Average number ofL p distance computations to complete the processing of a query. 3. Query time (T): Average response-time of a query. If processing of a query starts at time slot T 0 and completes at time slot T 1 , the response-time of the query is equal toT 1 ¡T 0 . B SWAM: Small-World Access Methods We define a family of efficient access methods for peer-to-peer databases, termed Small- World Access Methods (SWAM), which is designed based on the principles borrowed from the small-world models. Here, after a general overview of the useful properties of the small-world model, we define the SWAM family and characterize its properties. Also, as an example we introduce SWAM-V , a V oronoi-based instance of SWAM, which 132 (1,2) (2,2) (-1,2) (0,2) (-2,2) (1,1) (2,1) (-1,1) (0,1) (-2,1) (1,0) (2,0) (-1,0) (0,0) (-2,0) (1,-1) (2,-1) (-1,-1) (0,-1) (-2,-1) (1,-2) (2,-2) (-1,-2) (0,-2) (-2,-2) a. Hybrid small-world graph b. Small-world as peer-to-peer database index Figure 5.2: Small-World Model satisfies SWAM properties and achieves query time, communication cost, and computa- tion cost logarithmic to the size of the network for all types of similarity queries. B.1 Small-World as an Index Structure The small-world model is a network topology proposed to explain the small-world phe- nomenon, the fact that two individuals in a social network can efficiently locate each other through a short chain of acquaintances logarithmic to the size of the network [WS98, Kle00]). The small-world graph is a hybrid graph, a superimposition of a regular grid and a dilute random graph (p ¿ 1), inheriting both their properties (see Figure 5.2-a). It inherits average node-to-node path lengthO(logjNj) from the random graph component, and high clustering property from the grid. A graph is clustered if the neighbors of a node are more probably the neighbors of each other rather than the neighbors of the other nodes in the network. For a noden clustering is measured by the 133 clustering coefficient C(n), which is the realized fraction of all possible edges among the neighbors ofn: C(n)=l , 0 @ jA(n)j 2 1 A (5.1) where l is the number of existing edges among the neighbors of n. The clustering coefficient of a graph is the average of the clustering coefficients of its nodes. For a complete graph, a grid, and a dilute random graph G N;p , the clustering coefficients are 1,' 3 4 , andp¿1, respectively. To demonstrate a direct application of the small-world graph as an index structure for a peer-to-peer database, we consider the following simple peer-to-peer database. Assume the key space V is a subspace ofZ d rather thanR d , and also assume all pos- sible keys inV are available within the peer-to-peer database, one key per peer-to-peer database node. We can organize the topology of this peer-to-peer database based on a small-world graph with ad-dimensional underlying grid as follows: 1. Grid component: The node storing the key ¡ ! k =ha 1 ;a 2 ;:::;a d i is a neighbor of all nodes with keys ¡ ! k 0 whereL p ( ¡ ! k ¡ ¡ ! k 0 )·b (b2Z + ); and 2. Random graph component: The node n k storing the key ¡ ! k = ha 1 ;a 2 ;:::;a d i is a neighbor of one other node n k 0 with key ¡ ! k 0 selected probabilistically such that ifL p ( ¡ ! k ¡ ¡ ! k 0 ) = x, the probability of selecting n 0 k as the neighbor of n k is proportional tox ¡d (i.e., a power-law distribution). See Figure 5.2-b for an example with 2-dimensional key space,L 1 as the distance mea- sure, and neighborhood boundary parameterb=1. Kleinberg [Kle00] showed that with a greedy forwarding primitive, on average an exact-match query is resolved withT,C 1 , andC 2 all inO(logjNj). With the greedy forwarding, noden forwards a query ¡ ! q only 134 to one of its neighbors with key ¡ ! k such thatL p ( ¡ ! k ¡ ¡ ! q ) is minimum among all neigh- bors inA(n), i.e., the neighbor with the most similar key to the query key ¡ ! q is selected to receive the query. It is easy to see the underlying grid topology ensures that when a node with key ¡ ! k receives a query ¡ ! q , always either ¡ ! k = ¡ ! q or the node has at least one neighbor with the key ¡ ! k 0 such thatL p ( ¡ ! k 0 ¡ ¡ ! q ) <L p ( ¡ ! k ¡ ¡ ! q ). Therefore, along the forwarding path of the query, the distance between the key at the current node and the target key ¡ ! q is monotonically decreasing as the query is forwarded. Besides, the probabilistically selected neighbors act as long jumps that ensure exponential decrease of this distance on average. Thus, the average forwarding path length is logarithmic to the size of the network. The way we defined the neighborhood relationship between the peer-to-peer database nodes based on the distance between their keys, together with the clustering property of the resulting small-world topology allow for the effective execution of other types of similarity queries as well. On one hand, we defined the neighborhood relation- ship such that neighbors of a node have keys closely similar to the key of the node, and consequently, similar to each other. On the other hand, due to the clustering property of the generated small-world graph, neighbors of a node are closely connected in terms of the hop-count in the network (i.e., number of the edges on the path between each pair of nodes). Therefore, a locality of tightly connected nodes with closely similar keys is created at the neighborhood of each node in the network. With a topology constructed out of such localities, range andkNN queries can be executed efficiently in two phases, first, by an exact-match query to locate the locality of the query key ¡ ! q , and second, by flooding the query throughout the locality of ¡ ! q . With a localized topology, flooding at the locality of the query key is efficient. We can locate all the keys relevant to the range andkNN queries in a limited number of hopsh away from ¡ ! q , whereh is independent of the size of the networkjNj. With our simple peer-to-peer database example, for range 135 andkNN queries all the relevant keys (and almost only relevant keys) are visited within h = O(r) and h = O(dk 1 d e) hops from ¡ ! q , respectively. Therefore, for both types of queries,T isO(logjNj+h),C 1 isO(logjNj+h d ), andC 2 isO(dlogjNj+h d ). With an inclusive key space V ½ Z d , the simple peer-to-peer database example considered here is only of illustrative significance. We, however, use the same properties to develop SWAM that applies to more general peer-to-peer database models. B.2 SWAM Family Almost all the traditional access methods for database systems are based on one core idea to reduce the search space for efficient access (see the unified model by Chavez et al. [CNBYM01]). They recursively partition the key space into a set of disjoint similarity classes 3 . An index is then constructed as a hierarchy of the class representatives at successive levels (see Figure 5.3-a). The hierarchical index allows filtering out (i.e., to dismiss without inspection) the irrelevant/dissimilar classes while query is directed from the root of the hierarchy toward the similarity class of the query key. The average query time is logarithmic to the size of the database. By mapping each node of the hierarchy to a peer-to-peer database node, the same idea can be directly applied to index peer-to-peer databases, although as we show later the resulting distributed hierarchical index structure is not appropriate for peer-to-peer databases. Consider K as the set of keys available in a peer-to-peer database. Any similarity-based relation can be used to partition the key space. For example, in Figure 5.3-b, V is recursively partitioned based on the GNA approach [Bri95]. Starting from V as the global similarity class, at each level the parent similarity classc with the class 3 The generic mathematical term for similarity class is equivalence class. Here, the equivalence rela- tions that partition the space are based on the distance (or similarity) between the keys. 136 root a. Recursive partitioning k1 k4 k3 k2 k11 k14 k12 k13 root N1 N2 N3 N4 N11 N44 c1 c2 c4 c3 k44 N14 N13 N12 ... c21 c24 c23 c22 b. Recursive partitioning example: GNA k1 k2 k3 k4 k7 k8 k16 k9 k15 k14 k6 k10 k13 k12 k11 k5 N1 N4 N16 N12 N2 N3 N5 N6 N7 N8 N9 N10 N11 N13 N14 N15 c. Flat partitioning Figure 5.3: Partitioning of Key Space 137 representative ¡ ! k 2 K is partitioned into a set of h disjoint subclasses c i with repre- sentative keys ¡ ! k i 2 K (i 2 I h = [1::h]) such that c i = f ¡ ! k 0 2 VjL p ( ¡ ! k 0 ¡ ¡ ! k i ) < L p ( ¡ ! k 0 ¡ ¡ ! k j );8j 6= ig. Considering that in a peer-to-peer database each key ¡ ! k resides at a peer-to-peer database node n k , the GNA-tree corresponding to such a space parti- tioning is a distributed GNA-tree in whichA(n k ) =fn2 Njn = n k i ;i2 I h g. Query processing with such a distributed index tree is similar to that of its corresponding cen- tralized counterpart, with query actually traversing a physically constructed tree rather than a tree structure in memory. Although this indexing approach may seem appeal- ing, due to the lack of a balance load among its nodes, is inappropriate for peer-to-peer databases. The unbalance load is evident by observing that nodes which represent larger similarity classes (i.e., nodes at the higher levels of the hierarchy) receive more queries to process. In the extreme case, the root of the hierarchy processes all queries. Besides, hierarchical structures are loop-free and intolerant to failures and/or autonomous behav- iors of the peer-to-peer database nodes. SWAM also employs the space partitioning idea; however, to avoid the problems with hierarchies, instead of recursive partitioning assumes a flat partitioning (see Figure 5.3-c). Each key ¡ ! k 2 K (or n k 2 N) represents its own similarity class c k µ V and the set of jKj similarity classes are collectively exhaustive V = S k2K c k and mutually exclusive c k \ c k 0 = ?; ¡ ! k 6= ¡ ! k 0 . An uncharacteristic case is where two or more nodes store replicas of the same key ¡ ! k . We assume all such nodes repre- sent the same class c k redundantly. Such a partitioning scheme can potentially bal- ance the query processing load among peer-to-peer database nodes. With hierarchies, neighborhood relationship between a pair of nodes is directly derived from parent- child relationship between their corresponding similarity classes to reflect the simi- larity between their classes. Similarly, with flat partitioning we define the neighbor- hood relationship based on the adjacency relationship between the similarity classes 138 A(n k ) =fn k 0 2 Njc k andc k 0 areadjacent;k 0 2 Kg. The resulting index structure is a graph instead of a loop-free tree. Besides, processing of the query can start from any node (e.g., the actual query originator) rather than exclusively from a unique node, the root. The challenge is to define the similarity-based partitioning relation such that the resulting graph-based index structure bears indexing characteristics similar to those of the hierarchical index structures. Particularly, it should allow filtering of (i.e., avoid visiting) the irrelevant classes effectively as query is directed from a query originator toward the similarity class of the query key. Moreover, to support range andkNN sim- ilarity queries effectively, alike hierarchical index structures similar classes should be in proximity of each other in terms of the hop-count in the index topology. Finally, the O(logN) expected query time achieved by the hierarchies is also desirable with the graph-based index structure. As outlined in Section B.1, these requirements are addressed by the properties of a basic small-world graph. A SWAM index structure is a general graph-based index structure that satisfies a generalization of the same properties as follows: Property 1 : Monotonic approach toward query key When a node with key ¡ ! k receives a query ¡ ! q , always either ¡ ! q 2 c k , or the node has at least one neighbor with a key ¡ ! k 0 such thatL p ( ¡ ! k 0 ¡ ¡ ! q )<L p ( ¡ ! k ¡ ¡ ! q ). Consequently, if the noden k receives the query ¡ ! q , it is guaranteed that for all ¡ ! k 00 2f ¡ ! j 2 KjL p ( ¡ ! j ¡ ¡ ! q )¸ L p ( ¡ ! k ¡ ¡ ! q )g the node n k 00 will never be visited in future during the greedy for- warding, and the similarity classc k 00 is filtered. Property 2 : Localized index topology With a localized index, for each node n k the set of nodes at its neighborhood A(n k ) are tightly connected and store keys closely similar to ¡ ! k . We measure these two characteristics with the two metrics 139 k1 k2 k3 k4 k7 k8 k16 k9 k15 k14 k6 k10 k13 k12 k11 k5 N1 N4 N16 N12 N2 N3 N5 N6 N7 N8 N9 N10 N11 N13 N14 N15 Random Graph Component Delaunay Component a. V oronoi diagram and b. SWAM-V topology dual Delaunay graph Figure 5.4: SWAM-V Index Structure Clustering Coefficient (CC) and Neighbor Distance Distribution (NDD), respec- tively. For a node n, CC n = C(n) is defined by Equation 5.1. For a graph G, CC G = 1 jNj P 8n2N CC n . Also, NDD is the probability distribution function of the random variable X = L p ( ¡ ! k 0 ¡ ¡ ! k),8n k 2 N 8n k 0 2 A(n k ). As we discussed in Section B.1, a localized topology allows efficient processing of the range and kNN similarity queries. Property 3 : Logarithmic forwarding-path length For an exact-match query (pro- cessed by greedy forwarding), on averageT=O(logN). Any graph-based index structure that maintains these SWAM properties is a member of the SWAM family. In Section B.3, we introduce an example SWAM index structure. B.3 SWAM-V: A Voronoi-based SWAM SWAM-V partitions the key space V to a V oronoi diagram [OBSC00] (see Figure 5.4-a). For each key ¡ ! k i 2 K (i 2 I jKj ), n k i 2 N represents the similarity class c k i = f ¡ ! k 2 VjL p ( ¡ ! k ¡ ¡ ! k i ) < L p ( ¡ ! k ¡ ¡ ! k j );8j 6= ig, which is the Voronoi cell of 140 n k i . Accordingly, the neighborhood of the node n k i is defined as A(n k i ) = fn k j 2 Njc k i andc k j areadjacent;8j2I jKj g. Nodes that store replicas of the same key share the same neighborhood; i.e., if ¡ ! k i = ¡ ! k j , A(n k i ) = A(n k j ). The resulting graph is the dual Delaunay graph of the V oronoi diagram and is unique for each diagram (see Figure 5.4-a). Since the neighborhood relationship is symmetric, the Delaunay graph is depicted as an undirected graph. The SWAM-V topology consists of a random graph component (identical to that of the small-world graph) that is superimposed over the Delaunay graph (see Figure 5.4-b). Theorem 11. The SWAM-V index structure satisfies the SWAM Properties 1, 2, and 3. 141
Abstract (if available)
Abstract
Peer-to-peer networks are considered the new generation of distributed databases, termed peer-to-peer databases. With very large size, open architecture, and extreme dynamism and autonomy, approximate query answering is arguably the most promising approach for query answering in peer-to-peer databases. To enable approximate query answering, in this dissertation we propose a set of universal sampling operations specifically designed for probing the data in peer-to-peer databases. We complement these operations at the bottom tier by a set of approximate query processing techniques at the top tier to develop a two-tier system for answering both set-valued and aggregate queries in peer-to-peer databases.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ensuring query integrity for sptial data in the cloud
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Efficient updates for continuous queries over moving objects
PDF
WOLAP: wavelet-based on-line analytical processing
PDF
Location-based spatial queries in mobile environments
PDF
Generalized optimal location planning
PDF
A reference-set approach to information extraction from unstructured, ungrammatical data sources
PDF
Scalable reputation systems for peer-to-peer networks
PDF
Elements of next-generation wireless video systems: millimeter-wave and device-to-device algorithms
Asset Metadata
Creator
Banaei-Kashani, Farnoush
(author)
Core Title
Approximate query answering in unstructured peer-to-peer databases
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/15/2007
Defense Date
12/14/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
approximate query,continuous query,epidemic algorithms,OAI-PMH Harvest,peer-to-peer databases,peer-to-peer networks,sample-based query answering
Language
English
Advisor
Alexander, Kenneth S. (
committee member
), Govindan, Ramesh (
committee member
), Kalia, Rajiv (
committee member
), Kempe, David (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
banaeika@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m263
Unique identifier
UC1203464
Identifier
etd-BanaeiKashani-20070215 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-159555 (legacy record id),usctheses-m263 (legacy record id)
Legacy Identifier
etd-BanaeiKashani-20070215.pdf
Dmrecord
159555
Document Type
Dissertation
Rights
Banaei-Kashani, Farnoush
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
approximate query
continuous query
epidemic algorithms
peer-to-peer databases
peer-to-peer networks
sample-based query answering