Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
WOLAP: wavelet-based on-line analytical processing
(USC Thesis Other)
WOLAP: wavelet-based on-line analytical processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
WOLAP: WA VELET-BASED ON-LINE ANALYTICAL PROCESSING by Mehrdad Jahangiri A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2008 Copyright 2008 Mehrdad Jahangiri Dedication To my father, my mother, my brother, my sister, and my dear wife ii Acknowledgments I would like to thank my academic advisor, Professor Cyrus Shahabi, for his great sup- port throughout my study at the University of Southern California. Cyrus gave me the intellectual freedom, provided me with enough resources, and demanded a high quality of work at all time. His friendly mentorship and great vision has been always a source of inspiration to me. I would also like to thank Professor Antonio Ortega, Professor Shri Narayanan, Pro- fessor Ramakant Nevatia, and Professor Aiichiro Nakano for serving in my PhD qualifi- cation and dissertation committee. I am grateful for their support, guidance, and insight. I am grateful to all my Infolab colleagues for their advice and friendship. Among them, Rolfe Schmidt, Dimitris Sacharidis, and Dr. Farnoush Banaei-Kashani deserve special recognitions for their contributions toward the completion of this thesis. Last but not least, I sincerely appreciate my family and friends for their encourage- ment and support throughout all these years. I am quite grateful to them all. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . 1 1.2 WOLAP: Wavelet-based OLAP . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Range Aggregate Query Processing with Wavelets . . . . . . . 3 1.2.2 Maintenance of Wavelet-transformed Data . . . . . . . . . . . 5 1.2.3 Range Group-by Query Processing with Wavelets . . . . . . . . 6 1.3 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Preliminaries: Wavelets 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Multidimensional DWT . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Non-Standard Form . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Wavelet Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 3: Range Aggregate Query Processing with Wavelets 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Polynomial Aggregate Queries . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Defining Polynomial Queries as Dot Products . . . . . . . . . . 24 3.2.2 Processing Polynomial Queries in Wavelets . . . . . . . . . . . 27 3.3 Query Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 3.3.1 One-dimensional Query . . . . . . . . . . . . . . . . . . . . . 29 3.3.2 Multi-dimensional Query . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Update Query using Wavelets . . . . . . . . . . . . . . . . . . 34 3.4 Collection of Fixed Measures . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Approximation and Progressiveness . . . . . . . . . . . . . . . . . . . 41 3.5.1 Data Approximation . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.2 Query Approximation and Progressiveness . . . . . . . . . . . 43 3.6 Disk Block Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.7 Toward Realization of WOLAP . . . . . . . . . . . . . . . . . . . . . . 46 3.7.1 Filter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7.2 Auxiliary Coefficients . . . . . . . . . . . . . . . . . . . . . . 47 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 4: Maintenance of Wavelet Transformed Data 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.1 Data Maintenance Scenarios . . . . . . . . . . . . . . . . . . . 52 4.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Disk Block Allocation of Wavelet Coefficients . . . . . . . . . . . . . . 54 4.2.1 Multidimensional Wavelet Trees . . . . . . . . . . . . . . . . . 56 4.2.2 Disk Block Allocation of Multidimensional Wavelets . . . . . . 58 4.3 Shift-Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.1 Multidimensional Shift-Split . . . . . . . . . . . . . . . . . . . 63 4.3.2 Shift-Split of Tiles . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Shift-Split Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.1 Transformation of Massive Multidimensional Datasets . . . . . 65 4.4.2 Appending to Wavelet Decomposed Transforms . . . . . . . . . 68 4.4.3 Data Stream Approximation . . . . . . . . . . . . . . . . . . . 70 4.4.4 Partial Reconstruction from Wavelet Transforms . . . . . . . . 73 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 5: Range Group-by Query Processing with Wavelets 76 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Range Group-by Query . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 Range Group-by Query Processing with Wavelets . . . . . . . . . . . . 82 5.3.1 Aggregation Phase . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3.2 Reconstruction Phase . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Approximation and Progressiveness . . . . . . . . . . . . . . . . . . . 87 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 6: ProDA: An End-to-End WOLAP System 91 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 v 6.2.1 Earth Science . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.2 Oil Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3 Bottom Tier: Storage Engine . . . . . . . . . . . . . . . . . . . . . . . 96 6.4 Middle Tier: Query Engine . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4.1 Browse Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4.2 Essential Querying services . . . . . . . . . . . . . . . . . . . 98 6.4.3 Advanced querying services . . . . . . . . . . . . . . . . . . . 100 6.4.4 Data mining services . . . . . . . . . . . . . . . . . . . . . . . 101 6.5 Top Tier: Visualization Engine . . . . . . . . . . . . . . . . . . . . . . 102 6.5.1 Data visualization . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.5.2 Query Visualization . . . . . . . . . . . . . . . . . . . . . . . 104 6.5.3 High-fidelity UI . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.5.4 Connectivity Management . . . . . . . . . . . . . . . . . . . . 106 6.5.5 Advanced Visualization . . . . . . . . . . . . . . . . . . . . . 107 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Chapter 7: Experiments 108 7.1 Range Aggregate Query Processing with Wavelets . . . . . . . . . . . . 108 7.1.1 Experimental Datasets . . . . . . . . . . . . . . . . . . . . . . 109 7.1.2 WOLAP Datacubes . . . . . . . . . . . . . . . . . . . . . . . . 110 7.1.3 Storage, Query, and Update performance . . . . . . . . . . . . 111 7.1.4 Approximation and Progressiveness . . . . . . . . . . . . . . . 113 7.1.5 WOLAP vs. general OLAP cubings . . . . . . . . . . . . . . . 115 7.1.6 WOLAP vs. other wavelet-based techniques . . . . . . . . . . 117 7.1.7 Disk Block Allocation . . . . . . . . . . . . . . . . . . . . . . 118 7.2 Maintenance of Wavelet Transformed Data . . . . . . . . . . . . . . . 119 7.2.1 Transformation of Massive Multidimensional Datasets . . . . . 120 7.2.2 Appending to Wavelet-Transformed Data . . . . . . . . . . . . 121 7.2.3 Data Stream Approximation . . . . . . . . . . . . . . . . . . . 122 7.3 Range Group-by Query Processing with Wavelets . . . . . . . . . . . . 122 7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 123 7.3.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 8: Related Work 129 8.1 Range Aggregate Queries . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.2 Maintenance of Wavelet Transformed Data . . . . . . . . . . . . . . . 132 8.3 Range Group-by Queries . . . . . . . . . . . . . . . . . . . . . . . . . 133 Chapter 9: Conclusions 136 References 138 vi List of Tables 3.1 Complexity Table for WOLAP range aggregate processing . . . . . . . 41 4.1 Complexity of Multidimensional Shift-Split . . . . . . . . . . . . . . . 63 4.2 Complexity of Shift-Split for Multidimensional Tiles . . . . . . . . . . 64 4.3 I/O Complexities for Transformation of Multidimensional Datasets . . . 68 8.1 Query/Update tradeoff for exact range aggregate algorithms . . . . . . . 130 vii List of Figures 2.1 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Support Intervals of Haar Wavelets . . . . . . . . . . . . . . . . . . . . 13 2.3 Haar Wavelet Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 2-level decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 WOLAP framework for processing range aggregate queries . . . . . . . 22 3.2 Multidimensional View of Data and Queries . . . . . . . . . . . . . . . 26 3.3 Wavelet-Transformed Data Vector and Query Vector . . . . . . . . . . 27 3.4 Lazy Wavelet Transform using Haar filter . . . . . . . . . . . . . . . . 29 3.5 Multidimensional Query using Tensor Product . . . . . . . . . . . . . . 33 3.6 Wavelet Update using Haar filter . . . . . . . . . . . . . . . . . . . . . 35 3.7 Multidimensional View of Data . . . . . . . . . . . . . . . . . . . . . . 36 3.8 Disk Block Allocation Strategies . . . . . . . . . . . . . . . . . . . . . 45 3.9 Auxiliary Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 Disk Block Allocation Strategy . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Standard Form Wavelet Trees . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Standard Form Data Point Reconstruction . . . . . . . . . . . . . . . . 58 4.4 Non-Standard Form Wavelet Tree . . . . . . . . . . . . . . . . . . . . 59 4.5 Shift-Split Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 viii 4.6 Transformation by Chunks . . . . . . . . . . . . . . . . . . . . . . . . 66 4.7 Wavelet Tree Expanding . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Example of Range Group-by Query . . . . . . . . . . . . . . . . . . . 80 5.2 Aggregation in the Wavelet Domain . . . . . . . . . . . . . . . . . . . 84 5.3 Range Group-by Query with Wavelets . . . . . . . . . . . . . . . . . . 87 6.1 ProDA’s Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 ProDA’s Query Engine . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Post-ordered Polynomial Aggregate Query . . . . . . . . . . . . . . . . 99 6.4 A sample ProDA client for progressive querying (in C#) . . . . . . . . 102 6.5 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.6 Query Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.7 ProDA’s Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1 Query Performance vs. Wavelet Filter . . . . . . . . . . . . . . . . . . 111 7.2 Storage, Query, and Update performance of WOLAP’s cube models . . 111 7.3 WOLAP Query Progressiveness . . . . . . . . . . . . . . . . . . . . . 114 7.4 WOLAP with Data Approximation . . . . . . . . . . . . . . . . . . . . 115 7.5 WOLAP vs. general OLAP cubings . . . . . . . . . . . . . . . . . . . 116 7.6 WOLAP vs. other wavelet-based techniques . . . . . . . . . . . . . . . 118 7.7 Disk Block Allocation Methods . . . . . . . . . . . . . . . . . . . . . 119 7.8 Effect of Large Disk Blocks . . . . . . . . . . . . . . . . . . . . . . . 119 7.9 Effect of Larger Memory . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.10 Effect of Larger Tiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.11 SHIFT-SPLIT in Appending . . . . . . . . . . . . . . . . . . . . . . . 121 7.12 SHIFT-SPLIT in Multidimensional Streaming . . . . . . . . . . . . . . 122 7.13 Performance Analysis of Range Group-by Queries . . . . . . . . . . . 124 ix 7.14 Effect of Range Size on Range Group-by Queries . . . . . . . . . . . . 125 7.15 Effect of Number of the Grouping Dimensions on Range Group-by Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.16 Range Group-by Query at different Resolutions . . . . . . . . . . . . . 126 7.17 Progressiveness in Range Group-by Query processing . . . . . . . . . . 127 x Abstract Wavelet Transform has emerged as an elegant tool for online analytical queries. Most of the methods using wavelets, however, share the disadvantage of providing only data- dependant approximate answers by compressing the data. On the contrary, we propose a wavelet-based query processing technique, WOLAP, which does not require compress- ing the data. Instead, we employ wavelet transform to compact incoming queries rather than the underlying data. The intuition here is that queries are well-formed with repeti- tive patterns that can be exploited by wavelets for a more effective compression, leading to efficient query performance. WOLAP extends the set of ad-hoc analytical queries to include the entire family of range polynomial aggregate queries as well as the com- plex class of range group-by queries. In addition, leveraging from the multi-resolution property of wavelets, WOLAP supports progressive and approximate query processing in case of time or space limitation. Toward realizing the practical use of WOLAP, we provide a framework to efficiently maintain large multidimensional wavelet-transformed data. In particular, by introducing two novel operations which work directly in the wavelet domain, we allow WOLAP to transform, untransform, store, update, and append data in an I/O efficient manner. By developing a real system and conducting extensive sets of experiments with several real- world datasets, we have verified the effectiveness of WOLAP in practice. xi Chapter 1 Introduction 1.1 Motivation and Problem Statement Recent advancements in sensing and data acquisition technologies have enabled collec- tion of massive datasets that represent complex real-world events and entities in fine detail. In light of access to such datasets, scientists and system analysts are no longer restricted to modeling and simulation when analyzing real-world events. Instead, the preferred viable approach derives observations and verifies hypotheses by analytical exploration of representative real datasets that capture the corresponding event. This approach demands intelligent data storage, access, and analytical querying solutions and tools that facilitate convenient, efficient, and effective exploration of these massive datasets. This poses both an opportunity and a grand challenge for the database com- munity. By design, developers optimize traditional databases for transactional rather than analytical query processing. These databases support only a few basic analytical queries with nonoptimal performance and, therefore, provide inappropriate tools for analyzing massive datasets. Instead, current practice exploits the extensive analytical query processing capabili- ties of spreadsheet applications such as Microsoft Excel and Lotus 1-2-3 to explore the data. However, in this case the limitation lies in the spreadsheet applications’ capability to handle large datasets. With this approach, while the original datasets still reside in a 1 database server, smaller subsets of the data will be selected-by sampling, aggregation, or categorization-and retrieved as new data products for further local processing with the spreadsheet application at the client side. This inconvenient and time-consuming process of generating a secondhand dataset might unavoidably result in loss of relevant or detailed information. This can bias the analysis and encourage analysts to justify their own assumptions rather than discover surprising latent facts from the data. Further, given that most of the analysis occurs locally at the client with this approach, the data transfer overhead is increased, resource sharing at the server side does not apply, and considerable processing is required at the client side. Online Analytical Processing (OLAP) tools have emerged to address the limita- tions of traditional databases and spreadsheet applications. Unlike traditional databases, OLAP tools support a range of complex analytical queries; unlike spreadsheet appli- cations, they can also handle massive datasets. Moreover, OLAP tools can process user queries online. Online query processing is arguably a requirement for effectively supporting exploratory data querying. However, current OLAP tools heavily rely on precalculating the query results to enable online query processing. Consequently, they can support only a limited set of predefined (rather than ad hoc) queries online. In this dissertation, we go beyond the elementary OLAP queries by addressing more sophisticated analytical queries such as range polynomial aggregate queries and range group-by queries. By pushing the favorite analytical queries closer to data, we enable the end-users to efficiently interact with the data stored at the server. We provide such an end-to-end data analysis system by utilizing wavelet transform and benefiting from its multi-resolution properties. Wavelet transform is essentially a signal process- ing tool which provides a multi-scale decomposition of data by creating “rough” and 2 “smooth” views of the data at different resolutions. We have adopted wavelet transform in databases to enable efficient analysis of massive multidimensional datasets. 1.2 WOLAP: Wavelet-based OLAP We classify our studies into three parts, Range Aggregate Query Processing, Data Main- tenance, and Range Group-by Processing. First, we investigate how to process poly- nomial range aggregate queries efficiently and progressively using wavelet transform. Next, we show how to effectively prepare and maintain the transformed data. Later, we address an important class of queries, range group-by queries, and efficiently pro- cess it using wavelets. We would like to point out that we have implemented all these techniques as an end-to-end scientific data analysis system and verified their benefits in practice (see Chapter 6). 1.2.1 Range Aggregate Query Processing with Wavelets Scientific data analysis systems need to perform complex statistical queries on very large multidimensional datasets; thus, a number of multivariate statistical methods (e.g., calculation of correlation or kurtosis) must be supported. On top of that, the desired accuracy varies per application, user and/or dataset and it can well be traded-off for faster response time. We believe that our methodologies described in this proposal can contribute towards this end by providing progressive processing of polynomial aggre- gate queries. The space of polynomial queries is large enough to contain many complex functions; for example in the context of statistical queries, the class includes second order functions like covariance, third order skew, fourth order kurtosis, etc., besides the typical average and variance functions. 3 We propose a general framework that utilizes the wavelet transformation of multidi- mensional data and provide progressiveness by retrieving data coefficients based on the ordering of a significancy function. The use of the wavelet transformation is justified by the fact that the query cost is reduced from the query size to the logarithm of the data size, which is a major benefit especially for large range queries. Our general query formulation supports any high order polynomial range-aggregate query (e.g. variance, covariance, kurtosis, etc) and efficiently process them in a progressive fashion. Let us note that our work is not a simple application of wavelets to scientific datasets. Traditionally, data transformation techniques such as wavelets have been used to com- press data. The idea is to transform the raw dataset to an alternative form, in which many data points (termed coefficients) become zero or small enough to be negligible exploiting the inherent correlation in the raw dataset. Consequently, the negligible coef- ficients can be dropped and the rest would be sufficient to reconstruct the data later with minimum error and hence the compression of data. However, there is a major difference between the main objective of compression applications using wavelets and that of database applications. With compression appli- cations, the main objective is to compress data in such a way that one can reconstruct the dataset in its entirety with as minimal error as possible. Consequently, at the data generation time, one can decide which wavelets to keep and which to drop. Instead, with database queries, each range-sum query is interested in some bounded area of the data. The reconstruction of the entire data is neither feasible nor efficient. Hence, for the database applications, at the data generation or population time, one cannot optimally sort the coefficients or even specify which coefficients to keep or drop. Therefore, we advocate the transformation of queries and study alternative ways of 4 ordering the query coefficients to achieve better progressive (or approximate) answers to polynomial queries. 1.2.2 Maintenance of Wavelet-transformed Data The valuable benefit of wavelet-based query processing raises the demand of main- taining wavelet-transformed data as efficient as possible. In particular, we must be able to wavelet transform massive multidimensional datasets in an I/O efficient manner, where available memory is limited. In addition, we must support efficient methods for updating the transformed data and appending new data to the existing transformed data. Appending is fundamentally different from updating in that it results in the increase of the domain of one or more dimensions. As a result, the wavelet decomposed dimen- sions also grow, new levels of transformation are introduced and therefore the trans- form itself changes. Furthermore, we require to maintain the approximated data as new data are coming. The requirement here is to construct a space and time efficient algo- rithm for maintaining the representative synopsis. Finally, we must be able to efficiently reconstruct a set of values specified by a range on a multidimensional dataset from its wavelet-transformed. The problem is equivalent to translating the selection operation of relational algebra to the wavelet domain. We would like to perform all these tasks directly in the wavelet domain, preserving as much of the transformed data as possible and avoiding reconstruction of the original data. In this dissertation, we introduce two novel operations for wavelet decomposed data, named SHIFT and SPLIT, that stem from the multiresolution properties of wavelets to provide general purpose functionalities. They are designed to work directly in the wavelet domain and can be utilized in a wide range of data intensive applications, such as the above mentioned applications, resulting in significant improvements in every case. 5 Subsequently, we demonstrate the usefulness of these operations in the four widely- exercised data maintenance scenarios. The scenarios are diverse enough to cover most of the areas where wavelets are used, but not exhaustive, as we conjecture that the appli- cations that can benefit from the SHIFT and SPLIT operations are plenty. The scenar- ios examined here share the fact that the problem they deal with has a straightforward solution when dealing with untransformed data. Therefore, one is compelled to first reconstruct the original data from the transformed data. However, we are interested in working entirely in the wavelet domain, and as we see, this becomes a complicating, but fruitful, factor. Furthermore, we have observed that queries on wavelet-transformed data exhibit a particular access pattern. This strong dependency among wavelet coefficients, enforced by the multi-scale nesting property, leads us to construct multidimensional tiles contain- ing wavelet coefficients that are related with each other under a particular access pattern. These tiles are then stored directly into the secondary storage, as their size is adjusted to fit a disk block. By using this tiling approach we can minimize the number of disk I/Os needed to perform any operation in the wavelet domain, including the important reconstruction operation which results in significant query cost reductions. We present the multidimensional tiling of the wavelet coefficients first and then we show that the SHIFT and SPLIT operations are tiling-friendly, as these operations benefit significantly from the existence of tiling. 1.2.3 Range Group-by Query Processing with Wavelets Range group-by queries are among the most important analytical queries and are widely used in decision support systems and scientific applications. A group-by divides the data into groups and, per group, summarizes the data over one or more attributes for the 6 given arbitrary range. However, current practice for evaluating a range group-by query is to individually compute an aggregation for each group, which results in performing a large number of aggregate queries for large queries (i.e., containing many groups). On the contrary, we process a range group-by query as a single query by proposing a novel wavelet-based technique that exploits I/O sharing across groups to evaluate them efficiently. The intuition behind our approach comes from the fact that we can decom- pose a range group-by query into two sets of 1) aggregate queries, and 2) reconstruc- tion queries. Subsequently, we can effectively compute both in the wavelet domain by extending our earlier studies. In most cases, the valuable insight provided by this class of queries comes by view- ing the relationship among the group values, either by generating a corresponding graph or by using a pivot table on the query output. Thus it is essential to preserve the rela- tionship in approximate or progressive answering rather than conserving the accuracy of each individual query. Hence, we must treat the range group-by query as a single query instead of multiple unrelated individual aggregate queries. This become specially important in the WOLAP system, since wavelet transform distributes the group values across the transformed coefficients. Consequently, it is important to exploit the depen- dency across different queries by sharing the common coefficients across the groups. In this study, we show that our technique is not only efficient as an exact algorithm but also very effective as an approximation method for the entire group, in case of limited query time or storage space. Our method in processing range group-by queries effectively exploits and extends our two earlier studies, the progressive range aggregation and the efficient wavelet reconstruction techniques. We believe this study can proactively lead us toward building 7 an end-to-end scientific data analysis system, enabling efficient interaction with massive multidimensional datasets. 1.3 Road Map The remainder of this dissertation is organized as follows. We review the wavelet pre- liminaries in Chapter 2. In Chapter 3, we discuss our wavelet-based method for range aggregate query processing. We describe our method of wavelet data maintenance in Chapter 4. Next, we address an important class of queries, range group-by queries, and propose an efficient processing method for them in Chapter 5. In Chapter 6, we overview the real system designed and developed based on the techniques in this dis- sertation. We extensively examine our methods with real-world datasets in Chapter 7. Finally, we review the related work in Chapter 8. 8 Chapter 2 Preliminaries: Wavelets 2.1 Introduction In this chapter, we overview the preliminary concepts of Wavelet Transform that we use throughout this report. 2.2 Discrete Wavelet Transform Discrete Wavelet transform is defined as a series of decompositions on data by creating “rough” and “smooth” views of the data at different resolutions. The “smooth” view consists of averages or summary coefficients, whereas the “rough” view consists of differences or detail coefficients. At each resolution, termed level of decomposition, the summaries and details are constructed by pairwise averaging and differencing of the summaries of the previous level. More formally, the summary of the data is produced by a low-pass filter H, which filters out the rough elements. On the other hand, the detail view of the data is produced by a high-pass filter G, which filters out the smooth elements. A filter is simply com- prised by a set of coefficients that perform the convolution-decimation on the input to produce the output. LetH =fh 0 ;:::;h l¡1 g andG=fg 0 ;:::;g l¡1 g be the wavelet filters 9 of length l and let u k;j and w k;j be the j-th summary and the j-th detail coefficient for thek-th level of decomposition. We have: u k+1;i = N=2 k ¡1 X j=0 h j¡2i u k;j ´ U k+1 =HU k w k+1;i = N=2 k ¡1 X j=0 g j¡2i u k;j ´ W k+1 =GU k DWT is performed by chaining these filters on the output of the low-pass filter; doing so recursively leads to the multiresolution view of the data. Figure 2.1 illustrates the discrete wavelet transform of a vectora. It shows that the3-level wavelet decomposition ofa consists of the3-rd level summary coefficients and the detail coefficients across all levels of decomposition. We denote that vector^ a is the Discrete Wavelet Transform ofa by^ a=DWT(a). Note that the untransformed vectora contains the summary coefficients at the0-th level of decomposition: a[j]=u 0;j , for0·j·N. U 0 U 1 W 1 U 2 W 2 U 3 W 3 u 0,0 u 0,1 u 0,2 u 0,3 u 0,4 u 0,5 u 0,6 u 0,7 a u 1,0 u 1,1 u 1,2 u 1,3 w 1,0 w 1,1 w 1,2 w 1,3 u 2,0 u 2,1 w 2,0 w 2,1 u 3,0 w 3,0 u 3,0 w 3,0 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 â Figure 2.1: Discrete Wavelet Transform If we denote the set of summary coefficients at the k-th level by U k and the set of wavelet coefficients at the k-th level by W k , we can formally write the previous state- ment asU k =U k+1 ©W k+1 , where the direct-sum© notation refers to the decomposition 10 process. The original data are the summary coefficients of the0-th level. For example, the 3 level decomposition, shown in Figure 2.1, is formally written as: U 0 =U 1 ©W 1 =U 2 ©W 2 ©W 1 =U 3 ©W 3 ©W 2 ©W 1 In the case of Haar filter as the simplest and the first discovered wavelet filter, the low-pass filter is comprised of the coefficientsf 1 p 2 ; 1 p 2 g and the high-pass filter consists of the coefficientsf 1 p 2 ;¡ 1 p 2 g. Let us show how to use this filter to wavelet transform an array by providing an example. Example Consider a vector of 4 valuesf2;6;7;1g. Let us apply DWT on this vector using Haar filter. We first start by taking the pairwise summaries: f 2+6 p 2 ; 7+1 p 2 g and the pairwise detailsf 2¡6 p 2 ; 7¡1 p 2 g. The result is 2 vectors each half the size of original, con- taining a smoother version of the data, the summaries, and a rougher version, the details; these coefficients form the first level of decomposition,f4 p 2;4 p 2;¡2 p 2;3 p 2g. We continue by constructing the summary and detail coefficients from the smooth version of the data: f4 p 2;4 p 2g. The new summary isf8g and the new detail isf0g, forming the second and last level of decomposition. Notice that 8 is the weighted average of the entire vector as it is produced by recursively averaging the summaries. Similarly, 0 represents the weighted difference between the summary of the first half of the vector and the summary of the second half. The final summary and the differences produced at all levels of decomposition form the Haar transform:f8;0;¡2 p 2;3 p 2g. Notice that at each level of decomposition the summaries and details can be used to reconstruct the averages of the previous level. 11 In order to ensure that H and G act as “summary” and “detail” filters, h and g coefficients require to have the following properties: g i =(¡1) i h i MirrorRelationship l¡1 X i=0 h i = p 2 Normality l¡1 X i=0 h i h i+2j = 8 > < > : 1 ifj =0 0 ifj6=0 Orthogonality l¡1 X i=0 i º g i =0 ºVanishingMoment The above conditions imply that these filters are orthonormal bases for transforma- tion and consequently preserve dot product as the following lemma states. Lemma 1. Ifb a is the DWT of a vector a and b b is the DWT of a vector b then ha; bi=hb a; b bi, X i a[i]¢ b[i]= X j b a[j]¢ b b[j] We now define the term “support interval” that we will use frequently in this report. Definition 1. The support interval of a (wavelet or scaling) coefficient is the part of the original data that this coefficient depends on. For a vector of size 2 n , using zero-based indexing, the support interval of a coefficient u j;k or w j;k is [k2 j ;(k+1)2 j ¡1], for 0·j·n and0·k·2 n¡j ¡1. Figure 2.2 shows the support intervals of Haar wavelets for a vector of size 8. 12 Definition 2. A (wavelet or scaling) coefficient covers another (wavelet or scaling) coef- ficient if the support interval of the latter is (completely) contained in the support interval of the former. For example, the first coefficient in the second level of decomposition w 2;0 covers the first and second coefficients of the first level of decomposition, w 1;0 and w 1;1 ; see Figure 2.2. u 1,0 Level 1 (j=1) a (j=0) Level 2 (j=2) Level 3 (j=3) u 0,0 u 0,1 u 0,2 u 0,3 u 0,4 u 0,5 u 0,6 u 0,7 w 1,0 u 1,1 w 1,1 u 1,2 w 1,2 u 1,3 w 1,3 u 2,0 u 2,1 w 2,0 w 2,1 u 3,0 w 3,0 Figure 2.2: Support Intervals of Haar Wavelets Definition 3. An intervalI is a dyadic interval if I =[k2 j ;(k+1)2 j ¡1], for0·j·n and0·k·2 n¡j ¡1. Haar wavelet coefficients w j;k and Haar scaling coefficients u j;k have the property that their support intervals are dyadic intervals. 2.3 Wavelet Tree In this section, we review the notion of wavelet tree. Wavelet tree exploits the relation- ships between wavelet coefficients and thus simplifies our presentation throughout the report. Figure 2.3 shows a Haar wavelet tree for a vector of size 8; summary coefficients are shown with squares, whereas wavelet coefficients are shown in circles. The original data is drawn with dotted line as children of the leaf nodes of the tree. Haar wavelet tree is a binary tree where each node w k;j has exactly two children, w k¡1;2j and w k¡1;2j+1 . 13 The summary coefficientu n;0 is the root of the tree having only one childw n;0 . This tree structure has been given several names in the wavelet bibliography, such as error tree [VWI98, VW99], dependency graph [SS04], etc. w 2,1 u 3,0 w 3,0 w 2,0 time frequency u 0,0 u 0,6 u 0,2 u 0,5 u 0,7 u 0,3 u 0,4 u 0,1 w 1,0 w 1,1 w 1,2 w 1,3 Figure 2.3: Haar Wavelet Tree The beautiful property of this tree is that it portrays the way Haar wavelets partition the time-frequency plane; see Figure 2.3. As we navigate to the leaves of the wavelet tree we gain accuracy in the time domain, but simultaneously, we lose accuracy in the frequency domain and vice versa. 2.4 Multidimensional DWT To perform the wavelet decomposition of multidimensional datasets we need multidi- mensional wavelets and scaling functions. For illustration purposes, we focus our dis- cussion on 2-dimensional transformations. The extension to higher dimensionality is straightforward. In general, the multidimensional wavelets and scaling spaces are con- structed from tensor products of single dimensional waveletsW j and scaling vectorsU k , wherej andk index scale, or equivalently, levels of decomposition. 14 The tensor product of the vectors U j and U k results in a 2-dimensional subspace U j;k = U j U k . Similarly for W j and W k we shape the subspace W d j;k = W j W k . Tensor products among scaling and wavelet vectors result in 2 more sets of subspaces: W h j;k = U j W k andW v j;k = W j U k . These 4 sets of 2-d subspaces,U j;k ,W d j;k ,W h j;k , W v j;k can decompose 2-d space. However there are 2 ways to perform multidimensional wavelet decomposition, the standard and the non-standard. 2.4.1 Standard Form To decompose a 2-d array of size N 2 using the standard form, we first completely decompose one dimension and then the other, with the order not being important. This means that we first transform each of the N rows of the array to construct a new array and then take each of theN columns of the new array and again perform 1-d DWT on them. The final array is the 2-d standard transform of the original array. In terms of subspaces, each 1-d untransformed vector is initially expressed using the 0-th level vectorU 0 . Therefore, each cell in the untransformed array is expressed by the 2-d spaceU 0;0 =U 0 U 0 Decomposing both vectorsU 0 to the first level of decompositionU 0 =W 1 ©U 1 and distributing the tensor product, we get: U 0;0 =U 0 U 0 =(W 1 ©U 1 )(W 1 ©U 1 ) =(W 1 W 1 )©(W 1 U 1 )©(U 1 W 1 )©(U 1 U 1 ) =W d 1;1 ©W v 1;1 ©W h 1;1 ©U 1;1 15 Decomposing both vectorsU 0 to the second level of decompositionU 0 =W 1 ©W 2 ©U 2 and distributing the tensor product, we get: U 0;0 =U 0 U 0 =(W 1 ©W 2 ©U 2 )(W 1 ©W 2 ©U 2 ) =(W 1 W 1 )©(W 1 W 2 )©(W 1 U 2 )©(W 2 W 1 )©(W 2 W 2 )©(W 2 U 2 ) ©(U 2 W 1 )©(U 2 W 2 )©(U 2 U 2 ) =W d 1;1 ©W d 1;2 ©W v 1;2 ©W d 2;1 ©W d 2;2 ©W v 2;2 ©W h 2;1 ©W h 2;2 ©U 2;2 Figure 2.4a shows how the 2-d array is partitioned to 9 subspaces for 2 level decom- position. W 1 W 1 U 2 U 2 U 2 W 2 W 2 U 2 W 2 W 2 U 2 W 1 W 2 W 1 W 1 U 2 W 1 W 2 U 2 U 2 W 1 W 1 W 1 U 1 U 1 W 1 W 2 W 2 U 2 W 2 W 2 U 2 a. Standard Form b. Non-Standard Form Figure 2.4: 2-level decomposition 2.4.2 Non-Standard Form The non-standard form differs in that the decomposition is not happening on each dimension separately. Rather, after each level of decomposition only the coefficients 16 corresponding to the U j;j subspace are further decomposed. The first level of decom- position results from decomposing each dimension to the first level, exactly as in the standard form: U 0;0 =U 0 U 0 =(W 1 ©U 1 )(W 1 ©U 1 ) =(W 1 W 1 )©(W 1 U 1 )©(U 1 W 1 )©(U 1 U 1 ) =W d 1;1 ©W v 1;1 ©W h 1;1 ©U 1;1 By decomposing the scaling vectorU 1 into the next level of decompositionU 1 =W 2 © U 2 we decompose onlyU 1;1 subspace. U 0;0 =W d 1;1 ©W v 1;1 ©W h 1;1 ©U 1;1 =W d 1;1 ©W v 1;1 ©W h 1;1 ©(U 1 U 1 ) =W d 1;1 ©W v 1;1 ©W h 1;1 ©((W 2 ©U 2 )(W 2 ©U 2 )) =W d 1;1 ©W v 1;1 ©W h 1;1 © (W 2 W 2 )©(W 2 U 2 )©(U 2 W 2 )©(U 2 U 2 ) =W d 1;1 ©W v 1;1 ©W h 1;1 ©W d 2;2 ©W v 2;2 ©W h 2;2 ©U 2;2 Figure 2.4b shows how the 2-d array is partitioned to 7 subspaces for 2 level decom- position. 17 2.5 Wavelet Matrix Since the Wavelet Transform is a linear transformation, we can represent it by anN£N matrixW to transform the arrayD of lengthN as following. Here,¸ represents the level of decomposition. ^ D(i)= X j W ¸ (i;j)D(j) For example, we show the corresponding matrices for one level of transformation W 1 and two levels of transformationW 2 of an array of size4 as following. 0 B B B B B B B @ 1 p 2 1 p 2 0 0 1 p 2 ¡ 1 p 2 0 0 0 0 1 p 2 1 p 2 0 0 1 p 2 ¡ 1 p 2 1 C C C C C C C A 0 B B B B B B B @ 1 2 1 2 1 2 1 2 1 p 2 ¡ 1 p 2 0 0 1 2 1 2 ¡ 1 2 ¡ 1 2 0 0 1 p 2 ¡ 1 p 2 1 C C C C C C C A W 1 W 2 Example Consider a vector of 4 valuesD =f6;2;7;1g. We perform the first level of decomposition on this vector using the transformation matrix W 1 as : ^ D = W 1 ¢D = f4 p 2;2 p 2;4 p 2;3 p 2g. The second level of transformation is performed similarly by usingW 2 : ^ D =W 2 ¢D =f8;2 p 2;0;3 p 2g We omit the superscript ¸ when we refer to the maximum level of decomposition, that is,¸ = logN. We refer the reader to [Nie99] for more details about creating these matrices. 18 Using the matrix notation, we represent the multidimensional transformation of a given multidimensional arrayD as following whereW x andW y are the transformations along the dimensionsx andy, respectively. ^ D(x;y)= X s;t W x (x;s)W y (y;t)D(s;t) , ^ D =W x W y D As a rule of thumb throughout the dissertation, we use ^ D when we refer to the full transformation ofD along all its dimensions, whereas we denote the transformation of D only alongy dimensions byW y D. 19 Chapter 3 Range Aggregate Query Processing with Wavelets 3.1 Introduction Range aggregate queries are the most frequent data analysis queries among scientific queries. It is highly resource-consuming to perform them on the massive multidimen- sional datasets that are regularly collected and stored in various scientific fields. In this chapter, we present how to process this class of queries efficiently and progressively to expedite extraction of useful information. Discrete Wavelet Transform has emerged as a favorable tool for range aggregate query processing on multidimensional datasets. However, most of the methods using DWT share the disadvantage of providing only approximate answers by compressing the data. The efficacy of these approaches is highly data dependent, that is, it only works when the data have a concise wavelet approximation. In addition, these methods suffer from query-blind compression as these approaches mostly decompress the data first to process the query. The main pitfall here is that saving space by itself does not necessarily reduce the query complexity. We propose a query processing technique where data do not have to be compressed. Instead, we employ wavelet transform to compact incoming queries rather than the underlying data. This results in reducing the query cost from being a function of the 20 range size to a function of the logarithm of the data size without sacrificing the update cost. In addition, we provide data-independent query approximations after a small num- ber of I/Os by only using the most significant query wavelet coefficients. This approach naturally results in a progressive algorithm. Unlike former range aggregate techniques with limited support only for typical aggregate queries, we enable efficient processing of a variety of polynomial queries. Toward this end, we initially utilized the wavelet transform of data frequency distribu- tion, known as DFD, to form and process such queries (see [SS02b]). However, during the last five years of deploying this methodology in various real-world scientific appli- cations, we have experienced that preparing data frequency cubes is neither feasible nor efficient with most scientific datasets due to their sparsity. To address the impracticality of DFD, we propose a new cube model, Collection of Fixed Measures (CFM), for polynomial query processing on scientific datasets. While prior work assumed storing the data as one large data frequency distribution cube, the new model organizes the data as a collection of smaller fixed measure cubes to enhance both space and query efficiency of our past study[SS02b] in scientific applications. We analytically and empirically compare the two models in this report and show that CFM significantly outperforms DFD on real-world scientific datasets. We integrate both models into a single framework, named WOLAP: Wavelet-based On-Line Analytical Processing, for efficient support of polynomial range-aggregate query processing. Figure 3.1 illustrates the components of this framework and pro- vides a reference to the terminologies we use throughout the chapter. Let us briefly describe the framework here. Given a polynomial range-aggregate query, WOLAP forms a query vector and efficiently wavelet transform it on-the-fly. Using either of the cube models, WOLAP selects the appropriate cube as the data vector. Note that 21 these cubes are wavelet-transformed beforehand at the data population time. Subse- quently, WOLAP provides the exact query answer by computing the dot product of the two vectors. WOLAP can also provide an approximate query result by simply using a subset of data and/or query coefficients. By extending and generalizing the query sub- set selection, we enable progressive query processing at no additional cost. In addition to presenting all these components in this chapter, we fully describe how to efficiently transform a multidimensional query on-the-fly. For large datasets (or limited storage spaces), we also show how to utilize compressed data and still benefit from the progres- sive capability of WOLAP. m f 4 ˆ f . F d f . F f , , ˆ F ' f , , ˆ F 4 m f ' ˆ 4 ˆ ' ˆ LWT Exact Approximate Order Progressive Data Query Synopsis Query Vector Wavelet Query Vector Query Vector Ordered Wavelet Wavelet Data Vector Synopsis DFD CFM D ˆ Q ˆ Q Q ~ ˆ V Q ˆ D ~ ˆ Range & Polynomial function Range Polynomial function Polynomial f. Select Data Select Data Figure 3.1: WOLAP framework for processing range aggregate queries Furthermore, we identify a unique access pattern that arises when evaluating range aggregate queries on wavelet data with WOLAP in particular but even more impor- tantly with any other wavelet-based data management technique in general. Hence, we design an allocation strategy which results in significant improvements in the perfor- mance of WOLAP. In addition, by designing and exploiting a disk block-aware progres- sive answering strategy, we extend WOLAP to deliver excellent approximate results after a small number of disk accesses. Toward realization of WOLAP as a practical framework, we relax two common assumptions of most wavelet techniques. First, we relax the unrealistic assumption that 22 domain of each attribute must be in power of two. We propose a method that allows WOLAP to work with arbitrary domain sizes by padding enough auxiliary coefficients in both data and query vectors. Second, we relax the assumption of using a single filter for multidimensional transformation by proposing multiple 1-dimensional trans- formations; transforming each dimension with a different filter. The intuition behind this comes from the fact that we can separate a multidimensional aggregation to multi- ple 1-dimensional aggregations and associate each aggregation with the corresponding transformation. This allows us to treat each dimension differently and is very beneficial if attributes are subjected to unequal polynomial degrees. We have designed and developed a real system [CSBK08, JS05] using the WOLAP framework and verified the concepts reviewed in this chapter. We briefly describe this system, called ProDA (for progressive data analysis), in Chapter 6. Using ProDA, we have conducted several experiments with four real-world and three synthetic datasets. The results of our experiments demonstrate the effectiveness of WOLAP in practice. We begin our discussion with defining the polynomial range aggregate query and show how to utilize wavelets to perform this type of queries (see Section 3.2). In Section 3.3, we propose an efficient algorithm for query transformation and then we analyze the complexity of wavelet query processing. Next in Section 3.4, we address the inefficiency of the former data model and propose a practical model and analytically compare the two. We extend our query framework by providing approximate and progressive query processing in Section 5.4. In Section 4.1, we propose an optimal strategy to store the wavelet coefficients on disk drives. In Section 3.7 we address and relax two unrealistic assumptions about wavelets to further enhance WOLAP in real-world systems. Finally, we conclude our discussion in Section 5.5. Later in Chapter 7, we extensively examine our framework with real-world datasets. 23 3.2 Polynomial Aggregate Queries In this section we introduce the class of polynomial range-sum queries, and show how we recast these queries as vector queries. Then we show how we use the same query model in the wavelet domain. 3.2.1 Defining Polynomial Queries as Dot Products A polynomial query is a mathematical expression including a sum of one or more attributes each powered to a degree. Most statistical queries can be formed from this class of queries. Therefore, we work on the most general form of polynomial queries to support common and ad-hoc statistical aggregate queries. To start the discussion, let us mathematically define polynomial queries. Definition 4. Given a t-tuple dataset T , a range R, and a polynomial function f : x! Q t¡1 i=0 _ x ± i i for any tuplex = (_ x 1 ; _ x 2 ;:::; _ x t ) inT , the polynomial aggregate query is written as Q(T;R;f)= X x2T T R f(x) We call this query a polynomial range-sum query of degree± i on attributei. Example Consider a 2-tuple dataset with entries T = f(age;height) : (15;140), (15;160), (15;180), (20;140), (20;160), (20;180), (25;160), (25;200), (30;140), (30;200) g . Let the rangeR be the set of all ages between15 and25. COUNT query: The number of people with ages between15 and25 is computed by choosingf(x)´ 1(x)=1. Q(T;R; 1)= X x2T T R 1(x)=8 24 SUM query: By choosing f(x) ´ height(x), the query returns the sum of height inside the rangeR. Q(T;R;height)= X x2T T R height(x)=1320 AVERAGE query: An AVERAGE query can be composed of the two queries men- tioned above. Avg(height)= Q(T;R;height) Q(T;R; 1) = 1320 8 =165 We can generalize this and form other common aggregate queries, e.g. variance or covariance, as polynomial queries. Let us define two useful functions, characteristic function and data frequency distribu- tion, which we use to simplify the query definition. Definition 5. Given a range R, we define the corresponding characteristic function  as following: Â(x)= 8 > < > : 1 ifx2R; 0 otherwise: Definition 6. The data frequency distribution is the function ¢ T that maps a point x to the number of times it occurs in T . Let us omit the subscript and rewrite this as following: ¢(x)= X x2T 1(x) In databases, ¢ is at-dimensional datacube that describes the datasetT in a multi- dimensional view. Every point of this cube is filled by the frequency of the occurrence of the represented tuple in the relational data. That is why we name this cube a Data Frequency Distribution cube or DFD in short. 25 Now we rewrite the basic definition of the range-sum as Q(¢;Â;f)= X x f(x)Â(x)¢(x) This equation can be considered as a vector dot product of two multidimensional vectors: f(x)Â(x) and ¢(x). We name these vectors the query vector Q and the data vectorD respectively. Q(¢;Â;f) = <fÂ;¢> = <Q;D > Example The figure below shows the multidimensional view of the earlier example. The4£4 cubes represent the input data and the queries in multidimensional views. The count query is answered by<Q 1 ;¢>=8 in whichQ 1 is defined asQ 1 = 1: (see Figure 3.2b). The sum query is recasted as Q 2 = height:Â. This means that every item x inside the range R has the value of height(x) (see Figure 3.2c). We calculate the sum query by the dot product of<Q 2 ;¢>=1320. Age Weight 15 20 25 30 140 160 180 200 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 Age Weight 15 20 25 30 140 160 180 200 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 Age Weight 15 20 25 30 140 160 180 200 200 200 200 0 180 180 180 0 160 160 160 0 140 140 140 0 a. ¢:Data b. Q 1 :Count c. Q 2 :Sum Figure 3.2: Multidimensional View of Data and Queries 26 3.2.2 Processing Polynomial Queries in Wavelets As shown in Section 1, Discrete Wavelet Transform preserves the Euclidean norm. Thus, the generalized Parseval equality applies to DWT, that is, the dot product of two vectors equals to dot product of the wavelet-transformed of the vectors. <Q;D >=< b Q; b D > This property results in the following useful corollary. Corollary 2. The answer to a range-sum query Q over a datacube D is given by the dot product of their wavelet-transformed: < b Q; b D >. The answer to this dot product is computed as following: Q= X »2 b Q;if b Q(»)6=0 b Q(»)¢ b D(») (3.1) Example Let us answer the count query of Example 3.2.1 in the wavelet domain. We similarly form the Q 1 and then wavelet transform as shown in Figure 3.3b. The dot product of b ¢ and c Q 1 produces the same query result,< b ¢; c Q 1 >= 2:5¤3+0:5¤1+ 0¤1:414=8. Using wavelets, we have reduced the number of operations from 12 to 3. 0 1.41 0 0 0 0 0 -1 0.5 0.5 0 0 2.5 0.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 1.414 a. Wavelet Datacube( b ¢) b. Wavelet Count Query( c Q 1 ) Figure 3.3: Wavelet-Transformed Data Vector and Query Vector 27 Equation 3.1 hints the main benefit of Wavelets for us, that is, if b Q contains many zeros, then the query is answered very quickly. We will prove why b Q contains many zeros in the next section. Intuitively, Q is a smooth vector; thus, its wavelet transform, b Q, contains many zeros. In Section 3.3, we describe how to efficiently transform the query vectors (i.e., Q! b Q) on-the-fly and analyze the complexity of our method. We do not include the data transformation (i.e., D ! b D) in this chapter as the datacube is essentially transformed offline at the time the database is built. To study this offline task, we encourage the reader to explore [JSS05] in which we introduce an efficient algorithm to wavelet transform large data. Later in Section 5.4, we compute the Equation 3.1 in a progressive fashion. 3.3 Query Transformation Query transformation differs from data transformation in that the transformation is per- formed on-the-fly using the minimum resources, rather than performing this task offline with enough resources. The complexity for naive wavelet transformation of a datacube isO(N t ) wheret is the number of cube dimensions andN is the domain size for each dimension. However, this means that a memory space of the same size as the datacube is needed for query transformation. Clearly, this is neither efficient nor feasible as the dat- acube size is potentially very large and multiple queries are submitted into the database simultaneously. In this section we propose a fast algorithm for query transformation and compute the complexity of our query processing. We start with the1-dimensional data and then extend it to the multidimensional case. 28 3.3.1 One-dimensional Query In this section we present the advantage of the wavelet transformation in a1-dimensional range queryq. Here, we show that for a range of sizeR on a data of sizeN we only need O(llogN) coefficients from ^ q, independent of the range size, which is a major benefit specially whenRÀllogN. Let us start the discussion with an example. u0,1 u0,2 u0,3 u0,4 u0,5 u0,6 u0,7 u0,8 u0,9 u0,10 u0,11 u0,12 w3,1 u4,0 w4,0 -.71 w3,0 w2,0 w2,3 w1,0 w1,6 3 -.5 0 0 0 0 0 0 0 1.06 .71 .5 0 -.35 .5 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 Figure 3.4: Lazy Wavelet Transform using Haar filter Example Consider the transformation of the range R = [1;12] on the 0-degree poly- nomial queryq of size16 using Haar filter. The queryq and its wavelet tree is depicted in Figure 3.4. The only non-zero values of the transformed query ^ q are located on the boundaries of the range to the root of the wavelet tree. Having only two non-zero values per each level of the wavelet tree, we have2logN non-zero values in total. It turns out that we always observe a similar trend using an arbitrary wavelet filter of length l. Thus, we operate only on the boundaries and produce at most l non-zero detail coefficients per step on the boundary sides. Every step can be carried out in con- stant time, allowing us to perform the entire transform in time and space ofO(llogN) instead of O(N). We call this algorithm the Lazy Wavelet Transform (LWT) since it only evaluates wavelet coefficients when they are needed at the next recursive step. 29 Now we show that with appropriate choices of wavelet filters, we can obtain the same result for general polynomial range queries. Here, we formally define the required condition on choosing wavelet filters. Definition 7. A wavelet filterfg 0 ;:::;g l¡1 g is said to satisfy moment condition of ± if P l g i i º = 0 for0· º · ±. This condition is met when the wavelet filter has vanishing moment equal or greater than± (see Section 2.2). When a wavelet filter satisfies the moment condition of±, it guarantees that operating the filter on a polynomial query of degree º · ± produces zero coefficients anywhere except the boundaries. The proof is straightforward. Operating the detail filter with the internal query points of powerº produces only zero coefficients because the vanishing property of the filter, P l g i i º =0, produces only zero values sinceº·±. Ingrid Daubechies[Dau92] introduced a family of wavelet filters with the maximal number of vanishing moments for the given filter length. Haar filter, named db1 in this family, has the least vanishing moment of zero. Next filter in this family, db2, has van- ishing moment of one. In general, longer filters provide higher vanishing moments±. In particular, it is proven that vanishing moment and filter length are directly proportional for Daubechies filter family as ± = l=2¡1 (see [Dau92] for further information). We utilize this property in the following theorem and then we provide a descriptive example afterwards. Theorem 3. Using Daubechies filters, a wavelet filter of lengthl = 2± +2 is required to allow the use of LWT on a polynomial query of degree±. 30 Example Let us form three queries with polynomial degrees of 0, 1, and 2 over the domain of integers from1 toN. The set of wavelet filters satisfying the moment condi- tion of the corresponding polynomial degree is presented as below: q =f1;1;1;:::;1g ± =0!l¸2 (i.e. db1, db2, ...) q =f1;2;3;:::;Ng ± =1!l¸4 (i.e. db2, db3, ...) q =f1;4;9;:::;N 2 g ± =2!l¸6 (i.e. db3, db4, ...) It is clear that the data domain is not a domain of integers for most datasets. However, we show here that if the data domain satisfies the following condition, we can still benefit from LWT for any dataset. We apply this condition as a restriction on the data domain when the datacube is being generated. Definition 8. A domain X = fX 0 ;X 1 ;:::;X N¡1 g is said to satisfy domain condition if X i = X 0 + i(X 1 ¡ X 0 ). This condition ensures that the domain X is uniformly partitioned, a.k.a. equi-width partitioned. Following this condition, the data domain is basically a linear combination of an integer domain plus a constant. The following useful lemma allows us to apply LWT on such domains. Lemma 4. A domain X satisfying domain condition is applicable for LWT, that is, operating the query with the detail filter produces zeros any where except the boundary points. 31 Proof. This domain is basically an arithmetic series with the initial point ofX 0 and the increment ofX 1 ¡X 0 . Let us rewriteX i asX i = ®+i¯ for simplicity and apply the detail filter onX. We need to prove that P i g i X ± i =0. We have: P i g i X ± i = X i g i (®+i¯) ± = X i g i X º C ± º ® ±¡º ¯ º i º Let us distributeg i , swap the order of sigmas, and factor out the rest. We have: P i g i X ± i = X º C ± º ® ±¡º ¯ º X i g i i º Because of the ± vanishing moment of the filter and knowing that º · ±, we have P i g i i º =0. Therefore, this proves our lemma P i g i X ± i =0. Let us summarize our discussion so far. We ensure that we can transform any poly- nomial query in time and space complexity of O(llogN) if we meet these conditions (moment condition and domain condition). 3.3.2 Multi-dimensional Query At-dimensional queryQ is separable tot1-dimensional queries ofq i along each dimen- sion, q i (_ x i ) = f(_ x i )Â(_ x i ). The tensor product, a.k.a. outer product, of these vectors produces the original query as following: Q=q 1 q 2 q 3 :::q t 32 This equation is also held for wavelet domain if we use standard multidimensional wavelet transform since standard wavelet transformation is basically tensor product of wavelet subspaces (see Section 2.4 for more details). b Q=b q 1 b q 2 b q 3 :::b q t This leads us to the following multidimensional extension. Lemma 5. A multidimensional aggregate query Q is wavelet transformed by tensor producting the transformation of its aggregate queries along each dimensionq i . 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 q1 q2 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 1.414 0 0 0 2 1.5 .5 0 .707 q1 q2 a. Original Domain (Q) b. Wavelet Domain( b Q) Figure 3.5: Multidimensional Query using Tensor Product Example Figure 3.5 illustrates the use of tensor product for query transformation. Query vectorQ is the tensor product ofq 1 andq 2 . Thus, we transformq 1 andq 2 tob q 1 and b q 2 and produce the b Q by tensor producting of the two wavelet vectors, i.e. b Q=b q 1 b q 2 . Now, we sketch the algorithm of wavelet-based range aggregate query processing as following: The iteration of calculating the tensor product and answering query are combined in lines 7 » 12 and consequently no extra memory is needed. Notice that the whole construction of b Q is not performed here. Instead, only t 1-dimensional arrays of q 33 Algorithm 1 WOLAP Query Processing. Input: Wavelet Datacube b D, RangeR, measure functionf, and Wavelet FilterL 1: resultÃ0; 2: ÂÃR; 3: for (i=1 tod) 4: q i Ãf i  i ; 5: b q i à LWT(q i ,l i ); 6: end for 7: foreach (k inj N i b q i j) 8: \ Q coef à Q i [ q i (k); 9: if ( \ Q coef 6=0) 10: resultÃresult+ \ Q coef : b D(k); 11: end if 12: end for 13: returnresult; are transformed and stored in the memory. This infers that the memory complexity of query transformation is O(tllogN) which is a major improvement compared to the naive algorithm with the memory complexity of O(N t ). Moreover, the tensor product of the non-zero values of each vector results in O(l t log t N) number of non-zero coef- ficients therefore O(l t log t N) number of data retrieval is required. To the best of our knowledge, this is the best query complexity if we do not sacrifice the update cost. The following theorem summarizes what we have described in this section. Theorem 6. Answering a polynomial range-sum query on a t-dimensional cube with domain size ofN for all dimensions has the memory complexity ofO(tllogN) and the time and I/O complexity ofO(l t log t N). 3.3.3 Update Query using Wavelets In this section, we briefly discuss the update query for wavelet-transformed data. Here, we show that we can recast update queries as vector queries and efficiently perform them in wavelets. 34 Let us start the discussion with the one-dimensional dataD. Let¸ be the update for the i-th value of D, D[i] à D[i]+¸. Let Q be a point query on i-th element of D. We have D à D +¸¢Q. Now we wavelet transform both sides. Note that wavelet transform is a linear transformation, that is, it preserves the vector addition and scalar multiplication. This means we have: \ D+¸¢Q = b D+¸¢ b Q. Therefore, the following equation is performed when an update query is issued. b Dà b D+¸¢ b Q Theorem 7. Let b D be the wavelet transform of vectorD of sizeN using filter lengthl. Any update value onD results inO(llogN) updates on the wavelet coefficients of b D. Proof. The proof is straightforward. Q has onlyO(llogN) non-zero values. Thus, this update needs onlyO(llogN) operations. u0,11 w3,1 u4,0 w4,0 0 w2,2 w1,5 .25 0 0 0 0 0 0 -.5 -.71 .35 0 0 0 0 -.25 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Figure 3.6: Wavelet Update using Haar filter For the multidimensional case, we perform the multidimensional update query sim- ilar to the one-dimensional case, that is, we transform a multidimensional point query and multiply it with the update value¸ and add it to the wavelet data b D. b Dà b D+¸¢ b Q 35 1.0 2 1 2.0 4 2 2.1 2 3 3.0 2 4 1.0 4 4 4.0 6 4 2.0 8 4 1.0 6 3 4.0 8 3 1.0 8 2 3.0 8 1 b c a Dimension Measure b a 2 4 6 8 1 2 3 4 c 1 2 3 4 1 1 1 1 1 1 1 1 1 1 1 a. DB b. ¢ in DFD 6 8 3 4 3.0 1.0 4.0 2.0 2.1 1.0 4.0 2.0 1.0 1.0 3.0 b a 2 4 1 2 6 8 3 4 9.0 1.0 16.0 4.0 4.41 1.0 16.0 4.0 1.0 1.0 9.0 b a 2 4 1 2 6 8 3 4 1 1 1 1 1 1 1 1 1 1 1 b a 2 4 1 2 c. £ c in CFM d. £ c 2 in CFM e. £ 1 in CFM Figure 3.7: Multidimensional View of Data Theorem 8. Let b D be the wavelet transform of at-dimensional cube with domain size of N for all dimensions using filter lengthl. Any update value onD results inO(l t log t N) updates on the wavelet coefficients of b D. Proof. A t-dimensional point query is the tensor prodocut of t 1-dimensional point query: b Q = N i b q i . Therefore, we transform Q with the complexity of O(l t log t N) and this is the update complexity. 3.4 Collection of Fixed Measures In this section, we address the inefficiency of the former cube model, DFD, in scientific applications and propose an alternative model for effective storage and query processing. In DFD, dimension attributes and measure attributes are symmetrically treated. This 36 model, first presented in [AGS97], supports a large class of queries by allowing us to model polynomial queries on any attribute. However, DFD suffers from its high sparsity for most real-world datasets in which the dimension attributes uniquely identify each tuple. More specifically in scientific datasets, every event is uniquely identified by the dimensions and the corresponding facts are measured by measure attributes. This results in a very sparse cube as we formally describe below. Letd be the number of dimension attributes, m = t¡d be the number of measure attributes, andN be the domain size for allt attributes. If thed dimensions are uniquely identifying thet-tuple data, there exists only onem-tuple of measures for eachd-tuple of dimensions. In this case,¢ holds only one non-zero value of1 perN m cells in the best case, that is, the highest density ratio is 1 N m . Yet, when the datacube is wavelet trans- formed, its density significantly grows by producing a lot of non-zero values. The proof is straightforward. Consider a cube ofN m items with all zeros. When one of the original values becomes1 we need to updatel m log m N values to non-zero values (see Theorem 8). This means b ¢ hasl m log m N times more non-zero values than¢ in the worst case. To store this datacube, either we store only the non-zero values,O(N d l m log m N), and use a hash index on top, or we use an array-based storage to store the entire cube, O(N t ). Later in Chapter 7, we show the inefficiency of DFD in practice. To address this problem, we separate the attribute function f(x) to a dimension function f d (x d ) = Q d¡1 i=0 _ x ± i i and a measure function f m (x m ) = Q t¡1 i=d _ x ± i i . We have f(x) = f d (x d )¢ f m (x m ). We similarly separate the characteristic function, Â(x) = 37  d (x d ) m (x m ). Thus we rewrite the general range-sum query equationQ(¢;Â;f) as following: Q = X x d ;xm f d (x d ) d (x d )f m (x m ) m (x m )¢(x d ;x m ) = X x d f d (x d ) d (x d ) X x m f m (x m ) m (x m )¢(x d ;x m ) Let us define a new function to simplify the equation above. Definition 9. The fixed measure is the function£ f m  m that maps a pointx d to a value calculated based on f m . Let us omit the subscript  m as we assume  m = 1 for most scientific queries and rewrite the second part as following: £ fm (x d )= X xm f m (x m )¢(x d ;x m ) In databases, £ fm is a d-dimensional datacube that describes the dataset f m in a multidimensional view. Every point x d of this cube stores f m (x d ) representing a fixed measure of the relational data. In this model, data items are characterized by the dimen- sion attributes and carry fixed fact measure values. That is why we name this cube a Fixed Measure cube or FM in short. Now we have: Q(¢;Â;f) = X x d f d (x d ) d (x d )£ f m (x d ) = <f d  d ;£ f m > = Q(£ f m ; d ;f d ) This implies that we can answer the same query using£ f m datacube, if£ f m is known in advance. To answer arbitrary polynomial queries, it is necessary to pre-compute all 38 the required polynomial functions in advance. In practice, pre-computing a realistic subset of all the possible combinations of the measures powered up to the degree ± suffices. We formally define this set as following: Definition 10. A function setF ± is the set of required fixed measure functions to answer polynomial queries up to the degree ±. For each polynomial degree ± 0 · ±, this set includes all pairs of attributes powered to± 0 and all the individual attributes powered to ± 0 . Example Consider a dataset with one measure attribute c. We like to answer polyno- mial queries up to the degree 2 on the dataset. This means that the measure functions off(c) = 1, f(c) = c, andf(c) = c 2 need to be pre-computed in advance, that is, the function set for this example isF 2 =f1;c;c 2 g. Example Consider a dataset with three measure attributes ofu,v, andw. The function set for± = 1 isF 1 =f1;u;v;w;uv;uw;vwg having 7 measure functions. The function set for ± = 2 is F 2 = f1;u;v;w;uv;uw;vw;u 2 ;v 2 ;w 2 ;u 2 v 2 ;u 2 w 2 ;v 2 w 2 g including 13 measure functions. Lemma 9. The function set ofm measure attributes supporting polynomial queries up to the degree± is bounded as below. jFj =1+±m(m+1)=2 Proof. Function set has at least the function of 1 for count query. For any degree± 0 ·± function set includes 0 @ m 2 1 A pair of measures with power of ± 0 . This set also includes all them measures individually powered to± 0 . Therefore we have: jFj =1+±m+± m(m¡1) 2 =1+±m(m+1)=2 39 We store each function of the function set in a separate FM cube and name the entire set as Collection of Fixed Measures or CFM in short. As shown in the earlier lemma, the cardinality of CFM is directly proportional tom and±. This means that the number of required datacubes increases as the number of measures or polynomial degree grows. However, the number of FM cubes grows more gently than the cube size of DFD because F ¿l m log m N. We also empirically study this fact in the experimental section. All these cubes can be hidden from the high level users by defining a general cube framework£ for CFM. As shown in WOLAP framework (see Figure 3.1), the relevant datacube is selected upon the query submission: Q(£;Â;f)!Q(£ f m ;Â;f d ) Example Consider a dataset with two dimension attributes ofa andb and one measure attribute c as shown in Figure 3.7. While all three attributes are considered as cube dimensions for the DFD model (see Figure 3.7a), only the two dimension attributes have become cube dimensions for the CFM model (see Figures 3.7b, 3.7c, and 3.7d). Figure 3.7c shows an FM cube with the measure function off(c)=c. For example, the tuple of (4;2;3) is represented as the data point of (4;2) in this cube with the value of 3. Similarly, the FM datacubes with the measure functions of f(c) = c 2 and f(c) = 1 are depicted in Figure 3.7d and Figure 3.7e, respectively. Using these three datacubes, we can answer polynomial queries up to the degree2, such as count, sum, average, and variance. We analytically compare the two models, DFD and CFM, here and later in Chapter 7 we empirically compare them in more details. Table 3.1 shows the complexities of the 40 two models in terms of storage, update, and query costs. It shows that CFM model outperforms DFD in all three complexities. Note that the DFD model, even though less efficient, is the method of choice when we need to formulate ad-hoc ranges on the measure attributes in addition to the dimension attributes. Let us recall the required equations to better understand the complexities: l = 2(±+1),F = 1+±m(m+1)=2, andt=d+m. DFD b ¢ CFM b £ Storage ½ N t array-based N d l m log m N hash-based FN d Update l t log t N Fl d log d N Query l t log t N l d log d N Table 3.1: Complexity Table for WOLAP range aggregate processing 3.5 Approximation and Progressiveness In this section, we show how to incorporate approximate and progressive query process- ing into our framework. In fact, WOLAP leverages from the multi-resolution property of Wavelets at no extra cost to achieve this importance. Approximation is essentially needed when either storage space is limited or there is a time constraint on the query execution. When the storage is limited data needs to be approximated and when the time is limited the query needs to be approximated. Here, we study the data approximation first and then we study the query approximation. We further extend the query approximation to progressive query processing. As the reader may notice, the query approximation and progressiveness are essentially the query com- pression while the data approximation is considered as data compression. 41 3.5.1 Data Approximation There are many real-world situations in which either the data is extremely large or the storage capacity is limited. For either of these, compression of datacubes is essential in practice. When the storage is limited toB datapoints, we store only theB most sig- nificant wavelet coefficients. Using these B-term coefficients we provide approximate results to the queries. Let us state two widely used methods for selectingB-term wavelet coefficients. FIRST-B: The most significant coefficients are the coefficients with the lowest fre- quencies. The intuition behind this approach is that high frequency coefficients can be seen as noise and thus they should be dropped. HIGHEST-B: The most significant coefficients are those that have the most energy. The energy of a coefficient is defined as power two of the coefficient value. The intuition comes from signal processing: the highest powered coefficients are the most important ones for the reconstruction of the transformed signal. First-B has the advantage of fast preprocessing and smaller storage space require- ment in practice. The reason is that coefficients below a certain frequency are selected in O(B) without any coefficient sorting. Here, storing the indices of the kept coefficients is not required if an array-based storage is employed. On the other hand, Highest- B method introduces the smallest error for the total reconstruction of the data as it is widely accepted in signal processing literature. Now we show how to incorporate the compressed data into our framework. With WOLAP, we iterate through the non-zero coefficients of transformed query cube and multiply each coefficient with the corresponding wavelet data. In this process, if the data coefficient is not stored, i.e., dropped previously due to the data compression, we assume this wavelet data coefficient is zero and continue the process. This assumption 42 is basically the implementation of hard thresholding. In other word, e b D(k) returns zeros if there is no coefficient for the indexk. e b D(k)= 8 > < > : b D(k) ifk2B; 0 ifk = 2B: 3.5.2 Query Approximation and Progressiveness In many applications, the accuracy of the query result can be traded off for a better response time, that is, a fast less accurate result may be preferred to an exact late result. Since the dominant factor for query processing is the database retrieval, we limit the retrievals to a certain numberB, that is, we only retrieve theB most significant wavelet coefficients contributing to the query. Definition 11. The approximate answer for a range-sum query is computed by perform- ing the dot product for only a subset of the query vector indexed by¾: e Q ¾ = B¡1 X i=0 b Q(¾ i ) b D(¾ i ) The most contributing coefficients are the ones with the highest values of the pair of query and data items, b Q(¾ i ) b D(¾ i ). However, at the database population time, one cannot optimally sort the coefficients for all possible queries. Therefore, we advocate selecting the query coefficients at the query time to achieve good approximate results. Toward this end, theB most significant query coefficients are selected using Highest-B or First-B method. The major observation from our experiments is that this selection performs very well for most queries although we do not hold any assumption on data coefficient distribution. We can further enhance this selection with a hybrid approach in which we store a synopsis of data for a better query coefficient selection. We refer the 43 readers to [SJS05] for more information regarding various techniques used in wavelet query approximation. By progressively increasing the term B, we can order coefficients based on their significants. We exploit this ordering to answer the query in a progressive manner so that each step produces a more precise evaluation of the actual answer, eventually con- verging to the exact answer. We use the same significancy functions used for query approximation to make the approximate answer converge fast to the exact answer (see [SJS05] for more details). 3.6 Disk Block Allocation WOLAP is for applications that deal with very large amount of data that cannot fit in main memory. Hence, it is crucial to optimally store the wavelet coefficients on the secondary storage (i.e., disk drives) in such a way that the number of disk blocks required for answering queries is minimized. In fact, we must lay the data out on the disk in such a way that enables the principle of locality of reference hold, that is, when a datum is accessed on a disk block, it is likely for other data on the same block to be accessed. It turns out that for WOLAP range queries, if a wavelet coefficient is retrieved, we are guaranteed that all of its dependent coefficients will also be retrieved. Earlier, we demonstrated the dependency among the wavelet coefficients when we studied the wavelet tree. Now, we exploit this unique access pattern to optimally store the coeffi- cients on a disk. Intuitively, a disk block should contain dependant coefficients so that the utilization of the in-block coefficients becomes high. However, we must take under consideration the fact that the disk block allocation strategy should not allow redundancy, that is, a 44 wavelet coefficient should belong to one and only one block. Under this restriction, in order to be fair across all coefficients, we partition the wavelet tree into binary subtree tiles and store each tile on a disk block (see Figure 3.8b). Assuming that the disk block size C is a power of 2, C = 2 c , we achieve logarithmic utilization of the blocks. At leastc coefficients inside the block, lying on a path, are to be utilized any time this disk block is needed. We have proved in [SS04] that this allocation strategy is the optimal allocation method in terms of query response time. Figure 3.8 demonstrates the naive disk block allocation and our optimal disk block allocation. a. Naive Allocation b. Optimal Allocation Figure 3.8: Disk Block Allocation Strategies For the multidimensional case, each dimension is treated independently. Therefore, for each dimension we construct tiles of size C containing the C coefficients of a sub- tree, similar to the single dimensional case. The cross product of these d sets of single dimensional bases constructC d multidimensional bases. The coefficients corresponding to theseC d bases are stored in the same block and form a multidimensional tile. It is straightforward to see that the logarithmic utilization of disk blocks reduces the complexity of WOLAP queries to l d log d C N. In addition, we modify WOLAP’s progressive evaluation by ordering the query coefficients at block level granularity rather than individual coefficients, that is, we compute the importance of each block of query coefficients and correspondingly fetch the most important blocks first (see [SS04] for 45 more details). Later in Chapter 7, we quantitatively demonstrate the effect of optimal disk allocation on both query time and query progressiveness. 3.7 Toward Realization of WOLAP In this section, we address two practical problems to further enhance WOLAP for its use in real-world systems. 3.7.1 Filter Selection Data attributes are not symmetrically participating in the polynomial queries; that is, some attributes are subjected to higher polynomial degrees. In particular, the dimen- sion attributes are queried with low degree functions, typically 0 or1, and the measure attributes are queried with higher degree functions, typically2 or3. For example, when a dimension attribute is in nominal format, such as Gender or Country, it is only queried with0 degree functions. Unequal polynomial degrees on cube dimensions raise the need for using different filters for transforming along different dimensions. Recall that standard multidimen- sional wavelet transform is a series of 1-dimensional transform along each dimension. By carefully looking at this transformation, we observe that each1-dimensional transfor- mation is associated with an aggregation along that dimension. Therefore, we separate the multidimensional transform into multiple1-dimensional transforms with a different filter per each dimension. Let us describe how we consider this fact in our framework. For ad dimensional dat- acube, we define the set of required polynomial degreesf± 1 ;± 2 ;:::;± d g which states the maximum polynomial degree along alld cube dimensions. We obtain the corresponding 46 set of filtersfl 1 ;l 2 ;:::;l d g in whichl i satisfies moment condition of± i for attributei (see Definition 7). We transform both data and query vectors with filterl i along dimensioni. Note that the same filter chosen for data transformation is required to be used for query transformation in order for the parseval theorem to hold true. Example Consider the two dimensional datacube of fCountry;Populationg. We would like to transform this cube in such a way that we can perform polynomial queries up to the degree2 onPopulation attribute. We define the set of polynomial degrees as f0;2g and consequently transform this data withfHaar;db3g. We first transform the data withHaar filter along theCountry dimension and then we transform it withdb3 along thePopulation dimension. In practice, we useHaar (a.k.a. db1) wavelet for dimension attributes and longer fil- ters for measure attributes. This lowers the query complexity toO(2 d l m log t N) for DFD cubes. We denote this customized-transformed DFD cube by DFD-DM to emphasize that its dimension attributes and its measure attributes are transformed differently. For the CFM cubes, we usehaar filter for all dimensions, therefore, we haveO(2 d log d N) as the query complexity. Later in Chapter 7, we show how this optimization dramatically decreases the number of I/Os required for range-sum queries. 3.7.2 Auxiliary Coefficients Generally, it is widely assumed that the data size must be in power of two to allow wavelet transform. Unfortunately, real-world data are not usually in power of two. Here we show how to wavelet transform data with irregular sizes using the auxiliary coeffi- cients. 47 119.5 u1,3 u0,5 4.24 1 0 45.25 32 74.2 66 68 68 64 0 72 96 98 45 0 187.4 u1,3 u0,5 4.24 1 0 0 0 6.36 66 68 68 64 64 72 96 98 91 91 a. Zero Padding b. Symmetric Padding 193.7 u1,3 u0,5 4.24 1 0 -5.66 -1 0 66 68 68 64 72 72 96 98 96 98 191.2 u1,3 u0,5 4.24 1 0 -2.54 -1.8 2.55 66 68 68 64 68 72 96.2 97.6 93.1 95.6 c. Periodic Padding d. Average Padding Figure 3.9: Auxiliary Coefficients Definition 12. To produce summaries and details of k-th level, we add an arbitrary coefficient if any summary of the lower level k¡1 does not exist. We call the padded values auxiliary coefficients. The auxiliary coefficients can be estimated based on various methods, we suggest four of which as follows. We have borrowed the names from signal extension on the boundaries in wavelets [SN96]. Zero Padding: The auxiliary values are zeros in this method. Padding with zero basically rounds the data size to an upper power of two number without a need to store the extra zeros. This simple method has the disadvantage of introducing large detail coefficients by creating large discontinuities (see Figure 3.9a). Symmetric Padding: This method (a.k.a. Mirroring) is replicating the last existing coefficient on the same resolution. Mirroring has the advantage of introducing zero coefficients (see Figure 3.9b). 48 Periodic Padding: This method (a.k.a. Repeating) extends the existing summary with the same order. Repeating is an intuitive method of implementing the extension and may introduce significant noise in some cases (see Figure 3.9c). Average Padding: The auxiliary coefficients are calculated by averaging the exist- ing summaries of the same level. This advanced method smoothes the signal and intro- duces the least amount of noise to the transformed data (see Figure 3.9d). Except the first method, these methods basically assume that there are some non- zero values in the original array. Therefore, at least one of the query or the data vectors needed to be transformed by only using zero-padding. As a rule of thumb, we use Average Padding or Symmetric Padding when it is needed to add the least amount of noise to a vector. Thus, if progressive query or query approximation is the main purpose of using WOLAP, we suggest to use Zero Padding for data transformation and use either of the Average Padding or the Symmetric Padding for query transformation. If data approximation is being used in WOLAP, we recommend using Zero Padding for query transformation and using either of the Averaging Padding or the Symmetric Padding for data transformation. 3.8 Summary We have introduced a new framework, named WOLAP, in which we support exact, approximate, and progressive polynomial aggregate query processing. We have intro- duced two cube models, the DFD model and the CFM model, and deployed both in scientific applications. We have extensively studied the realization of both models and shown that CFM model is a more practical method for real-world datasets. We have also introduced the filter collection concept to reduce query and storage cost. Last but not least, we relaxed the common assumption on power-two domain of dimensions to an 49 arbitrary number. All these contributions enable WOLAP to be deployed for real-world scientific data analysis applications. 50 Chapter 4 Maintenance of Wavelet Transformed Data 4.1 Introduction The Discrete Wavelet Transform is a well established tool, used extensively in signal processing applications for many years since its introduction. However, despite its broad acceptance, the wavelet transformation has not been explored to its full poten- tial for data intensive applications. Namely, the compact support and the multi-scale properties of the wavelets, as illustrated by the wavelet tree of decomposition, lead to some overlooked but interesting properties. With the exception of [CGRS00], where traditional relational algebra operations are re-defined to work directly in the wavelet domain, most applications resort to reconstruction of many data values to support even the simplest operations in the original domain. We introduce two novel operations for wavelet decomposed data, named SHIFT and SPLIT, that stem from the multiresolution properties of wavelets to provide general purpose functionality. They are designed to work directly in the wavelet domain and can be utilized in a wide range of data intensive applications, resulting in significant improvements in every case. Furthermore, we have proven in [SS04] that there is a strong dependency among wavelet coefficients, enforced by the multi-scale nesting property, so that we always 51 know which coefficients must be retrieved alongside any coefficient to reconstruct a data point or a range. This observation leads to constructing multidimensional tiles con- taining wavelet coefficients that are related with each other under a particular access pattern. These tiles are then stored directly into the secondary storage, as their size is adjusted to fit a disk block. By using this tiling approach we can minimize the number of disk I/Os needed to perform any operation in the wavelet domain, including the impor- tant reconstruction operation which results in significant query cost reductions. We have designed the SHIFT and SPLIT operations to work with multidimensional tiles, as these operations benefit significantly from their existence. 4.1.1 Data Maintenance Scenarios To demonstrate the usefulness of our SHIFT and SPLIT operations we look into four common data maintenance scenarios, and examine these operations in each context. The scenarios examined here share the fact that the problem they deal with has a straight- forward solution when dealing with untransformed data. Therefore, one is compelled to first reconstruct the original data from the transformed data. However, we are interested in working entirely in the wavelet domain, and as we see, this becomes a complicating, but fruitful, factor. Beside the SHIFT and SPLIT operations, a major contribution of this work includes the six analytically proven improvement results in these four applications by utilizing SHIFT and SPLIT: ² Transformation of Massive Multidimensional Datasets: In the simplest scenario, we are faced with the transformation of a multidimensional dataset into the wavelet domain in an I/O efficient manner, where available memory is limited. Our approach is transformation by chunks, small enough to fit in memory. Each 52 transformed chunk is then shifted, to relocate its coefficients, and split, to update some of the already calculated coefficients. ² Appending to Wavelet Decomposed Transforms: As an example, assume that we have already accumulated and transformed data for 10 years of measurements to expedite query processing. Now, what should we do in the case that new data for one more year arrive? Should we throw the old transformed data and do the trans- formation from scratch? Certainly, we cannot perform updates; there is nothing to update since the new data involve a part of the data we have not transformed yet. This scenario, seen in the untransformed domain, involves a number of inserts. However, in the transformed domain, each of these inserts require some dimen- sions to grow, and therefore not only do coefficients have to be updated, but new coefficients need to be created as well. Generally, expanding the transformed data to larger dimension sizes has a high asymptotic cost, even when using our SHIFT and SPLIT operations; however, since these operations are faster than computing coefficients, our approach results in faster execution times. ² Data Stream Approximation: In the situation where storage space is limited or data is unbounded and coming with a high stream, usually data approximation in employed. This means we keep the synopsis of data at server side and update it as new data come in. Here, we provide an algorithm using the SHIFT-SPLIT operations that outperforms the best known wavelet-based data stream algorithm. We reduce the per-item cost at the expense of small additional storage to buffer coming coefficients. Furthermore, we investigate the case of multidimensional data streams, decomposed under two different forms of wavelet transformation. We conclude that we can maintain aK-term approximation, under certain restric- tions. To the best of our knowledge, this is the first work dealing with wavelet 53 approximation of multidimensional data streams, as previous works focused on the single dimensional case. ² Partial Reconstruction from Wavelet Transforms: Consider the scenario in which we wish to extract a region of the original data from its wavelet transform. We are faced with the following dilemma: either decompose the entire data and then extract the desired region, which is reasonable if the region extend over large part of the data; or reconstruct point by point the desired region, which is preferable for small regions. 4.1.2 Outline We begin our discussion, in Section 4.2, with introducing the disk block allocation strat- egy which leads to the efficient tiling of wavelet coefficients and then we extend this strategy to the multidimensional case. The SHIFT and SPLIT operations are presented in full details in Section 4.3, with their most important applications appearing in Section 4.4. We conclude our discussion in Section 4.5. Later in Chapter 7, we present our experimental studies. 4.2 Disk Block Allocation of Wavelet Coefficients The purpose of this section is to assign wavelet coefficients to disk blocks in such a way that the number of blocks required for answering queries is minimized. We have already seen that the wavelet tree captures the dependency among coefficients. In particular, if a coefficient is required to be retrieved then all coefficients on the path to the root must also be retrieved. This property creates an access pattern of wavelets that must be exploited by the disk block allocation strategy. 54 Intuitively, a disk block should contain coefficients with overlapping support inter- vals, so that the utilization of the in-block coefficients is high. However, we must take under consideration the fact that the disk block allocation strategy should not allow redundancy, in that a wavelet coefficient should belong to one block only. Under this restriction, in order to be fair across all coefficients, we partition the wavelet tree into binary subtree tiles and store each tile on a disk block. Assuming that the disk block sizeB is a power of 2,B =2 b , we achieve logarithmic utilization of the blocks. At least b coefficients inside the block, lying in a path, are to be utilized any time this disk block is needed. Logarithmic utilization may seem low at first, but it is the best we can hope for under our restrictions, as proven in [SS04]. One final issue is that the size of the binary subtree tiles is 2 b ¡ 1, whereas the block size is 2 b . We are wasting space of 1 coefficient in our block allocation strategy. Therefore, we choose to store the scaling coefficient corresponding to the root of the subtree, along with the wavelet coefficients of the tile. The extra scaling coefficients that we store are useful for query answering, as they can dramatically reduce query costs. An example of the disk block allocation strategy for a wavelet tree of 32 coefficients is shown in Figure 4.1. Figure 4.1: Disk Block Allocation Strategy 55 We continue our discussion by generalizing the disk block allocation schema for both standard and non-standard multidimensional wavelet transformation in Sec- tion 4.2.2; but first, we need to extend the wavelet tree notion to the multidimensional case. 4.2.1 Multidimensional Wavelet Trees There are two forms of multidimensional wavelet decomposition, the standard and non- standard. The non-standard form of decomposition involves fewer operation and thus is faster to compute but does not compress as efficiently as the standard form. Particularly, range aggregate queries can be highly compressed using the standard form as shown in [SS02b]. In the database literature both transformation forms have been used: standard by [VWI98, VW99, SS02b, Lem02] and non-standard by [WAA00, CGRS00]. How- ever, we are not aware of any study on the extension of the wavelet tree concept for either form of the multidimensional transformations. More details on the multidimen- sional transformation forms are in Chapter 2. In the standard multidimensional transformation each dimension is decomposed independently. Therefore, there cannot be a single tree capturing the levels of decom- position. In case of 2-d, considering a 1-d wavelet tree for each of the decomposed dimensions, two 1-d wavelet trees are required. Every coefficient in a transformed 2-d array has two indices, one for each dimension. Each of these indices identifies a posi- tion in the 1-d tree, which as we have seen corresponds to a decomposition level and to a translation inside that level. Figure 4.2 shows a coefficient in an 8£8 2-d array and the corresponding indices on the two wavelet trees. The two 1-d trees can be used to determine which coefficients need to be retrieved for reconstructing data values on the 2-d array. Subsequently, they provide information 56 1. 25 1. 25 Wavelet Tree for X Wavelet Tree for Y DWT Y X Figure 4.2: Standard Form Wavelet Trees about the access pattern of 2-d wavelets. A single data value on the untransformed (original) 2-d array corresponds to a path in each of the 1-d wavelet trees, or better, a set of 1-d indices, as mentioned before. The cross product among all indices across these sets, construct the 2-d indices whose coefficients must be retrieved. For aN£N array, whereN =2 n , each of the paths contains(n+1) 1-d indices, therefore there are (n+1) 2 2-d indices. Figure 4.3 shows the two paths on the 1-d wavelet trees, as well as the required coefficient resulting from the cross product between 1-d indices. In contrast to the standard multidimensional transformation, a single wavelet tree can capture the levels of decomposition and dependency among coefficients for the non- standard transformation. The support intervals of the wavelet coefficients form a quad- tree, as each support interval is further decomposed in quadrants at the next level of decomposition. At thej-th level ofd-dimensional decomposition we have(2 d ) j nodes, each containing 2 d ¡1 coefficients with support interval hypercubes with edge length 2 j . 57 Figure 4.3: Standard Form Data Point Reconstruction In the 2-d case, the support intervals of the coefficients are squares with side length of power 2. There are 3 coefficients for each support interval, one for each of the wavelet subspaces: W d , W v and W h ; thus, each quad tree node contains its 3 corresponding coefficients. Figure 4.4 shows the wavelet tree for an 8£ 8 array and zooms in on a multidimensional tile, described in Section 4.2.2. The support interval of the children nodes, which are the four quadrants of the support interval of the parent node, are shown in dark grey. To reconstruct a point in the original 2-d array, one has to traverse the quad tree bottom up and use all 3 coefficients in each node. 4.2.2 Disk Block Allocation of Multidimensional Wavelets As in the single dimensional case, our main concern is to pack coefficients in disk blocks so that we achieve the highest possible block utilization on query time and thus decreas- ing retrieval cost. The solution is to assign as many coefficients with the same support to the same disk block as possible. This results in different disk block allocation strate- gies for the two multidimensional forms of decomposition. We assume d-dimensional 58 W W 1 (0) W W W W v W d W h U Figure 4.4: Non-Standard Form Wavelet Tree dataset, where each dimension has size N = 2 n . Furthermore, disk block size is B d , whereB =2 b . In the standard multidimensional decomposition, each dimension can be treated independently. Therefore, for each dimension we construct tiles of size B containing theB coefficients of a subtree, similar to the single dimensional case. The cross product of these d sets of single dimensional bases construct B d multidimensional bases. The coefficients corresponding to these B d bases are stored in the same block and form a multidimensional tile. In the non-standard multidimensional decomposition, tiles are subtrees of the quad tree. The branching factor of ad-dimensional quad tree isD = 2 d and each node con- tainsD¡1 coefficients. Therefore, a tile of heightb contains D b ¡1 D¡1 nodes or equivalently D b ¡1 coefficients. By also storing the scaling corresponding to the root node we create 59 tiles of D b = (2 d ) b = (2 b ) d = B d coefficients which fit in a disk block of size B d . Figure 4.4 shows the tiling of a8£8 array, for disk blocks of size16. 4.3 Shift-Split In this Section we describe our general purpose operations, SHIFT and SPLIT, on wavelet transformed vectors. Later, in Section 4.4, we discuss the applications that can benefit from our operations. There is a relationship among the coefficients in the transform of a vector, a and in the transform of a dyadic region b of the vector. This relationship is captured by shifting, re-indexing, the wavelet coefficients (details) of b and by splitting, calculating contributions from the scaling coefficient (average). The SHIFT-SPLIT operations are better understood in the context of wavelet trees. Let a be a vector of size N = 2 n and let b be the (k + 1)-th dyadic range of vector a with size M = 2 m . The wavelet coefficients ofb a are denoted by w a j;l , whereas the wavelet coefficients of b b are denoted byw b j;l ; similarly for scaling coefficients, u a j;l and u b j;l . Also, let T a and T b denote the wavelet trees ofb a and b b, respectively. Figure 4.5 illustrates the above. Figure 4.5: Shift-Split Operations 60 The support of the wavelet coefficient w a m;k is the dyadic range that b represents. Therefore,w a m;k coversw b m;0 and vice versa, since their support is the same range of a; seeT b in Figure 4.5. Furthermore, all children ofw a m;k inT a have common support with the corresponding children ofw b m;0 inT b . Specifically, at thej-th level of decomposition, the i-th coefficient w b j;i of T b has the same support with the (k2 m¡j +i)-th coefficient w a j;k2 m¡j +i . Definition of SHIFT. Let a be a vector of sizeN = 2 n and let b be the (k +1)-th dyadic range of vector a with sizeM = 2 m . Also, letf :Z!Z,f(i) = (k2 m¡j +i), be a function that translates the indicesi of b b to indicesf(i) ofb a. The SHIFT operation on the transformed vector b b is defined as the re-indexing of the wavelet coefficients by functionf. The wavelet coefficients ofb a that cover the interval represented by b contain a por- tion of the energy of the average of vector b. To be exact, the value of the wavelet coefficientsw a j;b k 2 j¡m c forj2[m+1;n], as well as the averageu a n;0 depend on the value of the averageu b m;0 ; these coefficients lie in the path fromw a m;k to the root and are con- tained inT c of Figure 4.5. Essentially the value of the averageu b m;0 is split across these n¡m+1 coefficients, contributing either positively, or negatively. Definition of SPLIT. Let a be a vector of size N = 2 n and let b be the (k +1)-th dyadic range of vector a with sizeM =2 m . Also, letg :[m+1;n]!R, g(j)= 8 > > < > > : 1 p 2 j¡m u b m;k ; ifk mod 2 j¡m =0 ¡ 1 p 2 j¡m u b m;k ; ifk mod 2 j¡m =1 be the function that calculates the contribution of u b m;k per level j. The SPLIT oper- ation on the transformed vector b b calculates the contribution of u b m;k to the n¡ m 61 wavelet coefficients: ±w a j;b k 2 j¡m c = g(j) for j 2 [m + 1;n] and to the average: ±u a n;0 = 1 p 2 n¡m u b m;k . To demonstrate the use of the SHIFT-SPLIT operations, let us look at two examples. Example Assume we are to transform a very large vector a of size N = 2 n into the wavelet domain, where only the subregion[k2 m ;(k+1)2 m ¡1] of the vector contains non-zero values. Let b be that non-zero subregion of sizeM = 2 m . Because of the fact that b forms a dyadic interval, we can apply the SHIFT-SPLIT operations to constructb a as follows. First, we obtain the wavelet transform b b in timeO(M). Next, we apply the SHIFT operation to place the wavelet coefficients of b b in their corresponding position inb a. Finally, we apply the SPLIT operation on the average of b to obtain n¡m+1 contributions and construct the remainingn¡m+1 coefficients. We have completed the wavelet transformation of a in time O(M +n¡m) = O(M +log N M ), instead of O(N). Example Assume we have already transformed vector a of size N = 2 n into the wavelet domain. There are updates, stored in vector b, coming for a subregion [k2 m ;(k+1)2 m ¡1] of a. The goal is to update the wavelet transform of a as efficiently as possible. Each ofjbj=M =2 m updates requiresn+1 values to be updated, leading to a total cost of O(MlogN). However, we can use the SHIFT-SPLIT operations to batch updates and reduce cost, as follows. First, we obtain the wavelet transform b b in timeO(M). Next, we apply the SHIFT operation to calculate the indices of the wavelet coefficients ofb a which need to be updated by the wavelet coefficients of b b. Finally, we apply the SPLIT operation on the average of b to obtain n¡m+1 contributions and update the corresponding coefficients inb a. The total update cost using SHIFT-SPLIT has been reduced toO(M +log N M ). 62 SHIFT SPLIT Standard (M¡1) d (M +n¡m) d ¡(M¡1) d Non-Standard M d ¡1 (2 d ¡1)(n¡m)+1 Table 4.1: Complexity of Multidimensional Shift-Split 4.3.1 Multidimensional Shift-Split The SHIFT-SPLIT operations in the multidimensional decomposition exploit the rela- tionship between the wavelet coefficients of the entire dateset and those in a multidimen- sional dyadic range. A multidimensional dyadic range is formed by the cross product of single dimensional dyadic intervals. For the non-standard decomposition we will only consider cubic multidimensional dyadic ranges resulting from dyadic intervals of equal length for all dimensions; arbitrary multidimensional dyadic ranges can always be seen as a collection of cubic intervals. To perform the SHIFT-SPLIT operations for the standard multidimensional decom- position, one has to perform the operations for each dimension separately. Any coeffi- cient in thed-dimensional dyadic interval can only be shifted or split in each dimension, and thus can sustain d operations in total. Consider as an example a d-dimensional dataset, where each dimension has size N = 2 n , and a cubic dyadic range of edge M = 2 m . The SHIFT operation affects(M¡1) d coefficients and the SPLIT operation calculates(M +n¡m) d ¡(M¡1) d contributions. With the non-standard multidimensional transformation, all the wavelet coefficients in the cubic dyadic range must be shifted similar to the standard transformation. How- ever, only the scaling coefficient has to be split and the contributions for the coefficients inside nodes on the path to the root have to be calculated. Therefore, in the non-standard transformation, the SHIFT operation affectsM d ¡1 coefficients and the SPLIT operation calculates(2 d ¡1)(n¡m)+1 contributions. 63 SHIFT SPLIT Standard O ¡ d M B e d ¢ O ³ ¡ d M B e¡dlog B N M e ¢ d ¡d M B e d ´ Non-Standard O ¡ d M B e d ¢ O ¡ (2 d ¡1)dlog B N M e ¢ Table 4.2: Complexity of Shift-Split for Multidimensional Tiles 4.3.2 Shift-Split of Tiles In this section we assume that the coefficients are stored using the optimal block allo- cation strategy described in Section 4.2. We will calculate the number of tiles affected by the operations for the single dimensional and extend to the two multidimensional wavelet transformations. We start with the single dimensional case of a vector of sizeN =2 n and itsk+1-th dyadic interval of size M = 2 m , when the disk block size is B = 2 b . The coefficients affected by the SHIFT operation belong to a subtree of the wavelet tree, and that subtree contains exactlyd M B e tiles. On the other hand, the SPLIT operation calculates log N M contributions. Because these contributions lie on a single path to the root inside every tile, there are logB coefficients affected per tile. This results in exactlydlog B N M e tiles containing the contributions of the SPLIT operation. To summarize for the single dimen- sional case, the SHIFT operation affectsB times less tiles than coefficients, whereas the SPLIT operation affectslogB times less tiles than coefficients. Extending to d-dimensional tiles of size B d = (2 b ) d and applying the observation for the single dimensional case, we derive the number of d-dimensional tiles affected by the operations in each multidimensional form. The results are summarized in Table 4.2. For the remainder of this chapter we will drop the ceiling operations to increase readability. 64 4.4 Shift-Split Applications In this section, we describe some of the applications where the SHIFT-SPLIT operations prove useful and draw comparisons to the existing alternatives. 4.4.1 Transformation of Massive Multidimensional Datasets One of the most important application of the SHIFT-SPLIT operations is I/O efficient transformation of massive multidimensional datasets. In the following, we assume that the dataset is d-dimensional with each dimension having a domain of size N = 2 n , so that the hypercube hasN d cells. The available memory for performing the transforma- tion is M d , where M = 2 m , measured in units of coefficients. Therefore, at any point in time, there can only beM d ¿ N d coefficients in main memory. Given these restric- tions we need to construct an efficient, in terms of required I/O operations, algorithm for decomposing the dataset. We begin by assuming that one I/O operation involves a single data value, or coefficient. Later, we measure I/O operations in units of disk blocks, as we consider the optimal disk block allocation strategy described in Section 4.2.2. The intuition behind our approach is simple. We assume that the data are either orga- nized and stored in multidimensional chunks of equal size and shape, or that the chunk- organization process has been performed, similar to [CGRS00, VW99]. We transform each chunk and use the SHIFT operation to relocate the coefficients and the SPLIT oper- ation to update the stored coefficients. The chunks are hypercubes of size M d so that they fit in main memory. Figure 4.6 shows a one dimensional example, forN =16 and M = 4, where the current chunk is C. The transformation of C results in the wavelet coefficients inside the box needing to be shifted. The scaling coefficient of C must be split to calculate the contributions to the coefficients shown in grey. With black are shown the coefficients that have a finalized value; that is, coefficients that will not be 65 affected byC or by chunks coming afterC. With white are shown the coefficients that do not cover any of the chunks seen so far. Figure 4.6: Transformation by Chunks Result 1. The I/O complexity for transforming a d - dimensional dataset with each dimension having domain size N = 2 n into the standard form of decomposition using memory ofM d coefficients isO ³ ¡ N B + N M log B N M ¢ d ´ disk blocks of sizeB d . Proof. As mentioned in Section 4.3.1, the SHIFT operation, for the standard decom- position, affects(M¡1) d coefficients, whereas the SPLIT operation affects(M +n¡ m) d ¡(M¡1) d coefficients. Consequently, each chunk requires$O ¡ (M +n¡m) d ¢ = O ¡ (M +log N M ) d ¢ I/O operations. Summing for all ¡ N M ¢ d chunks, we derive the I/O complexity, measured in terms of coefficients, for the standard multidimensional wavelet transformation: O ³ ¡ N + N M log N M ¢ d ´ . Now, let us consider disk blocks of size B d , for B = 2 b . In this case, the I/O cost per chunk in units of disk blocks is: O ³ ¡ M B +log B N M ¢ d ´ . Summing for all chunks we derive the I/O complexity, mea- sured in terms of disk blocks, for the standard multidimensional wavelet transformation: O ³ ¡ N B + N M log B N M ¢ d ´ Vitter et al. [VWI98, VW99] use the standard form to decompose multidimen- sional datasets, without taking under consideration, however, our optimal block allo- cation strategy. They transform a densed-dimensional dataset inO(N d z log M N)disk I/O 66 operations; in the case of sparse data with N z non-zero values the I/O complexity is O(N d z log M N). We can modify our SHIFT-SPLIT approach to accommodate for sparse- ness similar to the latter case, where only N z non-zero values exist; the modified I/O complexity isO ³ ¡ N z + Nz M log N M ¢ d ´ . However, for comparison purposes we omit the effect of sparseness in the original data. The I/O complexities are summarized in Table 4.3. Result 2. The I/O complexity for transforming a d - dimensional dataset with each dimension having domain size N = 2 n into the non-standard form of decomposition using memory ofM d +(2 d ¡1)log N M coefficients isO ³ ¡ N B ¢ d ´ disk blocks of sizeB d . Proof. In the case of the non-standard multidimensional wavelet transformation, the SHIFT operation affects M d ¡ 1 coefficients, whereas the SPLIT operation affects (D ¡ 1)(n ¡ m) + 1coefficients, where D = 2 d . The per chunk I/O cost is O ¡ M d +(D¡1)log N M ¢ . Summing for all ¡ N M ¢ d chunks, we derive the I/O complex- ity, measured in terms of coefficients, for the non-standard multidimensional wavelet transformation: O ³ N d +(D¡1) ¡ N M ¢ d log N M ´ . When tiling is used, the I/O cost per chunk in units of disk blocks becomes: O à µ M B ¶ d +(D¡1)log B N M ! Summing for all chunks we derive the I/O complexity, measured in terms of disk blocks, for the non-standard multidimensional wavelet transformation: O à µ N B ¶ d +(D¡1) µ N M ¶ d log B N M ! However, if we enforce a particular access pattern on the chunks, namely a z-ordering, and allow some extra amount of memory(2 d ¡1)log N M to store those coefficients that 67 Transformation Method I/O cost (in coefficients) I/O cost (in blocks) Vitter et al. (Standard) O ¡ N d log M N ¢ Shift-Split (Standard) O ³ ¡ N + N M log N M ¢ d ´ O ³ ¡ N B + N M log B N M ¢ d ´ Shift-Split (Non-Standard) O ¡ N d ¢ O ³ ¡ N B ¢ d ´ Table 4.3: I/O Complexities for Transformation of Multidimensional Datasets are affected by the splitting of the scaling coefficient of the chunks, we can reduce the cost to the optimalO(N d ), as seen in Table 4.3. A similar approach has been suggested in [CGRS00], where a recursive procedure is used to ensure values come in the particular access pattern. 4.4.2 Appending to Wavelet Decomposed Transforms In this section we investigate the problem of appending new data to existing trans- formed data. Appending is fundamentally different from updating in that it results in the increase of the domain of one or more dimensions. As a result, the wavelet decom- posed dimensions also grow, new levels of transformation are introduced and therefore the transform itself changes. We would like to perform appending directly in the wavelet domain, preserving as much of the transformed data as possible and avoiding reconstruc- tion of the original data. The SHIFT-SPLIT operations helps us achieve this goal. To make complexity analysis easier, we omit the effect of the optimal disk block alloca- tion strategy, or equivalently assume disk block size of 1 coefficient. Also, we use the standard form of decomposition, as analysis for the non-standard form is similar. As a motivation, consider the scenario where a massive multidimensional dataset containing measurements over 10 years is decomposed into the wavelet domain to expe- dite query processing. A new set of data for the following year has become available, 68 which results in appending to the time domain and possibly on other measure dimen- sions. Let us assume that the 10-year decomposedd-dimensional dataset has size ofN d , and that the available memory isM d , forN =2 n andM =2 m . Our SHIFT-SPLIT approach to the problem is the following, repeating for eachM d data values that we gather in memory. We start by performing thed-dimensional DWT on the gathered data. Next, if required, we make the necessary space on the original transformed data (expand) to accommodate for the new data to be appended. The final step is to shift and split the gathered data to update the expanded data. The second step is the most important in the appending application. Let us assume that we must expand on one of the dimensions to accommodate for the coefficients held in memory. The expansion means that the wavelet tree for that dimension has to increase its height by 1, and thus double its domain range. This expansion process is carried out by shifting and splitting the decomposed data in this dimension. Figure 4.7 shows expansion in one dimension, whereT old becomesT new andjT new j = 2jT old j. The expansion step creates the necessary space for the current chunk ofM d coefficients in memory, as well as for some of the next chunks. Therefore, this step, although costly, is rather rare. Figure 4.7: Wavelet Tree Expanding The I/O cost of expanding transformed data in one dimension isO(N d ) as all coeffi- cients have to be shifted to construct the new data cube of size2N d . Note, that although 69 the asymptotic cost is high, the required SHIFT-SPLIT operations are very fast, which leads to fast execution times for expanding the domain of one dimension. This phe- nomenon is amplified by the use of tiling and is demonstrated in Section 7.2.2. More- over, this operation, unlike reconstruction, does not require memory to process. The I/O cost of applying the SHIFT-SPLIT operations on the memory chunk of size M d is O ³ ¡ M +log N M ¢ d ´ . 4.4.3 Data Stream Approximation In this section, we revisit the appending problem, this time in the context of stream query processing: we wish to maintain a wavelet approximation of a multidimensional datastream in the time-series model, when dimension sizes are unbounded and new data are coming. The focus here is to construct a space and time efficient algorithm for maintaining the best K-term synopsis. We show that we cannot, in general, maintain a K-term synopsis for multidimensional datasets decomposed using the standard form under bounded space. However, if certain conditions are met we can maintain aK-term synopsis effectively. Let us start with the simple one dimensional case. As shown in [GKMS01a], we can maintain the bestK-term approximation of a data of lengthN =2 n by using space K+logN+1. We always store theK highest coefficients encountered so far, plus those coefficients whose value can change by subsequent data arrivals. These coefficients, termed wavelet crest in [PBF03], lie on the path from the current value to the root of the wavelet tree and therefore, they are exactlylogN +1 . Equivalently, if we consider a range containing just the data values under consideration, the SPLIT operation results in contributions lying in the wavelet crest. Therefore, at any time we have to keep the coefficients that can be affected by the SPLIT operation in memory. 70 Result 3. AK-term wavelet synopsis of a data stream of sizeN in the time series model can be maintained using memory of O(K +B +log N B ) coefficients with O ¡ 1 B log N B ¢ per-item computational cost. Proof. If we keep in memory a buffer of sizeB =2 b we can reduce per-item processing time at the expense of extra space. We collect B coefficients in the buffer, transform them and apply the SHIFT operation to obtain theB¡1 relocated wavelet coefficients. Next, we compare these coefficients with the K highest, to obtain the new set of K highest coefficients. Finally, we have to update the coefficients that can change by using the contributions derived from the SPLIT operation. The number of contributions for a buffer of size B is log N B and thus the space required for the coefficients on the crest is log N B . The total computational cost for the buffer, which includes the cost for transformation and the cost for updating the coefficients on the crest, isO ¡ B+log N B ¢ . As a result, the per-item computational cost isO ¡ 1 B log N B ¢ reduced fromO(logN), at the expense of extra space ofB. The key for being able to maintain a wavelet approximation in the one dimensional case is the fact that only a single path to the root of the wavelet tree has to be maintained at any time. Let us turn our attention to the multidimensional case. We assume that the data needs to be appended in only one dimension (usually the time dimension), which is the case for multidimensional data streams of the time series model. To separate the continuously increasing dimension, we let T denote its current size, whereas the other dimensions have a constant size of N. Therefore, the d-dimensional data stream has a size of N d¡1 T . The amount of space, besides the K terms, required to maintain a K-term approximation depends on the number of coefficient that can be affected by a 71 SPLIT operation. We calculate the number of these coefficients for each of the multidi- mensional forms, assuming that we have extra storage to bufferM d coefficients, where M =2 m . Result 4. AK-term standard wavelet synopsis of ad-dimensional data stream growing in the T dimension can be maintained using memory of O ¡ K +M d +N d¡1 log T M ¢ coefficients. Proof. In the standard form, there ared¡1 wavelet trees of sizeN and a single wavelet tree of size T . Since, the stream expands on the dimension of size T , we only have to keep a path to the root for the wavelet tree corresponding to that dimension. However, a new data value can arrive in any position on the other trees, which means that we have to keep all the paths to the root for thed¡1 trees. To recap, we need to keep allN 1-d basis functions from thed¡1 trees of sizeN and onlylog T M 1-d basis functions for the tree corresponding to the dimension which increases. The cross product between these sets of 1-d basis functions results inN d¡1 log T M d-dimensional basis functions and thus that many coefficients have to be maintained, besides theK highest coefficients and the extra storage space ofM d coefficients used for buffering. Therefore, the required space of O ¡ K +M d +N d¡1 log T M ¢ coefficients is prohibitive, except in the case where the constant dimensions have very small domain size, so thatN d¡1 is small. Result 5. A K-term non-standard wavelet synopsis of a d-dimensional data stream growing in theT dimension can be maintained using memory of O ¡ K +M d +(2 d ¡1)log N M +log T N ¢ coefficients. Proof. Since the dimension with size T is constantly expanding, we have to deal with non equal dimension sizes, similar to [CGRS00]. Such a data stream can be seen as a T N hypercubes of sizeN d , where each of these hypercubes can be decomposed with the 72 non-standard form. Each of these T N hypercubes results in a wavelet tree capturing the non-standard decomposition, where there exists a single average as the root of each of these trees. We apply the single dimensional transformation on the T N data constructed by these averages. The final result consists of T N non-standard multidimensional trees and a single one dimensional tree which has as leaf nodes the averages of the non- standard trees. We assume the z-ordered access pattern, described in Section 4.4.1, and we allow for extra buffering space of M d coefficients. Under these restrictions, the coefficients we have to retain lie in a path to the root in the last tree of the hypercubes, and in the path to the root in the single dimensional wavelet tree. Therefore we need to keep(2 d ¡1)log N M coefficients from the non-standard tree andlog T N coefficients from the 1-d tree, resulting in a total space cost of O ¡ K +M d +(2 d ¡1)log N M +log T N ¢ . 4.4.4 Partial Reconstruction from Wavelet Transforms In this section, we discuss the problem of reconstructing a set of values specified by a range on a multidimensional dataset. The problem is equivalent to translating the selec- tion operation of relational algebra to the wavelet domain. Chakrabarti et al. [CGRS00] have provided a solution for the non-standard form, in which they identify the coeffi- cients who cover the range and calculate their contribution. Here, we present a similar approach, based on the inverse of SHIFT-SPLIT operations, which generalizes to both forms of decomposition. The inverse of SHIFT is essentially the inverse index trans- lation, whereas the inverse of SPLIT is a point query, which shows how to reconstruct a value from contributions on a path to the root. Therefore, the cost of the inverses of these operations is the same. 73 We focus our discussion here to multidimensional ranges that are dyadic ranges; an arbitrary selection range can be seen as a number of such dyadic ranges. Therefore, our problem degenerates to the reconstruction of ad-dimensional dyadic range of sizeM d , given the transformation of the entire data of size N d . The scaling coefficients of the dyadic range are calculated using the inverse SHIFT, whereas the rest of the coefficients are simply calculated from the coefficients in the original dataset by re-indexing, using the inverse SPLIT. Result 6. The time complexity for reconstructing ad-dimensional dyadic range of size M d from a wavelet transformed signal of sizeN d isO ³ ¡ M +log N M ¢ d ´ for the standard form andO ¡ M d +(D¡1)log N M ¢ for the non-standard. Proof. It follows from the complexity of the SHIFT-SPLIT operations. 4.5 Summary We have introduced two general purpose operations, termed SHIFT and SPLIT, that work directly in the wavelet domain and can also be applied in combination with the optimal disk block allocation strategy. We analyze costs for both the single dimensional case and the two forms of multidimensional transformation. There is a significant number of applications that can benefit from these operations. We have revisited some data maintenance scenarios, such as transforming massive mul- tidimensional datasets and reconstructing large ranges from wavelet decomposed data, and utilized the SHIFT-SPLIT operations to draw comparisons with current state of the art techniques. Furthermore, we have provided solutions to some previously un- explored maintenance scenarios, namely, appending data to an existing transformation 74 and approximation of multidimensional data streams. We demonstrated the effective- ness of the proposed techniques both analytically and experimentally, and we conjecture that the introduced operations can prove useful in a plethora of other applications, as the SHIFT-SPLIT operations stem from the general properties and behavior of wavelets. 75 Chapter 5 Range Group-by Query Processing with Wavelets 5.1 Introduction Spreadsheets allow us to easily perform complex data analysis on scientific datasets. However, they cannot operate efficiently on very large multidimensional datasets gen- erated by the current data acquisition methods. Current science practice is to store the original data in databases or ftp sites and then manually generate a smaller subset of the data (by sampling, aggregating, or categorizing) as a new “data product”. Yet, this time-consuming process suffers from one major drawback. We lose the detailed infor- mation and end-up working with the secondhand dataset. Hence, this may result in a biased study of the data by verifying our known hypothesis rather than being surprised with unknown facts. To address these shortcomings, we are investigating how to enable spreadsheet-type functionalities on the original large datasets in databases. In this chapter, we focus on one of the mostly exercised functionalities of spread- sheets: generating meaningful plots over the data. Here, we redefine a plot as a database query (range group-by query) and progressively process it using wavelets. A plot sum- marizes how a fact changes over a set of attributes and is visually represented in various forms of charts. These graphs are considered among the most effective visual aids for statistical analysis methods and are widely used to provide valuable insights over any 76 dataset. For example, one can extract outliers, trends, clusters, or measurements such as the gradient or the area under the curve by quickly looking at a plot output. Approximate plots with a limited number of I/Os are often acceptable enough to assist us to intuitively understand the general behavior of the data. The valuable insight provided by these queries comes from the easy-to-visualize relationship among the plot points. Thus it is essential to preserve this relationship in approximate or progressive range group-by answering rather than conserving the accuracy of each individual group value. In addition, scientists desire to have the plot output in various resolutions from time to time. In one scenario, the graphic software at the application side may be limited to only a small number of plot points. In another scenario, scientists may be only interested in large scale changes (e.g., annual climate change). Finally, the fine resolution data may carry some noise that its mining renders useless. With any of these scenarios, it is necessary to compute the range group-by query for a coarser resolution and avoid retrieving unwanted details of the data. We decompose a range group-by query into two sets of fundamental queries, aggre- gate query and reconstruction query. Given these components: aggregation and recon- struction, and emphasizing on the output requirements: multiresolution and approxi- mation/progreessiveness, we propose to utilize wavelet transform to provide efficient range group-by query processing. The intuition behind our proposal comes from the fact that we have observed that aggregate queries can be efficiently evaluated in the wavelet domain and the original data can be equitably reconstructed from wavelet-transformed data. With the proposed method, we provide high-quality approximate query results independent of the data distribution with very little I/O and computational overhead 77 by using the most important query wavelet coefficients. We further extend our algo- rithm to progressive query processing by ordering the retrieved data. Our experimental results show that the approximate results produced by our progressive framework are very accurate long before the exact query is complete (below 10% of retrieval). We begin our discussion with defining the range group-by query and process it using the state of the art method in Section 5.2. In Section 5.3 we present our efficient algo- rithm to process range group-by queries with wavelets. We extend our query framework by providing approximate and progressive query processing in Section 5.4. Finally, we conclude our discussion in Section 5.5. 5.2 Range Group-by Query We often wish to apply aggregate operations to each of a number of groups of rows in a relation. Toward this end, we select a certain region of data, called range, and divide it to different groups based on a subset of its attributes, called grouping attributes. Subsequently, we compute a value, called a group value, per group and draw the chart or print the table of group values versus grouping attributes. Consider a dataset with a 1 ,...,a d as its dimension attributes and D as its measure attribute. Let the range for each dimensioni (i· d) be[l i ;h i ] and let the firstg dimen- sions (g · d) be the grouping dimensions, without loss of generality. For each combi- nation of grouping dimensions, we compute the group value by aggregating the measure values inside the range. The sum and average are the most widely used aggregation in this regard. In this chapter, we focus on sum due to its simplicity. Extension to other aggregations is straightforward as discussed in Chapter 3. 78 We define a range group-by query as a query that prepares the data in the form of a set of tuples (a 1 ;:::;a g ;G) which we articulate as G versus (a 1 ;:::;a g ). We denote (a 1 ;:::;a g ) andG by grouping attributes and group value, respectively. In relational databases, we define a range group-by query by the following SQL statement: SELECT a 1 ;:::;a g ;SUM(D) FROM Data WHERE l 1 ·a 1 ·h 1 ::: AND l d ·a d ·h d GROUP BY a 1 ;:::;a g ; Now we mathematically define a range group-by query as following: Definition 13. Given a d-dimensional datacube D with its first g dimensions as the grouping dimensions, and a range[l i ;h i ] for each dimensioni, the range group-by query is defined as: f(a 1 ;:::;a g ;G)j8i·g;l i ·a i ·h i ; G(a 1 ;:::;a g )= P l g+1 ·a g+1 ·h g+1 ::: P l d ·a d ·h d D(a 1 ;:::;a d )g (5.1) To simplify our notation, we denote the grouping dimensions byx,x = (a 1 ;:::;a g ), and the rest of the dimensions byy,y = (a g+1 ;:::;a d ). Similarly, we use(x;y) instead 79 Time (day) Product 1 2 3 4 1 2 3 4 5 6 7 8 3 2 4 2 3 5 2 3 6 5 6 5 8 6 3 4 3 2 4 2 6 4 2 1 4 5 6 6 5 4 5 7 10 12 14 16 18 1 2 3 4 category-axis: Time value-axis: Total Sales a. Data Example b. Query Output on a Chart Figure 5.1: Example of Range Group-by Query of (a 1 ;:::;a g ;a g+1 ;:::;a d ) throughout the chapter. Therefore, we simplify our query definition as following: f(x;G)jl x ·x·h x ;G(x)= X ly·y·hy D(x;y)g (5.2) The equation above states thatG(x) is the sum of all values inside the range[l y ;h y ] for eachx point inside the grouping dimension range[l x ;h x ]. Let us continue our discussion with an illustrative example. Example Figure 5.1a demonstrates a2-dimensional datacube with product and time as the dimension attributes and sales as the measure attribute. Let time be the grouping dimension and product be the aggregating dimension. Now we would like to perform a range group-by query of sales versus time for1·time·4 and1·product·3. To answer this query, we compute the daily total sales by performing a range aggregate query for each day. For example for time = 2, we have sum(sales) = D(2;1)+D(2;2)+D(2;3)=4+5+3=12 (see Figure 5.1b). Lemma 10. Given ad-dimensional dataD with the domain size ofN per dimension, a range size ofM per dimension, and a set ofg grouping dimensions, the I/O complexity 80 of range group-by query processing isO(M d ) if we naively perform a batch of aggregate queries over the original data. Proof. Our range group-by query consists ofM g groups with the cost ofO(M d¡g ) for each individual range aggregate query on the original data. Thus, the total complexity becomesO(M d ). Note that in the worst case where M becomes as large as N, the total complexity becomes O(N d ) which is basically reading the entire database for a single query. Yet, this method suffers from two other major drawbacks in addition to its high I/O com- plexity. First, it does not provide any mechanism for approximate and/or progressive evaluation of the query. Second, it does not present a resolution-aware process to reduce the query complexity for coarser resolutions. It is straightforward to see that the com- plexity remains O(M d ) for coarser resolutions since as the number of range aggregate queries reduces for coarser resolutions, the cost of each aggregate query increases pro- portionally. Towards addressing these shortcoming, we utilize wavelets to process range group- by queries. Use of wavelets not only dramatically reduces the cost of aggregate queries but also addresses the multiresolution and approximation requirements. In the next sec- tion, first we show how the use of wavelet transform reduces the complexity of each individual aggregate query. Then, we introduce our efficient algorithm in which we exploit the I/O sharing across the aggregations and show the excellent approximation of the query in its entirety. 81 5.3 Range Group-by Query Processing with Wavelets In this section we present our algorithm for efficient processing of range group-by queries. First, we employ the wavelet transform to reduce the cost of computing each group value. Next, we propose our novel method in which we introduce a framework to share the coefficients across all the group values. Lemma 11. Given a d-dimensional wavelet-transformed data ^ D with the domain size ofN per dimension, the range size ofM per dimension, and theg grouping dimensions, the I/O complexity of range group-by query processing isO(M g log d N) if we perform a batch of aggregate queries using wavelets. Proof. The range group-by query consists of M g range aggregate queries. Table 3.1 shows that the cost of each aggregate query decreases from M d¡g to O(log d N) using wavelets. Thus, the total I/O complexity becomesO(M g log d N). Despite the significant improvement compared to the naive method, this utilization of wavelet transform still suffers from the fact that it treats the range group-by query as a set of individual aggregate queries. Thus, it does not share the common coeffi- cients among the queries which results in several passes over data. In addition, it cannot approximate the query in its entirety; instead, it approximates each aggregated value separately which may not necessarily lead to the best approximation of the query. To address these issues, we introduce an efficient algorithm in which we can process a range group-by query as a single query. We divide this process into two steps, aggre- gation and reconstruction, and describe each in turn. The aggregation phase deals with preparing the aggregated values for each group point in the wavelet domain, whereas the reconstruction phase deals with converting these values back to the original domain. 82 5.3.1 Aggregation Phase In this phase, we show how we recast a range group-by query as vector queries in the wavelet domain for its efficient processing. Let us simplify the aggregate equation by defining an aggregate query vector as following: Definition 14. The aggregate query vector consists of a set of1’s inside the range and 0’s outside the range: Q(y)= 8 > < > : 1 ifl y ·y·h y ; 0 otherwise: Now we rewrite the basic definition of the range group-by query (Eq. 5.2) as follow- ing: f(x;G)jl x ·x·h x ;G(x)= X y D(x;y)¢Q(y)g This equation can be considered as a dot product of two vectors: thex column ofD (noted as D x ) and Q, the data vector and the aggregate query vector, respectively. We denote thex column ofD byD x and rewrite the equation as following: G(x)= X y D x (y)¢Q(y) (5.3) We wavelet transform both vectors of D x and Q and utilize the following useful lemma: Lemma 12. Given a wavelet-transformed data vector ^ D x and a wavelet-transformed query vector ^ Q, we compute the group values as following: G(x)= X y ^ D x (y)¢ ^ Q(y) (5.4) 83 0.7 -0.7 2 7 0.7 0 0.5 1.5 0 1 1 1 Q 12 Dx Wy 3 2 4 2 3 5 2 3 6 5 6 5 8 6 3 4 3 2 4 2 6 4 2 1 4 5 6 6 5 4 5 7 D 2 3 5 4 12 Dx Q G(x) G(x) x=2 Figure 5.2: Aggregation in the Wavelet Domain Proof. It is proven that Discrete Wavelet Transform preserves the Euclidean norm. Thus, the generalized Parseval equality applies to DWT, that is, the dot product of two vectors equals to the dot product of the wavelet-transformation of the vectors (see Chap- ter 3 for more information). Example Consider the same data and query described in Example 5.2. Now, we would like to compute the group values using Wavelets. Figure 5.2 illustrates the process of computingG(2) both in the original domain and in the wavelet domain. We select the D x for x = 2 as the data vector and wavelet transform it to have ^ D x . Then, we form the aggregate query vectorQ with1’s inside the range R y and 0’s outside the range and wavelet transform Q to have ^ Q. Subsequently, we perform a dot product between ^ D x and ^ Q to computeG(x). Toward computing the group values using Wavelets, we must be able to efficiently wavelet transform the query vector and the data vector. For the transformed query, we employ our efficient wavelet transformation algorithm (see Section 3.3) to transform the vector by computing the coefficients only on its boundaries. For the transformed data vector, we select the x column of W y D which represents the transformation of D 84 along dimension y since we have ^ D x (y) = W y D(x;y) by the definition of standard multidimensional wavelet transformation (see Section 2.4). Thus, we have: G(x)= X y W y D(x;y)¢ ^ Q(y) (5.5) However, y dimension is selected on-the-fly (i.e. at the query submission) and we cannot pre-compute the data transformed alongy dimension in advance. The following lemma, however, provides the opportunity of constructing W y D from the transformed data ^ D. Lemma 13. Given a wavelet-transformed datacube ^ D and the set of dimensionsy, the data transformed along y is computed by inverse transforming the data along other dimensionsx: W y D(x;y)= X ® W ¡1 x (x;®)¢ ^ D(®;y) (5.6) Proof. Let the data D have two sets of dimensions x and y. Therefore, its wavelet transformation is defined as ^ D = W x W y D. By performing an inverse transformation alongx on the both sides, we have: W y D =W ¡1 x ^ D. 5.3.2 Reconstruction Phase Let us overview the process so far. First, we compute W y D by inverse transforming the ^ D alongx dimension. Then, after preparing ^ Q on-the-fly, we perform a dot product between ^ Q and thex column ofW y D for each group valueG(x). However, this process is not efficient yet since we must perform the costly operation ofW ¡1 x ^ D at first and store a temporary large datacubeW y D. Toward addressing this inefficiency, now we propose to perform the aggregation before the inverse transformation to reduce the overall cost. In fact, we push the aggregation down to the wavelet domain and reconstruct the result 85 from the wavelet-transformed temporary datacube. For this purpose, let us substitute Equation 5.6 into Equation 5.5 and interchange the linear operations as following: G(x) = X y ( X ® W ¡1 x (x;®) ^ D(®;y)) ^ Q(y) = X ® W ¡1 x (x;®)( X y ^ D(®;y) ^ Q(y)) The equation above shows that we aggregate the data longy first, then we reconstruct the value by inverse transforming alongx. We denote the second summation by ^ G as it carries theG values of our range group-by query. In fact, ^ G is the transformation of the result set alongx. The following lemma summarizes the process of range group-by query processing. In short, it states thatG(x) is computed by performing an inverse transform on ^ G for all the group values. Lemma 14. Given a wavelet-transformed datacube ^ D and a wavelet-transformed query vector ^ Q as the aggregation along y dimension, we compute the group values with the following steps: Step 1 (Aggregation): ^ G(x)= X x ^ D(x;y) ^ Q(y) (5.7) Step 2 (Reconsutrction): G(x)= X y W ¡1 x (x;y) ^ G(y) (5.8) Example Figure 5.3 demonstrates the process of constructing ^ G from ^ D. Here, everyx element of ^ G is computed by a dot product between ^ Q and ^ D x . Finally, we perform an 86 -5.0 .50 39.2 .35 .71 -.71 -1.4 -.71 .71 0 .5 1.5 0 1 1 1 -1.4 2.8 1.0 0.0 -2.5 0.0 -.3 -.3 -1.3 -3.0 6.5 .2 -2.3 0.0 23.5 .2 -1.5 -1.5 -.5 -.5 .5 -1.0 -1.0 -1.0 1.4 -.4 1.1 -1.8 .7 .4 -1.1 .4 Q D D Wy Q G D G 17 16 11 12 15 13 13 14 Wx -1 Figure 5.3: Range Group-by Query with Wavelets inverse wavelet transform on ^ G to computeG. Note that reconstructing the highlighted subset ofG suffices to complete the query computation. To conclude this section, we analyze the complexity of our algorithm by providing the following theorem. Theorem 15. Given ad-dimensional wavelet-transformed data ^ D with the domain size ofN per dimension, the range size ofM per dimension, and theg grouping dimensions, the I/O complexity of range group-by query processing is: O((M +log N M ) g ¢log d¡g N) Proof. The aggregation complexity is O(log d¡g N) based on Table 3.1 for a (d¡g)- dimensional range and the reconstruction complexity is O((M + log N M ) g ) based on Table 4.1. Multiplying the two, we compute the overall complexity for range group-by query processing. 5.4 Approximation and Progressiveness When execution time is limited, the accuracy of the query result can be traded off for a better response time, that is, a fast less accurate result become preferred to an exact late result. Since the dominant factor for query processing is the database retrieval, 87 we limit the retrievals to a certain number B, that is, we only retrieve the B most sig- nificant wavelet coefficients contributing to the query. Here, we adopt the two widely used methods, First-B and Highest-B, for selecting theB most significant wavelet coef- ficients. Using First-B, the most significant coefficients are the coefficients with the lowest frequencies. With Highest-B, the most significant coefficients are those that have the highest absolute values. Let us recall that the process of range group-by query processing has two phases: 1) Aggregation phase ^ D ! ^ G, 2) Reconstruction phase ^ G ! G. Therefore, we can approximate either or both of these steps to approximate the query output. First, we intend to approximate the reconstruction process, that is, we need to select the best coefficients of ^ G for reconstruction of G given a limited number of retrievals. The following example clarifies our purpose. Example Consider the wavelet data ^ G illustrated in Figure 5.3. When the query is limited to retrieve only 2 coefficients, it is recommended to retrieve either of the fol- lowing sets: f39:2;0:35g if we consider the First-B ordering (the least frequencies) or f39:2;¡5:0g if we consider the highest-B ordering (the highest absolute values). Unfortunately, we cannot utilize the Highest-B method of the process mentioned above in practice because this requires knowing all the values of ^ G in advance to deter- mine the highest values. Therefore, we can only utilize the First-B method in this step, that is, we select the coefficients with the lowest frequencies. In addition to the reconstruction phase, we can approximate the aggregated inter- mediate result ^ G by selecting the most contributing coefficients, that is, the ones with the highest values of the pair of query ^ Q and data ^ D items (see Equation 5.7). How- ever, the values of ^ D are not known in advance and cannot be utilized for this process. Therefore, we advocate selecting the query coefficients at the query time to achieve 88 good approximate results. Toward this end, theB most significant query coefficients are selected using Highest-B or First-B methods. We refer the reader to [SJS05] for more information regarding various techniques used in wavelet query approximation. Having the ability to approximate at both phases, aggregation and reconstruction, we are faced with this dilemma: either to compute the aggregated result in exact and then perform the approximate reconstruction phase, or to approximate both phases of aggregation and reconstruction together. The result of our empirical study shows that the latter outperforms the first one. We discuss this later in the experimental chapter. By progressively increasing the term B, we can order coefficients based on their significants. We exploit this ordering to answer the query in a progressive manner so that each step produces a more precise evaluation of the actual answer. In fact, the real- world users of our technique have found its progressiveness the most appealing feature for processing large range group-by queries. Let us end this section by emphasizing that we have studied the “query approxi- mation” in this section. Adopting “data approximation” (use of compressed data) is straightforward as discussed in Section 3.5.1. More specifically, we can compress the data with any of the two orderings, Highest-B or First-B. At the query time if the data coefficient is not stored, i.e. dropped previously due to the data compression, we assume this wavelet data coefficient is zero and continue the process. This assumption is basi- cally the implementation of hard thresholding (see Section 3.5.1 for more information). 5.5 Summary We have proposed a wavelet-based technique to efficiently process range group-by queries. Furthermore, we have extended our algorithm to progressive query processing by ordering the retrieval procedure. Our experimental results show that the approximate 89 results produced by our progressive framework are very accurate long before the exact query is complete. 90 Chapter 6 ProDA: An End-to-End WOLAP System 6.1 Introduction Over the past half decade, we have designed, developed, and matured an end-to-end system, dubbed ProDA (for progressive data analysis system), that efficiently and effec- tively analyzes massive datasets. ProDA functions as a client-server system with the three-tier architecture shown in Figure 6.1: The storage tier maintains the data at the bottom while the query tier executes the queries at the midlevel; together these elements comprise the ProDA server. The ProDA client, on the other hand, implements the visu- alization tier on top, where user queries are formulated and query results presented. Figure 6.1: ProDA’s Architecture 91 As an OLAP tool, ProDA supports a wide range of analytical queries while also being able to handle massive datasets. However, compared to current OLAP tools, ProDA offers the extended and enhanced online query processing capabilities made possible by leveraging our in-house wavelet-based technology, described in the previous chapters. Specifically, ProDA supports more complex analytical queries, including the entire family of polynomial aggregate queries as well as the very data-intensive range group-by queries, previously unsupported by OLAP tools. Moreover, unlike current OLAP tools, ProDA supports online ad hoc queries. To enable online execution of these queries, we take two measures to improve the efficiency of query execution. First, we treat analyti- cal queries as database queries and push them down, close to the data, to be executed at the server side rather than in client-side applications. Second, we leverage the wavelet transform’s excellent energy-compaction properties, which allow for accurate approxi- mation of the query result with minimal data access. Here, we innovate by transforming the query as well as the data. Since queries are often more patterned than data, they are also more compactable when transformed into the wavelet domain. With a highly compact yet accurate query representation, in addition to a compact data representation, we can effectively select and retrieve the high-energy data coefficients relevant to the query with exponentially less data access when compared to previous approaches that only transform data. There- fore, we can approximate the query result accurately with an exponentially improved response time. In combination, these two measures let ProDA carry out the online execution of ad hoc queries. Further, leveraging the multiresolution properties of the 92 wavelet transform can answer approximate queries progressively, either with fixed accu- racy or, alternatively, with fixed performance such as a limited time frame. This fea- ture is particularly useful for exploratory data analysis, which can be quite time- and resource-intensive with massive datasets. With exploratory analysis, users often issue several back-to-back queries, each time revising and enhancing a query based on the quick observation of the previous query’s partial results. With progressive queries, users can save time and system resources according to the required query accuracy or available time and system resources to exe- cute each query. On the other hand, with ProDA, inline with the typical use of wavelet transform in databases, we can optionally drop the transformed data’s low-energy coefficients to save storage space. Since the storage space is no longer the main resource constraint, we prefer lossless data storage to allow for exact query answering, if needed. Finally, ProDA also introduces novel operators that let developers manipulate the stored data by inserting, deleting, and updating data records directly in the wavelet domain rather than in the original domain. These operators are extremely useful for maintaining the massive datasets, particularly when the data is frequently updated by, for example, incoming data streams; otherwise, the entire dataset must be transformed back to the original domain for any minor data manipulation. 6.2 Case Studies We have successfully used ProDA to analyze two real-world data applications. These case studies serve as proofs of concept and as testbeds for realizing and addressing ProDA’s practical limitations. 93 6.2.1 Earth Science Earth scientists need to perform complex statistical queries as well as mining queries such as outlier/ pattern detection on very large multidimensional datasets produced by the advanced instruments. In particular, the Atmospheric Infrared Sounder (AIRS) instrument collects the Earths atmospheric temperature, water vapor, and ozone profiles with a very fine spatial resolution (13:5 km horizontal and 1 km vertical) at a high rate. Due to its large volume, scientists generate smaller data products (named as level-1, level-2, level-3) with coarser resolutions to work at ease. The data products are being used to validate climate models and to test their representations of critical climate feed- backs. In particular, scientists extensively perform a certain type of computation intensive queries, termed range-aggregate queries, which requires more attention to avoid the very slow response time for their processing. An example of such queries is “Find the average temperatures for each grid cell for a given granularity of lat, long, altitude and time grid”. Although a user can form such a query very easily by utilizing our graphical user interface and drawing a 4-D grid, the database server needs a lot of time to compute the aggregation function (in this example average) on the measure attribute (in this example temperature) for all the grid cells. Moreover, the system cannot rely on pre-computations of such queries because in the general case, many parameters are unknown. That is, the resolution at each dimension (e.g., hourly, daily, monthly on time dimension), the aggregate function (e.g., sum, average, variance, or covariance) and the measure attribute (e.g., temperature, pressure, or altitude) can vary from one query to the other. Using wavelets, not only we access the finest resolution of the data with no extra cost but also we are able to summarize the data at various levels of abstractions (e.g. 94 daily, weekly, or monthly) on-the-fly. Consequently, this framework diminishes the need of off-line level summarization as it is being used in the current Earth science data management. Moreover, ProDA allows scientists to run thousands of their hypotheses over the massive datasets and receive the results almost immediately. We illustrate the benefit of ProDA by providing the following data mining examples on top of ProDA. 6.2.2 Oil Industry With recent advancements in sensor technology, we can now economically equip oil pro- duction wells and an oilfield’s water and steam injection wells with multimodal sensor devices to monitor gas, water, steam, oil pressure, and related factors. A smart oilfield management system must analyze such a data feed in real time to provide decision sup- port for the oilfield operators. The system can also extend to automatically control the oilfield when it creates a closed loop. Analyzing oilfield sensor data is complicated not only by the data’s size and many dimensions, but also because of the data’s high update rate, which renders any slow analysis process useless. With this application, domain experts must execute complex ad hoc queries on the fly to understand the oilfield’s dynamic behavior in real time and react accordingly. To illustrate, imagine that the oilfield management system is continually receiving gas, water, steam, and oil pressure readings from a field of 4,000 wells, where each well has 20 sensor devices with a sampling period of 15 seconds, deployed at the well’s var- ious depths. Typically, a reservoir engineer must continuously monitor the covariance between the water and steam injection and oil production across all wells. The covari- ance matrix determines the injection rates required for optimal total production. As far as we know, only ProDA can compute such a complex query on the fly. 95 6.3 Bottom Tier: Storage Engine The bottom tier of ProDA, the storage engine, stores and manages the data. This engine consists of several different data sources: user accounts, a history of user activities, a datacube directory, and datacubes. Each datacube consists of the following: General information about the cube (Header DB), A list of user-defined polynomial queries (Query DB), The dimension values of datacubes (Dimension DB), and The wavelet-transformed measure values (Wavelet DB). With the exception of the Wavelet DB, we have implemented all the data sources in a relational database server to facilitate the management of the data. For the Wavelet DB, we have developed our own custom-made binary file structures to store multidi- mensional wavelet-transformed data using our optimal disk block allocation strategy in order to achieve high efficiency. 6.4 Middle Tier: Query Engine The query engine of ProDA provides a rich set of Web services that consists of four groups : browsing services, essential querying services, advanced querying services, and data mining services. Figure illustrates these categories in four layers, as each is built atop another. 96 Figure 6.2: ProDA’s Query Engine 6.4.1 Browse Queries We have designed this group of Web services to allow users to manage, explore, and modify the available datacubes. Cube metadata browsing. These services let users explore the metadata of the available datacubes, such as the description, schema, and the wavelet filter used for datacube transformation. Users can add, modify, or drop the user-defined queries, and can browse through and reuse previously issued queries. Cube content browsing. This family of services allows users to directly access the content of the selected datacube. Using this set, we can add a datacube, drop a datacube, or modify the wavelet coefficients of a datacube by issuing update queries. User profile browsing. This family of services are used to implement the user access control. Besides, they provide facilities for the users to add, drop, and modify the user profile information. 97 6.4.2 Essential Querying services This class of services includes polynomial aggregate queries, slice-and-dice queries, and cursor functions. Polynomial aggregate queries. The standard statistical range-aggregate queries are implemented as predefined queries in ProDA; e.g., count, sum, average, variance, covariance, and correlation. Furthermore, we have broadened the supported queries to encompass ad hoc statistical functions by providing a formal method of expressing analytical queries as polynomial functions on the data attributes. Users can define any polynomial expression on attributes and share the definition with others. For example, a high-ordered polynomial function such as kurtosis (i.e., the fourth moment of data divided by the square of the variance) is not predefined in ProDA; however, our new utility environment allows users to define and share such complex queries on-the-fly. We implement an arbitrary polynomial query by using two basic Web services, PushTerm and PushOperator. Given the importance of the order of the function calls, ProDA parses the equations in post order. For instance, consider implementation of the variance function using PushTerm and PushOperator as the following example describes. Example Consider the case we need to implement the variance function using PushTerm and PushOperator. Variance is defined as following: Var(x) = P x 2 i n ( P x i n ) 2 The post order representation of this function is P x 2 i , n, =, P x i , n, =, 2 , and. Therefore, we perform our polynomial query by calling the following 9 calls: 98 PushTerm(x;2); PushTerm(x;0); PushOperator( 0 = 0 ); PushTerm(x;1); PushTerm(x;0); PushOperator( 0 = 0 ); PushOperator( 0 0 ); PushOperator( 0 0 ); Submit(); Figure 6.3: Post-ordered Polynomial Aggregate Query Slice-and-dice queries. Consider the scenario in which we wish to extract a region of the original data from its wavelet transform. This poses the following dilemma: we can either reconstruct the entire dataset and extract the desired region (which is infeasi- ble), or reconstruct the desired region point-by-point (which is inefficient, particularly for large regions). Instead, we translate the selected operation of the relational alge- bra to the wavelet domain and choose the required coefficients for reconstruction of the desired range5. By employing this technique, ProDA clients can access small subsets of the wavelet data instantly because the server never needs to reconstruct the entire dataset. This class of queries is used when ProDA users intend to download a small subset of data into their own machines for convenient interaction. For example, an oil production engineer usually analyzes a few production oil wells at a time, without accessing the data of the entire reservoir. This relevant data is usually small enough to be cached in the client machine. The cached data is readily updatable and enables efficient query pro- cessing. Meanwhile, ProDA allows the user to receive the exact result when the remote connection is available. Cursor functions. ProDA provides progressive query answering by ordering the wavelet coefficients based on the significance of the query coefficients. Hence, ProDA incrementally retrieves the data coefficients related to each query from the storage 99 engine. Cursor functions track the progress of the query operations and the data retrieval operations. They also let users stop the query processing any time they are satisfied with the intermediate approximate result. 6.4.3 Advanced querying services Utilizing the essential querying services, we efficiently implement two widely used advanced queries, i.e., batch queries and plot queries (a.k.a. range group-by queries). Furthermore, we provide additional cursor functionality for these two query classes by prioritizing certain query regions. Batch queries. Scientists typically submit queries in batch rather than issuing indi- vidual, unrelated queries. We have proposed a wavelet-based technique that exploits I/O sharing across a query batch for efficient and progressive evaluation of the batch queries. The challenge is that controlling the structure of errors across the query results now becomes more critical than minimizing errors per each individual query. We have defined a class of structural error-penalty functions in our framework to achieve the optimal progressiveness for a batch queries.6 Users can invoke the batch query services by specifying a grid over data dimensions. Thereafter, they can progressively receive the results for the entire batch. Plot query. Plots are among the most important and widely used tools in scientific data analysis and visualization applications. In general, each plot point is an aggregate value over one or more measure attributes for a given dimension value. The current prac- tice for generating a plot over a multidimensional dataset involves computing the plot point-by-point, where each point is the result of computing an aggregate query. There- fore, for large plots a large number of aggregate queries are submitted to the database. 100 This method is neither memory-efficient (on either the client or the server side), nor communication-efficient. On the other hand, we process a plot as a single database query, range group-by query, and employ a wavelet-based technique that exploits I/O sharing across the aggre- gate queries for all plot points to evaluate the plot efficiently. With this approach, the main idea is to decompose a plot query into two sets: a set of aggregate queries and a set of slice-and-dice queries. Subsequently, we can use our earlier results to compute both sets of queries efficiently in the wavelet domain. Users can invoke the plot query services by specifying a range over the data dimen- sions and selecting the independent variable for the plot. A developer can employ the following advanced cursor functionality to generate the plot output progressively. Advanced cursor functionality. Scientists consider batch queries and plot queries among the most favored statistical analysis tools. They are widely used to provide valu- able insights about any dataset. For example, we can extract outliers, trends, clusters, or local maxima by quickly looking at the output of such queries. Furthermore, all the entire query result often is not used at once. For instance, the result may not fit on the screen, the user may point to a specific region of the result, or the user may prioritize the subsets of the result (e.g., local maxima, or the regions with high values or high gra- dients) to be computed. Accordingly, the advanced cursor functionality allows users to modify the structural-error-penalty functions to control the progressiveness of the query. 6.4.4 Data mining services We are currently designing and implementing additional analytical query processing components in ProDA to support complex data mining functionalities. So far, we have enhanced ProDA by incorporating clustering and outlier detection techniques, and an 101 effective visualization tool for exploratory data mining. For clustering, we use various methods-such as K-means with different distant functions-to cluster batch or plot query results. Using a similar approach with a customized penalty function, ProDA enables the progressive answering of the outlier detection queries. The visualization tool generates timestamped KML files to be imported to Google Earth for effective spatial and temporal visualization. 6.5 Top Tier: Visualization Engine By pushing the extensive and complex data processing tasks to the server side, the ProDA client can be implemented as a light and yet effective interface. Figure 6.4 demonstrates a sample client that invokes ProDA Web services for query processing. First, we create an instance of ProDA services, then select a datacube with appropriate login information. Next, we specify a range and submit a variance query. Finally, we use our cursor services to obtain the query result progressively. // Creating an instance and storing session state ProDAWebServices pws=new ProDAWebServices(); // Selecting a cube with the login information pws.SelectDB(dbName,userName,password); // Defining a range and submitting a query pws.SetRange(lowerLeft,upperRight); pws.Varariance(1); // Asking for result progressively while(pws.HasMore()) Console.Write(”Result=”+pws.Advance(5%)); Figure 6.4: A sample ProDA client for progressive querying (in C#) In addition, we have developed a stand-alone graphical C# client, ProDA client, for efficient interaction with arbitrary scientific datasets. We emphasize using the graph- ical interface as a more intuitive query interface. We have incorporated smart client 102 functionalities (e.g., smart data management, online/offline capability, high-fidelity UI) into ProDA client; thus, ProDA provides an adaptive, responsive, and rich interactive experience by leveraging local resources and intelligently connecting to distributed data sources. The ProDA client consists of data and query visualization, high-fidelity UI, connec- tivity management, and advanced visualization. In short, ProDA lets the user select a datacube and visualize the data. It also accepts queries from a user and displays the results progressively in offline or online mode, provides resource sharing at the client machine, and utilizes advanced commercial visualization tools. 6.5.1 Data visualization This module lets the user log in to the ProDA storage engine, browse the data sources, and select one to interact with. It then provides a visualization of the selected dataset for the user to browse, scroll, and rotate. It also facilitates definition of the desired ranges and queries. Defining a bounding box for a single range query and a grid for batch queries over all dimensions is one of many necessary functionalities the data visu- alization module must provide. Figure 6.5 demonstrates a sample oil-field sensor data visualization. The grid is specified by the user for batch query processing. Data visualization modules exhibit various attribute types, including spatial, tempo- ral, numeric, and categoric. In addition to present the hierarchy of the dimension values, ProDA displays dependant dimensions and allows the user to work with all dimensions simultaneously. For example, while exploring the oil-field sensor data, we can select a set of oil wells by identifying a certain window of interest, or alternatively, by choosing the wells based on their corresponding labels. 103 Figure 6.5: Data Visualization Once a developer selects a datacube, ProDA enables the list of available queries. This list includes common analytical queries, user-defined polynomial range-aggregate queries, plot queries, and slice-and-dice queries. In addition, the user can define a new polynomial query, add it to the list, and even share it with other users of this dataset. 6.5.2 Query Visualization The ProDA client displays the output of queries once they become available, then updates them frequently as new updates appear. It also visualizes the output of batch queries and plot queries using various advanced built-in chart types. As a query pro- gresses, the user can start interacting with the result through the ProDA visualization module’s operations. Zooming in and out, pivoting, and exporting are among the many possible actions. When real-world data contains noise, the output of a plot query displays some unde- sired small variations. These variations not only do not carry any valuable informa- tion but also confuse the users when analyzing the data. Leveraging from wavelet hard 104 Figure 6.6: Query Visualization threshholding, ProDA supports advanced denoising of the query output and generates smoother outputs, especially when they carry white Gaussian noise. 6.5.3 High-fidelity UI ProDA, as a well-designed smart client, guarantees that the utilities already installed on the client machine can access and process the data that ProDA generates. This is essential in practice because typical users are accustomed to using Microsoft Office components, especially Microsoft Excel, in addition to ProDA’s built-in visualization packages. Toward this end, the ProDA client provides a similar interface to improve the users’ performance and decrease training costs. In particular, we use the Microsoft Excel pivot table components designed for ad hoc analysis of large quantities of data. A pivot table is a powerful reporting tool that features basic to complicated calculation modules inde- pendent of the spreadsheet’s original data layout. With the table’s drag-and-drop func- tion, users can pivot the data and perform local computation tasks. Consequently, users will have an interactive table that automatically extracts, organizes, and summarizes the 105 data. They can use this report to analyze the data, make comparisons, detect patterns and relationships, and discover trends. With its extensive set of export functionalities, ProDA can be connected to almost any application. At any time, users can export the data to XML, Excel, Text files, and many more formats. Figure 6.7 shows ProDA’s export functionality. Figure 6.7: ProDA’s Export 6.5.4 Connectivity Management ProDA’s client lets a user cache a subset of a dataset while the system is online. Later, when the system is offline, by utilizing the cached data, the user can access all the ProDA’s functionalities as if the system were still connected to the server. Offline query processing offers an especially attractive feature for mobile users who need access to the data during a disconnected operation. Hence, we will empower ProDA’s clients to support offline wavelet-based query processing. With ProDA, the user can query this wavelet-transformed sketch to receive an excellent approximate answer. 106 6.5.5 Advanced Visualization We employ other widely accepted commercial products for universal spatial data rep- resentation. In particular, ProDA exports the spatial query results to Google Earth for advanced visualization. This tools allow a group of users to exchange their query results with each other while viewing other spatial data in relation to the problem. 6.6 Summary ProDA enables exploratory analysis of massive multidimensional datasets. Standard OLAP systems that rely on query precalculation are expensive to update, whereas tradi- tional, easily updated databases often have poor response time with analytical queries. With ProDA, we employed WOLAP technology to support exact, approximate, and progressive OLAP queries on large multidimensional datasets, while keeping update costs relatively low. ProDA extends the set of supported analytical queries to include the entire family of polynomial aggregate queries as well as the important class of plot queries (a.k.a. range group-by queries). 107 Chapter 7 Experiments In this chapter, we examine our proposed work in the previous chapters. In particu- lar, we extensively study the efficiency of our range aggregate query technique, our data maintenance operations, and our range group-by query method. We would like to emphasize that our experiments are performed on real datasets using our real system (see Chapter 6). 7.1 Range Aggregate Query Processing with Wavelets In this section, we empirically examine WOLAP using four real-world and three syn- thetic datasets. First, we describe the datasets used in our study. Next, after preparing the WOLAP cubes, we compare the WOLAP cube models and show that the CFM model is the method of choice for scientific data analysis. Next, we demonstrate the advan- tage of WOLAP in progressive and approximate query processing. Later, we show that WOLAP model outperforms general OLAP cubings and other wavelet-based aggrega- tion methods. Finally, we show the advantage of optimal disk block allocation in reduc- ing the number of disk I/Os required for aggregate query processing. 108 7.1.1 Experimental Datasets We evaluate our framework with four real-world scientific datasets, namely LH, RAIN, AIRS, and GPS. In addition, we generate three synthetic datasets with higher dimensions for further comparisons. LH contains monthly production and injection history data for a waterflood oil reservoir provided by Chevron. This dataset includes well ID and time as dimension attributes, and oil production, gas production, water production, water injection, steam injection, and co2 injection as measure attributes. We obtained this data for 57 years and its size is 1 GB. RAIN [WC94] measures the daily precipitation for the Pacific NorthWest for 45 years. It consists of three dimension attributes, latitude, longitude, and time, and one measure attribute, precipitation. The size of this dataset is 9 MB. AIRS, standing for Atmospheric Infrared Sounder, collects the Earths atmospheric water vapor and temperature profiles at a very high rate. This dataset provided by NASA/JPL includes latitude, longitude, pressure level, and time as dimension attributes, and temperature, water vapor mass mixing ratio, and ozone volume mixing ratio as mea- sure attributes. This data is gathered over a year and has a size of 320 GB. GPS Occultation Observations Level 2 dataset contains profiles of atmospheric tem- perature, pressure, refractivity, and water vapor pressure with resolution of about a kilo- meter, derived from radio occultation data. This dataset is provided by NASA/JPL and includes latitude, longitude, pressure level, and time as dimension attributes, and temperature, refractivity, altitude, and water vapor pressure as measure attributes. We obtained this data for a 9 month period and its size is 2 GB. 109 S5, S6, and S7 are three synthetic datasets with five, six, and seven dimensions, respectively. Each cell contains a random number between 0 and 100. The size of S5, S6, and S7 are 8 MB, 128 MB, and 2 GB, respectively. 7.1.2 WOLAP Datacubes We generated the WOLAP datacubes corresponding to each dataset using our data mod- els: DFD, CFM, and DFD-DM. First, it is essential to determine the maximum degree of the polynomial queries submitted to our database since the polynomial degree directly affects both models. Trivially, the size of the CFM model linearly grows as the polyno- mial degree increases based on Lemma 9. Here, we investigate how polynomial degree can affect the DFD model. Since all the attributes become cube dimensions in this model, we must uniformly quantize them (see Definition 8 for quantization details). By limiting the quantization error to 1%, we determined the size of DFD dimensions for measure attributes. Next, we transformed our DFD cubes with 5 different filters, db1 (Haar), db2, db3, db4, and db5. Subsequently, we generated100 random range sum queries (a random range with polynomial degree of 1 for each query) and count the number of required coefficients to answer each query. Figure 7.1 shows the result of our experiments. As expected, using longer filters dramatically increases the query cost. Note that the y-axis is in logarithmic scale. This observation suggests that we must limit the filter length to the minimum required lengthl =2±+2 (see Definition7). For the rest of our experiments we assume± =2 on measure attributes to efficiently process the common statistical queries (e.g. count, sum, average, covariance, variance, and correlation). Therefore, we wavelet transformed our DFD cubes with db3. For the CFM cubes, we generated one cube for each polynomial terms up to the degree 2 and 110 100 10,000 1,000,000 100,000,000 10,000,000,000 1,000,000,000,000 100,000,000,000,000 10,000,000,000,000,000 db1 db2 db3 db4 db5 Filter # of Coefficients RAIN AIRS GPS LH S5 S6 S7 Figure 7.1: Query Performance vs. Wavelet Filter 10,000 1,000,000 100,000,000 10,000,000,000 1,000,000,000,000 100,000,000,000,000 10,000,000,000,000,000 RAIN AIRS GPS LH S5 S6 S7 # of coefficients stored DFD DFD-DM CFM 1 100 10,000 1,000,000 100,000,000 10,000,000,000 1,000,000,000,000 RAIN AIRS GPS LH S5 S6 S7 # of coefficients retrieved DFD DFD-DM CFM 1 100 10,000 1,000,000 100,000,000 10,000,000,000 1,000,000,000,000 RAIN AIRS GPS LH S5 S6 S7 # of coefficients updated DFD DFD-DM CFM a. Storage b. Query c. Update Figure 7.2: Storage, Query, and Update performance of WOLAP’s cube models transformed it withdb1. Therefore, we generated43,3,13,21,3,3, and3 cubes for LH, RAIN, AIRS, GPS, S5, S6, and S7, respectively. Finally, we generated one DFD-DM cube for each dataset by separating wavelet filters for dimension attributes and measure attributes. That is, after generating another set of DFD cubes, we transformed them with db1 along dimension attributes and db3 along measure attributes. Let us note that we performed the data transformation off-line, using our efficient data transformation method [JSS05]. 7.1.3 Storage, Query, and Update performance In this section we compare the WOLAP cube models in terms of storage, query, and update performance. We generated 100 random range sum queries (a random range 111 with polynomial degree of1 for each query) and100 random update queries (an update for a random data point per each query). We report the average performance among all these queries. Figure 7.2 shows that CFM model significantly outperforms DFD and DFD-DM in storage, query, and update performance. For DFD and DFD-DM cubes, we observe that the number of coefficients stored or queried increases as the total number of attributes grows whereas the number of dimension attributes is the only effective parameter in the CFM model. Following the same trend, DFD-DM cubes outperform DFD cubes in all three figures because of using shorter wavelet filters for dimension attributes. In particular, Figure 7.2a shows that the storage requirement for the DFD models exponentially increases as the number of attributest=d+m grows whereas CFM mod- els are affected mainly by only the number of dimension attributes d. While it is prac- tically impossible to generate and store a DFD cube of very high dimensional datasets, CFM cubes are constructed using reasonable storage space although we have generated more than one datacube. For example, observe the significant difference between the CFM model and the DFD model of LH dataset in Figure 7.2a. Although we store 43 FM cubes in the CFM model of LH, it requires to store only 121 million coefficients. This is significantly smaller than378 million million coefficients of its single DFD cube. This figure basically shows thatF ¿ l m log m N (see Table 3.1 for more details). Also, notice the significant difference between the storage requirement of the DFD-DM model and the DFD model. Figure 7.2a shows that the DFD-DM model outperforms the DFD model by90% in the worst case. Note that the y-axis is in logarithmic scale. Figure 7.2b depicts the range aggregate query performance. CFM significantly out- performs DFD for all datasets. In particular, the significant difference between the DFD and the CFM query performance on LH data shows the inefficiency of DFD in querying 112 the datasets with large number of attributes. Also notice that the query performance on AIRS and GPS are exactly similar in the CFM model. This is due to the fact that both models have the same size of cubes in the CFM model and that query performance is independent of the cardinality of the CFM model. Moreover, Figure 7.2b shows the sig- nificant difference between the query cost of the DFD-DM model and the DFD model. Here, the DFD-DM model outperforms the DFD model by85% in the worst case. Figure 7.2c illustrates the update performances for our datasets. Similarly, CFM out- performs DFD although multiple cubes need to be updated for a single update query in the CFM model. With the CFM model, the update cost is generally higher than the query cost because multiple cubes (F FM cubes) need to be updated (see Table 3.1). Similar to query comparison results, the update cost of the DFD-DM model is significantly better than the DFD model (85% less in the worst case). For the rest of the experiments, we use only CFM model since we have shown that it is the superior cube model in WOLAP, although the general trend remains the same for the DFD cubes as well. 7.1.4 Approximation and Progressiveness In this section we investigate the WOLAP performance in progressive and approximate query processing. First, we monitor the average progression of 100 random range sum queries (a random range with polynomial degree of 1 for each query). In this set of experiments, we order the query coefficients based on their absolute values, Highest-B ordering of query, as this is the best progressive evaluation without extra storage. We refer the reader to [SJS05] for more empirical studies on progressive query processing. 113 Figure 7.3 shows that for all the datasets we achieve 99% query accuracy only by retrieving30% of the data coefficients required for each query. It is important to empha- size that the total number of data coefficients required for each query is already small and equal toO(l d log d N), not the entire datacube. Note also that we have reported the mean query time (qt) over each dataset in Figure 7.3 to demonstrate the practicality of WOLAP in the real-world, that is, the average query time for our experiments is only few seconds using our working system. 0.000 0.001 0.010 0.100 1.000 10.000 100.000 0 10 20 30 40 50 60 70 80 90 100 Query Progress Query Mean Relative Error (log) RAIN (qt=0.7 s) AIRS (qt=5.0 s) GPS (qt=4.9 s) LH (qt=0.3 s) S5 (qt=0.4 s) S6 (qt=3.2 s) S7 (qt=24 s) S7 S6 S5 LH RAIN GPS AIRS Figure 7.3: WOLAP Query Progressiveness Now, let us approximate data by keeping a subset of significant data coefficients, called synopsis, using Highest-B hard thresholding. For each synopsis size, we per- formed 100 random sum queries and computed the mean relative error over the exact answer computed with the complete data. Figure 7.4 shows that for all the datasets we achieve excellent query result, less than 1% relative error, with only keeping less than 1% of the data coefficients. This set of experiments empirically illustrates that wavelet transform is a favorable tool for data compression and query answering at the same time. This is more noticeable when a dataset is well-compressible like AIRS dataset in this study. For AIRS, storing only0:001% of data still results in99% query accuracy. Com- bining this impressive compression rate with the extra storage needed for storing the 114 extra cubes, WOLAP provides good approximations for polynomial queries with small amount of storage space. 0.000% 0.001% 0.010% 0.100% 1.000% 10.000% 100.000% 0.0001% 0.0010% 0.0100% 0.1000% 1.0000% 10.0000% 100.0000% Synopsis Size / Data Size (log) Query Mean Relative Error (log) RAIN AIRS GPS LH S5 S6 S7 LH S7 S6 S5 GPS RAIN AIRS Figure 7.4: WOLAP with Data Approximation 7.1.5 WOLAP vs. general OLAP cubings We compared WOLAP’s performance, as an exact algorithm, with the best-known range aggregate techniques in Table 8.1. Thus, our focus in this section is to compare WOLAP with two general OLAP cubing techniques: Dwarf [SRDK02] and Cure [MI06]. Dwarf, the most prominent OLAP cubing, prunes various redundancy from datacubes and stores them in tree-like data structures. Cure, the most recent ROLAP technique, removes all forms of storage redundancy and efficiently stores the complete datacubes in a relational format. Using Dwarf, Cure, and WOLAP techniques, we construct the corresponding cubes for our real-world datasets. Here, we assume that the desired aggregate query is sum, therefore, we only pre-compute this aggregate value for the cubing methods. Next, we perform 100 random range sum queries and report the mean query performance. Figure 7.5 shows the amount of storage used and the amount of data retrieved for all three methods. Note that the Y-axis is in logarithmic scale. 115 In terms of query performance, WOLAP significantly outperforms the two cub- ing techniques, Dwarf and Cure. This is not surprising because the cubing algorithms have pre-computed all combinations of the group-by queries, not all the arbitrary range queries. Therefore, all tuples/items inside the range must be retrieved for a random range query. On the contrary, WOLAP retrieves only few coefficients in order of the logarithm of the data size, independent of the range size. In terms of storage, all three methods require almost the same amount of storage space, except GPS (see Figure 7.5c). This is due to the fact that the GPS data is very sparse. Therefore, WOLAP, similar to other MOLAP techniques, requires more disk space for storing sparse datasets. 1 10 100 1,000 10,000 Cure Dwarf WOLAP Storage/Query (Kilo Bytes) Storage Query 1 10 100 1,000 10,000 100,000 Cure Dwarf WOLAP Storage/Query (Kilo Bytes) Storage Query a. LH b. RAIN 1 10 100 1,000 10,000 100,000 1,000,000 Cure Dwarf WOLAP Storage/Query (Kilo Bytes) Storage Query 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 Cure Dwarf WOLAP Storage/Query (Kilo Bytes) Storage Query c. GPS d. AIRS Figure 7.5: WOLAP vs. general OLAP cubings 116 7.1.6 WOLAP vs. other wavelet-based techniques In this section, we compare the accuracy of WOLAP’s progressive estimates with two other wavelet-based aggregation techniques. Here, we compare WOLAP with the pro- gressive version of the compact data cube (CDC) [VW99] and the progressive version of the approximate cube with the nonstandard transformation (NON) [CGRS00]. We cannot directly compare WOLAP with these techniques because they do not provide a progressive method. We perform 100 random range sum queries and report the mean relative error of the queries as the queries progress. Here, we limit our presentation only to AIRS and RAIN datasets due to the space limitation. We observed similar trends for other datasets. Figures 7.6a and 7.6b show the result of our experiments. CDC performs slightly better than the other two at the beginning, but it can never answer a query with 100% accuracy using the same number of retrievals because this method requires a complete scan of the data. NON, however, retrieves only the relevant coefficients to each query by reconstructing the query region in a multiresolution fashion. Yet, WOLAP dramati- cally outperforms NON for both datasets. The significant difference between WOLAP and NON comes from the fact that NON employs nonstandard wavelet transformation which is not suitable for aggregation. In fact, NON retrieves the coefficients relevant to the reconstruction of the query range. However, WOLAP retrieves the coefficients only related to the aggregate value. Figures 7.6a and 7.6b also demonstrate that WOLAP pro- vides an accurate result long before the entire retrieval is complete. However, CDC and NON must retrieve most of the related coefficients to provide accurate-enough results. Note that both axes are in logarithmic scale. 117 0.00% 0.01% 0.10% 1.00% 10.00% 100.00% 10 100 1,000 10,000 100,000 1,000,000 # of Coefficients Retrieved (log) Query Mean Relative Error (log) WOLAP CDC NON WOLAP NON CDC 0.00% 0.01% 0.10% 1.00% 10.00% 100.00% 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 # of Coefficients Retrieved (log) Query Mean Relative Error (log) WOLAP CDC NON WOLAP NON CDC a. RAIN b. AIRS Figure 7.6: WOLAP vs. other wavelet-based techniques 7.1.7 Disk Block Allocation In this section, we study how optimal allocation of the wavelet coefficients on disks can improve the performance of WOLAP. First, we compare our optimal allocation strategy with other allocation methods. Next, we demonstrate how use of larger disk blocks dramatically reduces the number of I/Os. Here, we perform 100 random range sum queries and report their average performance. Note that we report the result of our experiments on AIRS dataset only due to the space limitation. A similar trend is observed for other datasets. The problem addressed here is novel, and thus we lack a competitor method for direct comparison. However, there are two “common sense” block allocations which provide enlightening benchmarks. The first technique is the most naive: storing the multidimensional wavelet coefficients in a row-major order on disk. This is the alloca- tion we would obtain by default when we store the wavelet data as a one dimensional array on disk and let the file system determine the block allocation (in fact this is how we compute them in the first place [SS02b]). Our next technique has more respect for the multidimensional structure: we slice the data domain into cubes by slicing each 118 0.000% 0.001% 0.010% 0.100% 1.000% 10.000% 100.000% 0 500 1000 1500 2000 2500 # of Disk Blocks Retrieved Query Mean Relative Error Row Major Multidimensional Optimal Optimal Multidimensional Row Major Figure 7.7: Disk Block Allocation Meth- ods 0.000% 0.001% 0.010% 0.100% 1.000% 10.000% 100.000% 0 100 200 300 400 500 600 700 800 900 # of Disk Blocks Retrieved Query Mean Relative Error Block = 1K Block = 2K Block = 4K Block = 8K Block = 16K Figure 7.8: Effect of Large Disk Blocks dimension into virtual blocks and store each cell on a disk block. This gives us a natural multi-index for our disk blocks and wavelet coefficients in each block. Figure 7.7 shows that optimal allocation significantly outperforms the two naive allocation methods. In particular, optimal allocation requires retrieving50% less number of disk blocks compared to the naive methods. In addition, optimal allocation leads to a better progression as Figure 7.7 shows its excellent convergence. Here, we have assumed that the size of each disk block is 1K. Now, we choose larger disk blocks to study the effect of disk block size on the WOLAP performance. In particular, we have performed our experiments for disk blocks of size 1K, 2K, 4K, 8K, and 16K. Figure 7.8 illustrates that the number of retrievals along with its convergence improve as we use larger disk blocks. 7.2 Maintenance of Wavelet Transformed Data In this section, we study the performance of the SHIFT-SPLIT operations in three real world scenarios. First, we use these operations to transform a large dataset into the wavelet domain. Next, we show how SHIFT-SPLIT operations are employed for the 119 8 9 10 11 12 13 14 15 16 17 18 0 256 512 768 1024 Billions Memory size(MB) I/O cost (number of coefficients) Vitter et. al. Shift-Split (Standard) Shift-Split (Non-standard) d= 4 Dataset= 16 GB Figure 7.9: Effect of Larger Memory - 100 200 300 400 500 600 700 0 64 128 192 256 Millions Dataset size(GB) I/O cost (number of blocks) Standard (Tile=1 KB) Non-Standard (Tile=1 KB) Standard (Tile=4 KB) Non-Standard (Tile=4 KB) d= 2 Memory= 64 MB Figure 7.10: Effect of Larger Tiles maintenance of transformed data in an appending scenario. Finally, we show the signifi- cant improvement in the update cost for maintaining a wavelet synopsis in a data stream application by employing additional memory as a buffer. We would like to emphasize that the experiments are accurate implementations of the operations on real disks with real disk blocks. 7.2.1 Transformation of Massive Multidimensional Datasets In this set of experiments, we transform a large dataset, TEMPERATURE, into the wavelet domain using limited available memory. The TEMPERATURE dataset is a real-world dataset provided to us by JPL that measures the temperatures at points all over the globe at different altitudes for 18 months, sampled twice every day. We construct a 4-dimensional cube with latitude, longitude, altitude and time as dimension attributes, and temperature as the measure attribute, with the total size of the cube being 16GB. Figure 7.9 shows that larger memory considerably reduces transformation cost of SHIFT-SPLIT in the Standard form but it does not noticeably affect SHIFT-SPLIT in the Non-Standard form. The reason behind this is that the cost of the SPLIT operation is considerably different for the two forms of multidimensional wavelet transformation. Increasing memory size causes a significant decrease in SPLIT cost and consequently 120 a major decrease of the Standard form transformation as there are many coefficients affected by the contributions of the SPLIT operation. However, SPLIT cost is almost negligible in Non-Standard form (see Table 4.2). Finally, this figure also states that our SHIFT-SPLIT approach outperforms the Vitter et al. [VW99] algorithm for any memory size. As we have shown in Section 4.2, not only Tiling is the optimal wavelet coefficient blocking for query processing, but it is also a SHIFT-SPLIT friendly schema which introduces significant cost improvements in the transformation process. Figure 7.10 demonstrates this fact, by using different tile sizes and thus illustrates the scalability of SHIFT-SPLIT algorithm. - 5 10 15 20 25 30 0 2000 4000 6000 8000 10000 12000 14000 16000 Thousands Time (day) I/O cost (number of blocks) Tile size=2K Tile size=4K Tile size=8K Appending Rate=One Month Figure 7.11: SHIFT-SPLIT in Appending 7.2.2 Appending to Wavelet-Transformed Data We examine our proposed appending technique on the RAIN [WC94] dataset, where we incrementally receive new sets of data every month. RAIN is a real-life dataset that measures the daily precipitation for the Pacific NorthWest for 45 years. We built a 3-dimensional cube with latitude, longitude and time as dimensional attributes, and precipitation as the measure attribute for every day. The sizes of these dimensions are 121 8,8 and 32 respectively for each month. Figure 7.11 demonstrates the SHIFT-SPLIT I/O cost as new sets of data are appended. The sudden jumps in the figure correspond to the expansion process, where all coefficients must be shifted to accommodate for new data values. One can observe that this expansion process is not such a dominating factor as described in Section 4.4.2, especially for larger disk block sizes. 7.2.3 Data Stream Approximation In this scenario we only need to preserve the synopsis of the RAIN dataset, limited to a memory footprint of 40KB. Figure 7.12 demonstrates the computational cost versus the extra storage trade-off described in Section 4.4.3. As the figure suggests, the update cost can be improved by88% by employing additional buffer memory of only6% of the total synopsis size. - 150 300 450 600 750 900 0.0% 25.0% 50.0% 75.0% 100.0% 125.0% 150.0% Extra Memory(%) Number of Updates per Item 88% cost reduction 6% extra memory Figure 7.12: SHIFT-SPLIT in Multidimensional Streaming 7.3 Range Group-by Query Processing with Wavelets In this section, we empirically examine our proposed method with three multidimen- sional datasets. We would like to emphasize that our experiments are performed on real datasets using our fully functional system (see Chapter 6 for further information). 122 We start the experiments by describing the datasets employed in our study. Next, we compare the query performance of our range group-by technique with the individual queries in both original domain (Naive Batch Queries) and the wavelet domain (Batch Queries with Wavelets). Finally, we study the progressiveness and compare the different forms of approximation. 7.3.1 Experimental Setup We evaluate our framework with three real-world scientific datasets, namely LH, GPS, and AIRS. We described these datasets in Section 7.1.1. We wavelet transformed the datacubes using our efficient transformation technique [JSS05] and stored them into the disk using our efficient multidimensional tiling [SS04]. Each tile contains the wavelet coefficients that are related with each other under the par- ticular access pattern of wavelets to minimize the number of disk I/Os needed to perform any operation in the wavelet domain. By reporting the number of retrieved “coefficients” in our experiments, we do not include the advantage of using this technique. 7.3.2 Performance Analysis We generate 100 random range group-by queries (a random range for each query) and count the number of disk I/Os required to answer each query. We perform this experi- ment on our three datasets and used the three algorithm discussed in this paper; Naive Batch Queries, Batch Queries with Wavelets, and our proposed technique. The average number of I/Os across the queries is depicted in Figure 7.13. Generally, Batch Queries with Wavelets outperforms Naive Batch Queries because we perform each aggregate queries in a less costly method (O(N d ) is reduced to 123 10,000 100,000 1,000,000 10,000,000 100,000,000 LH GPS AIRS Dataset Number of Coefficients (log) Batch Queries (Naïve) Batch Queries (DWT) Our Approach Figure 7.13: Performance Analysis of Range Group-by Queries O(log d N)). More importantly, this figure shows that our proposed approach dramat- ically outperforms both. Note that Y-axis is in logarithmic scale. The reason is that our proposed technique is a one-pass algorithm which shares the coefficient among group values whereas Batch Queries with Wavelets requires submitting a large set of individual aggregate queries. Now, we study the effect of range size on the performance of range group-by queries. We generate10 queries for varied range sizes, from 1% to 35% of the entire dataset. The result of our experiment is shown in Figure 7.14, the median of the10 queries is reported for each point. Note that for this set of experiments and the following experiments, we report our results only on AIRS dataset because the trends and observations from other datasets were similar to those of AIRS. Here, as the range size grows, the number of coefficients increases for Naive Batch Queries as we expected since its complexity was O(M d ). However, the number of coefficients for the other two techniques, Batch Queries with Wavelets and our proposed technique, are almost constant as the range grows. This is discussed earlier when we justified the use of wavelets in aggregate queries. We discussed that the complexity of range aggregate queries with wavelets is O(log d N) which is independent of the range 124 10,000 100,000 1,000,000 10,000,000 100,000,000 0% 5% 10% 15% 20% 25% 30% 35% Range/Data (%) Number of Coefficients (log) Batch Queries (Naïve) Batch Queries (DWT) Our Approach Figure 7.14: Effect of Range Size on Range Group-by Queries size. In addition, this figure shows the significant difference between the two wavelet methods, this is due to the I/O sharing in the new approach. Note that Y-axis is in logarithmic scale in this figure. Now, we study the effect of number of the grouping dimensions on the performance of range group-by queries. We generate100 random range group-by queries for different grouping dimensions: g = 1,g = 2, andg = 3. The result of our experiment is shown in Figure 7.15, the median of the100 queries is reported for each point. 10,000 100,000 1,000,000 10,000,000 100,000,000 1 2 3 Number of Grouping Dimensions (g) Number of Coefficients (log) Batch Queries (Naïve) Batch Queries (DWT) Our Approach Figure 7.15: Effect of Number of the Grouping Dimensions on Range Group-by Queries 125 Here, as the number of grouping dimensions grows, the number of the retrieved coefficients increases for both wavelet methods, whereas the number of the retrievals remains the same for Naive Batch Queries (O(M d )). For Batch Queries with Wavelets O(M g log d N), as g grows, the number of required aggregate queries dramatically increases without a change in their aggregation costs. However, for our proposed approachO((M+log N M ) g log d¡g N), the increase is more gentle because as the number of aggregations increases, the cost of each aggregation decreases accordingly. Next, we compare the three methods in dealing with multi-resolution parameter. We generate 100 random queries for each resolution level. A higher resolution refers to a coarser view of data. We show the average number of I/Os required to answer these queries versus resolution in Figure 7.16. 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1 2 3 4 5 6 7 8 Resolution Number of Coefficients (log) Batch Queries (Naïve) Batch Queries (DWT) Our Approach Figure 7.16: Range Group-by Query at different Resolutions With Naive Batch Queries, processing the query for a coarser view has the exact same cost as we generate it for the finer resolutions. Because for coarser views, we submit less number of aggregate queries while the complexity to answer each query is higher. As the resolution grows (coarser views), Batch Query with Wavelets performs better. This is due to the fact that the number of range aggregate queries is reduced for coarser views. However, the steep of cost reduction is not as deep as our proposed 126 approach. The reason is that no matter what the resolution is, the cost of aggregate queries for Batch Query with Wavelets is O(log d N). On the contrary, our proposed method is a resolution-aware method which only retrieves the data up to the required level. The significant difference between the query processing methods is also illustrated in this figure. Note that Y-axis is in logarithmic scale. We conclude our experiments with studying the progressiveness of our algorithm. We generate100 random range group-by queries on AIRS and report the mean relative error on the group values in its entirety. Here, we use six ordering schemas: 1. Favoring on Reconstruction phase with first-B on Aggregation 2. Favoring on Reconstruction phase with highest-B on Aggregation 3. Favoring on Aggregation phase with first-B 4. Favoring on Aggregation phase with highest-B 5. Hybrid ordering with first-B on Aggregation 6. Hybrid ordering with highest-B on Aggregation 0% 1% 10% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Query Progress Mean Relative Error (log) Favor: Reconstruction, FB Agg Favor: Reconstruction, HB Agg Favor: Aggregation, FB Agg Favor: Aggregation, HB Agg Hybrid, FB Agg Hybrid, HB Agg Rec, FB Agg Rec, HB Agg Hybrid, FB Agg Hybrid, HB Agg Figure 7.17: Progressiveness in Range Group-by Query processing 127 Figure 7.17 shows that the first two algorithms are the inferior approaches whereas the last two orderings are the best algorithms to be used. The intuition behind our observation is that the first two algorithms compute each coefficient of the intermediate datacubeG exactly and then they move to computing the second coefficient. However, the hybrid methods estimate all coefficients of G at the same time. In fact, the best methods, the hybrid methods, estimate the coefficients of the result set (G) whereas the other approaches estimate the transformed result set ( ^ G) to be used for the reconstruction phase. 128 Chapter 8 Related Work 8.1 Range Aggregate Queries Gray et al. [GBLP96] proposed a new relational aggregation operator, the Data Cube, that accommodates aggregation of multidimensional data by precomputing group-by aggregations over all possible combinations of data dimensions. The inherent challenge for the cube operator was its huge size, both for computing and storing. Thus, a major body of OLAP studies focused on the efficient computation and implementation of dat- acubes. In fact, these techniques [SDN98, HRU96, SRDK02, MI06, LPZ03, ZDN97] provide a materialized view on data by precomputing the whole or a subset of group-by queries. However, these methods, in addition to their high update cost, are unable to efficiently process aggregate queries on arbitrary ranges since not all the ranges can be determined a priori at the time of computing the cubes. Extensive research, on the other hand, has been done to efficiently evaluate range aggregate queries over arbitrary ranges “on-the-fly” by indexing, transforming, or com- pressing the data. The prefix-sum method [HAMS97] publicized the fact that careful pre-aggregation can be used to evaluate range aggregate queries in constant time, inde- pendent of the range size. The update cost, however, could be as large as the size of the cube. This led to a number of new techniques [GAAS99, CI99, GAA00, RAA00b, CCL04, RAA01] that provided similar benefits with different query/update cost trade- offs (see Table 8.1). WOLAP not only competes with these techniques when exact 129 answer is needed but also can provide excellent approximate/progressive answers when a fast approximate result is required. We denote the number of dimensions byd and the domain size for each dimension byN in the following table. Algorithm Range-sum Query Update Query Prefix-Sum 2 d N d Relative Prefix-Sum 4 d p N d¡1 SDDC 2 d log d N log d N WOLAP 2 d log d N log d N Table 8.1: Query/Update tradeoff for exact range aggregate algorithms For applications where quick approximate answers are needed, a number of differ- ent approaches has been taken. Histograms [GKMS01b, GKTD00, PG99] have been widely used to approximate the joint data distribution and therefore provide approxi- mate answers in aggregate queries. Random sampling [GG02, GM98, HS95] has also been used to calculate synopses of the data cube. Vitter et al. have used the wavelet transformation to compress the prefix sum data cube [VWI98] or the original data cube [VW99], constructing Compact Data Cubes. Such approximations share the dis- advantage of being highly data dependent and as a result can lead to bad performance in some cases. On the contrary, we have introduced a data-independent approximation approach by compressing queries, rather than compressing data. In fact, data compres- sion is optional since WOLAP can work with either compressed data or uncompressed data. Hellerstein et al. [HHW97] introduced the notion of progressiveness in query answering with feedback, using running confidence intervals. References [LM01, RAA00a] are also based on this notion. Wavelets and their inherent multiresolu- tion property has been exploited in providing answers that progressively get better. 130 Lemire [Lem02] transforms the relative prefix sum cube to support progressive answer- ing, whereas Wu et al. [WAA00] transform the data cube directly. These methods share a common strategy: answer queries quickly using a low resolution view of the data, and progressively refine the answer while building a sharper view. Our approach, WOLAP, is fundamentally different. It makes excellent progressive estimates of the query, not the data. We have extensively studied progressive query processing in [SJS05] by exploring the ordering of not only query or data coefficients but also the hybrid of the two. It was not until the time our group proposed a new wavelet technique, ProPolyne[SS02b], for fast exact, approximate, or progressive polynomial aggregate query processing that data did not have to be compressed, unlike most of the prior studies in this area. Toward this end, ProPolyne utilizes the wavelet transform of data frequency distribution, known as DFD, to form and process such queries. However, during the last five years of deploying this methodology in various real-world scientific applications, we have experienced that preparing data frequency cubes is neither feasible nor efficient with most scientific datasets due to their sparsity. Subsequently, we propose a new cube model, CFM, to enhance ProPolyne’s both space and query efficiency. While ProPolyne assumed storing the data as large data frequency distribution cubes, CFM organizes the data as a collection of smaller fixed measure cubes to reduce the overall query and stor- age costs. We combine both cube models in an integrated framework, called WOLAP, for efficient polynomial aggregate query processing. We further enhance WOLAP by proposing practical solutions for real-world deployment in scientific applications. In particular, we show how to incorporate data approximation, how to improve wavelet filter selection, and how to work on datacubes with arbitrary domain sizes. In a nutshell, WOLAP effectively subsumes and extends ProPolyne and addresses several practical issues to provide a practical framework for wavelet-based range aggregate processing. 131 8.2 Maintenance of Wavelet Transformed Data Discrete Wavelet Transform has been extensively employed in various fields of databases. In the domain of OLAP range aggregate query processing, as fully explored in the previous section, there has been tremendous amount of studies[CGRS00, Lem02, SS02a, SS02b, VW99, VWI98, WAA00]. In the domain of time-series analysis and mining, wavelets are used to automate feature extraction and expedite pattern discovery and outlier detection [BS03, PBF03]. The wavelet transformation is also used to provide compact synopses of data streams [GG02, GKMS01a] in support of approximate query processing. With the exception of [CGRS00], where traditional relational algebra operations are re-defined to work directly in the wavelet domain, most applications resort to recon- struction of many data values to support even the simplest operations in the original domain. But the work of [CGRS00] is limited to only non-standard multidimensional wavelet and are not general enough to support all the data maintenance scenarios. Here, we generalize these operations for both forms of multidimensional wavelet transforma- tion by introducing two novel operations for wavelet transformed data, termed SHIFT and SPLIT, which work directly in the wavelet domain. We employ these operations in four common data maintenance scenarios to significantly improve their efficiency. Now let us review the related work for the data maintenance scenarios: In Transformation of Massive Multidimensional Datasets, we show that our new transformation technique significantly outperforms the state of the art methods [VW99, VWI98] for transforming large multidimensional datasets as our technique is essentially a one-pass method whereas the former methods require to perform several passes over the data. 132 In Data Stream Approximation, Gilbert et al. [GKMS01a] demonstrated that a best K-term wavelet approximation of a single dimensional data stream of domain size N in the time series model is possible using space of O(K + logN), by keeping the O(logN) coefficients that can change, with per-item cost of O(logN). We show that the SHIFT-SPLIT operations can further reduce the per-item cost to O ¡ 1 B log N B ¢ at the expense of additional storage of B coefficients. Furthermore, we investigate the case of multidimensional data streams, decomposed under two different forms of wavelet transformation. We conclude that we can maintain a K-term approximation, under certain restrictions. To the best of our knowledge, this is the first work deal- ing with wavelet approximation of multidimensional data streams, as previous works [BS03, GG02, GKMS01a, PBF03] focused on the single dimensional case. In Partial Reconstruction from Wavelet Transforms, Chakrabarti et al. [CGRS00] propose a solution to deal with relational algebra selection operations in the wavelet domain. Their approach examines the wavelet coefficients to calculate their contribution to the selected range. Our SHIFT-SPLIT approach generalizes this notion and therefore can be applied to other forms of wavelet decomposition. 8.3 Range Group-by Queries In the previous section, we explored the related work for range aggregate queries in exact [CI99, GAAS99, HAMS97], approximate [CGRS00, GG02, GM98, GKMS01b, GKTD00, HS95, PG99], or progressive[HHW97, JS07, LM01, RAA00a, SS02a, WAA00] fashion. However, none of these techniques addressed I/O sharing among a set of queries as these techniques are essentially designed for individual query processing. 133 More importantly in the case of approximation, these techniques minimize the approxi- mation error per individual queries rather than minimizing the total approximation error of the entire set. Simultaneous evaluation of multiple queries has been addressed in a couple of studies[CN05, DSRS03, ZDNS98]. However, their primary focus is on resource shar- ing among the queries by either creating materialized views or computing partial dat- acubes. In addition, they are not designed for the case when relations are stored using pre-aggregation or transformation techniques. Similar to the above, these techniques do not provide a plan for approximation of the entire batch of queries. Toward addressing these shortcomings, our group introduced a framework for pro- gressive answering of multiple range-sum queries in [SS02b]. The focus of this study was to minimize the structural error across the batch of queries and to share I/O among them. However, deploying this technique often requires submitting a large number of queries, which is neither memory efficient (both in the client side and the server side) nor communication efficient, compared to the method we propose for range group-by query processing. In fact, this technique was efficient when we had no extra information about the relationships among the queries inside the batch. However for processing the range group-by queries, we have this extra information in advance since we form the batch of queries. Intuitively, our proposed range group-by processing method exploits this extra knowledge for more efficient processing of the query. In this dissertation, we propose a fundamentally different approach. We process a range group-by query in its entirety with a single reconstruction operation instead of submitting multiple point-by-point reconstruction operations. Here, we decompose a range group-by query into two sets of 1) aggregate queries, and 2) reconstruction queries. Subsequently, we effectively compute both in the wavelet domain by extending 134 our earlier studies [SS02b, JS07, JSS05]. In particular, we employ the techniques in [JSS05] to operate the reconstruction phase (i.e., reconstructing a subset of data from its wavelet coefficients) and adopt the aggregation method discussed in [SS02b, JS07]. In addition, we extend our earlier progressive query evaluation to the new class of range group-by queries. Progressive method for range group-by queries differs from aggregate queries in that the range group-by queries are combined of two sets of queries: aggregate queries and reconstruction queries. In Chapter 5, we show that reconstruction query requires a different ordering function and study the near optimal ordering for range group-by query processing in general. 135 Chapter 9 Conclusions We have introduced a novel framework, named WOLAP, in which we support exact, approximate, and progressive processing of range polynomial aggregate queries and range group-by queries. We have proposed several applied solutions for the real-world deployment of WOLAP. In particular, we have proposed how to utilize data approxi- mation, achieve an optimal disk block allocation, improve wavelet filter selection, and transform datacubes with arbitrary domain sizes. Toward realizing the practical use of WOLAP, we have provided a framework to efficiently maintain large multidimensional wavelet-transformed data. In particular, by introducing two novel operations which work directly in the wavelet domain, we have allowed WOLAP to transform, reverse transform, store, update, and append data in an I/O efficient manner. We have designed, developed, and matured an end-to-end WOLAP system , dubbed ProDA, that efficiently and effectively analyzes massive multidimensional datasets. By employing ProDA and conducting extensive sets of experiments with several real-world datasets, we have verified the effectiveness of WOLAP in practice. Although we have extended the set of supported analytical queries to include the entire family of polynomial aggregate queries as well as the important class of range group-by queries; there are still other types of queries to be addressed and optimized such as Min/Max aggregations. Hence, an important direction in extending WOLAP is to provide wavelet-based solutions for the unexplored types of OLAP queries. 136 In addition, we believe the efficiency of WOLAP in dealing with very sparse datasets must be further studied. Since the use of the wavelet transform magnifies the storage cost for sparse datasets, optimizing the storage of wavelet transformed data for very sparse datasets is essential toward further realization of WOLAP in a wide range of applications. 137 References [AGS97] Rakesh Agrawal, A. Gupta, and Sunita Sarawagi. Modeling multidimen- sional databases. In ICDE, 1997. [BS03] A. Bulut and A. K. Singh. SWAT: Hierarchical stream summarization in large networks. In ICDE, pages 303–314, 2003. [CCL04] Seok-Ju Chun, Chin-Wan Chung, and Seok-Lyong Lee. Space-efficient cubes for olap range-sum queries. Decis. Support Syst., 37(1):83–102, 2004. [CGRS00] Kaushik Chakrabarti, Minos N. Garofalakis, Rajeev Rastogi, and Kyuseok Shim. Approximate query processing using wavelets. In VLDB, 2000. [CI99] C.-Y . Chan and Y . E. Ionnidis. Hierarchical cubes for range-sum queries. In VLDB, 1999. [CN05] Zhimin Chen and Vivek Narasayya. Efficient computation of multiple group by queries. In SIGMOD ’05, pages 263–274, New York, NY , USA, 2005. ACM Press. [CSBK08] Mehrdad Jahangiri Cyrus Shahabi and Farnoush Banaei-Kashani. ProDA: An End-to-End Wavelet-Based OLAP System for Massive Datasets. In IEEE Computer Magazine, volume 41, pages 69–77, 2008. [Dau92] Ingrid Daubechies. Ten lectures on wavelets. Society for Industrial and Applied Mathematics, 1992. [DSRS03] Nilesh N. Dalvi, Sumit K. Sanghai, Prasan Roy, and S. Sudarshan. Pipelining in multi-query optimization. J. Comput. Syst. Sci., 66(4):728– 762, 2003. [GAA00] S. Geffner, D. Agrawal, and A. El Abbadi. The dynamic data cube. In EDBT, 2000. 138 [GAAS99] S. Geffner, D. Agrawal, A. El Abbadi, and T. Smith. Relative prefix sums: An efficient approach for querying dynamic OLAP data cubes. In ICDE, 1999. [GBLP96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Datacube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In ICDE, 1996. [GG02] Minos Garofalakis and Phillip B. Gibbons. Wavelet synopses with error guarantees. In ACM SIGMOD, 2002. [GKMS01a] Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggre- gate queries. In VLDB, 2001. [GKMS01b] Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. Optimal and approximate computation of summary statistics for range aggregates. In PODS, 2001. [GKTD00] Dimitrios Gunopulos, George Kollios, Vassilis J. Tsotras, and Carlotta Domeniconi. Approximating multi-dimensional aggregate range queries over real attributes. In ACM SIGMOD, 2000. [GM98] Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD, 1998. [HAMS97] C. Ho, R. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In ACM SIGMOD, 1997. [HHW97] Joseph M. Hellerstein, Peter J. Haas, and Helen Wang. Online aggrega- tion. In ACM SIGMOD, 1997. [HRU96] Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. Imple- menting data cubes efficiently. In ACM SIGMOD, 1996. [HS95] Peter J. Haas and Arun N. Swami. Sampling-based selectivity estimation for joins using augmented frequent value statistics. In ICDE, 1995. [JS05] Mehrdad Jahangiri and Cyrus Shahabi. ProDA: A Suite of WebServices for Progressive Data Analysis. In ACM SIGMOD (demonstration), 2005. [JS07] Mehrdad Jahangiri and Cyrus Shahabi. Wolap: Wavelet-based range aggregate query processing. In Department of Computer Science Tech- nical Reports. USC, 2007. 139 [JSS05] Mehrdad Jahangiri, Dimitris Sacharidis, and Cyrus Shahabi. SHIFT- SPLIT: I/O Efficient Maintenance of Wavelet-Transformed Multidimen- sional Data. In ACM SIGMOD, 2005. [Lem02] Daniel Lemire. Wavelet-based relative prefix sum methods for range sum queries in data cubes. In CASCON. IBM, October 2002. [LM01] Iosif Lazaridis and Sharad Mehrotra. Progressive approximate aggregate queries with a multi-resolution tree structure. In ACM SIGMOD, 2001. [LPZ03] Laks V . S. Lakshmanan, Jian Pei, and Yan Zhao. Qc-trees: an efficient summary structure for semantic olap. In ACM SIGMOD. ACM, 2003. [MI06] Konstantinos Morfonios and Yannis Ioannidis. Cure for cubes: cubing using a rolap engine. In VLDB, 2006. [Nie99] Yves Nievergelt. Wavelets Made Easy. Springer, 1999. [PBF03] Spiros Papadimitriou, Anthony Brockwell, and Christos Faloutsos. Awsom: Adaptive, hands-off stream mining. In VLDB, pages 560–571, 2003. [PG99] Viswanath Poosala and Venkatesh Ganti. Fast approximate answers to aggregate queries on a data cube. In SSDBM, 1999. [RAA00a] Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi. pCube: Update-efficient online aggregation with progressive feedback. In SSDBM, 2000. [RAA00b] Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi. Space- efficient datacubes for dynamic environments. In Data Warehousing and Knowledge Discovery (DaWaK), 2000. [RAA01] Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi. Flexible data cubes for online aggregation. In ICDT, 2001. [SDN98] Amit Shukla, Prasad Deshpande, and Jeffrey F. Naughton. Materialized view selection for multidimensional datasets. In VLDB, 1998. [SJS05] Cyrus Shahabi, Mehrdad Jahangiri, and Dimitris Sacharidis. Hybrid Query and Data Ordering for Fast and Progressive Range-Aggregate Query Answering. International Journal of Data Warehousing and Min- ing, 2005. [SN96] G. Strang and T. Nguyen. Wavelets and filter banks. Wellesley-Cambridge Press, 1996. 140 [SRDK02] Yannis Sismanis, Nick Roussopoulos, Antonios Deligianannakis, and Yannis Kotidis. Dwarf: Shrinking the petacube. In SIGMOD, 2002. [SS02a] R.R. Schmidt and C. Shahabi. How to evaluate multiple range-sum queries progressively. In ACM PODS, pages 3–5, 2002. [SS02b] R.R. Schmidt and C. Shahabi. Propolyne: A fast wavelet-based technique for progressive evaluation of polynomial range-sum queries. In EDBT, 2002. [SS04] C. Shahabi and R.R. Schmidt. Wavelet disk placement for efficient query- ing of large multidimensional data sets. In Computer Science Technical Reports. University Of Southern California, 2004. [VW99] J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In ACM SIGMOD, 1999. [VWI98] J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and his- tograms via wavelets. In CIKM, 1998. [WAA00] Yi-Leh Wu, Divyakant Agrawal, and Amr El Abbadi. Using wavelet decomposition to support progressive and approximate range-sum queries over data cubes. In CIKM, 2000. [WC94] M. Widmann and C.Bretherton. 50 km resolution daily precipitation for the pacific northwest, 1949-94. [ZDN97] Yihong Zhao, Prasad M. Deshpande, and Jeffrey F. Naughton. An array- based algorithm for simultaneous multidimensional aggregates. In ACM SIGMOD, pages 159–170, 1997. [ZDNS98] Yihong Zhao, Prasad M. Deshpande, Jeffrey F. Naughton, and Amit Shukla. Simultaneous optimization and evaluation of multiple dimen- sional queries. In Proc. of SIGMOD, pages 271–282, New York, NY , USA, 1998. ACM Press. 141
Abstract (if available)
Abstract
Wavelet Transform has emerged as an elegant tool for online analytical queries. Most of the methods using wavelets, however, share the disadvantage of providing only data-dependant approximate answers by compressing the data. On the contrary, we propose a wavelet-based query processing technique, WOLAP, which does not require compressing the data. Instead, we employ wavelet transform to compact incoming queries rather than the underlying data. The intuition here is that queries are well-formed with repetitive patterns that can be exploited by wavelets for a more effective compression, leading to efficient query performance. WOLAP extends the set of ad-hoc analytical queries to include the entire family of range polynomial aggregate queries as well as the complex class of range group-by queries. In addition, leveraging from the multi-resolution property of wavelets, WOLAP supports progressive and approximate query processing in case of time or space limitation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Approximate query answering in unstructured peer-to-peer databases
PDF
Efficient reachability query evaluation in large spatiotemporal contact networks
PDF
Spatial query processing using Voronoi diagrams
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Data-driven methods in description-based approaches to audio information processing
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Efficient updates for continuous queries over moving objects
PDF
Scalable data integration under constraints
PDF
Parallel implementations of the discrete wavelet transform and hyperspectral data compression on reconfigurable platforms: approach, methodology and practical considerations
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Scalable processing of spatial queries
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Exploitation of wide area motion imagery
PDF
Random access to compressed volumetric data
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Robust real-time algorithms for processing data from oil and gas facilities
PDF
Model based view-invariant human action recognition and segmentation
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
DBSSC: density-based searchspace-limited subspace clustering
Asset Metadata
Creator
Jahangiri, Mehrdad
(author)
Core Title
WOLAP: wavelet-based on-line analytical processing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/26/2008
Defense Date
06/18/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,OLAP,range aggregate query,scientific data analysis,wavelet transform
Language
English
Advisor
Shahabi, Cyrus (
committee chair
), Narayanan, Shrikanth S. (
committee member
), Nevatia, Ramakant (
committee member
)
Creator Email
jahangir@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1421
Unique identifier
UC1293948
Identifier
etd-Jahangiri-20080726 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-198703 (legacy record id),usctheses-m1421 (legacy record id)
Legacy Identifier
etd-Jahangiri-20080726.pdf
Dmrecord
198703
Document Type
Dissertation
Rights
Jahangiri, Mehrdad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
OLAP
range aggregate query
scientific data analysis
wavelet transform