Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
(USC Thesis Other)
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT AND ACCURATE IN-NETWORK PROCESSING FOR MONITORING APPLICATIONS IN WIRELESS SENSOR NETWORKS by SunHee Yoon A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2008 Copyright 2008 SunHee Yoon Dedication To GOD who I love now and forever, and to my parents who show me what the infinite support is. ii Acknowledgements First and foremost, I thank God who always guides my way, takes good care of me, and teaches me what is love, wisdom, tears, and life. I am greatly thankful to prof. Ramesh Govindan and prof. Bhaskar Krishnamachari who donated their valuable time to serve on my both qualifying and dissertation committee. I am deeply thankful to prof. Cyrus Shahabi for being my advisor. I greatly appreciate prof. John Heidemann and prof. Wei Ye for their advising my qualifying exam and serving on qualifying committee. I also appreciate prof. David Kempe for being in my qualifying committee and giving me brilliant constructive criticism. I appreciate prof. Joe Touch who taught me what a PhD is, what research is, and what a researcher is. I also appreciate prof. Milind Tambe who still inspires me to see my work from AI’s perspective. I have been fortunate to have had all these great professors who contributed to my growth as a scientist. It is impossible to fully express my gratitude. I am greatly thankful to my friend, Omprakash Gnawali, who showed me what a true friend is, and have assisted me in various ways. I would like to thank Hyesook and Eunyoung who are supportive to me, all the Friday happy hour members I will never forget: David, Philipp, Alex, Jan, Deepak, and Jay who shared wonderful experience with me, and friends Sundeep, Marco, Youngjin, Marcos, and Venkata who gave me valuable feedback on my work. I sincerely thank them all for their time and effort. Finally, I thank my grand mother, parents, brother and sister. They gave me the prayer and confidence to pursue this dream. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables vii List Of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivations and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Summary of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Background and Related Work 14 2.1 CAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Accurate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Approximate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3 Modeling Data Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.4 Utilizing Data Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.5 Non-Clustered Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.6 Clustered Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 SWATS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 SCADA Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Collaborative Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 DSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Voronoi Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Convex Hull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3 GPSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.4 GHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.5 Skyline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.6 Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 iv Chapter 3: CAG: Clustered AGgregation 28 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 The CAG Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Two Modes of CAG Operation . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Interactive Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Streaming Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.3.1 Cluster Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.3.2 Cluster Size Estimation . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Analysis of CAG: Efficiency and Accuracy . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Measurement and Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 Variogram models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.2.1 Sensor Data Measurement in a Regular Grid . . . . . . . . . . . . 47 3.4.2.2 Data with Irregular Mote Placement on Great Duck Island . . . . 50 3.4.2.3 Synthetic Data from the Statistical Model . . . . . . . . . . . . . 50 3.4.2.4 Synthetic Data from the Ecological Model . . . . . . . . . . . . . 51 3.4.3 The Spatial Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.4 The Impact of Different Correlation Models on CAG . . . . . . . . . . . . . 54 3.5 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.1 Evaluation Metrics and Experimental Setup . . . . . . . . . . . . . . . . . . 57 3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.2.1 Interactive Mode of CAG . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.2.2 Streaming Mode of CAG . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.2.3 Sub-optimal Cluster Size and Routing Path of CAG . . . . . . . . 70 Chapter 4: SWATS: Steam and WAter Tracking System 73 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.1 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.2 The Steamflood Monitoring Algorithm . . . . . . . . . . . . . . . . . . . . . 79 4.3.2.1 Single-node Processing . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2.2 Multi-node Collaboration . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.3 Principal Rules in Decision Tree Algorithm . . . . . . . . . . . . . . . . . . 81 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Simulation Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 Anomaly Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.3 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Chapter 5: DSS: Distributed Spatial Skyline 93 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Formal Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.1 General Skyline Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.2 Spatial Skyline Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.2 Properties of Spatial Skyline Queries . . . . . . . . . . . . . . . . . . . . . . 97 5.3.3 Centralized Spatial Skyline Algorithm . . . . . . . . . . . . . . . . . . . . . 97 5.3.4 Vornoi-based Geographic Flooding . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.5 Determination of Geometric Association . . . . . . . . . . . . . . . . . . . . 98 5.4 DSS Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 v 5.4.1 TDSS: Triangulation-based DSS Algorithm . . . . . . . . . . . . . . . . . . 101 5.4.1.1 Handling Special Cases . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4.2 RDSS: Rendezvous-based DSS Algorithm . . . . . . . . . . . . . . . . . . . 110 5.4.3 TRDSS: Triangulation and Rendezvous-based DSS Algorithm . . . . . . . . 111 5.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.6 Correctness and Completeness of TDSS . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6.1 Correctness of TDSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6.2 Completeness of TDSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . 115 5.6.3 Message Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.7.1 Metrics and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 125 5.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Chapter 6: Conclusions and Future Work 134 Bibliography 137 vi List Of Tables 3.1 Properties of aggregate operators supported by CAG. Taxonomy is based on [Mad- denetal. 2002]. Keyforabbreviations: Exemplary(E),Summary(S),Distributive (D),andAlgebraic(A).TheerrorforonlytheMAXandMINoperatorsisbounded in the interactive mode of CAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Comparison of two modes of CAG’s operation: interactive and streaming. . . . . . 38 3.3 The metrics used in the evaluation of CAG. . . . . . . . . . . . . . . . . . . . . . . 58 4.1 Classification of anomalies in steamflood monitoring in oilfield. All the events need multiple sensors for detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 PrincipalRulestoidentifyProblemsandFalseAlarms. Keyforabbreviation: Prob- lem (P), False alarm (F). Last long (L), Ephemeral (E) * indicates the downstream regardless of equipment in the pipeline. For example, if P or F increases in one downstream of pipeline branch, then P or F in at least one another downstream decreases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Parameters used for blockage scenario using Dynsim . . . . . . . . . . . . . . . . . 88 5.1 Comparison of three DSS algorithms: TDSS, RDSS, and TRDSS. All algorithms use primary partitioning to classify the nodes as skylines or potentially skylines and support arbitrary synchronization points. The algorithms are different in the way they perform the secondary partitioning to search for skylines and what nodes can be control points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 vii List Of Figures 1.1 High Level Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Classification of major query systems in WSN. . . . . . . . . . . . . . . . . . . . . 15 2.2 Voronoi diagram (a), Convex hull (b), and GHT (c) . . . . . . . . . . . . . . . . . 23 2.3 GPSR with greedy and perimeter traversals. . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Classification of skyline algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Two modes of CAG’s operation: interactive and streaming modes. The solid lines indicate the query propagation and the dotted lines indicate the response. Black nodes are clusterheads, gray nodes are bridges, and white nodes are non- participating nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Examples of the query routing tree for the analysis of CAG. . . . . . . . . . . . . . 40 3.3 Pictures of outdoor and indoor environments where the data is measured and the map of the Great Duck Island. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Measured light data from the Exposition Park. Y-axis is ADC value of sensor reading. All the sensorboards are facing the sky except (e) and (f) in which case the sensorboards are facing down. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Synthetic data using statistical model (7h and 9h) and ecological model (spatial pattern). The magnitude of value (synthetic sensor reading) decreases from red to yellow to green to blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Variograms of measured and synthetic data. . . . . . . . . . . . . . . . . . . . . . . 52 3.7 Mathematical model of P c . CAGn means CAG with τ = n%. Both (a) and (b) use 10 meter internode distance in a regular grid. . . . . . . . . . . . . . . . . . . . 54 3.8 Accuracy model as a function of threshold (τ) and internode distance (h). . . . . . 56 viii 3.9 A snapshot of CAG tree with 375 nodes randomly placed in 250m×250m space with 9h synthetic data and τ = 20%. The big black square near the bottom left corner is the root node and other small black circles are nodes in the root cluster. Clusterhead nodes (except the root node) are the small black squares and the non- clusterhead nodes (except the root cluster) are small empty circles. The number beside each node indicates clusterhead node id, and the arrow points to the parent node in the query routing tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.10 Performance and precision tradeoff in the interactive mode with measured temper- ature data from the Exposition Park. . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.11 Impacts of different spatial and temporal correlations on the performance of CAG. All experiments use the radio profile for default transmission power. . . . . . . . . 64 3.12 Breakdown of transmission overhead with measured temperature data. . . . . . . . 67 3.13 Accuracy results with CAG for different datasets with default transmission power. 69 3.14 Sub-optimal cluster size of CAG with 25 meters disc radio model and temperature in 1 PM and 7 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.15 Sub-optimal transmission overhead of CAG with 25 meters disc radio model and temperature in 1 PM and 7 PM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1 Equipments currently being used in steamflood monitoring in oilfield . . . . . . . . 74 4.2 An example of decision tree on pressure for blockage, leakage, and downhole pres- sure change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 A blockage scenario using Dynsim in simple topology with a single pipeline, gener- ator, and injector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 A leakage scenario using Dynsim in simple topology with a single pipeline, gener- ator, and injector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5 Correctness of identification and localization: node 4 correctly identified the block- age at 200 second (first detection at 195 second) with 50% blockage severity and 5%/secondbuild-uprate. Reddotindicatesthatcorrespondingpredicate(abbrevi- ation P is pressure and F is flowrate) in the x-axis is detected at the node given on y-axis and a correct identification requires all the predicates for a given anomaly to be true. Only node 4 satisfies all the predicates for the blockage. Among 6 nodes (as shown in Fig. 4.4), only nodes 2, 3, 4, and 5 (shown on the y-axis) detected the event. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6 Correctness of localization and timeliness: only node 4 identified the blockage cor- rectly from 195 second to 202.5 second (for 4 detection sliding windows) among 6 nodes (as shown in Fig. 4.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.7 Impact of detection threshold on the level of severity, latency, and correctness. . . 91 5.1 Examples of general skyline and spatial skyline . . . . . . . . . . . . . . . . . . . . 95 ix 5.2 Voronoi-based geographic flooding . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3 Techniques to determine the geometric association of a point . . . . . . . . . . . . 99 5.4 Partitioning of three DSS algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5 Three phases of TDSS algorithm. Synchronization point is shown as a star and skyline as a sunflower. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.6 Some special cases. See section 5.4.1.1 for details. . . . . . . . . . . . . . . . . . . . 106 5.7 Some special cases. See section 5.4.1.1 for details. . . . . . . . . . . . . . . . . . . . 109 5.8 Example figures for the analysis of TDSS. . . . . . . . . . . . . . . . . . . . . . . . 114 5.9 Example figures for the completeness of TDSS. . . . . . . . . . . . . . . . . . . . . 118 5.10 Software architecture of DSS over GPSR . . . . . . . . . . . . . . . . . . . . . . . . 125 5.11 Efficiency of DSS algorithms by network size (a-c) and density (d), with fixed |Q| = 10 and MBR(Q) = 25% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.12 Breakdown of the number of transmissions with the same number of nodes (292 nodes) in the network: dense 300×300 and sparse 500×500 networks with fixed|Q| = 10, MBR(Q) = 25% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.13 Efficiency of three DSS algorithms by the number of query points (|Q|), with fixed MBR(Q) = 25% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.14 Efficiency of three DSS algorithms by MBR(Q), with fixed |Q| = 10, and network size = 100×100, 300×300 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.15 Accuracy of results in a moderate topology (a), and load balance of transmissions in a moderate 300×300 network (b), both with fixed |Q| = 10, MBR(Q) = 25% . . 132 5.16 Delay (a) and progressiveness (b) of three DSS algorithms, with fixed |Q| = 10, MBR(Q) = 25%, and moderate density . . . . . . . . . . . . . . . . . . . . . . . . 133 x Abstract In-network processing has long been considered a fundamental mechanism to improve energy efficiency in various monitoring applications in wireless sensor networks. This thesis proposes in-network processing algorithms which leverage the spatial and temporal correlations in sensor data and the geometric properties in the network topologies to enable monitoring applications to efficiently produce accurate results. We demonstrate this by designing and implementing three application-specific in-network processing techniques. First, we investigate the efficient and approximate in-network aggregation algorithm called Clustered AGgregation (CAG) for environmental monitoring. CAG reduces the number of trans- missionsbyleveragingbothspatialandtemporalcorrelationsofsensordatatoperformin-network aggregation. The CAG algorithm forms clusters with sensor nodes sensing similar values by ex- ploiting spatial correlation of sensor readings, and efficiently adjust those clusters over time under the data or network dynamics. CAG is a novel algorithm in that it provides approximate results with bounded error performing in-network aggregation. Second, we design autonomous detection, identification, and localization algorithms which reduce false alarms in applications that monitor steam and water pipeline in oilfields. We for- mulate detection and identification problems in oilfield as a decision making problem based on information with uncertain errors and build a decision tree to capture the salient pressure and flow characteristics of each problem and distinguish them from false alarms. The proposed Steam andWAterTrackingSystem(SWATS)utilizesmulti-modalsensingandmulti-sensorcollaboration and exploits spatial and temporal patterns of the sensed phenomena. xi Third, weintroducespatialskylineoperationtowirelesssensornetworksandsuggestpotential applications which can benefit from this operation. Subsequently, we design and implement three flavors of Distributed Spatial Skyline (DSS) algorithms in wireless sensor networks based on different partitioning strategies: 1) TDSS: Triangulation-based DSS, 2) RDSS: Rendezvous-based DSS, and 3) TRDSS: Triangulation and Rendezvous-based DSS. Our algorithms utilize geometric properties in two layers of the protocol stack: 1) geographic routing techniques such as GPSR, GHT-basedrouting, andVoronoi-basedgeographicfloodingintheroutinglayer, and2)geometric notions such as convex hull and Voronoi diagram to compute spatial skyline in the application layer. The proposed DSS algorithms are novel in that they provide accurate and progressive spatial skyline results with modest delay while performing in-network processing efficiently. xii Chapter 1 Introduction 1.1 Overview Monitoring is the most popular class of applications in Wireless Sensor Networks (WSN). WSN provides quick, inexpensive, and close observation of phenomena safely without human interven- tion. Users benefit from using WSN in different monitoring applications such as monitoring envi- ronments[11,21,29,61,62,85,91],detectingevents[1,30,50,81,82],toxicwastemonitoring[75], tracking targets [12, 23, 30, 32, 57, 59, 105], predicting events [29, 44], seismic monitoring [58], structure monitoring [99], and pipeline monitoring [83, 84]. In-network processing and data ag- gregation are widely used to save energy, increase scalability, and reduce computation in many monitoring applications of WSN. Whendevelopingmonitoringapplications, usersoftenposecontradictoryrequirementssuchas cheap overall cost, reliable and accurate results, long network lifetime, and timely and progressive results. Users want to reduce the deployment cost by using cheap equipments and sensors, which often provide less reliable and low precision results. On the other hand, they want to get reliable and accurate results. Cheap equipments often do not last long, but users want to monitor for a long time over a large area. Progressive results often trades efficiency because more messages are 1 used to report early results, and timely results often trades accuracy because results should be provided quickly before gathering complete data in the network. These contradictory user requirements bring up several interesting research challenges for sys- tem designers. First, even with the low quality and inexpensive equipments and sensors, we need to develop algorithms to provide reliable and accurate results. Second, our algorithms should be energy-efficient and help to prolong network lifetime. For example, in-network processing can be used to reduce the number of packets transmitted by a node thereby saving energy. Third, our algorithms need to consider the constraints of sensor nodes, such as limited computing power and storage, so that they can be practically deployed. Many algorithms (e.g., signal process- ing, classification, reasoning, probability computation, etc.) can easily become too complex for resource-constrained sensor nodes. Finally, unreliable radio links are challenging to in-network processing because a valuable result is not guaranteed to be delivered even after complicated processing in the network. Loss of aggregated results can impact the accuracy of the final result more than a loss of a single sensor reading. It is possible to address these challenges due to the improvement in hardware and software systems. TheadvanceinhardwaretechnologysuchasVLSIformicroprocessor,sensors,andradio transceiver makes sensor node platforms available at a low cost. Various sensor fusion techniques improve reliability by performing link quality estimation, filtering, and collaboration. A lot of research has been done to improve accuracy from three orthogonal approaches: 1) if the data is often predictable, getting approximate results using as little data as possible, 2) if the data is often unpredictable, getting exact results using as much data as possible, and 3) if there are known data properties, getting exact results using as little data as possible. Energy efficiency, the most important requirement of WSN, is achieved by reducing the radio activity in different layers such as reducing the number and scope of transmissions in the application and reducing the radio operation time in the MAC. 2 In-network processing has made significant progress to improve energy efficiency and sensor network lifetime [21, 42, 61, 66, 100, 102, 103, 107] for applications that allow data aggregation. (It has limited impact to applications that only require raw data from each sensor.) Since sensor nodes have local computation abilities, in-network processing enables to reduce expensive radio operations consuming a lot of energy by moving out part of the computation from the sensor networks, aggregating records, or eliminating irrelevant records. Our work focuses on creating in- network processing algorithms by exploiting data correlation, sensor collaboration, and geometric propertiestoincreaseaccuracy,efficiency,scalability,timeliness,andprogressivenessofmonitoring applications. 1.2 Motivations and Challenges Correlation in sensor data provides an opportunity to design an efficient wireless sensor networks. Collected data using wireless sensor networks reflect the spatial and temporal correlations of physical attributes intrinsic in the environment. Tobler’s first law of geography states that “Ev- erything is related to everything else, but near things are more related than distant things” [90]. This statistical observation implies that data correlation increases with decreasing spatial sepa- ration. There has been a number of research studies pursuing efficient in-network aggregation in the literature [42, 61, 100] including representative systems such as TAG and Directed Diffusion. Neitherofthesesystems,however,exploitsthespatialandtemporalcorrelationsofdatatoachieve even more efficient in-network aggregation although well-known observations on spatial statistics indicates that data correlation increases with decreasing spatial separation [90]. The major motivations behind Clustered AGgregation (CAG) are as follows. Users do not need results with arbitrary accuracy; required accuracy depends on the application. A system can give flexibility to the users to specify a satisfactory level of accuracy with error bound. Once the arbitrary accuracy requirement is relaxed, the system can filter data, or perform lossy 3 compression or aggregation to increase efficiency while achieving the required accuracy. Even without transmitting all the data to the user application, we can achieve significant accuracy by combining the sensor readings from multiple cheap sensors and exploiting the underlying data correlations. We can save energy usage by suppressing transmissions of redundant data. Basic assumptions behind these approaches are ubiquitous spatial and temporal correlations in the sensor readings from environment and events. We also assume inherent limitations of dynamic sensor range and sensor resolution, and inherent approximation of phenomena with sensors with different time granularities. CAG is an improvement over existing in-network aggregation mechanisms. It leverages the spatial and temporal correlations of sensor data to perform much higher level of aggregation than systems such as TAG. However, providing the error bound in the aggregation result along with the efficient use of energy usage is a challenge. Moreover, clusters need to be adjusted under the data and network dynamics and these operations must be performed energy efficiently over the duration of the operation. Anomalydetectionsystemsdeployedtomonitorpipeline(oilfield, water, sewer)networkshave major shortcomings. Supervisory Control and Data Acquisition (SCADA) systems for pipeline network in oilfields, for example, are expensive (equipment and maintenance), less scalable (low densityintimeandspace), littleflexible(protocolchangeandsoftwareupdate), lessinteroperable (hardwareandsoftware),andprovidethedataorresultwithlongdelay. Engineersneedtocontrol and maintain the equipments manually. Steam and WAterflood Tracking System (SWATS) is to detect, identify, and localize major problems that arise in steamflood pipeline networks in oilfields. Our system aims to allow contin- uousmonitoringofthesteamfloodsystemwithlowcost, shortdelay, andfinegranularitycoverage while providing high accuracy and reliability. Detectingprobleminthesteampipelinenetworkischallengingbecausesensorsinherentlyhave inaccuracies especially because we plan to use inexpensive sensors to make the monitoring system 4 economically viable. These imprecise sensor readings coupled with transient changes in flow rate, temperature, or pressure might trigger false alarms which makes it challenging to confidently detect a problem in steam pipeline network. Challenges in identification and localization arise from the complexity in pipeline topology (split, merge, etc.). A single-node processing with a single-modal sensing can not capture the topological effects on the transient characteristics of steam fluid to disambiguate similar problems and false alarms. Low energy, processing, and storage availability in sensor nodes create further constraints in the design of our system. Designing intelligent collaboration algorithms is challeng- ing with the conflicting requirements such as low-end hardware, long lifetime, and the reliable and accurate result. Spatialskylineisusefultoapplicationsthatneedtofindgood candidate locations, notthebest one, ofinterestssuchscenariosascollaborativepositioningformultiplemovingtargetsintheroad network or positioning for defense against multiple enemies in the battlefield, etc. We illustrate the importance of distributed spatial skyline query with this example: Suppose the central police station wants to make decision to find a list of policemen with better location to take care of multiple accidents on the road. Multiple policemen can be on the road, police stations, and fire stations can collaboratively decide who might be closer or equal in distance (non-dominated) than the others to respond to multiple events: a suspect on the road, a speeding car, and a few car accidents. In general, wired network is not available in these environments; wireless networks such as Mobile and Ad hoc Networks (MANETs), wireless sensor networks, and Mesh are often required and used in these application domains. Over the past several years, skyline computation has been one of the most popular research topics in the database community in various applications such as multi-criteria decision making and data mining [2, 5, 6, 17, 20, 39, 40, 47, 55, 56, 65, 68, 70, 77, 80, 86, 87, 88, 95, 94, 93]. Most of existing skyline algorithms, however, are centralized algorithms that work on static [5, 6, 20, 47, 55, 56, 65, 68, 70] or stream data [6, 56, 65, 70, 80, 86]. Although, more than a 5 few distributed skyline algorithms [3, 17, 39, 95, 98] have been proposed, they are studied in the contextofeithervertically[3]orhorizontally[95,98]partitioneddatasetsoverconnectednetworks such as peer to peer [95] networks. Huang et al. [39] proposed distributed skyline algorithm for MANETs. However, there is no known distributed algorithm to compute spatial skyline. The existing centralized spatial skyline algorithm is not applicable for those tracking or battlefield application scenarios due to the poor efficiency and scalability. Because the centralized algorithm assumesacompletesetofdatainspaceandtime,theyrequiretoomuchtransmissionoverheadfor resource constrained wireless networks. The primary way to increase the lifetime of these wireless networks while still running monitoring applications is by reducing the number of transmissions. The number of transmissions scale exponentially with the number of nodes in network. The key challenge in computing spatial skyline for WSN is finding a way to efficiently combine thesemulti-dimensionaldata,e.g. distancestomultipleevents,tocomputetheskyline. Ourdesign goal is to create a distributed algorithm to provide spatial skyline points efficiently, correctly, progressively, scalably, with low latency and fair use of energy. To achieve these goals, we need a technique that delivers only the useful data, and not the entire dataset to the user. Designingadistributedspatialskylinealgorithminresourceconstrainedwirelessenvironments introduces several contradictory research challenges: Accuracy vs Efficiency: Perfect accuracy is achievable with a complete dataset, but accu- mulating complete dataset in a distributed setting incurs a huge communication overhead making the system inefficient. Progressiveness vs Efficiency: Progressiveness often trades efficiency because more mes- sages are used to report early results. Scalability: Scalable spatial skyline algorithm in a distributed settings needs to reduce the search space where there is no skyline, otherwise the number of dominance check thereby the number of the transmissions will increase rapidly as the size of the network increases. 6 Agility vs Stability: A distributed algorithm uses multiple phases of communication and state transition to compute spatial skyline. How to ensure a consistent result in face of various event and network dynamics? 1.3 Thesis Statement In-network processing algorithms can leverage the spatial and temporal correlations in sensor data and the geometric properties of the network topologies to enable monitoring applications to effi- ciently produce accurate results. We demonstrate this by designing systems for (1) Adaptive approximate in-network aggre- gation with bounded-error in the environment monitoring. (2) Autonomous problem detection, classification, and prediction with false alarm reduction in the oilfield monitoring. (3) Fast, effi- cient, and progressive computation of spatial skyline with guaranteed accuracy in the positioning for collaborative event tracking. 1.4 Summary of Work We design and implement three application-specific in-network processing techniques. Fig. 1.1 presents the high level summary of this thesis. First, we investigated the efficient and approximate in-network aggregation algorithm called Clustered AGgregation (CAG) [103] for environmental monitoring. CAG reduces the number of transmissions by leveraging both spatial and temporal correlations in sensor data. The CAG algorithmformsclusterswithsensornodessensingsimilarvalueswhichmeetstheuser-givenerror threshold by exploiting spatial correlation of sensor readings. CAG operates in two modes, interactive and streaming, depending on the dynamics of the environment. In the interactive mode, users issue a one-shot query and the network generates a single response. This is appropriate for scenarios when the environment (network topology 7 Problem Efficient and accurate in-network aggregation Accurate in-network detection, classification Fast, efficient, progressive in-network spatial skyline Challenge Provide error bound Eliminate false alarms Provide guaranteed accuracy Leverage Spatial and temporal correlations Geometric property Approach Decision tree with multi-modal, multi- node collaboration Efficient adaptive clustering with bounded error Partitioning search space systematically and recursively Biologists, ecologists Environmentalists Who Benefits Oilfield engineers Decision makers who use spatial information Expected Result Provide a new service in environmental monitoring application Explore real world problem in novel application domain Provide a new service in collaborative positioning applications Environment monitoring Oilfield monitoring Application Collaborative positioning Figure 1.1: High Level Summary of Thesis and data) changes drastically so the overhead by cluster adjustment overwhelms the overhead by constructing new clusters, or when users desire to change the approximation granularity or query attributes interactively. On the other hand, in the streaming mode, the clusterheads transmit a stream of response for a query that is issued just once. This mode of operation is well-suited for static environments which require the low cluster maintenance overhead where sensor readings do not change as frequently so the clusters make incremental changes or the query remains valid for a certain period of time. The interactive mode of CAG only exploits the spatial correlation of the sensor data to form clusters, whereas the streaming mode of CAG leverages both temporal and spatial correlations. The latter adjusts clusters locally as the data and topology change over time. The cluster adjustments are infrequent when the data is correlated both spatially and temporally. In the interactive mode, same weight is assigned to all the clusterhead readings regardless of the clustersize,whichcanresultinlargeerrorswhilecomputinganaggregate. Inthestreamingmode, we count the number of nodes per cluster and use the cluster size as a weight of the clusterhead readingswhilecomputingtheaggregatefunction. Therefore,theresultsofthestreamingmodeare 8 more accurate than those of the interactive mode. In addition, we used the fixed range clustering, and the accuracy of result does not depend on the magnitude of the clusterhead sensor value. The advantage of CAG is the high accuracy in the approximate results even with packet losses. CAG interactive mode guarantees the error in the result is bounded by the user-provided threshold for exemplary and duplicate insensitive aggregation operators such as MAX and MIN. AlthoughtheinteractivemodeofCAGisoblivioustothenumberofnodeswithinacluster, itstill provides a good approximation for the summary and duplicate sensitive operators such as SUM, AVG, STD, and VAR when the data is normally distributed and highly correlated. However, the interactive mode cannot guarantee that error in the result is within the error threshold for the summary and duplicate sensitive operators while the streaming mode can. Thus, the errors are bounded while still saving a significant number of message transmission, hence energy. This benefit amplifies when the number of sensor nodes, the density of node deployment, and the level of data correlation (both spatial and temporal) increase. CAG is a mechanism to implement or augment into the existing sensor network query system without OS/infrastructure modification. To the best of our knowledge, CAG is the first system for efficient approximation of in-network aggregation in that it supports semantic broadcast [97] by leveraging both the spatial and temporal correlations prevalent in the real world data. Second, wedesignanovelapplicationusingwirelesssensornetworkstodetect, identify,andlo- calize major problems that arise in steamflood pipeline networks in oilfields. Our system, SWATS (Steamflood and WAterflood Tracking System), detects and identifies major anomalies in steam- floodpipelinenetworks: blockage, leakage, outsideforcedamage, generatorbreakdownandSpliti- gatormalfunction. Theproblemischallengingbecauseoftheinherentinaccuracyandunreliability of sensors and the transient characteristics of the two-phase steam flows, resulting in various po- tential false alarms. Moreover, observation by a single node cannot capture the topological effects on the transient characteristics of steam fluid to disambiguate similar problems and false alarms. 9 Weaddressthesechallengesbydesigningamulti-modalsensingandmulti-sensorcollaboration algorithm which utilizes the rule-based decision tree for anomaly identification and localization. By exploiting temporal and spatial patterns of the sensed phenomena, we build a rule-based decision tree to capture the salient pressure and flow characteristics of each problem and distin- guish them from false alarms. Even though we use low-fidelity sensors to keep the cost down, we increase accuracy by combining the sensor readings from multiple sensors and exploiting the underlying data correlations. Basic assumptions behind these approaches are ubiquitous spatial and temporal correlations in the sensor readings from anomalies in pipeline network. Third, we design and implement three flavor of Distributed Spatial Skyline (DSS) algorithms in WSN: 1) TDSS: Triangulation-based DSS, 2) RDSS: Rendezvous-based DSS, and 3) TRDSS: Triangulation and Rendezvous-based DSS. The intuition behind our algorithms is that we can still get the correct and complete spatial skylines utilizing the parallel search by partitioning the searchspacerecursivelybasedongeometric propertiesofthetopology andpruningnon-candidate nodes. This approach enables the efficient, progressive, scalable search while providing correct results with short delay. Our algorithms perform proactive search for DSS; On sensing query points in the network, local sensor nodes push the computed skylines out to the user or base station when an event of interest occurs in the network. Because the location information of nodes and the events of interest drives the spatial skyline search, our algorithms utilize geometric properties in two layers of the protocol stack; in the routing layer we use geographic routing technique such as Greedy Perimeter Stateless Routing (GPSR) [46], GHT-based routing, and Voronoi-based geographic flooding, and in the application layer we use geometric notions such as convex hull and Voronoi diagram to compute spatial skylines. 1.5 Contributions We summarize the main contributions of CAG as follows: 10 • CAG algorithm: We designed a clustered in-network aggregation algorithm, CAG, which saves significant energy while providing an approximate result with errors bounded by a user-provided error threshold. CAG exploits the temporal as well as spatial correlations. We investigated the efficiency of CAG using both our measured and modeled data. • Customization: We designed two modes of CAG (interactive and streaming) that are ap- propriate for different applications and the environmental characteristics. • Measurement: We performed a large-scale measurement of the environmental data (tem- perature and light, both indoor and outdoor) using mica2 motes. • Model: We derived the data models from the sensor data we obtained from the real-world measurements. Ourdatamodelcapturestwokindsofspatialproperty: linearandspherical. Our experimental results for AVG operator with measured data and link reliability indicate that the CAG interactive mode, with an error threshold of 20%, sends up to 68.25% fewer packets compared to TAG while introducing an error of only 2.46%. The streaming mode of CAG can save up to 70.24% over the TAG when data shows high spatial and temporal correlations in just over 100 minutes. CAG uses 19.0% less transmission than TAG with the real world dataset from the Great Duck Island while introducing less than 6.26% error. The contribution of SWATS is enlarging application domain for WSN by exploring a novel problemofsteamfloodmonitoring,anddesigningnewalgorithmsforit. Wedesignedadistributed on-line rule-based decision tree algorithm to identify and localize the anomalies by understanding their spatial and temporal patterns in the sensor readings. Our algorithm can successfully detect, identify, and localize anomalies in the presence of sensor failure and various false alarms. Our system represents a new promising approach for oilfield monitoring that has the benefits of low cost, short delay, flexible deployment, continuous monitoring, and accurate problem detection and identification. Our system detects and identifies multiple anomalies and complicated false 11 alarmsinmultiphasesteampipelinewhilepriorcomputer-basedmonitoringsystemsinoilfieldare customized to detect a single problem. Ourexperimentsshowthat,ouralgorithmcansuccessfullydetect,identify,andlocalizeanoma- lies in the presence of sensor failure and various false alarms. DSS, the third system we designed and built, has following contributions: • We introduce the spatial skyline operation to wireless sensor networks and suggest potential applications which can benefit from this operation. • We proposed the first set of distributed algorithms for computing spatial skyline in wireless sensor networks. Our algorithm is efficient, scalable, accurate, progressive, and quickly computes the final skyline result. • In our algorithms, we proposed two novel techniques based on computational geometry: Voronoi-based geographic flooding and Determination if a point is interior of open triangle. Ourexperimentsshowthat,inanetworkof554nodeswithmoderatedensity,theTriangulation- based DSS (TDSS) algorithm outperforms the centralized algorithm by 91.49% where 10 query pointswith25%MinimumBoundingRectangle(MBR)oftheentirenetworkwhileproviding100% accurate results. TDSS also provides skylines progressively with modest delay for the complete skylines, while providing fairness in communication overhead across the nodes. 1.6 Organization The remainder of this dissertation is organized as follows. Chapter 2 provides background related to our study and related work. In Chapter 3, we present the Clustered AGgregation (CAG) technique which leverages spatial and temporal correlations in wireless sensor networks. In Chap- ter 4, we present the timely, accurate, and reliable Steamflood and Waterflood Tracking System 12 (SWATS) using wireless sensor networks. In Chapter 5, we present efficient, accurate, and pro- gressive Distributed Spatial Skyline (DSS) in wireless sensor networks. Finally in Chapter 6, we concludes this thesis and discuss remaining challenges and future work. 13 Chapter 2 Background and Related Work 2.1 CAG Fig. 2.1 presents a classification of major query systems in WSN and places our work in the context of related work. 2.1.1 Accurate Results Several in-network aggregation techniques have been proposed for energy-efficient communica- tion in WSN. TinyDB [61], Directed-diffusion [42], and Cougar [100] are the first generation of in-network aggregation systems. These approaches use tree or Directed Acyclic Graph (DAG) topology as an underlying routing framework. However, they do not consider further energy optimization using spatial or temporal correlation and approximate aggregation. Two recent approaches, Digest Diffusion [107] and Synopsis Diffusion [66], support robust communicationforduplicate-insensitiveandduplicate-sensitiveaggregates,respectively. However, these two systems are not without energy overhead. Digest Diffusion requires each node to maintain link quality statistics to construct the routing tree. Synopsis Diffusion with adaptive ring topology uses redundant transmissions and receptions, which makes it less energy efficient 14 Query system in WSN Accurate Approximate TAG (TinyDB) Cougar Directed-Diffusion Synopsis Diffusion AQUA Olston et al. Jain et al. Considine et al. BBQ CAG Preciseness Gupta et al. TiNA BBQ PREMON CAG PREMON BBQ Data correlation Model-based prediction Optimization techniques Spatial Temporal Spatio- temporal Hierarchical Flat Clustered Non- clustered LEACH APTEEN CAG Adaptive Static PREMON TEEN Pattem et al. TAG (TinyDB) Directed-Diffusion Digest-Diffusion Synopsis Diffusion Gupta et al. Network topology Figure 2.1: Classification of major query systems in WSN. than TAG. Our work shows that CAG is more energy efficient than TAG. Thus, we establish that CAG is more efficient than Synopsis Diffusion. 2.1.2 Approximate Results Allowing for an approximate result instead of requiring an exact answer enables designing energy- efficient in-network aggregation mechanisms. Approximate results can be used in an interactive setting in which users may first ask for a rough picture of regional data before they decide to drill-down further [26]. In this scenario, not every sensed data is required to compute the synop- sis. [13] proposed an approximate aggregation technique by generalizing the Flajolet and Martin duplicate-insensitivesketchesforduplicate-sensitiveaggregates. TheAQUAproject[27]proposed atechniquetocomputeasynopsisusingasubsetofnodesinthenetwork. Thistechniqueprovides accurate approximate answer to the query. However, it does not leverage any data correlation. It is known that both energy efficiency and accuracy are important in time-critical monitoring. In many systems, however, higher accuracy comes at a prohibitive energy cost. CAG addresses this problem by providing bounded approximate result with significant energy reduction. 15 [67]designedanadaptivebounded-widthfilterinwhichfilterwidthsareadjustedcontinuously tomatchthecurrentdatadynamics. Inthisscheme, distributeddatastreamstransmitabounded approximateanswertothecentralizedsitewithreducedoverhead. [44]triedtominimizeresource usage while satisfying the precision requirement by designing a prediction system using Dual Kalman Filter (DKF). As such, sophisticated filter or prediction scheme can be incorporated in WSN to prevent unnecessary data transmission. User-provided error threshold τ functions as a fixed-width filter to determine the allowable sensor readings in a CAG cluster. 2.1.3 Modeling Data Correlation There have been many studies modeling spatial correlation property in the context of WSN. [21] proposed a model-based data acquisition prototype called BBQ which uses a time-varying mul- tivariate Gaussian model. The authors proposed a framework that can use any data model to predict the sensor readings. Unlike CAG, BBQ does not form clusters; neither does it model the environmental data or take into account the data dynamics. Moreover, BBQ can not detect outliers whereas CAG can. [31] proposed distributed regression, which is an efficient and general framework for in-network modeling of sensor data. In this work, rather than communicating the data, nodes communicate constraints on the model parameters thereby significantly reducing the communication cost. [45] proposed a method to generate spatially correlated data based on a mathematical model. [53] describes a technique to generate synthetic data set that contains realistic spatial patterns with known spatial properties bearing the fractal pattern in the environ- ment. This synthetic pattern is generated using stochastic noise; fractal (strictly quasi fractal) is brown noise in terms of the color of the spectra. [75] modeled the diffusion phenomena (such as the propagation of a gas in the air or of a chemical agent in the water) using the Partial Differ- ential Equations in order to estimate parameters for diffusion monitoring. [24] modeled spatially correlated data in a Bayesian framework using statistical approach. In this study, we statistically model the spatial correlation property with the measured environmental data using variogram 16 and PDF (Probability Density Function) as a function of internode distance. We observed two correlation data models, linear and spherical, which describe the data from our measurement study. 2.1.4 Utilizing Data Correlation TiNA [79] exploits temporal correlation in sensor data while CAG takes advantage of both spatial and temporal correlations in sensor data. [33] proposed an efficient data gathering algorithm exploiting the spatial correlation. However, their algorithm is not based on the clustering tech- nique and the overhead from selecting the connected correlation-dominating set compromises the efficiency of the proposed algorithm. In addition, their work is not validated empirically using the measured sensor data and does not address the data dynamics. 2.1.5 Non-Clustered Hierarchy There has been various research on optimizing tree-based routing. [16] proposed a distributed approximation algorithm for correlated data gathering in a tree topology. The authors suggested acodingstrategybasedonSlepian-Wolfmodelandjointentropycodingmodeltojointlyoptimize the transmission structure of the tree and data allocation at a node. [28] proposed a randomized tree structure algorithm which simultaneously optimizes all concave non-decreasing aggregate functions. CAG’s approach is different: it is not about optimizing the routing tree by forming better trees; it is about optimizing (minimizing) both energy usage (forwarding) and resulting error using an existing query routing tree. 2.1.6 Clustered Hierarchy TechniquessuchasLow-EnergyAdaptiveClusteringHierarchy(LEACH)[35],Thresholdsensitive Energy Efficient sensor Network protocol (TEEN) [63], and Adaptive TEEN (APTEEN) [64] use hierarchical clusters and routing to save energy. LEACH [35] forms cluster based on the received 17 signalstrengthanduseslocalclusterheadasrouters. Transmissionsaremadeonlybyclusterheads. LEACHutilizesrandomizedrotationoflocalclusterheadstoevenlydistributetheenergyoverhead amongthesensorsinthenetwork. ThemaindifferencebetweenLEACHandCAGisthatLEACH does not provide a mechanism to compute aggregate using clusterhead values while CAG does. TEEN [63] is another hierarchical protocol designed to be responsive to sudden changes in the sensor readings. TEEN, which is based on hierarchical clustering, forms clusters using nearby nodes. The nodes transmit sensor readings only when they fall above the specified threshold (hard threshold) and changes by given amount (soft threshold). While this saves energy, it does not support periodic reports. APTEEN [64] addresses both periodic data collection and prompt reporting of time-critical events. None of these protocols, however, leverages spatial and temporal correlations to improve efficiency. [69] analyzed the total cost for jointly optimizing the routing performance and data compres- sion using the joint entropy of sources leveraging the spatial correlation. Authors claimed that there exists a static, near optimal cluster size for ranges of spatial correlation. In contrast, CAG is an adaptive clustering scheme (clusters adjust over time) with lossy aggregation which provides an approximate result where the error is bounded by the user-provided error threshold in the streaming mode. PREMON[29]providesenergyefficientmonitoringbasedonaclusteredarchitecture. Cluster- head nodes in PREMON use a technique similar to MPEG compression algorithm and generate prediction models to predict the spatio-temporal data within a cluster. PREMON saves energy by avoiding the transmissions of all the redundant data which can be successfully predicted by the clusterhead node. PREMON assumes that the clusters are already formed using any existing mechanismwhileCAGformsclustersusingreal-timesensorvalues. PREMONuseslossycompres- sion with approximation where the error is not bounded (they did not mention or show anything on the boundedness of error), whereas CAG uses lossy approximation and guarantees that the error in the result is bounded by the given error threshold. PREMON uses the block-matching 18 algorithm of MPEG to compute the prediction model, whereas we classify the spatial correlation models existing in the measured sensor reading to investigate the impact of different correlation models on the CAG algorithm. CAGexploitssemanticbroadcast[97]inordertoreducethecommunicationoverheadbylever- aging spatial and temporal correlations. CAG achieves efficient in-network processing by allowing a unified mechanism between query routing (networking) and query processing (application). In- stead of gathering and compressing all the data (lossless algorithm), CAG generates synopsis by filtering out insignificant elements in data streams (lossy algorithm) to minimize response time, storage, computation, and communication costs. 2.2 SWATS Monitoring in WSN spans broad applications such as monitoring environments [11, 21, 29, 61, 62, 85, 91, 103], detecting events [1, 30, 50, 81, 82], tracking targets [12, 32, 57, 59, 105], and predicting events [29, 44]. There are two kinds of monitoring applications closely related to SWATS: pipeline monitoring and target tracking. Our system can integrate with or replace current SCADA systems (pipeline monitoring) and utilizes multi-modal multi-sensor collaboration like target tracking, with the common goal of detection, identification, and localization of anomalies or targets. Pipeline monitoring is widely used in industry applications to monitor pipelines conveying water [84] [81], oils [15], multiphase gas [22], and two-phase steam [10] [89]. Most of these pipeline monitoring works, however, are using or are integrated with SCADA systems which is expensive, manually maintained, have limited scalability, flexibility, and interoperability, and provide the data or result with long delay. SWATS, like [84], is an emerging application in WSN, which performs steamflood as well as waterflood pipeline monitoring as. 19 Oilfield monitoring is fundamentally different from the existing, yet most similar, target track- ingapplications[30][57]insensornetworksinmanyways. Oilfieldmonitoringattemptstolocalize the static location of a problem while target tracking localizes the position of a moving object. The problems in an oilfield are confined within a pipeline while an object being tracked in a tracking application might move in an undetermined and open path. Many explicit rules are used to identify and classify problems and many types of false alarms in oilfield monitoring while very few, if any, rules are used in target tracking. Pressure, differential pressure, and temperature sensors are commonly used to measure the properties of a transient fluid in oilfield monitoring while seismic, acoustic, and magnetic sensors are used to measure the proximity and bearing of a solid object in target tracking. In oilfield monitoring, cross-check only across the neighboring sensors along the trajectory of the fluid is required to validate a reading while in target tracking such comparison and correlation is done across all the neighboring nodes. TherearethreemaintechniquesrelatedtoSWATS:SCADA,collaborativefusion,anddecision tree. 2.2.1 SCADA Systems SCADA systems are designed for specific monitoring applications. Liu [15] proposed a software- based crude oil and refined petroleum pipeline leakage detection system using SCADA to obtain the field data. It quantified the impact of various parameters on leak detectability given pipeline, specifiedinstrumentation, andSCADAcapabilities. Thoseparametersarepipelinevariables(fric- tion factor, length, diameter, velocity, and wave speed), steady-state vs. transient flow (flow increasing or decreasing), leak location, and noise. This study enables oilfield engineers to under- stand the achievable level of leakage detection, and sensitivity of detectability with respect to the parametersinvolved. Thisworkonlyaddressedthepipelineleakageproblem, anddidnotconsider applyingWSN.EricksonandTwaite[22]developedapipelineintegritymonitoringsystem(PIMS) which helps detect pipeline leakage and track the gas composition of the wet gas pipelines. PIMS 20 predictsthepipelineleakageformultiphasesystembycomparingthemeasurementsfromSCADA andthemodelsbasedonthetransientpipelinesimulator, OLGA.PIMSpreventsfalsealarmfrom the inaccuracy of flow meter by setting the error tolerance threshold while still providing a timely warning when a real leakage occurred. PIMS only considers a single problem of pipeline leakage. TherearenewapproachestoreplaceexpensiveSCADAsystems. Stoianovet al.[84]presented a WSN-based prototype monitoring system deployed at Boston Water and Sewer Commission (BWSC), which replaces existing expensive SCADA system. Three on-line monitoring applica- tions (hydraulic and water quality monitoring, remote acoustic leak detection, and monitoring combined sewer outflows) feature high sampling rate (1/ms), fine-time synchronization (up to 1ms), and in-network processing. Although this work provides proof-of-concept in real world, because this system is designed as an architectural framework to collect raw data using WSN, it did not provide any intelligent in-network algorithm such as SWATS running on top of this architecture. Shinozuka et al. [81] provides a practical solution for the real-time detection and rapid response to prevent severe damages to water pipelines. By monitoring the water pressure with accelerometers and acoustic sensors, this system can identify and locate the severity of dam- age in a water pipeline system. Although this work provides a proof-of-concept in real pipeline implementation, it simply detects and localizes a single pipeline leakage problem with very small false alarms without classification issue. 2.2.2 Collaborative Fusion Compared to the traditional signal processing techniques which mainly focused on getting the optimal estimation for the given set of data, WSN-based pipeline monitoring and object tracking systems select a subset of sensor nodes to participate in collaboration and the balance of the information from its decentralized and resource constraint nature. Gu et al. [30] built a distributed surveillance application satisfying conflicting requirements of low-end hardware, long-lifetime, and sophisticated functions such as signal processing and 21 classification. By performing hierarchical classification, each node tries to perform detection and classification locally as much as possible to reduce data size, radio operation, and bandwidth usage. Liu et al.[57] proposed a distributed and dynamic group management method for multiple target tracking. Each node estimates the target location independently, and declares detection if presence likelihood is larger than the absence likelihood. To reduce the communication cost, group collaboration takes place only after target detection. Both approaches utilize collaborative sensor fusion and conserve energy by trying to maximize the local computation and minimize communication. They have simple problem sets (a few different objects or multiple same objects) to be classified. Unlike SWATS, these approaches did not use the decision tree to detect, classify, and localize the tracked objects. 2.2.3 Decision Tree Implementationofdecisiontreeissimpleandcomputationallyefficient,whichmakesitappropriate for the complicated online diagnosis applications in WSN. Ramanathan et al. [72] designed a debugging tool called Sympathy which detects, classifies, and localizes the sensor network failures. Sympathy uses the empirical decision tree to determine the most likely cause of packet loss in the network while SWATS uses theoretical decision tree based on fluid dynamics to determine the anomaly in the steam and water pipelines. Sympathy uses simple binary decision tree, while SWATS uses complicated multi-dimensional decision tree. Zhao et al. [106] proposed a prototype diagnostic system which integrates both approaches of the model-driven signature analysis and the utility-driven sensor queries. In this system, Petri net model is used to detect faults, and these results provide prior probability to the Bayesian mode estimation algorithm. Decision tree is used to capture the utility of the sensor tests. However, this system detects and classifies problems occurring in a single node, but does not diagnose ubiquitous problems over WSN. Thus, they do not explore the localization problem because in their case studies all problems occurs on the designated node. 22 1 2 3 4 5 6 7 8 9 10 (a) Voronoi diagram D F C E B A G (b) Convex hull d e a f c b L (replica) (replica) (home) (c) GHT: home node a, replicas d and e on the home perimeter. Figure 2.2: Voronoi diagram (a), Convex hull (b), and GHT (c) 2.3 DSS 2.3.1 Voronoi Diagram The Voronoi diagram of a set of points P ∈ R d is a partition of space into disjoint cells, each of which consists of the points closer to one particular object p ∈ P than to any others p ′ ∈ P according to a distance metrics D(..). The Voronoi diagram of p ∈ P holds the property that a point q lies in the cell corresponding to a p if and only if ∀p ′ ∈P,p ′ 6=p,D(q,p)≤D(q,p ′ ) (2.1) The equality holds for the points on the borders of p’s and p ′ ’s cells. Fig. 2.2(a) shows additional terms on Voronoi diagrams: The pentagon cell surrounding the node 5 is Voronoi cell of 5, each 5 vertex (point) consists of pentagon is Voronoi vertex (point) of 5, and nodes 3,10,6,9,7 are Voronoi neighbors of node 5. We denote a set of Voronoi neighbors of p as VN v (p), and Voronoi cell of p as VC v (p). Voronoi diagram can be computed using either Fortune’s [19] algorithm which is the optimal centralized algorithm computing Voronoi diagram in (nlogn) time, or algorithm by Bash et al. [4] which is a distributed algorithm leveraging GPSR [46] and CLDP [48] protocols. 23 2.3.2 Convex Hull Convex Hull of a set of points in P ∈ R d is the smallest convex set containing all the points in P. Informally, it is a rubber band wrapped around the “outside” points. Convex points are the subset of points in P which consists of vertices of convex hull. Fig. 2.2(b) shows an example of convex hull (pentagon ABCDG) and convex points (5 vertices on that polygon) in which the convex hull is composed from the 7 points P ={A,B,C,D,E,F,G}. 2.3.3 GPSR Greedy Perimeter Stateless Routing (GPSR) [46] is the largely adopted technique for the scalable point-to-pointroutinginwirelessnetworksincludingWSNinwhichtheaddressschemeforrouting is based on the node identifier not the geographic location of node. GPSR consists of two routing phases: greedy or perimeter traversals (Fig. 2.3). In greedy traversal, by which GPSR initiates, a node forwards packets to its neighbor closest to the destination (Fig. 2.3(a)). When greedy traversal reaches a node with the local optimum as shown in Fig. 2.3(b), where all neighbors are farther to the destination, GPSR switches to the perimeter traversal. As perimeter traversal progress to the destination using right-hand rule (Fig. 2.3(c)), GPSR is guaranteed to work correctly with the assumption such that the underlying network graph is planarized. Fig. 2.3(d) depictstheexampleofperimetertraversalonaplanargraph. Planarizedgraphisoftenconstructed by using techniques such as Gabriel Graph (GG) [25], Relative Neighborhood Graph (RNG) [92], (bothforunit-diskgraphs)coupledwithCross-LinkDetectionProtocol(CLDP)[48](forarbitrary graphs). LazyCross-LinkRemoval(LCR)[49]reducesthehighmessageoverheadoftheCLDPby eliminating most crossing edges during graph planarization after the approximate planarization using mutual witness. However, LCR does not enable GPSR to efficiently route messages to arbitrary locations. 24 D x y (a) Greedy traversal: A node x forwards packets to its neighbor y closest to D. (b) Greedy forwarding failure. x is a local optimum to D. w and y are far from D. x y z 1 3 2 (c) Perimeter traversal: The right hand rule (d) Perimeter traversal on a planar graph Figure 2.3: GPSR with greedy and perimeter traversals. 2.3.4 GHT Geographic Hash Table (GHT) [74] is a data-centric storage in which sensor data is stored at a node determined by the name associated with the sensed data. GHT hashes the data key into the geographic coordinates, and store the key-value pair in the vicinity nodes of key consistently. Perimeter refresh protocol (PRP) stores a copy of a key-value pair at each node on the home perimeter (Fig. 2.2(c)). PRP is to support robustness and consistency of GHT on network dy- namics such astopology changes or nodefailures. Itensuresthat the nodesclosest to a key’s hash location will become the home node for that key and store that key’s data even after topological changes or node failure. 2.3.5 Skyline Skyline queries search the non-dominated tuples among multiple dimensions with respect to the user preferences. The preference used in skyline queries is usually formulated through monotone function over multiple dimensions. Compared to this general skyline queries, Spatial skyline queries retrieves the non-dominated data objects with respect to the spatially derived attributes such as distance [80]. While the general skyline performs static queries where the information for skyline operation is stored in a database statically for attributes such as restaurant ratings, the 25 Centralized Approach General Skyline Static data [Borzsonyi et al.] ICDE 2001, [Papadias et al.] TODS 2005, [Chan et al.] SIGMOD 2006, [Lin et al.] ICDE 2007, [Deng et al.] ICDE 2007, [Morse et al.] VLDB 2007, [Pei et al.] VLDB 2007, [Khalefa et al.] ICDE 2008, [Lian et al.] SIGMOD 2008 Streaming data [Tao et al.] TKDE 2006, [Huang et al.] TKDE 2006, [Sarkas et al.] SIGMOD 2008 Spatial Skyline [Sharifzadeh et al.] VLDB 2006 Distributed Approach Distributed Skyline Mobile Ad Hoc [Huang et al.] ICDE 2006, [Chen et al.] EWSN 2007 [Antony et al.] ICDE 2008, Distributed Databases [Vlachou et al.] ICDE 2007, [Cui et al.] ICDE 2008 [Zhu et al.] TKDE 2008 Distributed Spatial Skyline: DSS (our work) Skyline Algorithms Figure 2.4: Classification of skyline algorithms. spatial skyline performs relative queries where the information for skyline operation is computed dynamically forattributessuchasthedistancebetweenquerieranduserlocation. Fig.2.4presents a classification of skyline algorithms and places our work in the context of related work. Over the past several years, skyline computation has been one of the most popular research topics in the database community in various applications such as multi-criteria decision making and data mining [2, 5, 6, 17, 20, 39, 40, 47, 55, 56, 65, 68, 70, 77, 80, 86, 87, 88, 95, 94, 93]. Mostofexistingskylinealgorithms, however, arecentralizedalgorithmsthatworkonstatic[5, 6, 20, 47, 55, 56, 65, 68, 70] or streaming datasets [6, 56, 65, 70, 77, 80, 86]. Although, several distributed skyline algorithms [3, 7, 17, 39, 95, 98, 108] have been proposed, they are studied in the context of either vertically [3] or horizontally [95, 108] partitioned datasets over connected networkssuchaspeertopeer[95]networks. Wuetal.[98]proposedtheDistributedSkyLinequery (DSL) algorithm that uses Content Addressable Network (CAN) [73] and pipelines the skyline queryexecutionoveralargenumberofmachinesbyleveragingthecontent-baseddatapartitioning. Cuiet al.[17]proposedaParalleldistributedSkylinequeryprocessing(PaDSkyline)algorithmfor 26 distributedsystems,whichutilizespartitioningandfiltering. Zhuetal.[108]proposedaFeedback- based Distributed Skyline (FDS) algorithm where the server iteratively filters the dominated data distributed over systems. A few distributed skyline algorithms have been proposed for wireless networks. Huang et al. [39] proposed a distributed skyline algorithm for MANETs. Chen et al. [7] proposed a threshold-based skyline monitoring algorithm for sensor networks. However, there is no known distributed algorithm to compute spatial skyline. 2.3.6 Other Related Work Yu et al. [104] proposed the Geographical and Energy Aware Routing (GEAR) algorithm for wireless sensor networks to disseminate the packets inside the destination region without flooding the entire network. GEAR uses energy aware neighbor selection to route a packet towards the target region, a recursive geographic forwarding, and restricted flooding. Inordertosimplifyapplicationdevelopmentinsensornetworks, abstract regions [96]byWelsh et al. provides a set of spatial operators by hiding the details of low-level communication, data sharing, andaggregationbasedontheregion-basedcollectivecommunication. Theuseofabstract regions is demonstrated over various applications such as tracking, finding spatial contours, event detection using directed diffusion, and geographic routing using GPSR. Although several abstract regions are implemented, spatial skyline operator is missing. Implementing efficient distributed spatial skyline using operators proposed in abstract regions is challenging because spatial skylines cannot be defined within a certain number of hops or distance from a node. Computing correct spatial skylines may require the location information beyond the neighborhood relationship on which abstract regions are built. 27 Chapter 3 CAG: Clustered AGgregation 3.1 Overview Sensed data in Wireless Sensor Networks (WSN) reflect the spatial and temporal correlations of physical attributes existing intrinsically in the environment. In this paper, we present the ClusteredAGgregation(CAG)algorithmthatformsclustersofnodessensingsimilarvalueswithin a given threshold (spatial correlation), and these clusters remain unchanged as long as the sensor values stay within a threshold over time (temporal correlation). With CAG, only one sensor reading per cluster is transmitted whereas with Tiny AGgregation (TAG) all the nodes in the network transmit the sensor readings. Thus, CAG provides energy efficient and approximate aggregationresultswithsmallandoftennegligibleandboundederror. Inthispaperweextendour initial work in CAG in five directions: First, we investigate the effectiveness of CAG that exploits the temporal as well as spatial correlations using both the measured and modeled data. Second, wedesignCAGfortwomodesofoperations(interactiveandstreaming)toenableCAGtobeused in different environments and for different purposes. Interactive mode provides mechanisms for one-shot queries, whereas the streaming mode provides those for continuous queries. Third, we propose a fixed range clustering method which makes the performance of our system independent of the magnitude of sensor readings and the network topology. Fourth, using mica2 motes, we 28 performalarge-scalemeasurementofrealenvironmentaldata(temperatureandlight,bothindoor and outdoor) and the wireless radio reliability, which were used for both analytical modeling and simulation experiments. Fifth, we model the spatially correlated data using the properties of our real world measurements. Our experimental result shows that when we compute the average of sensor readings in the network using the CAG interactive mode with the user-provided error threshold of 20%, we can save 68.25% of transmissions over TAG with only 2.46% inaccuracy in the result. The streaming mode of CAG can save even more transmissions (up to 70.24% in our experiments) over TAG whendatashowshighspatialandtemporalcorrelations. Weexpecttheseresultstoholdinreality because we used the mica2 radio profile and empirical datasets for our simulation study. CAG is the first system that leverages spatial and temporal correlations to improve energy efficiency of in-network aggregation. This study analytically and empirically validates CAGs effectiveness. 3.2 The CAG Algorithm CAG, originally introduced in [102], is an algorithm to compute approximate answers to queries byusingtherepresentativevaluesinthenetworkandleveragingspatialandtemporalpropertiesof data. The prevalence of spatial and temporal correlations in environmental phenomena makes it possibleforCAGtoignoreredundantdataandquicklygenerateasynopsisofthedatadistribution withsignificantenergysavings. WeusetheAVGasamainaggregateoperatortodescribe,analyze, evaluate CAG, and to compare CAG with other algorithms. Thus, AVG operator is implied in all discussions unless we explicitly mention other operators. 29 2 8 5 9 10 4 3 6 Cluster1 Cluster4 Cluster2 Cluster5 1 7 12 11 Cluster3 13 (a) Interactive mode of CAG: a single response for a query. 2 8 5 9 10 4 3 6 Cluster1 Cluster4 Cluster2 Cluster5 1 t1 t2 t3 7 12 11 Cluster3 13 (b) Streaming mode of CAG: multiple responses (at times t1<t2<t3, etc.) for a query. Figure 3.1: Two modes of CAG’s operation: interactive and streaming modes. The solid lines indicate the query propagation and the dotted lines indicate the response. Black nodes are clusterheads, gray nodes are bridges, and white nodes are non-participating nodes. 3.2.1 Two Modes of CAG Operation CAG has two operating modes: interactive and streaming. CAG generates a single set of response for a query in the interactive mode. In the streaming mode, periodic responses are generated in response to a query. The interactive mode of CAG exploits only the spatial correlation of sensed data by forming clusters just once and computing and forwarding results based on those clusters. The streaming mode of CAG takes advantage of both spatial and temporal correlations of data by forming clusters and adjusting them over time to accommodate temporal changes in sensor readings in response to a streaming query. Fig. 4.1(e) shows the clusterhead responding with a single value in the interactive mode. Fig. 3.1(b) shows periodic response messages from clusterheads in the streaming mode. The CAG algorithm operates in two phases: query and response. During the query phase, CAG forms clusters when TAG-like forwarding tree is built using a user-specified error threshold τ. In the response phase, CAG transmits a single value per cluster. CAG is a lossy clustering method in that only the clusterheads contribute to the aggregation. 30 3.2.2 Interactive Mode Algorithm 1 Pseudocode of the interactive mode of CAG algorithm 1: Function Query.Received: 2: if MR∈[CR−Range×τ, CR+Range×τ] then 3: clusterhead=FALSE; 4: broadcast query Q; 5: else 6: CR =MR; 7: clusterhead=TRUE; 8: broadcast query Q; 9: end if 10: Function Response.Received: 11: enque response to the buff; 12: Function ResponseTimer.Fired: 13: if clusterhead then 14: forward aggregate(buff, MR); 15: else if size(buff)>0 then 16: forward aggregate(buff); 17: end if A user runs the CAG algorithm by specifying a query using a syntax similar to that of TAG except an additional “threshold τ” clause. The user query can be described as UQ = < QueryID,O i ,τ >, where O i is the monitoring attribute and τ is the user-provided error threshold. CAG supports disseminating multiple queries with different QueryID and O i . Sub- sequently, the base station broadcasts the query packet Q = < UQ, ParentID, MyID, level, CR>, where ParentID is the node ID of the parent in the forwarding tree, MyID is the node ID of the transmitting node, and level is the depth of the current node in the forwarding tree. Note that CR is included in the query to be compared with each MR when it is received by a node. Clusters are formed when the forwarding tree is built. Upon receiving the query, each node decides to join a cluster based on Clusterhead sensor Reading (CR) and My local sensor Reading (MR); if |MR−CR|≤ Range×τ, where Range = MaxValue−MinValue of the entire dataset, then the sensor node joins the cluster. This range can be determined in advance by surveying the dataset using MAX or MIN operation of the CAG algorithm. This Range value is based on the range which is semantically meaningful for users or 31 the range of value which ADC sensor provides. The only requirement is consistent scale be used for filtering, thresholding, clustering, and displaying. We define the interval [CR−Range×τ, CR+Range×τ]asclusteringrange. Formingclustersusingτ istermedτ-approximation,because τ alsofunctionsastheerrorboundoftheresultsuchthat|EstimatedResult−CorrectResult|<τ. This is why τ, a user-provided error threshold, is interchangeable with a user-provided error- tolerance threshold. Algorithm1showsthepseudocodeoftheinteractivemodeofCAGalgorithm. Once all the nodes receive the query packet, the response phase starts. The ResponseTimer worksthesamewayasepoch timerinTAG[61]. OnnodeshigherupinthetreetheResponseTimer fires later than on the nodes lower in the tree. Thus, by the time a parent is ready to aggregate and transmit its result, it would have already received the results from its children. Whenever the ResponseTimer fires, only clusterheads transmit packets with the following tuple: R = < ParentID, CR >. If the clusterheads cannot communicate with each other, the intermediate nodes, termed bridge nodes, are required to bridge the segments of the forwarding tree. Bridge nodes do not contribute their sensor readings to the aggregate by default, but they can optionally participateintheaggregationbecausetheytransmitthepacketsanyway. Amoredetailedexample of the CAG algorithm execution is described in [102]. In the interactive mode, CAG builds a forwarding tree when a query is sent out. Just like TAG[61],weassumeanunderlyingmechanismtoblacklistasymmetriclinks. Thus,theforwarding path is set along the reverse direction of the query propagation. This newly formed clustered tree can address the dynamics of network and data on the fly. However, the interactive mode requires the overhead for broadcasting a query each time a user wants new data from the network. The frequently rebuilding the tree can be wasteful if the sensed data is almost the same over time. If data is unchanged, clusterhead nodes and the forwarding tree is likely to be the same. Moreover, intheinteractivemode, CAGdoesnotcountthenumberofnodeswithinacluster; thismaytrade the accuracy of result for energy saving by reducing the number of packet transmissions. 32 Aggregation Operator MAX MIN CNT SUM AVG VAR STD Duplicate Sensitive No No Yes Yes Yes Yes Yes Exemplary, Summary E E S S S S S Monotonic Yes Yes Yes Yes No No No Partial State D D D D A A A Error Bound τ τ Correct Nτ τ 4τ 2 2τ Table 3.1: Properties of aggregate operators supported by CAG. Taxonomy is based on [Madden etal. 2002]. Keyforabbreviations: Exemplary(E),Summary(S),Distributive(D),andAlgebraic (A). The error for only the MAX and MIN operators is bounded in the interactive mode of CAG. In our earlier experiments using the measured data from the Great Duck Island [102], we observed that CAG may result in an out-of-bound error with AVG operation, regardless of the error threshold and data correlation, when the data values do not follow the normal distribution. If the data is normally distributed, it is more likely that the clusterhead values will be closer to the mean than far from it. For a given clusterhead value (which is likely to be close to the mean), more nodes are likely to have their sensor readings close to the clusterhead value compared to the readingsderivedfromuniformdistribution. Thus, withnormaldistribution, theclusterheadvalue ismorerepresentativeofallthesensorreadingsintheclusterthanwithuniformdistribution. This property results in higher accuracy with normal distribution compared to uniform distribution. We implemented aggregation operators in CAG shown in table 3.1, which is a subset of op- erators mentioned in [61]. Duplicate-sensitive operators such as SUM and AVG cannot tolerate duplicate packets, while duplicate insensitive operators can. Exemplary aggregate returns repre- sentative sensor reading such as MAX or MIN, while summary aggregate such as AVG and CNT compute the summary of all the sensor readings. Monotonic aggregate monotonically increases or decreases after each intermediate aggregation. Partial state of algebraic operators consists of multiple variables. For example, partial state of AVG consists of SUM and CNT. The partial state of distributive operator such as CNT constitutes only a single variable. For MIN operator, each participating node (clusterhead) returns the MIN of all the values received in an epoch. The maximum error bound of MIN returned by CAG is equal to or smaller 33 than τ. MAX operation works in a similar way as MIN, and CAG returns the error equal to or smaller than τ. To compute VAR, CAG adds squared value along the AVG (SUM and CNT) at each node by using the following equation: VAR = P v j i 2 /n - AVG 2 , and STD is computed as the square root of VAR. Although CAG interactive mode does not provide the result with the bounded error in dupli- cate sensitive and summary aggregation operators such as AVG and VAR, it provides the result with bounded error with the exemplary and duplicate insensitive operators such as MIN and MAX. Note that the interactive mode only leverages the spatial correlation of data and cannot take advantage of the temporal correlation. 3.2.3 Streaming Mode The motivation for the streaming mode is the potential to exploit temporal correlation of data existing in the nature in addition to spatial correlation already exploited by the interactive mode of CAG algorithm. In the streaming mode of CAG, a single query generates periodic responses from the network. A query for the streaming mode uses the clause “epoch duration i” to define the sampling frequency. In response to a query with this clause, the network is expected to generate a query reply every i seconds while the query is injected to the network only once. The query phase of the CAG algorithm in the streaming mode is identical to that of the clusteringalgorithmintheinteractivemode. Theresponsephasealgorithmisthemajordifference between the streaming mode and the interactive mode. First, streaming mode must task the clusterheads to generate response messages once per epoch as opposed to one-shot response in the interactive mode. Second, the clusters need to be updated and repaired as sensor readings changeovertimeandbecomeinconsistentforcurrentclustermembership. Third,streamingmode allows amortizing the cost of cluster size estimation over a long duration that a query is expected to last. This makes it practical to obtain the cluster size along with clusterhead readings which 34 enables the streaming mode of CAG to compute results with high accuracy and guarantee that the resulting error is always bounded by the user given threshold even when both the population and sampled data is not normally distributed. 3.2.3.1 Cluster Adjustment The purpose of cluster adjustment is to make cluster membership consistent as sensor readings changeovertime. Notethatasensorreadingmustsatisfythecondition(|MR−CR|≤Range×τ) if it is to remain within a cluster. As soon as the reading is outside this range, the node must be evicted from its current cluster. Consequently, the node must be either added to a different cluster where its sensor reading is within the clustering range of the clusterhead value or start a new cluster with itself as a clusterhead. The cluster adjustment algorithm works by having the nodes in the network check if their sensorreadingsarewithintheallowedclusteringrangeoftheclusterheadeveryClusterAdjustment Interval. If the sensor reading is still within the allowed range, the cluster membership is still valid and no further action is necessary. If the sensor reading has veered off the clustering range, the node first attempts to migrate to a neighboring cluster where its sensor reading might be within the range for that cluster. To aid in the discovery of such neighboring clusters and their clustering range, the nodes snoop the broadcast medium while responses are transmitted and clusters are adjusting, and keep track of all the neighboring clusters within its radio range. If a suitable neighboring cluster for its sensor reading is not found, the node mustcreate a new cluster with itself as the clusterhead. In either case, the node must inform its children in the routing tree that it is no longer in the old cluster. The node must also inform the children (who in turn will inform their children) the new clusterhead value. If the children find this new clusterhead value compatible with their sensor reading, they will then join the cluster and propagate the message down the tree, otherwise they will proceed to start their own cluster and repeat the algorithm. 35 Algorithm 2 Pseudocode of the streaming mode of CAG algorithm 1: Function Query.Received: 2: if MR∈[CR−Range×τ, CR+Range×τ] then 3: clusterhead=FALSE; 4: broadcast query Q; 5: else 6: CR =MR; 7: clusterhead=TRUE; 8: broadcast query Q; 9: end if 10: Function Response.Received: 11: update neighbor table; 12: enque response to the buff; 13: Function ClusterAdjustmentMsg.Received: 14: update neighbor table; 15: if (clusterhead id==ClusterAdjustmentMsg.previous clusterhead id) then 16: ClusterAdjustmentMsgReceived=TRUE; 17: process cluster adjustment; 18: else 19: ClusterAdjustmentMsgReceived=FALSE; 20: end if 21: Function ClusterAdjustmentTimer.Fired: 22: if MR / ∈[CR−Range×τ, CR+Range×τ] then 23: if (my reading is within neighbor ′ s clustering range) then 24: join that neighbor ′ s cluster; 25: notify cluster membership change to old and new clusterhead; 26: else 27: create a new cluster with myself as the clusterhead ; 28: notify cluster membership change to old clusterhead; 29: end if 30: if (cluster join or cluster create or ClusterAdjustmentMsgReceived) then 31: propagate ClusterAdjustmentMsg; 32: end if 33: end if 34: Function Epoch.Fired: 35: if clusterhead then 36: forward aggregate(buff, MR); 37: else if size(buff)>0 then 38: forward aggregate(buff); 39: end if 36 Therearetwopropertiesoftheadjustmentalgorithmthathelpscontroltheclusteradjustment overhead. First, the cluster adjustment messages need only be propagated to the nodes within a cluster of which the node with the out-of-range reading is a member. Because the change of the clustering range in one cluster does not affect the range in a different cluster, there is no need to propagate these adjustment messages to other clusters. Second, the cluster adjustment timers are orchestratedinsuchawaythattheparentnodesinthequeryroutingtreealwaysperformscluster adjustment before its children. By the time a node runs the adjustment algorithm, it can be sure that cluster adjustment has already taken place in all the upstream nodes. This ensures that a node needs to perform at most one adjustment every Cluster Adjustment Interval. That is, the maximum amount of time that data can be out-of-range of their current cluster depends on the Cluster Adjustment Interval. We can select the cluster adjustment interval as a function of the temporal correlation of data, and the accuracy and agility requirement of application. In other words, smaller interval makes the system more responsive to the data dynamics. With these two techniques, adjustment cost can be controlled to make cluster adjustment practical and efficient. 3.2.3.2 Cluster Size Estimation Errors in the result obtained from the interactive mode can be large because equal weights are assigned to the results coming from clusters of different sizes. For example, while computing an average,theerrorcanbequitelargeifweassignequalweighttotheresultfromalargeandasmall cluster. One way to address this is by estimating the size of the cluster so that an appropriate weight can be assigned to the clusterhead values while computing an aggregate. Unfortunately, cluster size estimation is too costly in the interactive mode as it would require a large number of transmissions to compute the size which is used a single time in a one-shot query. In streaming mode, the queries last a long period of time. Even though the cost of counting the number of nodes in a cluster for the first time is high, there are two reasons in which cluster size estimation is practical in the streaming mode. First, the amortized cost of cluster size estimation over the 37 Interactive Mode Streaming Mode Description Single response for a query Multiple responses for a query (periodic or event driven) Exploiting property Spatial correlation Spatial and temporal correla- tions Clustering Fixedclusterandclusterheadfor a query Sameclusterandclusterheadun- til reclustering Advantages 1) Good for reactive, interactive, and one-shot query 2) Good es- timation when data is normally distributed with summary and duplicate sensitive operators 1) Good for proactive/reactive (periodic response or event driven) and long-lived query 2) Appropriate for model-based clustering 3) Accurate: bounded error for any data distribution Disadvantages 1) Not accurate when data is not normally distributed 2) If mul- tiple responses are desired over time, users have to inject queries each time a response is needed 1) Extra overhead and complex- ity from cluster adjustment and counting the number of nodes in a cluster 2) Unfair energy usage in static environment: Energy bottleneck at clusterhead be- cause clusterhead rarely changes Operators with bounded error MAX, MIN MAX, MIN, CNT, SUM, AVG, VAR, STD Table 3.2: Comparison of two modes of CAG’s operation: interactive and streaming. 38 duration of the query becomes quite small. Second, it is possible to update the node count incrementally when cluster adjustment algorithm changes the number of nodes in a cluster. With high temporal and spatial correlations, the cluster dynamics are rare, which makes changes in cluster size infrequent and the incremental cluster size estimation overhead small. When clusters are formed for the first time, each node sends a count increment message to its clusterhead. The number of nodes in a cluster is computed by counting the number of count increment messages received at the clusterhead. Due to data dynamics, some nodes might leave theclusterandsomeothernodesmightjointhecluster. Everytimeanodejoinsadifferentcluster, the nodes sends a count decrement message to its old clusterhead and sends a count increment message to the new clusterhead. If a node forms a new cluster with itself as the clusterhead, it sends count decrement message to its old clusterhead. The children in the query routing tree now must use the same algorithm to ensure that the node count remains consistent in the old as well as the new cluster. Cluster adjustment ensures that the clusterhead value is representative of the readings in a cluster even with changing sensor readings. Cluster size estimation enables CAG to assign appropriate weights to the clusterhead values while computing aggregates. Together these two techniques enable CAG to compute results efficiently with high accuracy and guarantee that the results are within the user-provided thresholds regardless of data distribution. Table 3.2 compares these two modes of CAG operations. 3.3 Analysis of CAG: Efficiency and Accuracy In this section, we formally analyze the efficiency and accuracy of the CAG algorithm in terms of the number of transmissions and the absolute error, respectively. We analyze the efficiency of CAG for the interactive mode or a single response of streaming mode 1 . We also prove that the 1 The efficiency of the CAG algorithm for both interactive and streaming modes is investigated in section 3.5.2.1 and section 3.5.2.2 respectively using simulations. 39 v v 1 v k … v 2 1 1 1 k k k d … … … (a) Tree with an average branching factor k and a depth d for the analysis of the number of clusters. v v 1 v 2 1 1 … v 1 2 v 2 2 v 3 2 v 4 2 (b) An example of a set of clusters formed in a balanced binary tree. Figure 3.2: Examples of the query routing tree for the analysis of CAG. absolute error is always bounded by the given threshold value in the streaming mode even when the data is not normally distributed. In this section, we definev as the sensor reading at the root node andv j i as the sensor reading of theith child ofv in thejth level in the tree. Even though the real world is not uniform, in this analysis, weassumethatthesensorreadingv j i isi.i.d. randomvariableuniformlydistributedover the range [0, 1]. The CAG algorithm is designed to take advantage of correlated data. Because there is no spatial correlation between the data with the uniformly distributed i.i.d. values, the probability of the sensor readings from two nodes are in the same cluster is minimal. Thus, the analysiswithuniformlydistributedi.i.d. datawillgiveusaninsightontheworstcaseperformance and accuracy of the CAG algorithm. We assume that each node in the tree has an average branching factor ofk. Fig. 3.2(a) depicts ak-ary balanced tree with depthd and Fig. 3.2(b) shows the same tree with annotation using the variables used in this analysis. Let T be the entire tree, and N T be the number of clusters in T. N v j i is the number of clusters in the subtree rooted at node v j i , and n{v} is the number of nodes that are in the same cluster as v. Even though our analysis does not address the bridge nodes, our simulation does (Fig. 3.10(b)). 40 To calculate the expected number of transmissions, we need to compute the expected number of clusters. We begin to build clusters from the root node with its single-hop children. As we assumed v has k children in average, N T is given by: N T = 1+N v 1 1 +...+N v 1 k −n{v 1 i |v 1 i is in the same cluster with v} (3.1) When we compute the expected value of Equation (1), we obtain the following equation. E[N T ] = 1+ k X i=1 E[N v 1 i ]− k X i=1 P r [v 1 i is in the same cluster with v] (3.2) Here, P c = P r [v j i is in the same cluster with v], the probability that v j i is in the same cluster as v, is derived as follows: P c = P r [−τ ≤v j i −v≤τ] = v+τ ,if 0<v <τ 2τ ,if τ ≤v≤1−τ 1−v+τ ,if 1−τ <v≤1 = Z τ 0 (v+τ)dv+(1−2τ)2τ + Z 1 1−τ (1−v+τ)dv = τ(2−τ) (3.3) Now we analyze the number of transmissions for CAG. As a special case, first, we compute the number of clusters only with the single-hop nodes by extending Equation (2) and combining Equation (3) as follows. In this scenario, all the nodes including the root nodes are within the 41 single-hop radio range from any node in the network. Thus, the number of children of the root nodes is k =|T|−1. E[N T ] = 1+ k X i=1 E[N v 1 i ]− k X i=1 P c = 1+k−k×P c = 1+(1−P c )k = 1+(1−2τ +τ 2 )(|T|−1) = |T|(1−2τ +τ 2 )+τ(2−τ) = N(1−2τ +τ 2 )+τ(2−τ) = (N −1)τ(τ −1)+N (3.4) We can generalize the single-hop scenario into the multiple-hop scenario by iteratively using Equation (2) for each node in the network up to the depth d−1. E[N T ] = 1+ k X i=1 E[N v 1 i ]− k X i=1 P c = 1+k−kP c : for d=1 = 1+(k−kP c )+k(k−kP c ) : for d=2 = 1+(k−kP c )+k(k−kP c )+k 2 (k−kP c ) : for d=3 = etc. for d = 4, 5, ... Thus, E[N T ] for a k-ary balanced tree with depth d is given below: E[N T ] = 1+ (k−kP c )(k d −1) k−1 (3.5) = 1+ k(1−τ(2−τ))(k d −1) k−1 42 Equation (5) validates Equation (4) which is for the single-hop scenario as a special case of (5) withk =|T|−1andd=1. Thus,Equation(5)estimatesthenumberofclustersinbothsingle-hop and multi-hop topologies. As in Equation (5), the expected number of clusters depends on the correlation level P c and the branching factor k where P c =τ(2−τ) and k <|T|=N. The size of a cluster, S T , can be computed as follows: S T = Total number of nodes Expected number of clusters = N E[N T ] = N 1+ (k−kPc)(k d −1) k−1 (3.6) = N(k−1) (k−1)+k(1−τ(2−τ))(k d −1) As shown in Equation (6), the size of a cluster, S T , is also a function of P c and k where P c = τ(2−τ). With the uniformly distributed sensor readings, P c =τ(2−τ) from Equation (3). For empirical data, we can derive P c as shown in Equation (13) and (14) in section 3.4.4, and use those P c to compute S T . Because we build clusters based on the absolute range (defined in Table 3.3), the absolute error is always bounded by τ, such that |v j i −v|≤ τ where v j i is normalized to [0, 1]. Thus, the sensor reading on each node, v j i , can be off by up to τ from the clusterhead reading. If the total number of nodes is N, the maximum error for the entire network becomes (N −1)τ. For AVG operator, the maximum error is (N−1)τ N . For SUM operator, the maximum error is (N−1)τ. For MIN and MAX, the maximum error is τ: Real MIN≤CAG MIN≤Real MIN+τ Real MAX≥CAG MAX≥Real MAX−τ 43 VAR is given by VAR = P n 1 (v j i −AVG) 2 /n. Here v j i can be off from the clusterhead value by up to τ, and the AVG returned by CAG has maximum error of (N−1)τ N <τ. Thus, the maximum error for VAR is given by P n 1 (τ +τ) 2 /n = 4τ 2 . The maximum error for STD is 2τ. All the operators implemented in CAG and maximum error bounds are summarized in Table 3.1. NowwecanformallyprovethattheabsoluteerrorE r isalwaysboundedbyτ inthestreaming mode 2 . Assume that the values in each cluster are sorted. Let v ij be the jth unique value in the cluster i, n(v ij ) be the number of nodes with jth unique value in the cluster i, and d(i) be the number of unique values in the cluster i. C i be the clusterhead value for the cluster i. c(i) be the number of nodes in the cluster i such that c(i)= P d(i) j=1 n(v ij ). Let s be the number of clusters in the network, and N be the total number of nodes in the network such that N = P s i=1 c(i). We can compute the AVG operation correctly using the TAG, and approximately using the CAG as in Equation (7) and (8), respectively. CorrectResult = P s i=1 P d(i) j=1 n(v ij )×v ij N (3.7) EstimatedResult = P s i=1 c(i)×C i N (3.8) The absolute error, E r , using the CAG can be computed as follows. E r = |EstimatedResult−CorrectResult| , where 0<τ ≤0.5 = | P s i=1 c(i)C i − P s i=1 P d(i) j=1 n(v ij )v ij | N = | P s i=1 P d(i) j=1 n(v ij )C i − P s i=1 P d(i) j=1 n(v ij )v ij | N = P s i=1 P d(i) j=1 n(v ij )|C i −v ij | N ≤ (N −1)τ N ≤ τ (3.9) 2 We mentioned in Section 3.2.2 that interactive mode may not provide the bounded error when the data is not normally distributed. 44 As shown in Equation (9), E r is always bounded by τ when it computes AVG in the streaming mode regardless of the data distribution. Note that the actual magnitude of absolute error with the measured sensor data and without packetlossinthestreamingmodeisaconstantforlinearmodelasinEquation(15)andlogarithmic function for spherical model as in Equation (16). This result corresponds to the shape of the variograms of the linear and spherical data models (Fig. 3.6(a)) respectively. The impact of different spatial patterns on P c and E r is described in detail in Section 3.4.4. 3.4 Measurement and Correlation Model The patterns and levels of spatial correlation observed in the measured environmental data can give us an insight on the potential benefit of the CAG algorithm deployed in the real world. Eventually, the spatial data models can be used for predicting near future data or missing data. In section 5.1, we present the common variogram models used to classify spatially correlated data. In section 5.2, we describe our setup for the environmental data measurement. Our fine granularity(10meterinternodedistance)datasetisthesignificantimprovementovercoarsegranu- larity(tensofkilometersinternodedistance)datasetusedbyexistingstudiesonspatialcorrelation in the wireless sensor networks [21] [45] [69]. In section 5.3, we present the results from analyzing our collected data using spatial statistical techniques. In section 5.4, we studied the performance and accuracy of CAG with the data model derived from our empirical dataset. 3.4.1 Variogram models The semivariogram 3 is the most common way to characterize the correlation between pairs of points separated by a spatial distance [18]. In probabilistic notation, the variogram is defined as follows: γ(h)= 1 2 E[(X(p)−X(p+h)) 2 ] for all possible locations p, where X(p) and X(p+h) are 3 We simply refer the semivariogram as a variogram from now on. The difference between the two is the multiplicative factor of 2 which only effects the magnitude and not the trend of the statistics. 45 the values at the head and tail of each pair of points at a distance h. Variograms with 0 slope present no correlation such as i.i.d. random values. Under positive autocorrelation, points that arecloseinthe(x, y)-plane tendtohave similar valuesofz, wherezisthesensorvalueatlocation (x, y). Now we describe three common variogram models. Spherical model: The variogram of spherical model increases linearly in the beginning, then it becomes a sill, which is a plateau. That is, the expected difference of the sensed values between two points stop increasing at certain point although the distance (h) between the nodes increases. In spherical pattern, data is correlated over shorter distance than in linear or fractal patterns. Linear model: The positive correlation with linear model stretches over longer distance than other correlation data models. In this model, data becomes less correlated as distance increases, and this relation continues without stopping. We would say that linear model is stronger corre- lation model than other models in terms of distance. Fractal model: Fractal structures are ubiquitous in the nature with the key property of self- similarityacrossarangeofspatialscale[34]. Fractalobjectsorbehavioroftenemergeinecological models even if the models are not explicitly designed. Natural landscapes are not ideal fractals, but such models provide the simplest available means of simulating spatially complex landscapes and serve as neutral habitat model. For a fractal pattern on a two-dimensional landscape, the relationship should be linear in a graph of log(γ(h)) against the log(h) with the measurement points stretching at least 2 orders of magnitude on each axis [34]. 46 (a) Picture of the Exposition Park in Los Angeles. (b) 4th floor of Tutor Hall at USC. (c) Map of the Great Duck Island with sensor deployment. Figure3.3: Picturesofoutdoorandindoorenvironmentswherethedataismeasuredandthemap of the Great Duck Island. The Equations for the spherical and linear models used for variograms are given below, where c is the nugget effect (γ(0) =c), S is the sill, and a is the range of influence [78]. As the distance h approaches zero, the measurement error and microscale variation induce the nugget effect. γ(h) = c+(S−c){ 3 2 h a − 1 2 ( h a ) 3 } , for 0<h≤a : for spherical c+(S−c) , for h≥a 0 , otherwise γ(h) = c+(S−c) h a : for linear These models are common for variograms used in practice in spatial statistics [78]. 3.4.2 Data Sets For modeling, analysis, and simulation, we used the following four data sets which we describe below. 3.4.2.1 Sensor Data Measurement in a Regular Grid We measured two modalities (light and temperature) of real data using mica2 motes and MTS 300 sensor boards in two different environments, outdoor at Exposition Park in Los Angeles (Fig. 3.3(a)) and indoor on the 4th floor of Tutor Hall at USC (Fig. 3.3(b)). At each position, the 47 1010 1011 1012 1013 1014 1015 1016 1017 1018 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 1010 1012 1014 1016 1018 1020 Raw data (a) Light data at 1 PM. 1002 1004 1006 1008 1010 1012 1014 1016 1018 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 1000 1005 1010 1015 1020 Raw data (b) Light data at 4 PM. 950 960 970 980 990 1000 1010 1020 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 950 960 970 980 990 1000 1010 1020 1030 Raw data (c) Light data at 6 PM. 0 100 200 300 400 500 600 700 800 900 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 0 100 200 300 400 500 600 700 800 900 1000 Raw data (d) Light data at 7 PM. 930 940 950 960 970 980 990 1000 1010 1020 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 920 940 960 980 1000 1020 Raw data (e) Light data at 1 PM (sensorboard down). 0 50 100 150 200 250 300 350 400 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 0 50 100 150 200 250 300 350 400 Raw data (f) Light data at 7 PM (sensorboard down). Figure3.4: MeasuredlightdatafromtheExpositionPark. Y-axisisADCvalueofsensorreading. Allthesensorboardsarefacingtheskyexcept(e)and(f)inwhichcasethesensorboardsarefacing down. 48 mote and the sensor board was placed on the ground and data is measured twice to understand the sensitivity of data depending on the sensor board’s orientation: (1) the sensor board facing theskyand(2)thesensorboardfacingtheground. Samplesaretakenat200millisecondinterval; arithmetic mean of twenty values makes one reading to prevent getting an inaccurate data which might be generated from a single instance of malfunctioning sensor because it was not our goal to model or incorporate sensor errors in our protocol. For our measurements, a mote (directly connected to a laptop) was quickly repositioned to the next position on the grid after a set of measurement. The delay is not significant compared to the time scales at which changes in the measured phenomena occur and hence can be ignored. All our experiments are based on the raw sensor value. A small subset of the light measurement from the Exposition Park is presented in Fig. 3.4 4 . • Outdoor environment (Exposition Park in Los Angeles): We measured the envi- ronmental data at 100 (10m×10m grid) positions in the Exposition Park located in Los Angeles. Light and temperature readings were taken at four different times (1, 4, 6, 7 PM) of the day. We decided 10 meter as an internode distance because less than 10 meter is too redundant to monitor the outdoor environment. Also, the mica2 mote radio (433 MHz ChipconCC1000)isreliableuptoabout35metersevenwiththedefaulttransmissionpower from our measurement study [101]. • Indoor environment (4th floor of Tutor Hall at USC): We took measurement at 40 locations with 5 meter internode distance in the rooms and hallway on the 4th floor of TutorHallatUSCat7PM.Wedeployedthesensorsmoredenselythanoutdoorenvironment becausethedataisaffectedbyvariousindoorequipments(desks,chairs)andbuilding(walls, windows). These factors can weaken the radio transmission, too. 4 Here we do not show all the data we measured due to space constraint. 49 (a) 7h data. (b) 9h data. (c) Spatial pattern data. Figure 3.5: Synthetic data using statistical model (7h and 9h) and ecological model (spatial pattern). The magnitude of value (synthetic sensor reading) decreases from red to yellow to green to blue. 3.4.2.2 Data with Irregular Mote Placement on Great Duck Island The four modalities (humidity, temperature, light, and pressure) measured on Great Duck Island [62] constitute this data set (Fig. 3.3(c)). Different modalities are in different units, but we used raw values in all cases. As these sensor nodes are not deployed in a grid (irregular internode distance), distances are subdivided into a number of intervals called lags to simplify variogram computation [18]. 3.4.2.3 Synthetic Data from the Statistical Model Sensordataisgeneratedusingthemethodsuggestedin[45]fora250m×250mgrid. Fivedatasets with different degrees of correlation are generated with parameters α =1/2 i , β, andh=1,3,5,7, and 9. The correlation coefficient h determines the level of correlation; an h of 1 generates data with almost no spatial correlation (similar to i.i.d. random), and a larger h results in a higher spatial correlation. Our previous research [102] used this data to compare the effect of different spatial correlation levels on the efficiency and accuracy. In this paper we compare P c given in Equation (13) and (14) using empirical data, and 7h (Fig. 3.5(a)) and 9h (Fig. 3.5(b)) synthetic data 5 . 5 In [102], we reported the details on the different levels of spatially correlated data and the corresponding results. 50 3.4.2.4 Synthetic Data from the Ecological Model Weusedthespatiallycorrelateddata(Fig.3.5(c))generatedbasedontheecological(environmen- tal) patterns model provided by [53] in 250m×250m grid. Even though this data is synthetic, it contains realistic spatial patterns with known spatial properties bearing the fractal pattern in the environment. This synthetic pattern is generated by using stochastic noise. We generated the power spec- trum with 1 f 2 frequency using the normal error distribution in two dimension, and then inverse Fourier transform this to obtain the spatial pattern; this technique is used to create many natural looking fractal forms [53]. We generated the scale invariant, brown noise (which is strictly quasi fractal) with the fractal dimension D = 2. Fig. 3.6(c) includes the variogram of this pattern. This spatial pattern presents the fractal characteristic of the environment with a high correlation level between 7h (Fig. 3.5(a)) and 9h (Fig. 3.5(b)) [102]. 3.4.3 The Spatial Data Model In this section, we mathematically model the property of spatial correlation using the measured sensor data from different environments. We observed that the linear correlation model (which can be potentially a fractal model) and spherical data model is prevalent in the environmental data we measured. Fig. 3.6(a) shows the variograms of temperature data measured from the Exposition Park. All the temperature data from different time of the day show linear property (6 PM shows almost random) except for the 1 PM data which shows spherical characteristics. Fig. 3.6(b) shows the variograms of light value measured from the Exposition Park. Each variogram from different time of the day shows the different linear functionality; 1 PM shows the least correlation (almost random) and 6 PM shows the strongest correlation. 51 0 500 1000 1500 2000 0 10 20 30 40 50 60 70 80 1PM 4PM 6PM 7PM γ(h) Distanceh (a) Variograms of measured temperature with sensorboard up at 1, 4, 6, and 7 PM from the Exposition Park. 0 50 100 150 200 250 0 10 20 30 40 50 60 70 80 1PM 4PM 6PM γ(h) Distanceh (b) Variograms of measured light with sensorboard up at 1, 4, and 6 PM from the Exposition Park. 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 0 10 20 30 40 50 60 70 80 Spatial Pattern Humidity Temperature Light Pressure γ(h) Distanceh (c) Variograms of spatial pattern and the measured data (light, temperature, humidity, pressure) from the Great Duck Island. 0 500 1000 1500 2000 2500 0 10 20 30 40 50 60 70 80 Spherical model: 1PM Quasi-spherical: 1PM Data: 1PM Linear model: 7PM Data: 7PM γ(h) Distanceh (d) Variograms of temperature data from the Exposition Park and their models in the spatial statistics: spherical and quasi-spherical (1 PM) and linear (7 PM) models. Figure 3.6: Variograms of measured and synthetic data. 52 Because of the strong sunlight at 1 PM, the temperature values in the outdoor environment changes by a large amount (480 to 660) even at a short distance due to the shades under the several trees [101]. On the other hand, the range of light readings at 1 PM is small (1010 to 1018) under the same shadows (almost random as shown in Fig. 3.4(a)). The magnitude of variogram for light at 7 PM is much larger than those at 1, 4, and 6 PM because of its huge data range (0 to 900) from the effect of street light (some places were bright and some were completely dark) at night (Fig. 3.4(d)), so it is not included in Fig. 3.6(b). The spatial correlation property holds for both the orientation (face up and face down) of the sensorboard. For the light sensor value, the sensed value with sensorboard facing up captures the physicalphenomenonbetter,andshowsthespatialcorrelationmoreclearlythanwithsensorboard facing down. On the other hand, the temperature value was not sensitive to the orientation of the sensorboard. Whenweapplythetemperaturedatatothecorrelationdatamodelequations,wegetEquation (10) for spherical model and Equation (12) for the linear model. For finer match with the data at 1 PM, we build a quasi-spherical model using polynomial equation by regression as in Equation (11) 6 . γ(h) = 1254×( 3 2 h 36 − 1 2 ( h 36 ) 3 ) , for 0<h≤36 : for spherical 1254 , for h≥36 0 , otherwise (3.10) γ(h) = 0.0089h 3 −1.5906h 2 +80.76h−20.594 : for quasi-spherical (3.11) γ(h) = 425 81 ×h : for linear (3.12) As shown in Fig. 3.6(c), variograms of four modalities measured on the Great Duck Island and the variogram of spatial pattern from ecological model show the linear function. If we ignore the 6 We model the data mathematically for the purpose of our study. A more sophisticated modeling is out of the scope of this study. We will address this problem in our future work. 53 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 CAG4, 1PM CAG4, 7PM CAG10, 1PM CAG10, 7PM model 1PM model 7PM P c Distanceh (a) Simulation and mathematical model of Pc with CAG4 and CAG10 with temperature data from the Exposition Park. Linear model (model 7PM) has larger heavy tail than spherical model (model 1PM). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 P c CAG4,7h CAG4,9h CAG10,7h CAG10,9h Distanceh (b) Simulation model of Pc with CAG4 and CAG10 using 7h and 9h synthetic data. Figure 3.7: Mathematical model of P c . CAGn means CAG with τ = n%. Both (a) and (b) use 10 meter internode distance in a regular grid. nugget effect [78], which makes the variogram start at a non-zero value at h = 0, all variograms in Fig. 3.6(c) become similar in magnitude and pattern. Note that the important factor in the variogram is not the magnitude but the shape of graph. The different magnitudes of each modality in variogram is due to the difference in magnitudes of the raw sensor values across different modalities. Most variograms using the real sensor data from the Exposition Park (Fig. 3.6(a), 3.6(b)), Tutor Hall (excluded for brevity), and the Great Duck Island (Fig. 3.6(c)) show the linear pattern (except for the 1 PM temperature from the Exposition Park, which shows the spherical pattern) similartothatofecologicalspatialpattern(Fig.3.6(c))indifferentmagnitude. Thus,weconclude that linear and spherical correlation models are prevalent in our sensor readings. We use these two models to analyze and evaluate CAG’s performance and accuracy in the next section. 3.4.4 The Impact of Different Correlation Models on CAG While we proved the analytic lower bound of efficiency and accuracy of CAG with uniform dis- tribution of sensor data in section 3.3, in this section, we investigate the impact of different 54 correlation patterns observed in the measured environmental sensor data on the efficiency and accuracy of the CAG algorithm. This analysis can give us an insight on how much energy saving and accuracy CAG can achieve in the real world. We now analyze the performance of CAG based on the real measured data. We model P c , the probability of two nodes being in the same cluster, empirically as the level of correlation using PDF as a function of internode distance h and threshold τ for each spatial data model. We observed that the CAG can achieve better efficiency and accuracy with the linear pattern than with the spherical pattern. We selected two data sets as the representative environmental data models; the temperature data at 1 PM and 7 PM measured in the Exposition Park. We model P c by curve fitting the simulation result of CAG using τ =4% and 10%. Fig. 3.7(a) shows the result of simulations and their mathematical models 7 . For small τ (e.g., CAG with τ = 4% or less), we observed that both spherical and linear patterns present similar P c : both patterns follow the same polynomial which is a function of h and τ as in Equation (14). For large τ (e.g., CAG with τ =10%), we observed P c of linear model is more heavy-tailed than that of spherical model; P c depends on the data model,h, andτ. Thus, we model P c for the relatively large τ such as the linear pattern with a logarithmic function as presented in Equation (13), and the spherical pattern with a polynomial function as in Equation (14). P c = τ 2(log(h)+h) : for linear (3.13) P c = 2τh −2 : for spherical (3.14) In our earlier work [102], we observed P c is larger than those in the Fig. 3.7(a) with the synthetic 7h data which corresponds to the similar correlation level of temperature from the 7 We did not measure the sensor data whereD ij < 10m. Thus, we just connected two measured points where D ij = 0m and 10m. 55 0 1 2 3 4 5 6 7 20 10 4 2 0 Absolute error (%) Threshold (τ) E r (1PM) E r (7PM) Perfect-rel: 1PM Perfect-rel: 7PM (a) Simulation (streaming mode with perfect reliability) and analytical models of the absolute error for two temperature data patterns (linear for 7 PM and spherical for 1 PM). 0 1 2 3 4 5 6 7 20 10 4 2 0 Absolute error (%) Perfect-rel: 1PM Perfect-rel: 7PM Measured-rel: 1PM Measured-rel: 7PM Threshold (τ) (b) Absolute error of two data patterns (linear for 7 PM and spherical for 1 PM) of temperature in terms of threshold both with the perfect reliability and the default transmission power. Figure 3.8: Accuracy model as a function of threshold (τ) and internode distance (h). Great Duck Island. This difference can be attributed to the changed measurement setting: in this study we use 10 meter internode distance in a regular grid while in our earlier work [102], we use random positioning with an average internode distance of 10 meter. To help us compare these two sets of results under the same condition, we repeated experiments from our previous work with nodes placed in a regular grid with 10 meter internode distance using 7h and 9h data. We observed that the our measured temperature from the Exposition Park (Fig. 3.7(a)) is more spatially correlated than 7h synthetic data (Fig. 3.7(b)) in terms of P c . The absolute error of the streaming mode of CAG,E r , is obtained empirically by curve fitting graphs in the Fig. 3.8(a). We model E r as a constant for a linear model in Equation (15) and logarithmic function for a spherical model in Equation (16). E r = 0.41 = c , where c is a constant : for linear (3.15) E r = 3 2 log(τ +1) : for spherical (3.16) 56 We demonstrate thatE r is bounded by the given threshold in the streaming mode even under the packet loss using the mica2 radio profile. Fig. 3.8(b) shows that even with the packet loss, the absolute error is always bounded by τ. We observed that E r of linear pattern (7 PM) is smaller than that of spherical pattern (1 PM) at a given threshold without packet loss. However, with packet loss, we observed that E r for 7 PM is not always smaller than that of 1 PM as shown in Fig. 3.8(b). 3.5 Experimental Study In this section, we describe 1) the evaluation metrics and experimental setup, 2) results from the evaluation of interactive mode, and 3) evaluation results of streaming mode which supports cluster adjustment. 3.5.1 Evaluation Metrics and Experimental Setup The primary metric used for evaluation, the reduced number of transmissions, uses the number of transmissions to estimate the energy cost in WSN. This approach is reasonable because radio transmissions consume far more energy than any other operation in a sensor node [36]. In the interactive mode, the reduced number of transmissions is calculated as nTX(TAG)−nTX(CAG) nTX(TAG) × 100, where nTX(TAG) and nTX(CAG) are the number of transmissions for TAG and CAG respectively. In the streaming mode, we compute the number of transmissions to compare overheads be- tween CAG and TAG. This metric is also used to compare the breakdown of each transmission overhead. In our calculations, nTX excludes the query packets because this number is the same in both TAG(one-shotorstreaming)andCAG(interactiveorstreaming)regardlessofτ. Ineithermodes of both protocols, there is a single instance of query dissemination. 57 Metrics Description Cluster formation range Absolute range: CR±Range×τ, where Range=MaxValue− MinValue Reduced number of transmissions nTX(TAG)−nTX(CAG) nTX(TAG) ×100 in the interactive mode Number of trans- missions nTX(CAG) in the streaming mode Accuracy of result Absolute error: E r =|EstimatedResult−CorrectResult|× 100 Number of bridge nodes Number of participating nodes - Number of clusterheads Reduced number of transmissions per density Three densities: moderate (average 17 neighbors per node), dense (average 26 neighbors per node), and sparse (average 9 neighbors per node). Table 3.3: The metrics used in the evaluation of CAG. Another metric is the accuracy of result as an absolute error for a given τ calculated as |EstimatedResult − CorrectResult| × 100. Note that the absolute error is computed using EstimatedResult and CorrectResult from the same query cycle. We compute the average number of bridge nodes in the data forwarding paths to understand their contribution to the total communication overhead. Bridge nodes are used to keep the topology connected and they can optionally contribute to the aggregate. Note that the bridge nodes did not contribute to the aggregate in our simulations. We explored the impact of 3 different densities (moderate, dense, and sparse) on the reduced number of transmissions. For moderate density, we randomly placed 375 nodes in a 250m× 250m grid which results in at least a 5-hop topology with each node having an average of 17 neighbors. We randomly placed 550 nodes (average 26 neighbors per node) in dense deployment, and randomly placed 200 nodes (average 9 neighbors per node) in sparse deployment with other conditions constant. To understand the effectof packet loss on accuracy, we ransimulations on two types of topolo- gies: 1) lossless topologies and 2) topologies constructed using empirical loss rates. Comparing resultsfromthesetwotopologiesgivesusinsightintothedifferencebetweentheoreticalresultsand 58 what might be observed in real world implementation of CAG. We used the loss profile from our own measurement using mica2 motes to assign reliabilities to links between nodes. The reception rate was measured using 433 MHz Chipcon CC1000 by counting the successfully received packets among500packettransmissionsatthedefaultradiotransmissionpower. WealsoconfiguredCAG and TAG not to use retransmissions for our experiments. Table 3.3 describes the metrics used to evaluate the CAG. We used the TOSSIM simulator of TinyOS 1.1.8 for our simulation study [54]. We used the temperature data collected with sensorboard up at 1, 4, 6, and 7 PM from the Exposition Park. One hundred nodes were deployed in a regular 100m×100m grid. We also used temperature readings collected on the Great Duck Island to study the efficiency of CAG with a long temporal history (from 35 nodes every hour for four days). Each node at position (x, y) uses the value from the corresponding position in the empirical data sets. We generated four different topologies (the root node is placed at each corner of 100m×100m grid) with the above configuration and results are averaged over 20 runs for each topology. We observed that the inclusion of the bridge nodes in the aggregation only contributes marginally to the improvement of the precision of the result. We chose τ = 0,2,4,10 and 20% to cover typical values of τ that users might be interested in. 3.5.2 Results In this section, we report the results from our experiments both in interactive and streaming modes. Our key finding is that CAG in the interactive mode produces results with very small and often bounded errors with dramatically reduced message overhead compared to TAG. In the streaming mode, efficiency compared to TAG is even higher while at the same time guaranteeing that the errors in the results are always bounded by the user-provided threshold. In this section, we only report the results for the AVG operator. We expect qualitatively similar results for other aggregation operations. We omit those results due to lack of space. 59 Figure 3.9: A snapshot of CAG tree with 375 nodes randomly placed in 250m×250m space with 9h synthetic data and τ = 20%. The big black square near the bottom left corner is the root node and other small black circles are nodes in the root cluster. Clusterhead nodes (except the rootnode)arethesmallblacksquaresandthenon-clusterheadnodes(excepttherootcluster)are small empty circles. The number beside each node indicates clusterhead node id, and the arrow points to the parent node in the query routing tree. 60 0 20 40 60 80 100 20 10 4 2 0 Reduced number of transmissions (%) 1PM 4PM 6PM 7PM Threshold (τ) (a) Performance with the default transmission power. 0 10 20 30 20 10 4 2 0 Number of bridge nodes 1PM 4PM 6PM 7PM Threshold (τ) (b) Average number of bridge nodes with the default transmission power. 0 1 2 3 4 5 6 7 8 9 10 11 12 20 10 4 2 0 Absolute error (%) 1PM 4PM 6PM 7PM Threshold (τ) (c) Precision with the default transmissions power. 0 1 2 3 4 5 6 7 8 9 10 11 12 20 10 4 2 0 Absolute error (%) 1PM 4PM 6PM 7PM Threshold (τ) (d) Precision with perfect link reliability (bounded by τ). Figure 3.10: Performance and precision tradeoff in the interactive mode with measured tempera- ture data from the Exposition Park. 3.5.2.1 Interactive Mode of CAG InthisSection,wepresenttheperformanceandprecisiontradeoffintheinteractivemodeofCAG. In this mode, the clusterhead values are not weighted by the respective cluster sizes. Thus, if the data from all the nodes do not follow the normal distribution, we cannot guarantee that the error from the result is bounded by the given τ. Fig. 3.9 shows a snapshot of the CAG tree with 375 nodes randomly placed in 250m×250m space with 9h synthetic data and τ = 20%. The root node (the big black square) is located in the bottom-left area, and all other nodes shown in black circles are in the same cluster with the root nodes. As expected, most new clusters are built along the diagonal band spanning from the top-left to the bottom-right (clusterhead nodes, small black 61 square nodes, are spread there) where there is a large difference in the magnitude of data. This figure confirms the validity of our implementation of the CAG algorithm. Fig. 3.10(a) shows the performance of CAG over TAG in terms of the reduced number of transmissions in the interactive mode using the default transmission power. As we mentioned earlier, we compare the CAG interactive mode against the TAG with one-shot query because CAG interactive mode also uses one-shot query. CAG with τ = 4% has a transmission saving up to 37.5% at 6 PM and CAG with τ = 10% has a transmission saving up to 51.25% at 7 PM. An interesting result is that CAG can save 8.75 to 20% even with τ = 0, which guarantees that the results have no error. This is because nodes with the identical sensor readings form clusters thereby allowing no error. Fig. 3.10(b) shows the number of bridge nodes. As the error threshold increases, more nodes choose not to respond. Thus, we need more bridge nodes to keep the tree connected. Because a modest increase in the number of bridge nodes is accompanied by a dramatic reduction in the number of clusterheads, the total number of transmissions still decreases as the threshold value increases. Fig. 3.10(c) shows the accuracy of the result obtained using the interactive mode of CAG with empirical radio profile. Due to the unreliable links, the resulting errors is out of bound when τ =2% at 7 PM. The error caused by the packet loss is 9.375% when τ =10% at 4 PM. Fig. 3.10(d) shows the accuracy of the result obtained using the interactive mode of CAG with perfect link reliability. Even though we expected that the resulting error is not guaranteed to be within the given threshold τ in the interactive mode, we observed that the resulting error was always bounded by τ. This can be an indication such that our measured temperature data in the physical world follows the normal distribution. We investigated the impact of 3 different densities (moderate, dense, and sparse) on the re- duced number of transmissions. With a denser node deployment, CAG saves more number of 62 transmissions by exploiting the increased correlation in readings from closeby sensors. Sparse de- ployment results in weaker data correlation, which increases the transmission overhead for CAG. We omit the figure which is presented in our earlier paper [102] due to the space constraint. 3.5.2.2 Streaming Mode of CAG In this section, we present the impact of data dynamics (both spatial and temporal) on the per- formance and accuracy of the streaming mode of CAG. In our evaluations, we systematically compare1)thetotal transmissionoverheadand2)theclusteradjustment overheadinthestream- ing mode with alternative approaches, while presenting 3) the detailed breakdown of the total message overhead. Our main result from studying the streaming mode of CAG is that small and bounded error is achievable while significantly reducing the message overhead compared to TAG. Weusedthreedifferentdatasetsforourmeasurementstudyinthissection: GreatDuckIsland dataset, stair-wise dataset, and linear dataset. Great Duck Island dataset: We obtained the coarse granularity (recorded once per hour) time series temperature data from the measurements conducted on the Great Duck Island from 35 nodes for four consecutive days (Fig. 3.11(f)). This data set gives us insight into the long term performance of CAG with high data dynamics. In order to investigate the impact of different temporal patterns of data on different aspects of CAG performance, we first analyzed the data from our measurement from the Exposition Park and that from the Great Duck Island. We observed two common data patterns in our analysis of data from the Great Duck Island – (1) a stair-wise pattern in which data do not change for a long period of time and change drastically whenever there is a change, and (2) a linear pattern in which data changes monotonically, increasing or decreasing. The stair-wise pattern is due to the shadows that keep certain places cool for a period of time even in the middle of the day resulting in the sharp changes of temperature. 63 0 2000 4000 6000 8000 10000 12000 14000 0 20 40 60 80 100 Number of transmissions Minute CAG: Stair-wise CAG: Linear TAG (a) Comparison of the total transmission overhead (accumulated) between TAG and CAG using the linear and stair-wise datasets with τ = 20%. 0 1000 2000 3000 0 20 40 60 80 Number of transmissions Hour CAG TAG (b) Comparison of the total transmission overhead (accumulated) between TAG and CAG using the Great Duck Island dataset with τ = 20%. 0 10000 20000 30000 0 20 40 60 80 100 Number of transmissions Minute CAG Query flooding (c) Comparison of cluster adjustment overhead (accumulated) using the linear dataset with τ = 20%. 0 2000 4000 6000 8000 0 20 40 60 80 Number of transmissions Hour CAG Query flooding (d) Comparison of cluster adjustment overhead (accumulated) using the Great Duck Island dataset with τ = 20%. 0 20 40 60 80 100 0 20 40 60 80 100 Number of transmissions Minute Linear Stair-wise (e) Comparison of cluster adjustment overhead (accumulated) between linear and stair-wise datasets using with τ = 20%. 0 20 40 60 80 100 120 140 0 20 40 60 80 Temperature (F) Hour Max and Min Average (f) Maximum, minimum and average temperature of the Great Duck Island dataset. Figure 3.11: Impacts of different spatial and temporal correlations on the performance of CAG. All experiments use the radio profile for default transmission power. 64 Stair-wise dataset: The stair-wise dataset for our evaluation is taken from the measurements from the Exposition Park. The temperature readings between 4 PM and 6 PM is interpolated as stair-wise data: no change from 4 to 5 PM, at 5 PM sudden change to the 6 PM level, and no change from 5 to 6 PM. This dataset has one-minute resolution in time axis. Linear dataset: We do not have access to linear time series data at a high resolution from direct sensing in large deployment. Instead, we generate a linear dataset by linearly interpolating (one minute interval) the temperature snapshots taken at the Exposition Park at 4 PM and 6 PM described in section 3.4.2.1. Simulationwiththesefinergranularitydatasets(linearandstair-wise)canhelpusunderstand CAG’s performance with high data correlation at a smaller time scale. Now we proceed to present the main results from our simulation study. The first observation we made is that the CAG is much more efficient than TAG in terms of the total number of messages (control and data forwarding) incurred by both the algorithms. Fig. 3.11(a) shows that CAG with linear dataset is able to achieve 63.07% reduction in message overhead. With stair- wise data pattern, CAG used 70.24% less messages to compute the aggregate result. We also compared the message overhead of CAG and TAG with the Great Duck Island dataset and found that CAG consistently incurs about 19% less overhead than TAG (Fig. 3.11(b)). We further note that these reductions in message overhead were achieved while incurring a small error in the result (Fig. 3.13). We attribute this reduction in number of transmissions to the reduced number of nodes that send their responses to the root. In TAG all the nodes send their results to the root while in CAG only the clusterheads transmit and aggregate the results. We now justify the need for the cluster adjustment algorithm in the streaming mode using experimental evaluation. Without reclustering, the clusters can quickly become inconsistent as new sensor readings no longer fall within the clustering range. One approach that addresses this problem is called reclustering using query flooding. This technique floods the query from the sink periodically which recreates the query routing tree and all the clusters from scratch using the 65 latest sensor readings. The second approach is using a separate reclustering algorithm like the one we developed for CAG. With our algorithm, we repair the clusters that need to change and avoid changing and adjusting the clusters that have nodes with sensor readings still within the clustering range. Fig. 3.11(c) shows that the cluster adjustment overhead, with linear dataset, remains fairly constant and reaches a maximum of 52 while the reclustering overhead using query flooding reaches 20000 messages in 100 minutes even though both the algorithms are reclustering at the same frequency. Fig. 3.11(d) shows the same result using the Great Duck Island dataset. In this figure, the reclustering overhead using query flooding reaches 6650 messages while that of CAG reaches 249 messages in 96 hours. Thus, we conclude that although reclustering using query flooding works, it has an unacceptable overhead if one desires to use this technique at a frequency high enough to ensure the consistency of cluster membership. On the other hand, the CAG reclustering algorithm is able to run the adjustment algorithm at the same frequency with much smaller overhead. The CAG adjustment algorithm achieves this efficiency by using local repair and avoiding a global adjustment of all the clusters. We also investigated how the cluster adjustment overhead evolves over time for different data patterns. Our experiments suggest that the message overhead for linear and stair-wise data is similar though they evolve differently. We described earlier that both the linear and stair-wise datasets are constructed using the temperature snapshot at 4 PM and 6 PM. Fig. 3.11(e) shows thatwiththestair-wisedataset,theadjustmentoverheadremains0untilthedatajumpsfromone value to the next. During that jump, there are a lot of nodes changing the clusters which result in a large number of adjustment messages transmitted in the network. Once reclustering is done in response to this jump, there is no more need for reclustering and there is no further adjustment overhead. For linear dataset, the reclustering process is continuous as there is a small change in data during the entire duration of the experiment. Even though instantaneous adjustment overheads are seen to be significantly different for the two datasets, we note that accumulated message overhead for both the datasets is similar at the end of the experiment in 100 minutes. 66 0 2000 4000 6000 8000 10000 12000 14000 20 10 4 2 0 Number of transmissions Threshold (%) Counting msg Reclustering msg Response msg (a) Breakdown of transmission overhead using the linear dataset with default transmission power. Response message(bottom) overwhelms reclustering message(middle) and counting message (top). 0 2000 4000 6000 8000 10000 20 10 4 2 0 Number of transmissions Threshold (%) Counting msg Reclustering msg Response msg (b) Breakdown of transmission overhead using the Great Duck Island dataset with default transmission power. Response message(bottom) overwhelms reclustering message(middle) and counting message (top). Figure 3.12: Breakdown of transmission overhead with measured temperature data. Our next observation concludes that the cluster adjustment overhead becomes smaller with a larger user-provided threshold τ. In Fig. 3.12(a) which shows results with the linear dataset, the adjustment overhead with τ = 2% is 987 messages. The overhead gets smaller as a larger threshold is chosen, eventually yielding in a total of 76 messages with a τ of 20%. With the Great Duck Island dataset, in Fig. 3.12(b), the overhead starts at 1531 messages with τ = 2% and becomes 229 messages with τ =20%. This result can be explained by the fact that a smaller thresholdcorrespondstoasmallerrangeofvaluesallowedwithinthecluster. Evenasmallchange in sensor reading can make the node fall outside the allowed range which forces a node to find a new cluster. Hence a larger number of nodes participate in cluster adjustment which results in a larger adjustment overhead with smaller thresholds. We observe that the adjustment overhead is a much smaller fraction of the total overhead for the linear dataset (Fig. 3.12(a)) compared to the Great Duck Island dataset (Fig. 3.12(b)). Our linear dataset shows more correlation than the Great Duck Island dataset. Lower correlation results in frequent and large changes in sensor readings which necessitate frequent reclustering. This frequent reclustering explains the higher adjustment overhead with the Great Duck Island dataset compared to the linear dataset. 67 While not one of our main results, one interesting observation that we made during the course of our experiments is that as the threshold τ increases, the forwarding message overhead from the Great Duck Island dataset remains relatively flat (Fig. 3.12(b)) compared to the gradual decrease in forwarding message overhead for the linear dataset (Fig. 3.12(a)). We attribute this difference to different levels of correlation in the two datasets. The Great Duck Island dataset is less correlated than the linear dataset because it spans a larger period of time (4 days vs. 2 hours). With both datasets, with small threshold, because of the small clustering range, a large number of clusters are formed. With the linear dataset, with a larger threshold, the data is more correlated, leading to large cluster sizes which can be maintained over time. This results in lower data overhead which is why we see a decreasing trend in forwarding overhead. With the Great DuckIslanddatasetwhichhaslowercorrelation,withalargethreshold,eventhoughlargeclusters areformedinthebeginning, theclustersarebrokenintosmallerclustersbecausethedatachanges drastically over time. Thus, we end up with a large number of small clusters, which explains why the message overhead does not go down significantly even with a larger threshold. Next we focus on the error in the results obtained using CAG. What relation is seen between the error in the result and the user-provided threshold? How does the accuracy of the result using CAG compare with the accurate result one could obtain using TAG? What improvement in accuracy, if any, is contributed by the adjustment algorithm? The accuracy of result achieved indicates that the absolute error in the result obtained using CAG is always bounded by the user-provided threshold τ. With linear data, with a threshold of 20%, the error is always less than 3.09% compared to the accurate result (Fig. 3.13(a)). With the Great Duck Island dataset, the error is always less than 6.26% also with a threshold of 20% (Fig. 3.13(b)). To understand the impact of CAG adjustment on accuracy, we ran the next set of experiments without the cluster adjustment algorithm. We expected the errors to be generally higher without the adjustment algorithm because sensor readings slowly drift away from the clustering range without periodic reclustering. The results in Fig. 3.13(a) and 3.13(b) 68 0 5 10 15 0 20 40 60 80 100 Absolute error (%) Minute Reclustering Without reclustering (a) Accuracy with the linear dataset and τ = 20%. 0 5 10 15 20 25 30 0 20 40 60 80 Absolute error (%) Hour Reclustering Without reclustering (b) Accuracy with the Great Duck Island dataset with τ = 20%. Figure 3.13: Accuracy results with CAG for different datasets with default transmission power. corroborates our expectation. The errors are not only generally higher without reclustering but also sometimes out of bound (Fig. 3.13(b)). We observed a downward trend in the absolute error in Fig. 3.13(a). The sensor readings at 6 PM have smaller variance than data at 4 PM. In both with and without reclustering, as time progresses towards 6 PM, the similarity of sensor reading at clusterhead and that at other nodes in the cluster increases. This continuous increase in similarity results in a downward trend in error for both cases (Fig. 3.13(a)). We omit the results on performance and precision using the indoor environmental data due to thespacelimitationandthefactthatitdoesnotshowasignificantdifferencefromtheresultusing the outdoor data. Our earlier work [102] described the impact of different levels of correlation and density on efficiency and accuracy. CAG allows a user to seek approximate answers to a query, thus enabling the network to conserve energy by reducing the number of transmissions. CAG, in general, is not expected to be used with very small thresholds because such thresholds prevent CAG from fully exploiting data correlation to significantly increase energy efficiency over TAG. We further observed that CAG provides bounded error all the time over all thresholds despite using mica2 radio profile (which is inherently unreliable) in our simulations. Based on our experimental evaluation, for data profiles 69 expected to be similar to the one used in our simulations, we suggest that a τ ≥ 4% might be a reasonable trading point to achieve both small errors (bounded) and energy saving in reality. 3.5.2.3 Sub-optimal Cluster Size and Routing Path of CAG In this section, we show the sub-optimality of CAG in terms of cluster size and routing path lengthbyprovidingthelower-boundofthenumberofclustersandthelower-boundofthenumber of transmissions. We prove finding the minimum number of clusters exploiting the spatial correlation in data is NP-complete by reducing this problem into set-cover problem which is known to be NP-complete. The optimization version of finding the minimum number of clusters problem is finding the set cover with the smallest number of sets (clusters). Theorem 1. Finding the minimum number of clusters exploiting spatial correlation is NP- complete. Proof. • Minimum-number-of-clusters ∈NP. Finding the Minimum-number-of-clusters is in NP because there exists a polynomial time verifier such that we can verify a given solution in a polynomial time. • Set-cover ≤ p Minimum-number-of-clusters. Clusters in the solution of Minimum-number-of-clusters can be mapped into the sets in the Set-cover problem. That is, Set-cover is a special case of Minimum-number-of-clusters. Thus, finding Minimum-number-of-clusters is NP-hard. We compute the optimal cluster size by the branch and bound [76] optimization. We compute the approximation of the optimal cluster size using greedy set cover, which is known to compute a logn-approximation of the optimal cluster size [14]. We denote this approximation algorithm as greedy clustering. 70 0 5 10 15 20 25 20 10 4 2 1 Number of clusters Threshold (%) Optimum (7PM) Optimum (1PM) Greedy (7PM) Greedy (1PM) CAG (7PM) CAG (1PM) (a) Cluster size: optimal, greedy, and CAG clustering with 25(5×5) nodes. 0 20 40 60 80 100 120 20 10 4 2 1 Number of clusters Threshold (%) Greedy (7PM) Greedy (1PM) CAG (7PM) CAG (1PM) (b) Cluster size: greedy and CAG clustering with 100(10×10) nodes. Figure 3.14: Sub-optimal cluster size of CAG with 25 meters disc radio model and temperature in 1 PM and 7 PM. As we can see in Fig. 3.14, the cluster size computed by CAG is sub-optimal; one can improve the clustering algorithm in CAG to form larger clusters and approach the optimal cluster size. Both in linear (7 PM) and spherical (1 PM) patterns, the size of cluster is ordered by the optimal clustering, greedy clustering, and followed by CAG (from the smallest to largest). CAG with linear pattern (7 PM), which has stronger data correlation than the spherical pattern, results in far smaller number of clusters than CAG with spherical pattern (1 PM). Because CAG benefits more from stronger correlation in data and forms larger clusters, CAG with linear pattern (7 PM) is closer to optimal clustering than CAG with spherical pattern (1 PM). We compare the number of transmissions from clustering and forwarding the packets between 1) CAG (clustering with forwarding), and 2) greedy clustering with Greedy Incremental Tree (GIT). The GIT [41] constructs the shortest path only for the first source to the sink whereas each of the other sources is incrementally connected at the closest point on the existing tree. GIT increases the path sharing, thereby reducing energy consumption. We used GIT as an approximation of the optimal forwarding path because GIT is known to be 2×optimal where the optimal path is the Steiner tree which is NP-complete. 71 0 5 10 15 20 25 30 20 10 4 2 1 Number of transmissions Threshold (%) Greedy clustering + GIT (7PM) Greedy clustering + GIT (1PM) CAG (7PM) CAG (1PM) (a) Number of transmissions: greedy clustering with GIT vs. CAG with 25 (5×5) nodes. 0 20 40 60 80 100 120 20 10 4 2 1 Number of transmissions Threshold (%) Greedy clustering + GIT (7PM) Greedy clustering + GIT (1PM) CAG (7PM) CAG (1PM) (b) Number of transmissions: greedy clustering with GIT vs. CAG with 100 (10×10) nodes. Figure 3.15: Sub-optimal transmission overhead of CAG with 25 meters disc radio model and temperature in 1 PM and 7 PM. Overall,weobservedthatCAGislessefficientthanthegreedyclusteringwithGIT(Fig.3.15). As the number of nodes in the network increases to 100 as in Fig. 3.15(b), CAG with spherical pattern (1 PM) lags behind the optimal clustering in performance. CAG with linear pattern (7 PM) lags farther behind the optimal clustering. This also supports our claim that the stronger the correlation in data, the better the performance measured by the number of transmissions incurred by CAG. 72 Chapter 4 SWATS: Steam and WAter Tracking System 4.1 Overview State-of-the-artanomalydetectionsystemsdeployedintheoilfieldsareexpensive,notscalabletoa large number of sensors, require manual operation, and provide data with a long delay. A sensor network consists of small, inexpensive nodes equipped with embedded processors and wireless communication, which enables flexible deployment and close observation of phenomena without human intervention. We design a novel application using wireless sensor networks to detect, identify, and localize major problems that arise in steamflood pipeline networks in oilfields. Our system aims to allow continuous monitoring of the steamflood system with low cost, short delay, and fine granularity coverage while providing high accuracy and reliability. Our system, SWATS (Steamflood and WAterflood Tracking System), detects and identifies major anomalies in steamflood pipeline networks: blockage, leakage, outside force damage, gener- ator breakdown and Splitigator malfunction. The problem is challenging because of the inherent inaccuracy and unreliability of sensors and the transient characteristics of the two-phase steam flows, resulting in various potential false alarms. Moreover, observation by a single node cannot capture the topological effects on the transient characteristics of steam fluid to disambiguate sim- ilar problems and false alarms. We address these hurdles by utilizing multi-modal sensing and 73 (a) Injection well (top part) (b) Orifice flowmeter (c) Splitigator which controls steam quality constant between upstream and downstream (d) Steam generator (e) Co-Gen(erator) of steam and electricity Figure 4.1: Equipments currently being used in steamflood monitoring in oilfield multi-sensor collaboration with an SWATS algorithm that exploits temporal and spatial patterns of the sensed phenomena. We build a rule-based decision tree to capture the salient pressure and flow characteristics of each problem and distinguish them from false alarms. Our algorithm can successfully detect, identify, and localize anomalies in the presence of sensor failure and various false alarms. Our system represents a new promising approach for oilfield monitoring that has the benefits of low cost, flexible deployment, continuous monitoring, and accurate problem detection and identification. 4.2 Problem Description Heat delivery to the oilwell is a major cost in the operation of thermally heated oilwells. This cost can be significantly reduced by finer and fast control of heat delivery to the malfunctioning equipment or pipelines. 74 Events to detect Physical phenomena Wheretodeploysen- sors Reasons Problems Blockage Blockage Splitigator(waterleg orifice, valve, steam leg orifice), choke Scale, something left over after construction Leakage Leakage Flange, joint Pipeline corrosion, pipeline junction loose Generator breakdown Leakage or blockage Generator Generator breakdown Splitigator malfunc- tion Phase split- ting Before and after Splitigator Splitigator malfunc- tion Outside- force or third-party damage Leakage Flange, junction, near obstacles Earthquake, stone False alarms Change in steam sup- ply Leakage or blockage Generator Generator outage, shortage in steam supply Downhole pressure change Downhole pressure change Injector Vertical permeability, geological formation, heterogeneity Change in steam qual- ity Change in steam quality Inlet of pipeline, Be- fore and after Spliti- gator, obstacles, and pipeline elevation Due to the elevation or pressure change, or 2- phase transient prop- erty of steam Phase split- ting at pip- ing tees Phase split- ting All the branches in- cluding Splitigators difference in steam quality between upstream and down- stream of branches Sensor noise and sensor fault Inaccuracy in sensor readings A pair of close-by sensors in the middle of pipeline for sanity check Inaccurate and unre- liable pressure meter, thermometer, and flow meter Environmental effects Inaccuracy in sensor readings A pair of close-by sensors in the middle of pipeline for sanity check Environmental noise unique to this sys- tem such as pipeline friction, ambient temperature, etc. Table 4.1: Classification of anomalies in steamflood monitoring in oilfield. All the events need multiple sensors for detection. 75 Steamflooding is the one of the Thermally Enhanced Oil Recovery (TEOR) techniques, which utilizes the heat contained in the steam to make heavy oil (< 20 ◦ API) more fluid for easier oil recovery [37, 52]. This is an economic-driven problem because the steam generation and distribution uses about half of the total budget for the entire oilfield operation. The goal of steamflooding is to optimize the quantity of steam injected to each injection well so that the amount of heat delivered by the stream pipeline networks is fair and constant; oilfield engineers want to keep the steam flow rate under critical flow rate constantly for maximum efficiency. Critical flow rate refers to a situation where the reduction in the downstream pressure of the flow (in reservoir) does not change the mass flow rate [8, 9, 10]. Maintaining the critical flow rate is important in steamflood systems, because it ensures the delivery of constant amount of heat to each well. The critical flow rate can be determined by measuring the steam injection pressure (both upstream and downstream) and differential pressure at the wellhead [8]. Oilfield engineerswanttodetect thesituationwheretheactualflowrateisoutofcriticalflowrate,identify its causes, localize its origin, trigger alarm immediately, and provide feedback to the machines that control steam injection, generator or co-gen, to halt steam injection until further diagnosis. Fig. 4.1 shows the equipments for steamflooding in current oilfield monitoring system. The problems resulting in the out of critical flow rate, which oilfield engineers care about, can be due to blockage, leakage, equipment breakdown both in generator and Splitigator, and outside force or third-party damage (Table 4.1). These problems are often observed at choke and Splitigator but not in the pipelines. Blockage and leakage are the major concern which can result in out of critical flow rate of steam. Blockage, which is often observed at the Splitigator and choke, is often caused by the scale deposition from the saturated steam or left-over debris and foreign objects after construction, etc. Leakage, often caused observed at Splitigator, near pipeline junction, and obstacle, is caused by pipeline corrosion and loose junction. Incipient detection of these problems is challenging because at the early stage of problem the pressure and flow rate change is difficult to distinguish from 76 Algorithm 3 SWATS ( F, W, D, T, N, C) Require: F isthesensingfrequency. W istheslidingwindowsizeoverwhichthelinearregression is performed in a node. D is the threshold for the event detection for pressure and flow rate which classifies the trend between constant and small increase/decrease. T is the threshold for the temporal trending for pressure and flow rate which classifies between small and big increase/decrease. N is the number of upstream and downstream neighbors to collaborate by cross-checking the local temporal trending in order to identify and localize problems and false alarms. C is the threshold in voting for pressureand flow rate above which the local temporal trending is validated. 1: Single node processing: 2: Stage 1: In-node sensor readings validation 3: Cross-check the validity of sensor readings with multi-modal sensing (pressure and tempera- ture) at F. 4: Stage 2: Noise reduction 5: To clean the data, compute the moving average of sensor readings over an appropriately configured sliding window size, W. 6: Stage 3: Event detection 7: Linear regression for temporal trending and compute the numeric slope for pressure and flow rate over W. 8: if ((Numeric slope for pressure ≥ D) OR (Numeric slope for flow rate ≥ D)) then 9: Determine the classes of trend (big increase, small increase, constant, small decrease, big decrease) by comparing the numeric slope and T. 10: go to stage 4. 11: else 12: return. 13: end if 14: Multi-node collaboration: 15: Stage 4: In-network event detection validation 16: Cross-check with the local classes of trend both over N at F. 17: Vote over N. 18: if (The result from voting > C) then 19: The local event detection is validated. 20: go to stage 5. 21: else 22: return. 23: end if 24: Stage 5: Problem identification 25: if (In-network event detection is validated) then 26: DECISION TREE ALGORITHM over N. 27: go to stage 6. 28: else 29: return. 30: end if 31: Stage 6: Problem localization 32: Localize the problem by finding the best matching node with the rule for identified problem in the decision tree. 77 the those of normal or transient fluctuations and false alarms. Moreover, in real environment, multiple problems can happen simultaneously in addition to various false alarms which makes anomalydetectionandidentificationevenmorechallenging. GeneratorbreakdownandSplitigator malfunction are equipment-specific problems for which we strategically deploy multiple sensors near those equipments. The outside force or third party damage happen less frequently than the blockageandleakageandtheyareeasytodetectbecausetheyshowthesuddenchangeinpressure and flow rate. The anomaly detection system in use in the steam pipelines has prohibitive cost, long delays in measurement, coarse measurements, and requires periodic manual inspection of system. Field engineersareinterestedinautomatingthismanualandslowdetectionandcorrectionprocesswith asystemthatcandetectproblemsfast, makedecisionsrapidlyandtakeactiontofixtheproblems quickly. The economic consideration dictates that such a system has to cost less than the current manual system and eventually save cost in oilfield operation by detecting and fixing problems in the steam pipeline networks in a timely manner. Our goal is to design a system which detects, identifies, localizes the problems reliably, quickly, and accurately while reducing cost. 4.3 System Design 4.3.1 Overview of Our Approach The key technique in our algorithm is the rule-based problems and false alarm identification by collaboratively exploiting spatial and temporal correlations in the sensor readings. We define the principal rules for decision making by capturing the salient characteristics of the pressure (or temperature) and flow rate in space and time as a consequence of each problem and false alarm. The intuition behind our approach is that the neighboring sensor nodes in a pipeline observe the coherent impact of each anomaly on pressure and flow rate in steam flow. We assume we use inexpensive thermocouple-based temperature sensors on all the nodes which is strategically 78 placed in the pipeline network. Temperature can then be converted into pressure. Flow rate is computed using pressure at multiple points in a pipeline. In our system, all the nodes conduct peer-to-peer communications because the problem is ubiquitous and all the nodes are expected to sense and process data with approximately equal level of intelligence. Because of possible inaccuracy in readings produced by inexpensive sensors, we use multi- modal multi-node collaboration to improve the correctness in problem diagnosis. Although we may detect the problems and false alarms correctly at a single node, the single node processing is not enough for correct identification of problems and false alarms. Most of the problems and false alarms present the same phenomena in pressure and flow rate in a node such as gradual drop, sudden drop, or ephemeral change. Several problems and false alarms are only distinguished by analyzing physical signature over upstream and downstream at a certain distance from the origin along with the local node, and by comparison with multiple modalities such as pressure and flow rate simultaneously. We create spatial and temporal pattern rules in our decision tree algorithm by understanding those unique indications of each problem. 4.3.2 The Steamflood Monitoring Algorithm Our steamflood monitoing algorithm tries to determine the potential causes resulting in out of critical flow rate at the critical flow choke which can be blockage, leakage, equipment breakdown, or outside force damage (Algorithm 3). Several thresholds used in this steamflood monitoing algorithm are tuned with the parameters of pipelines, equipments, and the out of critical flow rate such that downstream pressure upstream pressure 6=0.55 [10] at choke, orp2<p3 wherep2 is the choke throat pressurewhentheflowiscriticalandp3isthedischargepressureatdownstreamofthethroat[89]. Our proposed algorithm consists of two stages: single node processing and multi-node collab- oration. 79 4.3.2.1 Single-node Processing At each node, our algorithm performs in-node sensor readings validation (using multi-modal sensing) and noise reduction. Then it analyzes the temporal trend locally to detect events. • In-node sensor readings validation: In order to check the validity of sensor readings, we cross-check data in a node from multi-modal sensing (pressure and temperature) at 1Hz. • Noise reduction: In order to clean the raw data samples, we compute the average pressure and the average flow rate for each sliding window, W. We need to tune W to 1) correctly detect and identify problems and false alarms given the pipeline parameters and transient flow, 2) satisfy timeliness, 3) reduce storage and communication overhead. To reduce noise due to transient flows, we adaptively use larger W during the times when there is transient flow in the pipeline. • Event detection: Forthetemporal trending atalocalnode, wecapturethetemporalpattern of pressure and flow rate by performing the linear regression over W which results in the numeric slope and intercept both for pressure and flow rate. We determine the 5 classes of trend (big increase, small increase, constant, small decrease, big decrease) using this computed numeric slope and two thresholds: Detection and Temporal. If the scales of the computedslopeforbothpressureandflowratearelargerthanDetectionthresholdforeach, we determine that an event of interest (at least small event) occurred at the local node over the window W. The Temporal threshold determine the numeric slope for pressure and flow rate between small and big event. 4.3.2.2 Multi-node Collaboration Our proposed rule-based decision tree algorithm utilizes collaboration of neighboring nodes to reach a consensus in their detection and identification results for the same phenomena. 80 • In-network event detection validation: In order to verify the local detection result, we cross- check the classified local trend with both upstream and downstream neighbors overW. The Numcross parameter is the number of neighboring nodes to cross-check. Voting is used to eliminate the impact of unreliable or noisy sensing on the correct detection. • Problem identification: For the nodes with validated event, we identify the causes of prob- lems and false alarms by providing the classes of trends for pressure and flow rate over the neighbors as inputs to our decision tree algorithm. To identify the anomalies in the pipeline and disambiguate problems from false alarms, we use the decision rules that describes spatial and temporal characteristics of problems and false alarms. We identify the cause of anomaly by comparing each local classes of trends over the Numdecision, the size of neighbors both in upstream and downstream to be used whenweperformspatialcomparison. WeassumeNumdecisionparameterisgivenasapart of decision tree. In addition, each node utilizes information about pipeline elevation, logical location, equipment maintenance schedule, and physical proximity to equipments such as generator, branch, Splitigator, and choke to identify problems correctly. • Problem localization: To localize the problem, we find the best matching node with the rule for identified problem in the decision tree. The node satisfying the specific condition in decision tree is the origin of the problem. 4.3.3 Principal Rules in Decision Tree Algorithm Our system classifies the anomalies into 5 types of problems: blockage, leakage, generator break- down, Splitigatormalfunction, andoutsideforceorthirdpartydamage, and6falsealarms: down- hole pressure change, change in stream supply or generator outage, phase splitting on pipeline branch, change in steam quality, sensor noise or sensor fault, and environmental effects. 81 Local node Downstream node Big decrease Small decrease Constant Small increase Big increase Small decrease N/A Upstream node Small increase Small decrease Upstream node Constant Temporal duration Persistent Temporal duration Persistent Temporal duration Ephemeral Leakage Blockage Downhole pressure change Persistent Unidentified Figure 4.2: An example of decision tree on pressure for blockage, leakage, and downhole pressure change The decision tree checks from critical to trivial causes: problems to false alarms. Algorithm first compares the problem set using the rules in the decision tree. Then it tries to distinguish the candidate problems from the related false alarms using 1) in-depth comparison of phenomena 2) the prior information (scheduled outage or pipeline elevation), 3) the reported information from other nodes, and 4) the information about proximity to equipments. To demonstrate how our system works, we now present an example of a decision rule used to identify blockage in a pipeline. In order for the system to be able to detect such blockage, the algorithm needs to diagnose the gradual drop in pressure and flow in local node, and thus exploit the temporal patterns of measured values. Since small fluctuations could be caused by measurement errors, the sensor individually needs to clean the data first. In addition, the sensor willalsoneedtocorrelateitsdiagnosiswithupstreamanddownstreamnodes. Otherwise,itwould not be able to distinguish a blockage from either a downhole pressure change, which is considered a false alarm, or a leakage, a different type of problem. 82 Problem or False alarm Physical Phenomena Event point (local node) Modality Temporal change (local node) Temporal duration (local node) Upstream node Local node Downstream node Blockage (P) Blockage Splitigator or choke Pressure Gradual drop L Increases Drops Drops Flow rate Gradual drop L Drops Drops Drops Downhole pressure change (F) Downhole pressure change Between choke and injector Pressure Gradual OR sudden change E Does not change Changes (fluc- tuates OR in- creases OR de- creases) N/A (Can not measure) Flow rate Gradual OR sudden change E Does not change Changes (fluc- tuates OR in- creases OR de- creases) N/A (Can not measure) Leakage (P) Leakage Flange or joint Pressure Gradual drop L Drops Drops Drops Flow rate Gradual drop L Increases Drops Drops Outside force or third party damage (P) Leakage Pipeline Pressure Sudden drop L Drops Drops Drops Flow rate Sudden drop L Increases Drops Drops Generator breakdown (P) Leakage or blockage Generator Pressure Gradual drop L N/A (most up- stream point) Decreasesother than scheduled maintenance Decreasesother than scheduled maintenance Flow rate Gradual drop L N/A (most up- stream point) Decreasesother than scheduled maintenance Decreasesother than scheduled maintenance Change in steam supply or generator outage (F) Leakage or blockage Generator Pressure Gradual drop L N/A (most up- stream point) Decreases only when scheduled maintenance Decreases only when scheduled maintenance Flow rate Gradual drop L N/A (most up- stream point) Decreases only when scheduled maintenance Decreases only when scheduled maintenance Splitigator breakdown (P) Phase split- ting Splitigator Pressure Sudden change L Constant Increases OR decreases OR fluctuates Increases OR decreases OR fluctuates * Flow rate Sudden change L Constant Increases OR decreases OR fluctuates Increases OR decreases OR fluctuates * Table 4.2: Principal Rules to identify Problems and False Alarms. Key for abbre- viation: Problem (P), False alarm (F). Last long (L), Ephemeral (E) * indicates the downstream regardless of equipment in the pipeline. For example, if P or F increases in one downstream of pipeline branch, then P or F in at least one another downstream decreases. 83 Problem or False alarm Physical Phenomena Event point (local node) Modality Temporal change (local node) Temporal duration (local node) Upstream node Local node Downstream node Phase split- ting (F) Phase split- ting Branch with- out Splitiga- tor Pressure Sudden change L Constant Increases OR decreases OR fluctuates Increases OR decreases OR fluctuates * Flow rate Sudden change L Constant Increases OR decreases OR fluctuates Increases OR decreases OR fluctuates * Change in steam quality (F) Change in steam quality Pipeline with OR without elevation Pressure Sudden change E in a flat pipeline, L for inclina- tion Constant Increases OR decreases OR fluctuates Increases OR decreases OR fluctuates Flow rate Sudden change E in a flat pipeline, L for inclina- tion Constant Increases OR decreases OR fluctuates Increases OR decreases OR fluctuates Sensor noise or sensor fault (F) Inaccuracy in sensor readings Everywhere Pressure Sudden change E for sensor noise, L for fault Does not change Sudden change Does not change Flow rate Sudden change E for sensor noise, L for fault Does not change Sudden change Does not change Table 4.2: Continued. 84 Blockage causes a gradual drop over a long time (small decrease) in both pressure and flow rateatthelocalanddownstreamnodes, whilethepressureatupstreamnodesincreasesduetothe constant injection with a valve and flow rate drops. Alternatively, if both the pressure and flow rate for the upstream node do not change with all other conditions are the same with blockage, the algorithm considers the problem as leakage. On the other hand, if both the pressure and flow rate for the upstream node do not change and those readings for local node change (either fluctuate or increase or decrease), then the algorithm identifies that event as a downhole pressure change, a false alarm. Fig 4.2 depicts the part of decision tree with this example. Table 4.2 describes the rules for decision tree used to identity and differentiate the problems from false alarms by exploiting the spatial and temporal characteristics in pressure and flow rate. These are the minimum set of rules that are necessary for even the simplest algorithm for identification. In general we can identify most of the problems and false alarms using the rules in table 4.2, but we may be not be able to distinguish them in these cases: • Leakage by outside force or third party damage (problem) vs. Change in steam quality along a flat pipeline (false alarm) • Change in steam quality (false alarm) vs. Sensor noise along a flat pipeline (false alarm) 4.4 Evaluation 4.4.1 Simulation Scenarios TovalidateSWATSalgorithm,wefirstcreatesimulationscenarioswithasimplepipelinetopology. In the scenario, there is a single generator and an injector at one end of a 500 feet long pipeline as shown in Fig. 4.4. From the generator, 6 nodes with pressure and temperature sensors are placed every 100 feet along the pipeline. We also assume that these sensors are deployable on 85 Figure4.3: AblockagescenariousingDynsiminsimpletopologywithasinglepipeline, generator, and injector. the pipelines. By communicating the measured pressure with a neighboring node, each node can compute its flow rate to provide as an input to the SWATS algorithm. Each node samples sensor readingsevery0.5secondandaveragesthemover2.5secondasasinglereadingforgettingsteady, distinctive, and meaningful data. 4.4.2 Anomaly Scenarios We design the blockage and leakage scenarios using Dynsim [43] which enables dynamic simu- lation of transient steam pipeline networks with the equipment and process control in fine time granularity. In these scenarios, at time 0, Dynsim starts to flood steam from generator to sink without any anomalies. We issue an anomaly such as blockage at time t. We use the Dynsim predefined valve blockage as a pipeline blockage. We place two valves in blockage scenario: the first valve immediately after the generator and the second 300 feet away from the generator. The first valve is open 100% whenever the generator is on and the second valve is open 100% at the beginning but configured with blockage with specified severity and rate in 3 minutes. We verified that the valve blockage has the same effect as valve closure at a corresponding percentage. 86 Figure 4.4: A leakage scenario using Dynsim in simple topology with a single pipeline, generator, and injector. We build a leakage scenario in a single pipe using a short pipe (1 feet) connected to another sink with pressure 14.496 psia (1.0 standard atmosphere absolute pressure). The header diverts a fraction of the flow to the new sink simulating a leakage in the main pipe. This valve is closed at time0,butisconfiguredwithavalveleakagewithspecifiedseverityandratein3minutes. Wealso verified that the valve leakage has the same effect as valve open at a corresponding percentage. We experimented blockages and leakage with the severity from 0% to 100 % with 10% incre- ment, with blockage build-up rates of 0.001%/sec., 0.01%/sec., 0.1%/sec., 1%/sec, and 5%/sec. (In this paper, we will show the results from the 5%/sec. due to the space constraint.) In this scenario, we do not create any false alarm events but the sensors have the natural noise and the pipeline friction. Table 4.3 presents the all the parameters that we used for the blockage scenario. 4.4.3 Preliminary Results Our proof-of-concept design and simulation demonstrate the accuracy with which we can identify each problem accurately in an ideal and simple topology consists of a single pipeline. The metrics used to evaluate SWATS are: 87 Parameters Values Generator Pressure 600 psia Temperature 486.3 F Components Wet steam (H 2 O) Amount 1984.5 kg Standard temperature 60.0 F Standard pressure 1.0 atmosphere Relative elevation 0 m Internal phases Vapor Liquid External phases Two Phases Injector pressure 558 psia Valve Reverse flow factor 1.0 Internal phases Vapor Liquid Pipe Reverse flow factor 1.0 Inside diameter 4 inch Length (each segment) 100 ft Wall thickness 40 schedule Internal phases Vapor Liquid Table 4.3: Parameters used for blockage scenario using Dynsim • Correctness of identification: Ratio of correctly identified events and all correct and incorrect identifications. • Correctness of localization: Ratio of correctly localized events and all correct and incor- rect localizations. • Timelinessofidentification: Delaybetweentheoccurrenceofananomalyinthepipeline and the first correct identification result. • Smallestblockageidentifiable: ThesmallestlevelofblockagethatSWATScancorrectly identify. For the required parameters of SWATS, we used W = 10 samples (25 seconds), detection threshold = 0.2, temporal threshold for P = 30.0 and F = 2.0, crosscheck threshold for P = 0.5 and F = 0.5, numcross = numdecision = 3 in the entire experiment. Fig. 4.5 shows that SWATS correctly identifies and localizes pipeline anomalies. In this sce- nario, we introduced a 50% blockage upstream from node 4. Among other problems and false 88 5 4 3 2 downstream from choke F is not available at downstream F increases at local F drops at local F is constant upstream F suddenly increases at me F gradually increases at me F suddenly drops at me F gradually drops at me P is not available at downstream P increases at local P drops at local P is constant at upstream P suddenly increase at me P gradually increases at me P suddenly drops at me P gradually drops at me generator halt generator scheduled halt F drops at downstream F drops at local F is not available at upstream F gradually drops at me P drops at downstream P drops at local P is not available at upstream P gradually drops at me generator halt generator scheduled halt F drops at downstream F drops at local F is not available at upstream F gradually drops at me P drops at downstream P drops at local P is not available at upstream P gradually drops at me F drops at downstream F drops at local F constantly increases at upstream F suddenly drops at me P drops at downstream P drops at local P is constant at upstream P suddenly drops at me F drops at downstream F drops at local F increases at upstream F gradually drops at me P drops at downstream P drops at local P drops at upstream P gradually drops at me F drops at downstream F drops at local F drops at upstream F gradually drops at me P drops at downstream P drops at local P increases at upstream P gradually drops at me Node id Blockage Leakage Outside force Generator breakdown Generator Outage Downhole pressure change Figure 4.5: Correctness of identification and localization: node 4 correctly identified the blockage at 200 second (first detection at 195 second) with 50% blockage severity and 5%/second build-up rate. Reddotindicatesthatcorrespondingpredicate(abbreviationPispressureandFisflowrate) in the x-axis is detected at the node given on y-axis and a correct identification requires all the predicatesforagivenanomalytobetrue. Onlynode4satisfiesallthepredicatesfortheblockage. Among 6 nodes (as shown in Fig. 4.4), only nodes 2, 3, 4, and 5 (shown on the y-axis) detected the event. 89 1 2 3 4 5 6 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 Identified node id Time (second) Figure4.6: Correctnessoflocalizationandtimeliness: onlynode4identifiedtheblockagecorrectly from 195 second to 202.5 second (for 4 detection sliding windows) among 6 nodes (as shown in Fig. 4.4). alarms which can arise in a simple topology, SWATS correctly identified the blockage and local- ized this problem at node 4 which is the immediate downstream node after blockage in a pipe in Fig. 4.4. To provide a stable detection result, SWATS reports identification after three consecu- tive identical classification. In the figure, three consecutive identical classification were achieved at time 200 second. In the figure, we only show nodes that satisfied at least one predicate for anomalies. Fig. 4.6 shows the timeliness of anomaly identification. In the simulation scenario, we intro- duced a blockage on valve upstream from node 4 at time 180 seconds, increasing the blockage from 0% to 50% by 5% every second. SWATS identified this blockage at time 200 seconds. We measured the impact of noise to the correctness of identification and localization. We introduced up to 10% flow rate noise in our simulation. We observed a negligible reduction in correctness across blockage severity of 0% to 100%. Fig. 4.7 shows the impact of detection threshold on the anomaly severity, latency, and correct- ness of identification. Fig. 4.7(a) shows that a smaller detection threshold does not help detect and identifyincipient orlesssevereanomalies. Evenwithdetectionthresholdof0,althoughweexpectSWATStodetect mildblockagecorrectly, wefoundthatSWATSwasabletocorrectlyidentify andlocalizeblockage larger than 50%. SWATS could not detect smaller blockages. This counter-intuitive result is due 90 0 10 20 30 40 50 60 70 80 90 100 0.2 0.1 0.01 Level of blockage identifiable (%) Detection threshold (a) Level of blockage identifiable 180 182.5 185 187.5 190 192.5 195 197.5 200 202.5 205 207.5 210 0.2 0.1 0.01 Identification Latency (second) Detection threshold (b) Identification latency (first of identification) with the fixed level of blockage (50%). 0 10 20 30 40 50 60 70 80 90 100 0.2 0.1 0.01 Correctness of identification (%) Detection threshold (c) Correctness of identification with the fixed level of blockage (50%). Figure 4.7: Impact of detection threshold on the level of severity, latency, and correctness. to the low threshold triggering false alarms in response to small changes in the flow rate reporting blockage even before real blockage. Even though the detection threshold does not help identify small scale anomalies, it impacts thedetectionlatency. Fig.4.7(b)showsthattheidentificationlatencydecreaseswiththedetection threshold. Fig. 4.7(c) shows that impact of detection threshold on the correctness of identification. A low detection threshold of 0 results in several false alarms which results in only 77.8% of the identificationreportstobecorrect. Withthresholdgreaterthan0.01, alltheidentificationreports were correct. 91 Although we prove the correctness, timeliness, and robustness of SWATS algorithm in this article, it may not detect or identify anomaly with infinitesimal impact in flow over long time. Because SWATS utilizes the changing pattern of flows over time and space, it works better in a scenarioinwhichanomaliesintroducenon-negligiblechangeinflowrate. Weusedstaticthreshold in this study which might need to be customized to other pipeline networks. We empirically proved the correctness and potential of SWATS algorithm in this study. SWATS is the first system driven by WSN with inexpensive sensors enabling quasi in-situ sensing for pipeline monitoring exploiting spatial and temporal pattern of flow phenomena. 92 Chapter 5 DSS: Distributed Spatial Skyline 5.1 Overview Spatial skylines, among all the nodes in the network, are the nodes that are closer to at least one query point and equidistant to the rest of the query points. Spatial skyline queries can be used in wireless sensor networks for collaborative positioning for multiple moving targets, or positioning for defense against multiple enemies in the battlefield. However, designing a distributed spatial skyline algorithm in resource constrained wireless environments introduces several research chal- lenges: howtocombinethesemulti-dimensionaldata,e.g. distancestomultipleevents,tocompute the skyline, efficiently but accurately and progressively, concurrently with multiple queries, while dealing with the network and event dynamics. We propose three flavors of distributed spatial skyline (DSS) algorithms based on different partitioningstrategies: 1)TDSS:Triangulation-basedDSS,2)RDSS:Rendezvous-basedDSS,and 3)TRDSS:TriangulationandRendezvous-basedDSS.Ouralgorithmsutilizegeometricproperties in two layers of the protocol stack: geographic routing technique such as GPSR, GHT-based routing, and Voronoi-based geographic flooding in the routing layer, and geometric notions such as convex hull and Voronoi diagram to compute spatial skyline in the application layer. To the best of our knowledge, these are the first distributed algorithms to compute spatial skylines. Our 93 algorithmutilizesanovelroutingtechnique, Voronoi-basedgeographicflooding, andawell-known data-centric routing algorithm called GHT and a geographic routing algorithm such as GPSR. Together with these techniques, our algorithms compute the skyline result efficiently, accurately, progressively,quickly,andconcurrently. Inanetworkof554nodes,TDSS,forexample,reducesthe communication overhead by 91.49% over a centralized algorithm while providing 100% accurate skylines with a modest delay. 5.2 Formal Problem Definition Eachdataobjectpwithdreal-valuedattributescanbedenotedasad-dimensionalpoint(p 1 ,...,p d ) ∈R d where p i is the i-th attribute of p. Let a set P contains a set of points in the d-dimensional space R d and D(..) be a distance metrics defined in R d where D(..) obeys the triangle inequality. Object, point, and node are used interchangeably to refer to each data object unless we define specifically, and the distance denotes Euclidean distance throughout the paper. 5.2.1 General Skyline Query Given the two points p = (p 1 ,...p d ) and p ′ = (p ′ 1 ,...p ′ d ) in R d , p dominates p ′ iff we have p i ≤ p ′ i for 1 ≤ i ≤ d and p j < p ′ j for some 1 ≤ j ≤ d, where point p is data object with d real-valued attributes defined as a d-dimensional point (p 1 ,..,p d ) ∈ R d . Fig. 5.1(a) depicts an example of general skyline query with six objects p={a,b,c,d,e,f}. Each point represents a hotel with two dimensional attributes: price and distance to beach. The skyline in Fig. 5.1(a) is S = {a,c,e}, which are the non-dominated points in P. 5.2.2 Spatial Skyline Query Given d-dimensional query points Q = {q 1 ,...,q d } and the two points p and p ′ in R d , the point p ∈ P is in the spatial skyline of P with respect to Q iff for any point p ′ ∈ P there is a query 94 b 1 2 3 50 100 150 200 Distance to beach price b d f a c e (a) A general skyline {a,c,e} with two dimensional database of 6 points. j q2 q1 q3 q4 k h i a g b c d e f (b) A spatial skyline {a,b,c,d,e,f} with 4 query points and 11 nodes. Figure 5.1: Examples of general skyline and spatial skyline point(s) q i and q j for which we have D(p,q i ) ≤ D(p ′ ,q i ) for all q i ∈ Q and D(p,q j ) < D(p ′ ,q j ) for some q j ∈Q. That is, p is a spatial skyline iff we have: ∀p ′ ∈P,p ′ 6=p,∃q i ∈Qs.t.D(q i ,p)≤D(q i ,p ′ ) (5.1) Fig. 5.1(b) depicts an example of spatial skyline query with 4 query points Q={q 1 ,q 2 ,q 3 ,q 4 } and 11 objects P = {a,b,c,d,e,f,g,i,j,k}. Each query point q i represents a separate accident, and point p represents the position of policemen with four dimensional attributes: the distance from any position of policeman to accident1, accident2, accident3, and accident4, respectively. The skyline in Fig. 5.1(b) is S ={a,b,c,d,e,f}, which are the non-dominated points in P. That is to say, p spatially dominates p ′ iff every q i is closer to p than p ′ or at the same distance from p and p ′ . From this point, skyline and spatial skyline are used interchangeably throughout the paper. 95 5.3 Foundation 5.3.1 Preliminaries Let there be N nodes on a plane. Let CH v (Q) be the query convex hull, which is the convex hull of query points Q, and query convex point be the points consists of CH v (Q). Let closest convex node, CN v (Q), be the closest node to the query convex point. In fig. 5.4(a), quadrilateral ABDC is the query convex hull, and nodes 1,11,9, and 7 are closest convex nodes. Let synchronization point, SP, be an arbitrary position in the network. Synchronization point is used to gather data from all the query points to a single point. Synchronization point may or may not be a node. If it is not a node, the closest node to that point delegates synchronization. Let GHT home node be the closest node to the synchronization point. In fig. 5.4(a), the star is a synchronization point, and node 2 is a GHT home node. Let rendezvous point, R, be the centroid of query convex hull, and rendezvous node be the corresponding closest node. Let triangle, T, be the triangle region divided by two consecutive queryconvexpointsandtherendezvouspoint. Letextended triangle,ET, betheunionoftriangle and the outside area of query convex hull confined by the extended line connecting rendezvous point and two consecutive query convex points (but closed area by the network boundary). For each triangle, let triangle home be the counter-clockwise closest convex node, and clockwise home be the clockwise closest convex node. In Fig. 5.4(a), node 1 is the triangle home of triangle ABR, and node 11 is the clockwise home as well as rendezvous node. Let control points be the nodes where the skyline search is initiated and the computed results are returned such as triangle home, clockwise home, or rendezvous point. 96 5.3.2 Properties of Spatial Skyline Queries There are 3 rules to determine spatial skyline used in centralized spatial skyline algorithm which can be applied to DSS algorithms. Here is the relation between query points Q and points P to be a spatial skyline (proved in [80]): Theorem 2. A point p∈P is a spatial skyline if p is inside query convex hull. Theorem 3. A point p ∈ P is a spatial skyline if the Voronoi cell of p intersects with the boundaries of query convex hull. The third property is from the definition of spatial skyline defined at Equation (1). Definition 1. A point p ∈ P is a spatial skyline if p is not spatially dominated by other nodes p ′ ∈P. 5.3.3 Centralized Spatial Skyline Algorithm Voronoi-based Spatial SkylineVS 2 algorithm identified and proved two theorems in section 5.3.2 to determine the spatial skylines in a centralized setting [80]. VS 2 traverses the Delaunay graph, i.e. a graph connecting Voronoi neighbors, of the data pointsstartingfromthenearestneighborofthequeryconvexhull(whichisaspatialskylinepoint), given all the nodes are sorted in an ascending order by mindist which is defined by the the sum of distances between a point and query convex hull. VS 2 maintains the rectangle corresponding to the dominator region of skylines (the MBR of the union of all circles connecting skylines and query convex hull) which includes all the candidate skyline points. That is, nodes inside of this rectangle is a superset of skylines. VS 2 also maintains two data structures in a heap to keep track of traversing status of all the candidate points: one for a set of skylines and the other for their Voronoi neighbors. VS 2 inputs new candidates into the heap, and extracts them when traversing itself and its Voronoi neighbors is over. VS 2 stop traversing when the following condition meets: 97 a b c f d e g p Figure 5.2: Voronoi-based geographic flooding Theorem 4. The spatial skyline search stops when none of Voronoi neighbors of current skylines has a new skyline. During traversal, each point checks if it satisfies two theorems of spatial skylines defined in section5.3.2. Fornodeswhoareintheheapbutdonotsatisfiesthesetwopropertiesusedefinition to verify if it is spatial skyline. When all of the data points in a heap are checked and emptied, the algorithm is over. 5.3.4 Vornoi-based Geographic Flooding Voronoi-basedgeographicfloodingisatechniquetomulticastapackettoalltheVoronoineighbors. This service uses a combination of 1-hop radio broadcast and GPSR (Fig. 5.2). A node first does 1-hop radio transmissions reaching all the Voronoi neighbors within the radio transmission range. Thus we are able to send a packet to many Voronoi neighbors using a single transmission. Those Voronoi neighbors that are farther than 1-hop transmission range are reached by multihop paths using GPSR. Vornoi-based geographic flooding can be used by DSS algorithms to communicate with all the nodes in a given geographical region. 5.3.5 Determination of Geometric Association DSS uses following techniques to compute the association of points and polygons: 98 a b c (a) Odd-Parity Rule: technique to determine if a point is inside or outside of a polygon. 1 3 2 R Triangle home Clockwise home p (b) Technique to determine if a point is inside or outside of an extended triangle. Figure 5.3: Techniques to determine the geometric association of a point To determine if a pointp lies on the interior of a polygon, we usedodd-parity rule which draws a horizontal line from p to the right (Fig. 5.3(a)). To make sure that the endpoints are only counted once, we ignore two special cases from counting: 1) an edge lying along the ray, and 2) an edge ending on the ray. Barring these two special cases, the point is inside the polygon if this line intersects the line segments of the polygon odd number of times [38]. A Voronoi cell intersects a query convex hull (or triangle) if any line segment of the Voronoi cell intersect any line segment of the query convex hull (or triangle). Anodepiswithinanextendedtriangle, ifpisenclosedbetweenthesetwolinesinthecounter- clockwise and clockwise direction respectively: the line passing through the rendezvous point and clockwise home node and the line passing through the rendezvous point and the triangle home (Fig. 5.3(b)). Weutilized the left-hand rule and right-hand rule from GPSR forthis computation. 5.4 DSS Algorithms In this section, we describe the distributed spatial skyline (DSS) algorithms. These algorithms partition the search space into two sets of nodes (primary partitioning): nodes inside or outside of query convex hull. Nodes inside of query convex hull are guaranteed to be spatial skylines, while the nodes outside of query convex hull might not be skylines. Then the algorithms optionally 99 a b c d 1 7 3 2 10 5 4 8 6 12 9 11 D R B C E A Triangle home Query convex hull Clockwise home Rendezvous point Triangle Extended Triangle GHT home SP (a) TDSS performs primary and secondary partitioning where both internal and external query convex hull are triangulized. 1 7 3 2 10 5 4 8 6 12 9 11 D R B C E A (b) RDSS performs primary partitioning only. a b c d 1 7 3 2 10 5 4 8 6 12 9 11 D R B C E A (c) TRDSS performs primary and secondary partitioning where only external query convex hull is triangulized. Figure 5.4: Partitioning of three DSS algorithms partition those two sets of nodes by triangulation (secondary partitioning) to aid in the search of skyline. Fig. 5.4 presents the partition made by three DSS algorithms. We designed three DSS algorithms that use different partitioning strategy and control points placement: 1) Triangulation-based DSS (TDSS), 2) Rendezvous-based DSS (RDSS), and 3) Tri- angulation and Rendezvous-based DSS (TRDSS) algorithms. These algorithms work in three phases: 1) initializing skyline query, 2) searching internal skylines, and 3) searching external skylines. The second and third phases are different in these algorithms as shown in table 5.1. In our algorithms, each node keeps track of information in the neighborhood as opposed to keeping track of global information. This design choice makes DSS algorithms more robust to node failures and scalable. The DSS algorithms described in the next three sections assume: • Each node knows its geographical location by using a GPS receiver or running a well-known localization algorithms. • Each node has computed a list of its Voronoi neighbors using a well-known distributed Voronoi computation algorithm [4]. • There are more than three query points not having a degenerate case where three points are positioned on a line. • All the nodes are connected by a single spanning tree so that they are mutually routable. 100 Algorithm Primary partitioning Secondary parti- tioning Control points Supports arbitrary synchronization point TDSS Yes Triangulation of in- ternal and external query convex hull Triangle home, clockwise home, rendezvous node Yes RDSS Yes None Rendezvous node Yes TRDSS Yes Triangulationofex- ternal query convex hull Rendezvous node, Triangle home, clockwise home Yes Table 5.1: Comparison of three DSS algorithms: TDSS, RDSS, and TRDSS. All algorithms use primarypartitioningtoclassifythenodesasskylinesorpotentiallyskylinesandsupportarbitrary synchronization points. The algorithms are different in the way they perform the secondary partitioning to search for skylines and what nodes can be control points. 5.4.1 TDSS: Triangulation-based DSS Algorithm TDSS algorithm operates in these three phases: Algorithm 4 Phase 1 of TDSS Algorithm Require: Input nodes compose a single connected spanning tree, and Voronoi neighbors for each node are given. 1: Phase 1: Initialization 2: Each CN v (Q) sends a packet containing corresponding query point to SP using GHT-based routing 3: if SP receives all the packets with the query point then 4: SP computes CH v (Q) and R. 5: SP sends a packet containing<R, clockwise sortedCH v (Q)> to eachtrianglehome using GHT-based routing 6: end if 7: Triangle home triggers phase 2 for internal skylines Phase 1: Initializing distributed spatial skyline The aim of phase 1 is to compute the query convex hull and rendezvous point. In order to compute them, we need a list of all the query points. The node closest to each query point is responsible for reporting the coordinate of the query point to the synchronization point, where all the query point locations from across the network are gathered. For example, nodes 0,14,2,9, and 6 in Fig 5.5(a) report the query points A,B,C,D, and E respectively to the synchronization point. GHT-based routing is used to report the query point coordinates because a node might not exist at the synchronization point, in which case the node closest to the synchronization point (node 7 in Fig 5.5(a)) receives the 101 GHT-based routing R 10 4 16 B A C 1 15 11 12 13 14 9 0 3 6 5 2 8 E 11 1 0 2 3 5 6 7 7 8 9 16 12 13 14 15 R A B C D E 10 erf (a) Phase 1 Query initialization C 10 7 6 3 9 2 11 13 12 8 A 4 16 5 14 15 1 11 1 4 5 8 10 12 13 16 15 R C D B 7 7 14 3 9 6 2 0 A (b) Phase 2 Search internal skylines C D 6 3 0 1 9 2 11 13 12 8 A 5 14 B A C 10 4 16 7 5 15 R 11 1 4 7 8 10 12 13 16 15 R D B 14 3 9 0 6 2 A 5 C (c) Phase 3 Search external skylines Figure5.5: ThreephasesofTDSSalgorithm. Synchronizationpointisshownasastarandskyline as a sunflower. coordinates and computes the query convex hull and rendezvous point. The rendezvous point can be an arbitrary location. For efficiency reasons, we compute the centroid of query convex hull as the rendezvous point (R in Fig. 5.5(a)) so that the rendezvous point is approximately in the middle of query convex hull thereby reducing the communication cost in the later stages of the TDSS algorithm. We then send the clockwise-order sorted query convex hull vertices (ABDC in Fig. 5.5(a)) and rendezvous point to the node closest to each vertex of query convex hull, which is a triangle home. Phase 2: Searching internal skylines The goal of phase 2 is to search definite skylines progressively which are located within or near the query convex hull. These internal skylines (S i ) canbedeterminedquickly, correctly,andefficientlybecausetheycanbecomputedlocallywithout dominance check which can require communications with other nodes. TDSS even expedites this process by enabling parallel search per triangle; The three vertices of such a triangle are the rendezvous point R and two clockwise-consecutive query convex points. When each triangle home p receives the query convex hull and rendezvous point from the synchronization point, p automatically becomes a skyline because it satisfies the definition of spatialskylinesuchthatD(p,q i )<D(p ′ ,q i )forallp ′ ∈P(p ′ 6=p)givenpistheclosestpointtoq i . Thatis, atrianglehomeisthenodeclosesttoaquerypointq i , soitisaskyline. Nodes0,14,2,and 9 in Fig 5.5(b) are the triangle homes and hence the first set of skylines. Then each triangle home sends the vertices of its triangle to its Voronoi neighbors using Voronoi-based geographic flooding 102 Algorithm 5 Phase 2 of TDSS Algorithm 1: Triangle home adds itself to S i , and identifies its T 2: for all ((p∈VN v (S i )) AND (p6= visited)) do 3: repeat 4: Send a packet containing <T,S i > to p using Voronoi-based geographic flooding 5: untilpdoesnothaveanyVoronoineighborx∈VN v (p), where(xisinsideT)OR(VC v (x) intersects the edge not adjacent to the R in T) 6: When a node p receives a packet: 7: if ((p / ∈ VN v (S i )) OR (p is visited)) then 8: Drop the packet 9: elseif ((p is inside T) OR (VC v (p) intersects the edge not adjacent to the R in T))then 10: p is an internal skyline node 11: p reports this new skyline location of p to the triangle home 12: end if 13: end for 14: When triangle home receives a skyline: triangle home aggregates received internal skylines from its T 15: When the timer fired, triangle home reports aggregated skylines to R, and triggers phase 3 for external skylines (section 5.3.4) to search for other undiscovered internal skyline nodes that are either 1) within the triangle or 2) if its Voronoi cell intersects the edge not adjacent to the rendezvous point in the triangle (e.g. edge AB in the triangle ABR in Fig 5.5(a)). This is because internal skylines of TDSS are determined by two rules in Theorem 2 or Theorem 3 in which the query convex hull is replaced by triangle. This change does not impact the correctness (Theorem 7 in section 5.6.1) while allows for an efficient search due to the reduced search space (triangle vs. query convex hull) for each triangle home. For example, in Fig. 5.5(b) triangle home 0 sends vertices A,B, and R to its Voronoi neighbors 7,8,6, and 16 asking them if they are internal skylines. In this figure, node 9 is a triangle home of the triangle CAR but is located in a different extended triangle. We describe how TDSS handles this and other special cases in section 5.4.1.1 in detail. When a node determines that it is a skyline, it uses GPSR to send a message to the triangle home informing the triangle home that the node is a new internal skyline. The triangle home collects all these notifications to construct a list of internal skylines for its triangle and sends this list to the rendezvous point. In Fig. 5.5(b), the triangle home 0 reports skylines{0,8}, 14 reports {14}, 2 reports {2,3}, and 9 reports {9,6} to the rendezvous point R. Now rendezvous point has 103 all the internal skylines S i ={0,8,14,2,3,9,6} which can be sent to the user to provide progressive results even though we do not identify the external skylines until phase 3. Phase 3: Searching external skylines The aim of phase 3 is to search external skylines, S e , the skylines that are located outside of the query convex hull. These external skylines can be determined by dominance check over the internal skylines found in phase 2. Triangulation for extended triangle in TDSS expedites this process by enabling parallel search. Algorithm 6 Phase 3 of TDSS Algorithm 1: Eachtrianglehometraversesskylinenodesseriallybysendingapkt=<CH v (Q),T,clockwise home,skylines> 2: repeat 3: for all (q∈VN v (p), where p∈skylines) do 4: if (q / ∈ existing skylines) AND (q is located withinET) AND (q is not dominated by any skyline) then 5: if (q∈VN v (triangle home)) then 6: p adds q into skylines, and does not send a pkt. 7: else 8: p adds q into skylines, and sends a pkt to q. 9: end if 10: else 11: q reports the pkt to clockwise home. 12: end if 13: When q receive a packet: 14: q performs dominance check over VN v (q). 15: q reports this new skyline location of q to the clockwise home. 16: end for 17: until VN v (skylines) do not have any new skyline 18: When clockwise home receives a pkt: clockwise home aggregates received skylines from its ET 19: When the timer fired, clockwise home reports the aggregated skylines to R. 20: When R receives skylines: R aggregates all the skylines, and prunes the dominated skylines. Triangle home starts external skyline search by sending a packet containing query convex hull, trianglepoints,clockwisehome,andinternalskylinestotheinternalskylinenodesofcorresponding extended triangle. When this packet traverses each skyline node, that node checks if any of its Voronoi neighbors spatially dominates the list of skylines in the packet. If any of the neighbors is non-dominated, we just discovered a new external skyline. We put that new skyline in the packet and send it to the next skyline node in the list. This traversal continues until we cannot find any non-dominated neighbor of skylines. At that point, the search for the external skylines is over 104 and we send the final list of skylines to the clockwise home. Note that regardless of the location of triangle home, even when the triangle home is located in another triangle, we compute the dominance check for all its Voronoi neighbors to guarantee to provide the complete set of skylines (section 5.6.2). In Fig 5.5(c), we show an example of external skyline search. In this figure, the triangle home 9 checks its neighbors in its extended triangle for any node not spatially dominated by internal skylines9and6. Becausenode5isnotspatiallydominatedbythenodes{9,6}, 5isanewskyline. Node 9 now sends the skyline list {9,6,5} to node 6. Node 6 discovers node 16 as a new skyline among its neighbors and sends {9,6,5,16} to 5. None of the node 5’s neighbors is non-dominated by the list of skylines in the packet. Thus, a branch of the external skyline search has ended. Node 5 sends this final list {9,6,5,16} to the clockwise home node 0. Note that, however, the external skyline search is not over yet. TDSS traverses skyline 16 and checks dominance over its Voronoi neighbors. None of 16’s neighbors is a new skyline, and external skyline search is finally over because TDSS traverses all the skyline nodes and performs dominance check over all the Voronoi neighbors of them. Node 16 also sends its final list of skylines{9,6,5,16} to the clockwise home node 0. When the clockwise home receives all the skylines, it checks if the clockwise home node itself spatially dominates any of received external skylines. If there is any node which is dominated by the clockwise node, we delete that node from the list of skylines. This pruning is necessary because there is a region in an entire extended triangle which can be dominated by the clockwise home node. Each triangle does not have a complete list of skylines; it only knows the skylines discovered within the extended triangle. Dominance check with this incomplete list of skylines can incorrectly conclude that some non-skyline nodes, especially the ones close to clockwise home node,areskylines. Theclockwisehomeaggregatetheskylineswhileeliminatingduplicateskylines received. 105 B D A C 1 2 4 3 5 6 7 8 0 E F G R (a) Case 2 B D A F E C 1 2 4 3 5 6 7 10 9 8 11 12 0 R (b) Case 4 D 1 A F C 2 4 3 5 6 7 8 0 R E B (c) Case 6 Figure 5.6: Some special cases. See section 5.4.1.1 for details. In Fig. 5.5(c), the clockwise homes 0,14,2, and 9 report aggregated skylines{0,8},{14},{2,3}, and {9,6,5,16} respectively to the rendezvous point. If the list of skylines gets too big to fit in a single packet, we use the fragmentation service to split the packet before transmissions, and assemblethemuponreception. Whentherendezvouspointreceivesthepacketcontainingexternal skylines, it aggregates them into a list {0,8,14,2,3,9,6,5,16} and performs the final verification which prunes the spatially dominated points (a rare occurrence at this stage) from the skyline list. Thus, the final skyline points in this example are {0,8,14,2,3,9,6,5,16}. 5.4.1.1 Handling Special Cases Other than the algorithm description in subsection 5.4.1, the TDSS algorithm has to deal with more complicate scenarios in arbitrary network topologies. Many geometric exceptional cases generated by triangulation impacts the executions of TDSS and TRDSS (section 5.4.3). Case 1: Triangle home located in another triangle. We define the triangle home to be the node closest to the query convex point. This node can lie inside or outside of its own triangle. Similarly, clockwise home can be inside or outside of its own triangle. In Fig. 5.5(b), the node 9 is the triangle home for the triangle CAR but is located in the extended triangle DCR. The node 14 is the triangle home for the triangle ABR but is located in the triangle BDR. To handle this case, in phase 2, we remove node 9 from the search space for internal skylines in the triangle DCR. 106 Case 2: A skyline node is only Voronoi neighbor of the triangle home in another triangle Sometimes a part of topology satisfies these two conditions: 1) a skyline node which is not a Voronoi neighbor of skylines in a triangle, and 2) that skyline node is a Voronoi neighbor of skylines in another triangle. In that case, both TDSS and TRDSS cannot find that skyline because in phase 3 those algorithms only search for a new skyline among the Voronoi neighbors of skylines in the same triangle. To handle this case, we include both triangle home and its Voronoi neighbors in the triangle to which the triangle home belongs so that the algorithms can search the skyline which is located in adjacent triangle. To get rid of duplicate skylines in adjacent triangle, we exclude the clockwise home and Voronoi neighbors of clockwise home from the new external skylines. In Fig 5.6(a), the skyline node 8 is not a Voronoi neighbor of skyline 0, the only skyline node in triangle BCR. But the skyline node 8 is a Voronoi neighbor of skyline 4 of triangle CDR. Thus, both TDSS and TRDSS cannot find the skyline node 8 because they do not search for a skyline among the Voronoi neighbors of skylines in different triangle. In this case we include Voronoi neighbor 8 and triangle home 4 in the same triangle, so that we can reach node 8 through the skyline node 4 while searching for external skylines in phase 3. Case 3: Empty triangle and extended triangle. In a topology, especially in a sparse network, there can be triangles or extended triangles without any nodes. Because the triangle home waits for the other internal skylines to report before initiating phase 3, TDSS, in such a sparse topology will wait forever. We handle this case by having the triangle home initiate phase 3 as well as notify the rendezvous point after a certain timeout interval. The triangle home 14 in Fig.5.5(b)isanexamplewhichhasneitherotherinternalnorexternalskylines. Uponnothearing from any other internal skyline nodes until the timeout interval, node 14 determines itself to be the only internal skyline and notifies the rendezvous point and also initiate phase 3. Case 4: No radio connectivity within a triangle. Within a triangle, sometime there can be two nodes that are not radio or Voronoi neighbors and the communication path between them 107 traversesanadjacent triangle. Inthis case, the Voronoi-based geographic floodingcannot spanall the nodes in the triangle. To handle this case, each node that does not receive the Voronoi-based geographic flooding from its own triangle within a timeout interval, transmits its location as an internal skyline to any reachable triangle home learned from the packets in the neighborhood. Even though the skyline is from a different triangle, the triangle home will forward this skyline to the rendezvous point at the end of phase 2. In Fig 5.6(b), node 1 and 2 cannot communicate directly nor are they Voronoi neighbors. Thus, node 1 cannot reach node 2 using the Voronoi- based geographic flooding. But node 2 can listen to the packets transmitted by node 4 and learn about their triangle homes. After a timeout interval, node 2 not having received any message from its own triangle, reports itself to the triangle home 12 (node 4’s triangle home). Although there are multiple nodes unreachable from the triangle home, they have at least one node to receive a packet either 1) from the reachable node in other triangles or 2) from other nodes in the same triangle that transmit the Voronoi-based geographic flooding which is triggered after learning from the packet received from the other triangles. The rendezvous node notifies the skylines received from the other triangle home to the right triangle home where they belong to before initiating the external skyline search. Case 5: Accommodates multiple roles in a single node. A single node is closest to at least two of the synchronization point, rendezvous point, and triangle vertices. In a network with such a topology, the GHT-based routing routes packets to the node closest to those destinations, inthiscase, thesamenode. Wehandlethesecasesbyenablingnodestoplayappropriatemultiple roles depending on the phases and the messages received. Case 6: A single node part of multiple triangles. There can be a case there a single node is located in multiple-adjacent triangles. For example, triangle home for one triangle might be the triangle home for an adjacent triangle as well because thesamenodecanbeclosesttothecounter-clockwisequeryconvexpointoftwodifferenttriangles. Furthermore,thetrianglehomeforonetrianglecanbethetrianglehomeforanothernon-adjacent 108 B A C 1 2 4 3 5 6 10 9 8 11 0 E F G R 7 D (a) Case 7 D 3 1 2 4 R 5 C 7 E B A 6 (b) Case 8 and 9 3 1 2 4 R 5 7 6 C A B 3 D (c) Case 10 Figure 5.7: Some special cases. See section 5.4.1.1 for details. triangle as well. In this case, the same node will receive messages corresponding to two different triangles. To handles this case, we have the node participate in skyline computation in both the trianglessequentially. InFig.5.6(c), node1isthetrianglehomeforbothtrianglesABRandCDR which are the non-adjacent triangles. Case 7: Common triangle home and clockwise home for adjacent triangles. There can be a case where both triangle home and clockwise home are the same in a triangle. Further- more, the same node might also be the triangle home and clockwise home in adjacent triangles. This causes a problem because the triangle home must include itself in the list of skyline while a clockwise home must be excluded from the list of the skylines. To handle this case, we iteratively merge two triangles in the clockwise direction until we are adjacent to a triangle that has a new trianglehome. Westopthismergesooneriftheseriesofmergesweep180degreesintheclockwise direction. In Fig. 5.7(a), node 4 is the triangle home for the triangles BCR, CDR, and DER, and the clockwise home for the triangles BCR and CDR. We first merge BCR and CDR making a new triangle BDR. We then merge triangle BDR and DER making a new triangle BER. The clockwise adjacent triangle EFR has a different triangle home 11, so the merging algorithm terminates. Case 8: Voronoi cell intersects the other triangle. In Fig. 5.7(b), the Voronoi cell of node 7, which belongs to the extended triangle DER, inter- sects its clockwise triangle EAR. In this case TDSS cannot find the internal skyline 7 within its 109 extended triangle. To handle this case, we also check if the Voronoi cell of a node intersects the query convex hull if it does not intersects its own triangle. Thus, node 7 can be found within its own extended triangle DER. Case 9: Voronoi cell encompasses its triangle. In Fig. 5.7(b), the Voronoi cell of node 2 encompasses its triangle ABR. We fix this problem sameascase8suchthatwecheckiftheVoronoicellofanodeintersectsthequeryconvexhullifit does not intersects its own triangle. Thus, node 2 can be found within its own extended triangle. Case 10: Rendezvous node is outside of query convex hull nor intersect query convex hull Although the rendezvous node, the node closest to the rendezvous point, is an internal skyline TDSS cannot find this skyline sometimes in phase 2 because this node is located outside of its triangle (or query convex hull), or its Voronoi cell does not intersect its triangle (or the query convex hull). To handle this case, we explicitly add the rendezvous node into the internal skyline (node 4 in Fig. 5.7(c)). 5.4.2 RDSS: Rendezvous-based DSS Algorithm RDSS performs only primary partitioning, and the skyline query is initiated and gathered by the rendezvous point both for internal and external skylines. Phase1: The aim and process of phase 1 is consistent with that of TDSS except that the node closest to the synchronization point (node 7 in Fig. 5.5(a)) sends a query convex hull to the node closest to the rendezvous point (node 14 in Fig. 5.5(b)) in order to trigger phase 2. Phase2: The purpose of phase 2 is to search definite skylines progressively. Unlike TDSS, RDSS does not triangulate inside of the query convex hull. When rendezvous point receives the query convex hull, it becomes skylines because it is the centroid of query convex hull. Then rendezvous point start to search the other internal skylines locally using two properties in sec- tion 5.3.2: a node is skyline 1) if it is inside of query convex hull or 2) it its Voronoi cell intersects 110 queryconvexhull. Alldiscoveredskylinenodessendsamessagetotherendezvouspointinforming theyarenewskylines. Whentherendezvouspointreceivesallthesenotifications, ittriggersphase 3. In Fig 5.5(b), the rendezvous node 14 collects the internal skylines, S i = {14,3,6,2,9,0,8}. Phase 3: The aim of phase 3 is to search external skylines located outside of the query convex hull. Determining external skylines requires dominance check over internal skylines found in phase 2. Unlike TDSS, the rendezvous node initiates the external search and progressively receives all the newly found external skylines. Because RDSS does not triangulate outside of the queryconvex hull, itdoesnot requiredominance check over radio neighbors (recall TDSSrequires dominance check for both Voronoi and radio neighbors) in order to provide correct skylines, but dominance check over radio neighbors expedite the external skyline search. When there is no more new skyline among Voronoi neighbors of skyline nodes, the search process ends and this node sends the list of skylines to the rendezvous point. Then rendezvous point performs the final verification to prune out any dominated points. In Fig. 5.5(c), the rendezvous node 14 collects the external skylines, S e = {14,3,6,2,9,0,8,5,16}. 5.4.3 TRDSS: Triangulation and Rendezvous-based DSS Algorithm TRDSS is hybrid of TDSS and RDSS. The internal skylines are gathered by rendezvous point without triangulation, but the external skylines are initiated by triangle home and received by clockwise home in parallel with triangulation. Phase 1: The purpose and process of phase 1 are consistent with RDSS. Phase 2: TRDSS searches internal skylines in phase 2 as the same way with RDSS. Only difference from RDSS is on the transition process from RDSS in phase 2 to TDSS in phase 3. In order to do that, TRDSS makes the subset of the internal skylines by corresponding triangle after rendezvous point receives all the internal skylines. Then, the rendezvous point sends a packet to each triangle home containing skylines corresponding to that triangle to trigger phase 3. Phase 3: The purpose and process of phase 3 are consistent with TDSS. 111 5.5 Optimizations 1. Aggregation at triangle home and clockwise home: One of the design goals and challenges of DSS algorithms is make them progressive as well as efficient. To be progressive, our DSS algorithms report the internal skylines to the rendezvous point first. However, to be efficient at the same time, the triangle home does not report the internal skylines every time it discovers a new skyline. Instead, the triangle home waits to receive messages from all the internal skylines, constructs a single list containing all the internal skylines, and sends this list to the rendezvous point at the end of phase 2. Similarly, clockwise home node waits until the end of phase 3 collecting messages from external skyline nodes, constructs a single list of external skylines, and sends the list to the rendezvous point. These techniques make the algorithm less progressive but more efficient. 2. Implicit synchronization of algorithm stages: Instead of sending explicit messages instructing the nodes to start or end a particular stage or phase of the algorithm, we use timeouts to transition the algorithm from one phase to the next. These timeouts are chosen carefully to ensure all the nodes, that need to make a transition to the next phase, are allowed enough time to complete the previous state of algorithm. This approach avoids communication inherent in an approach that uses explicit synchronization messages. 3. Dominance check over all the radio neighbors in phase 3: All DSS algorithms normallyperformdominancecheckoverVoronoineighborsinphase3. Notethat,eachnodeknows itsradioneighborsinadditiontoVoronoineighbors. Takingadvantageofthisinformation,wecan perform dominance check over all radio neighbors in addition to Voronoi neighbors. Because each node now searches through a larger number of nodes, we can search through the entire triangle faster and with fewer transmissions. Note that, these extra checks do not require extra message transmissions because we are simply using the neighborhood information from the underlying routing protocol. 112 4. Reducing the redundant skylines: Skylines among the Voronoi neighbors of a triangle home are discovered during the search of triangle home’s extended triangle in phase 3. When these neighboring skyline nodes are located in the adjacent triangle, we can exclude them from the search space in the triangle in which they are located. Thus, we avoid the redundant skylines in the intermediate results. 5. Early termination of search when no new skylines: All DSS algorithms terminate as soon as they are not able to find a new skyline when each node looks for a new skyline among its Voronoi neighbors. Thus we can stop the search even before we have swept the entire triangle or extended triangle thereby avoiding unnecessary search while maintaining the correctness and completeness of the skyline the algorithm finds. 5.6 Correctness and Completeness of TDSS 5.6.1 Correctness of TDSS Algorithm Lemma 5. Any two nodes inside of query convex hull do not spatially dominate each other. Proof. Proofbycontradiction. Letpandq betwonodesinsideofqueryconvexhull. Assumethat p spatially dominates q. Let AB be the perpendicular bisector line of pq. Because p dominates q, all the query convex points CH v (Q) should be located on the same half plane of the line AB where p is located. This is impossible because both p and q are located inside of query convex hull. Corollary 6. (Mutual Exclusiveness of TDSS) A node inside of a triangle does not spatially dominate the other node inside of another triangle. Proof. Letpandq betwo nodesintwo differenttriangles (constructedbyTDSS)insidethequery convexhull. ByLemma5,pandq,beingtwonodesinsidethequeryconvexhull, cannotdominate each other. 113 A B C B’ C’ q’ q q’ q empty (a) Correctness of TDSS A B C B’ C’ s u t w r v q p C (b) Early Termination Figure 5.8: Example figures for the analysis of TDSS. Theorem 7. (Correctness of TDSS) TDSS is guaranteed to provide the correct skylines. Proof. Let q in Fig. 5.8(a) be the skyline determined by TDSS. q can be either in the triangle ABC or in B ′ BCC ′ . • If q is in the triangle ABC, TDSS is correct because q is a skyline by definition (inside the query convex hull). • Ifq is inB ′ BCC ′ : Let’s assume that TDSS incorrectly determined that isq is not a skyline. Let q ′ be the skyline that dominates q. – If triangle ABC not empty: Case 1) q ′ is in ABC - There cannot exist such a q ′ which dominates q, because if we run TDSS, TDSS does the dominace check between q and q ′ and concludes that q is a spatial skyline. That is,q dominatesq ′ . Thus, it is impossible forq ′ to exist in triangle ABC (inside the query convex hull) which dominates q. Case 2) q ′ is in B ′ BCC ′ - similar to above, it is impossible for q ′ to be located in B ′ BCC ′ which dominates q if we perform the dominance check by running TDSS. – If triangle ABC is empty: Case 1) q ′ is a skyline in a different triangle - q ′ dominates q but checking in the same triangle is insufficient. But TDSS will perform dominance check across the reported 114 skylines from different extended triangles at the rendezvous point and filter the false positives. The result is thus correct after this check at the rendezvous point. Case 2) q ′ is in B ′ BCC ′ - it is impossible for q ′ to be located in B ′ BCC ′ which domi- nates q if we perform the dominance check by running TDSS. Thus, TDSS conclusion is correct. 5.6.2 Completeness of TDSS Algorithm Lemma 8, Lemma 9, Theorem 10, and Lemma 11 prove the completeness of TDSS algorithm. The intuition behind the proof of completeness of TDSS is that, the marginal area of skylines dominance-checked by TDSS encompasses the marginal area of skylines dominance-checked by the centralized algorithm. Lemma 8. (Early Termination of Skyline Search) TDSS is able to terminate the skyline search as soon as none of the Voronoi neighbors of skylines are skylines while ensuring the complete of the results. Proof. Proof by induction. Base case: The base case follows from the fact that p, the triangle home, is the only skyline in triangle ABC in Fig. 5.8(b). After internal skyline search in phase 2, the triangle home p starts to search for the external skylines within its extended triangle AB ′ C ′ . The skyline p finds a new skyline q among its Voronoi neighbors within AB ′ C ′ , where q is not spatially dominate by p. Then skylines {p,q} find new skyline r among their Voronoi neighbors, where r is not spatially dominated by {p,q}, and so on. Inductive Hypothesis: Because the external skyline search by TDSS expands away from the rendezvous point in the extended triangleAB ′ C ′ , this search must encounter a skylines where all the Voronoi neighbors of s located within AB ′ C ′ are spatially dominated by the skylines already 115 found within its extended triangle AB ′ C ′ , {p,q,r,s} and terminating the search. Suppose one of VN v (s), u, is a new skyline. However, if we draw the perpendicular bisector line of the segment us, u is located on the other half plane with CH v (Q), while s is located on the same half plane with CH v (Q). This is a contradiction, because u is spatially dominated by s so u cannot be a new skyline. If none of non-visited VN v (s) is a new skyline, none of the non-visited Voronoi neighbors of the non-visited VN v (s) can be a new skylines because they are also located on the other half plane withCH v (Q) of the perpendicular bisector lines of the segmentVN v (s)s. Hence, the external skyline search by TDSS terminates as soon as the search encounters the skyline nodes such that all the non-visited Voronoi neighbors of the skylines within a particular extended triangle are spatially dominated by the accumulated skylines from that extended triangle. Lemma 8 also holds for TRDSS, because TRDSS also partitions the external query convex hull and searches within the extended triangle in the same way as TDSS. The RDSS algorithm is also able to terminate early because the dominance check for skyline search expands away from the rendezvous point in the same way as TDSS without partitioning. To prove that TDSS provides a complete list of skylines, we first prove that there is no blind spot where TDSS fails to perform dominance checks while a centralized algorithm would successfully perform a dominance check. We prove this in Theorem 9 by dividing nodes into two sets: border nodes and inside triangle nodes. To simplify the presentation of this proof, we introducethetermsecondVoronoihopnodes: thesetofnodesconsisting ofneighborsofVoronoi neighbors of a given node p but excluding all the Voronoi neighbors of p. Lemma 9. (Marginal Area for the Superset of Skylines) Dominance check by TDSS is com- plete; Union of marginal area of skylines dominance-checked by TDSS per extended triangle is the marginal area for the superset of complete skylines. Proof. • Border area (nodes whose Voronoi cell intersect an extended triangle line): In phase 3, TDSS expands the marginal area performing dominance check by traversing the already 116 found skyline nodes, visiting the newly found skyline node (if any), and performing domi- nance check for its Voronoi neighbors. Starting from the triangle home nodet in Fig. 5.9(a), if t finds a new skyline p among its Voronoi neighbors p and r of triangle home, TDSS always visits p and performs dominance check for p’s Voronoi neighbors q and r, where q is the second Voronoi hop node from triangle home. That is, TDSS performs dominance check up to two Voronoi hops (at the Voronoi neighbors by utilizing the node’s local information) of the triangle home. In border area, however, there can be no new skyline at and farther than the second Voronoi hop nodes from the triangle home. To be a spatial skyline in the border area, the second Voronoi hop node q should dominate other nodes over the single dimensional distance. That is, q should be closer to the closest query convex point A than any other nodes in the border area. But this is not possible in border area because there is always a skyline, the triangle home t, which dominates other nodes in the border area, therefore dominates q. In border areas, skylines can exist up to the Voronoi neighbors of the triangle home because those Voronoi neighbors of the triangle home perform the dominance check over 2 dimen- sional distance: if a node is not dominated in the distance either to A and B (clockwise query convex point) or to A and C (counter clockwise query convex point). Because TDSS performs the dominance check up to second Voronoi hop nodes of the triangle home, the marginal area performing dominance check by TDSS in the border area is complete. • Inside triangle nodes (nodes not in a border area): Case 1) If there is no empty triangle: TDSS gathers only the exact set of skylines per extended triangle because both internal and external skylines from each extended triangle aremutuallyexclusive(Corollary6),andthesearchterminateswhileprovidingthecomplete skylines within its own extended triangle (Lemma 8). Thus, for inside triangle nodes, the 117 C B A Global margin Global skyline Local margin Local margin Local skyline D R k t p r q (a) Marginal area performing dominance check is Voronoi neighbors of the superset of skylines. 4 8 t 11 a p q q C D A B G (b) An example handles special case 2 Figure 5.9: Example figures for the completeness of TDSS. marginal area performing dominance check is the Voronoi neighbors of the exact set of skylines. Case 2) If there is an empty triangle: TDSS can gather the superset of skylines from the extended triangle where its triangle is empty; TDSS delivers false positives temporarily in phase3,andthesearchterminateswhileensuringthecompletenessoftheresults(Lemma8). Thus,forinsidetrianglenodes,themarginalareaperformingdominancecheckistheVoronoi neighbors of superset of skylines, so there is no false negative such as k in Fig 5.9(a) due to the marginal area for the subset of skylines in any extended triangle. Thus,theunionofmarginalareaofskylinesdominance-checkedbyTDSSfromalltheextended triangles is the marginal area for the superset of complete skylines. Theorem10provesthatTDSSresultsinnofalsenegativeandthusreturnsthecompletespatial skylines. Theorem 10. (Completeness of TDSS) TDSS is guaranteed to provide all the spatial skylines and does not miss any skyline. Proof. There are many special cases that arise due to wireless topologies and the partitioning algorithm used by TDSS. TDSS handles all those special cases to compute the complete set of spatial skylines. 118 Handles special case 5. If there are not enough nodes in a triangle to assign the roles of triangle home, clockwise home, rendezvous node, a single node performs multiple roles depending on the phase of the algorithm. • Border area (nodes whose Voronoi cell intersect the extended triangle line): Handles special case 1. If a triangle home is located in another triangle or extended triangle,TDSSexplicitlyincludesthattrianglehomeintoitsclockwisetrianglesandexcludes that node from the current triangle. This reduces the redundant skylines and the size of packet while providing the complete skylines. Handles special case 2. If there is a skyline node which is the only Voronoi neighbor of the triangle home in another triangle, TDSS includes both triangle home and its Voronoi neighbors in the triangle to which the triangle home belongs. Thus, TDSS can search the skyline for the special case 2 where the skyline is located in adjacent triangle. TDSSdeterminesthetrianglehomeainthetriangleACDinFig.5.9(b)asaskylinebecause it is the nearest neighbor of the query convex point C. Then TDSS searches all the internal skylines within its triangle ACD uses the rules in Theorem 2 and Theorem 3. Note that all the Voronoi neighbors of the triangle home belong to the same triangle with its triangle home(case2insection5.4.1.1). IfTDSSfindsanewskylinepamongtheVoronoineighbors of the triangle home a, then TDSS checks the non-visited Voronoi neighbors of p, VN v (p), if there are new skylines. However, none of q ∈VN v (p) is a new spatial skyline because 1) if we draw the perpendicular bisector lines of the segment pq, all the non-visited Voronoi neighborsq arenotlocatedonthesameplanewithCH v (Q),or2)q isdominatedbyskylines from the adjacent triangle ABC where q is located. Thus, TDSS completes skylines search without missing any skyline incurred by special case 2. 119 Handles special case 6. If a single node is a part of multiple non-adjacent triangles, TDSS makes the node participate in skyline computation in those non-adjacent triangles sequentially, and TDSS provides the complete skylines. Handles special case 7. If there is a common triangle home and clockwise home for adjacent triangles, TDSS iteratively merges two triangles in the clockwise direction until there is a triangle that has a new triangle home. TDSS still provides the complete skylines. • Inside triangle nodes (nodes not in a border area): – With empty triangle: Handles special case 3. If there is an empty triangle, TDSS lets the triangle home initiate phase 3 as well as notifies itself to the rendezvous point after a certain timeout interval. This prevents TDSS from waiting indefinitely for the other internal skylines, and enables the collection of the complete set of skylines regardless of existing empty triangles. – Without empty triangle: ∗ With radio connectivity within a triangle: TDSS collects the complete set of sky- lines without false negative by handling the following special cases. Handlesspecialcase8. IftheVoronoicellofanodeintersectsanyothertriangle, TDSS cannot find that internal skyline node within its extended triangle. TDSS handlesthiscaseandprovidesthecompleteskylinesbycheckingiftheVoronoicell of a node intersects the query convex hull if it does not intersect its own triangle. Handles special case 9. If the Voronoi cell of a node completely surrounds its triangle, TDSS cannot find that internal skyline node within its extended triangle. TDSS handles this case and provides complete skylines by checking if the Voronoi cell of a node intersects the query convex hull if it does not intersect its own triangle. 120 Handles special case 10. If a rendezvous node, the node closest to the ren- dezvous point, is outside of query convex hull nor intersects the query convex hull, TDSS cannot find that rendezvous node as an internal skyline within its extended triangle. TDSS handles this case and provides complete skylines by explicitly adding the rendezvous node into the internal skylines. ∗ Without radio connectivity within a triangle: Handles special case 4. If nodes within a triangle are connected only through the nodes in the other triangles, the Voronoi-based geographic flooding cannot span all the nodes in the triangle. TDSS can still provide complete skylines by enabling each node that does not receive the Voronoi-based geographic flooding from its own triangle within a timeout interval to transmit its location as an internal skyline to any reachable triangle home learned from the packets in the neighborhood. Although there are multiple nodes unreachable from the triangle home, they have at least one node to receive a packet either from the reachable node in other triangles or from other nodes in the same triangle that transmit the Voronoi-based geographic flooding which is triggered after learning from the packetreceivedfromtheothertriangles. Therendezvousnodenotifiestheskylines receivedfromtheothertrianglehometotherighttrianglehomewheretheybelong to before initiating the external skyline search. • With empty extended triangle: Handles special case 3. If there is an extended triangle, TDSS lets the triangle home notify itself to the rendezvous node after a certain timeout interval. This prevents TDSS from waiting indefinitely for other external skylines. Thus, TDSS gathers the complete set of skylines from the empty extended triangle. • Without empty extended triangle 121 – Without empty triangle: ∗ With radio connectivity within an extended triangle: TDSS provides the exact set of skylines (Lemma 8 and Lemma 9) from this extended triangle. ∗ Without radio connectivity within an extended triangle: Handles special case 4. If nodes within an extended triangle are connected only through the nodes in the other extended triangles, TDSS still provides the complete skylines. Note that all the triangle homes have a complete list of internal skylines; In case there is no radio connectivity within the triangle, the rendezvous node notifies the skylines discovered from the other triangle to the right triangle home where they belong to before starting the external search. Thus, TDSS is able to visit all the internal skylines to perform dominance check for their Voronoi neighbors by using GPSR which delivers a packet to the destination node, in the same extended triangle, by multi-hop transmissions through the nodes in the other extended triangle. Therefore, TDSS gathers the exact set of skylines from the extended triangle as in Lemma 9. – With empty triangle: ∗ With radio connectivity within an extended triangle: By utilizing GPSR, which guarantees the multi-hop delivery to the destination node (in the same extended triangle) TDSS can provide the superset of skylines by delivering false positives temporarily in phase 3 from this extended triangle as in Lemma 9. ∗ Without radio connectivity within an extended triangle: Handles special case 4. If nodes within an extended triangle are connected only through the nodes in the other extended triangles, TDSS still provides the complete skylines. Because only the triangle home is the internal skyline due to the empty triangle, TDSS starts the external search from the triangle home by 122 sweeping outside of the triangle. For the external search, TDSS uses GPSR which guarantees packet delivery to the destination node (in the same extended triangle) by multi-hop transmissions. Thus, TDSS can gather the superset of skylines from this extended triangle by delivering false positives temporarily in phase 3 as in Lemma 9. Thus, we can conclude the following lemma such that TDSS gathers the superset of skylines. Lemma 11. (Superset of Skylines) The union of skylines gathered from all the extended triangles is the superset of complete skylines. Proof. Case 1) If there is no empty triangle: TDSS gathers only the exact set of skylines per extended triangle because both internal and external skylines from each extended triangle are mutually exclusive (Lemma 6), and dominance check within its own extended triangle is sufficient to provide complete skylines. Case 2) If there is an empty triangle: TDSS can gather the superset of skylines temporarily in phase 3 from the extended triangles where its triangle is empty, and TDSS gathers the correct set of skylines from the other extended triangles in phase 3. Thus, the union of skylines gathered by TDSS from all the extended triangles is the superset of complete skylines. 5.6.3 Message Cost Let n be the number of nodes in the network, Q be the query points, m be the MBR(Q) where m≤1, and s be the number of skylines. The message cost of the centralized algorithm, which uses a TAG-like data gathering tree, is (n 2 ) when all the nodes are in a single line. The best case message cost of the centralized algorithm is (n) when all the nodes are within a single hop. The message cost of all the DSS algorithms is (n 3 ). 123 (n 3 ) = (mn+s)× (G) = (message cost of GHT)+mn (G)+s (G) = (phase 1 message cost)+ (phase 2 message cost)+ (phase 3 message cost) where G is the message cost of geographic routing. This is because (message cost of GHT) = (G). And GPSR, the message cost of geographic routing technique that we used in this work is (n 2 ) in which greedy routing takes n and face routing takes (n 2 ). In DSS algorithm, this (n 3 ) is dominated by the message cost of GPSR, the geographic routing protocol used, which is (n 2 ). This can be improved if we use more efficient geographic routingprotocolsuchasGOAFR[51], whichisthebestknowngeographicroutingalgorithm. The message cost of GOAFR is l, where l is the optimal path length. The remaining (n) depends on what dominates between the number of nodes inside of query convex hull and the number of skyline size. Note that this is the analysis for the worst case scenario. Most of time all DSS algorithms are much more efficient than the worst case, because they use GPSR packet to its Voronoi neighbors, triangle home, or rendezvous node which are close in distance, thereby results in the short routing path or using efficient greedy routing frequently. In reality, both mn and s are much smaller than n, because we often search the spatial skyline for the part of entire network where the m = MBR(Q) << 1 and s << n, so the size of the skyline is much less than the size of entire network. We show the average case of message cost using simulation in section 5.7. 5.7 Performance Evaluation In this section, we describe the evaluation metrics, experimental setup, and experimental results. 124 GPSREvent GPSRForwarder DSSM DSS Send Task DSS ReceiveEvent Skyline Computation Voronoi Computation Voronoi NeighborList GHT-based Routing Voronoi-based Geographic Flooding GPSR Figure 5.10: Software architecture of DSS over GPSR 5.7.1 Metrics and Experimental Setup We evaluate the three proposed DSS algorithms compared to the centralized algorithm running over TAG [60]-like data gathering system using these metrics: • Number of transmissions (ntx) estimates the energy cost of these algorithms in WSN. This approachisreasonablebecauseradiotransmissionsconsumefarmoreenergythananyother operation in a sensor node. We breakdown the number of transmissions into packets for 1) phase 1, 2) phase 2, 3) phase 3, and 4) fragmentation to understand how different parts of the algorithm contribute to the total number of transmissions. • Accuracyofresult isdefinedasthepercentageofactualskylines(gatheredbyDSSalgorithm) over the correct skylines. • Progressiveness is defined as the ratio of the received skylines over the complete skylines at the time of receiving the first and 50% of the skylines. • Delay to compute skyline is defined as the time of receiving the complete skylines. • Load balance of transmissions is defined as the evenness in distribution of the number of transmissions among the nodes. To study the DSS algorithms over a wide range of inputs and environments, we varied five parameters in our simulations: 125 • Density (25, 17, 9 neighbors for dense, moderate, and sparse topologies) • Network size (100×100, 200×200, 300×300, 500×500 meters) • Number of query points (4, 5, 6, 7, 8, 9, and 10) • Different MBR(Q) (5%, 10%, 15%, 20%, and 25% of the entire network area) • Use of arbitrary synchronization point We run experiments the total number of 25,200 runs. All the results are averaged over 10 different topologies with the same parameters. We do not show all the results due to the lack of space. For our simulations, we implemented the three DSS algorithms in TinyOS 1.1.15 and use TOSSIM packet-level simulator [54]. We also use GPSR [48] for both GPSR routing and GHT based routing. Fig 5.10 presents the software architecture of DSS over GPSR. In each of our experiment, the query points are positioned at i.i.d. random location with uniform distribution while satisfying the given MBR(Q). We placed the centroid of the MBR(Q) at the centroid of the network. All the nodes are positioned at i.i.d. random location with uniform distribution across the network area. All the nodes in the network are connected; they form a single spanning tree verified by Kruskal’s algorithm. We used qhull library [71] for Voronoi neighbor computation. We assume that there is no degenerated cases with query points less than three (cannot form query convex hull) or any three nodes on a line. We use the disc radio model with a communication radius of 50 meter. Reasonably reliable communication is possible at this range on the TelosB mote platform. We do not simulate packet losses due to the interference or buffer overrun by GPSR [48] that we used. But our simulations drop packets when face routing fails [48]. 126 0 500 1000 1500 2000 300x300 200x200 100x100 ntx Network Size TDSS Central RDSS TRDSS (a) Sparse network (16,50,108 nodes) 0 500 1000 1500 2000 300x300 200x200 100x100 ntx Network Size TDSS Central RDSS TRDSS (b) Moderate network (26,92,202 nodes) 0 500 1000 1500 2000 300x300 200x200 100x100 ntx Network Size TDSS Central RDSS TRDSS (c) Dense network (36,132,292 nodes) 0 500 1000 1500 2000 dense moderate sparse ntx Density TDSS Central RDSS TRDSS (d) Network size 300×300 (108, 202, 292 nodes) Figure 5.11: Efficiency of DSS algorithms by network size (a-c) and density (d), with fixed |Q| = 10 and MBR(Q) = 25% 5.7.2 Results All DSS algorithms are more efficient and scalable than the centralized algorithm in large net- works. Fig. 5.11 shows the number of transmissions incurred by the DSS and centralized skyline algorithms. Among the three DSS algorithms, TDSS is the most efficient and RDSS is the least efficient. TDSS is also the most scalable while RDSS is the least scalable algorithm as we increase the network size or density. In sparse and small networks, the centralized algorithm is the most efficient: it transmits 63.93% fewer packets compared to the TDSS algorithm when 16 nodes are sparsely deployed in 100×100 network (Fig. 5.11(a) and 5.11(b)). In a large and dense network with 292 nodes in 300×300 meter, TDSS incurs 67.61% fewer transmission compared to the cen- tralized algorithm, and 51.65% fewer transmissions compared to RDSS for|Q| = 10 and MBR(Q) = 25%. 127 0 100 200 300 400 500 600 700 fragment # Phase 3 Phase 2 Phase 1 ntx Breakdown of packets TDSS RDSS TRDSS (a) Breakdown of ntx, dense 300×300 network 0 100 200 300 400 500 600 700 fragment # Phase 3 Phase 2 Phase 1 ntx Breakdown of packets TDSS RDSS TRDSS (b) Breakdown of ntx, sparse 500×500 network Figure5.12: Breakdownofthenumberoftransmissionswiththesamenumberofnodes(292nodes) in the network: dense 300×300 and sparse 500×500 networks with fixed |Q| = 10, MBR(Q) = 25% TDSS is more scalable than other DSS algorithms because it partitions the potentially large search space outside of query convex hull. Due to the partitioning, TDSS has 2) fewer packet fragmentation due to smaller search space and hence smaller number of potential skylines to put in the packet, and 2) shorter routing path than RDSS while RDSS traverses over large area outside of query convex hull without partition. Fig 5.12(a) shows the result of an experiment in a densely deployed 300×300 network with fixed |Q| = 10, MBR(Q) = 25%. TDSS has 90.78% fewer packet fragmentation, and 34.79% (in phase 3) fewer unfragmented packet transmission than RDSS. RDSS’s high cost can be attributed to it having to traverse a large area outside of convex hull without partitioning and hence a large number of skylines to be exchanged among the nodes for dominance check. Fig. 5.13 shows the number of transmissions by the number of query points (|Q|) in three different densities and three different network sizes with fixed MBR(Q) = 25%. If we fix the MBR(Q), given network size and density, the number of transmissions by DSS algorithms do not dependonthenumberofquerypointsbutdependsontheareaofthequeryconvexhull (Fig.5.14). In a network of 292 nodes, TDSS and TRDSS algorithms transmit fewer packets indenser and smallernetworks(300×300)asinFig.5.13(d)thanthoseinsparserandlargernetworks(500×500) as in Fig. 5.13(e). With 292 nodes and query size of 10, TDSS in dense 300×300 network uses 36.34%lesstransmissionsthaninasparse500×500network. TDSSandTRDSSaremoreefficient 128 0 20 40 60 80 100 120 140 4 5 6 7 8 9 10 ntx |Q| TDSS Central RDSS TRDSS (a) Sparse 100×100 (16 nodes) 0 20 40 60 80 100 120 140 4 5 6 7 8 9 10 ntx |Q| TDSS Central RDSS TRDSS (b) Moderate 100×100 (26 nodes) 0 20 40 60 80 100 120 140 4 5 6 7 8 9 10 ntx |Q| TDSS Central RDSS TRDSS (c) Dense 100×100 (36 nodes) 500 1000 1500 2000 2500 3000 4 5 6 7 8 9 10 ntx |Q| TDSS Central RDSS TRDSS (d) Dense 300×300 (292 nodes) 500 1000 1500 2000 2500 3000 4 5 6 7 8 9 10 ntx |Q| TDSS Central RDSS TRDSS (e) Sparse 500×500 (292 nodes) 2000 4000 6000 8000 10000 12000 14000 16000 4 5 6 7 8 9 10 ntx |Q| TDSS Central RDSS TRDSS (f) Moderate 500×500 (554 nodes) Figure 5.13: Efficiency of three DSS algorithms by the number of query points (|Q|), with fixed MBR(Q) = 25% 129 in denser and smaller networks due to relatively short routing paths in dense and small networks compared to the path lengths in sparse and large networks. By comparing the breakdown of the number of transmissions observed in the dense and small networks in Fig. 5.12(a) and the sparse and large networks Fig. 5.12(b), we found significantly different number of transmissions between these two networks in phase 2 regardless of the algorithm. The Voronoi-based geographic flooding used in phase 2 enables all DSS algorithms to traverse internal query convex hull of dense and smaller networks more efficiently, thereby resulting in the shorter routing path. However, there are fewer packet fragmentations with TDSS and TRDSS than with RDSS, which results in overall smaller number of transmissions for TDSS and TRDSS. Under sparse and small size network, the centralized algorithm uses the fewest transmissions followed by RDSS and then by TDSS and TRDSS (Fig. 5.13(a)). In a large network with 554 nodes in 500 × 500 network (Fig. 5.13(f)) and |Q|=10, TDSS has 91.49% fewer transmissions than the centralized algorithm, while TRDSS and RDSS are also more efficient than the central- ized algorithm with 89.24% and 81.65% less transmissions respectively but are less efficient than TDSS.(Fig. 5.13(f)). Fig. 5.14 shows that the number of transmissions increases as the MBR(Q) increases in ex- periments with three different densities with 10 query points and 100×100 and 300×300 network size. Similar totheFig.5.13, thenumberoftransmissionsbythecentralizedalgorithm arealmost constantoverallsettings. WeobservedthatTDSSoutperformsRDSSandTRDSSasthenetwork becomes denser and larger as in Fig. 5.14(d). Although these patterns are similar to those of the Fig. 5.13, we observed that the number of transmissions are more dependent on the variation of MBR(Q) than the number of query points. This result validates that the query points inside of query convex hull do not have any impact on the spatial skylines. The spatial skylines are determined by the geometric relation between nodes and the query convex hull. All three DSS algorithms provide 100% accurate skyline results across different network sizes, densities, and values of other parameters. Fig. 5.15(a) shows that three DSS algorithms provide 130 0 20 40 60 80 100 120 140 5 10 15 20 25 ntx MBR(Q) TDSS Central RDSS TRDSS (a) Sparse 100×100 (16 nodes) 0 500 1000 1500 2000 2500 3000 5 10 15 20 25 ntx MBR(Q) TDSS Central RDSS TRDSS (b) Sparse 300×300 (108 nodes) 0 500 1000 1500 2000 2500 3000 5 10 15 20 25 ntx MBR(Q) TDSS Central RDSS TRDSS (c) Moderate 300×300 (202 nodes) 0 500 1000 1500 2000 2500 3000 5 10 15 20 25 ntx MBR(Q) TDSS Central RDSS TRDSS (d) Dense 300×300 (292 nodes) Figure 5.14: Efficiency of three DSS algorithms by MBR(Q), with fixed |Q| = 10, and network size = 100×100, 300×300 accurate skylines with negligible false positive at the time when the rendezvous points receives all the skylines. Note that these false positives are removed during final dominance check as soon as the rendezvous point receives all the skylines. TDSS and TRDSS provides similar level of false positives at a level more than that with RDSS. For example, in 300×300 network, RDSS provides 0.00% false positives whereas TDSS and TRDSS provide 2.09% false positives. Because both TDSS and TRDSS are ignorant of the skylines from the other extended triangles, some special cases such as empty triangle can result in false positive temporarily before the rendezvous point prunes them during the final dominance check. Load balancing is a desirable property of protocols on low power wireless networks to prevent early network partitioning. Fig. 5.15(b) shows the distribution of transmissions per node to analyzetheloadbalancingpropertyoftheDSSalgorithms. TDSSpresentsthebestloadbalancing (steepest curve), followed by TRDSS, and then by RDSS which presents the worst load balancing 131 0 0.2 0.4 0.6 0.8 1 1.2 300x300 200x200 100x100 Accuracy: false positive Network Size TDSS RDSS TRDSS (a) Accuracy of results 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 Fraction of nodes ntx TDSS1 RDSS1 TRDSS1 (b) Distribution of transmissions per node Figure 5.15: Accuracy of results in a moderate topology (a), and load balance of transmissions in a moderate 300×300 network (b), both with fixed |Q| = 10, MBR(Q) = 25% (shallowest curve). This is because TDSS distributes communications in three control points: rendezvous point, triangle home, and clockwise home, while RDSS concentrates communications at the rendezvous point only. TRDSS distributes communications in three control points same with TDSS, but internal skyline search concentrates on rendezvous point only. Fig. 5.15(b) also shows that 73.2% of the nodes do not transmit any packet with TDSS and TRDSS and 74.2% of the nodes with RDSS in a 202 node network deployed in 300×300m area. In another result from a 108-node network in 300×300m, 46.67% of nodes have no transmissions with TDSS and TRDSS, and 45.94% with RDSS (not shown in this graph). Fig. 5.16(a) shows that all the DSS algorithms delays comparable to the delay with the cen- tralizedalgorithm. AmongthreeDSSalgorithms, TDSSshowstheleastdelayfollowedbyTRDSS and then by RDSS. In a moderate density 300×300 network with fixed |Q| = 10 and MBR(Q) = 25%, TDSS, RDSS, and TRDSS can provide the skyline result in 3.23, 5.16, 3.32 second re- spectively, while the centralized algorithm results in 5.26 second delay (Fig. 5.16(a)). Due to the triangulation in both internal and external query convex hull, TDSS can speed up the skyline search exploiting the parallel search, while RDSS sequentially traverses all the internal and then external query convex hull which results in longer delay. TRDSS has slightly larger delay than TDSS because the delay is dominated by the time it takes to search external skylines which are 132 0 1 2 3 4 5 6 7 8 300x300 200x200 100x100 Delay (sec) Network Size TDSS Central RDSS TRDSS (a) Delay to compute skyline 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Fraction of skylines received Arrival time (sec) TDSS RDSS TRDSS (b) Progressiveness (300×300 network) Figure 5.16: Delay (a) and progressiveness (b) of three DSS algorithms, with fixed |Q| = 10, MBR(Q) = 25%, and moderate density searched after the internal skyline and the division of internal skylines. Unlike RDSS, both TDSS and TRDSS utilize triangulation for external query convex hull. The delay by TRDSS does not dependonthelatency tosearchinternalskylinesgatheredbyrendezvouspoint withskylinesclose to rendezvous point; these skylines can be collected fast by Voronoi-based geographic flooding with optimization. Fig. 5.16(b) shows the progressiveness of skylines (fraction of skylines received vs. arrival time) with fixed|Q| = 10 and MBR(Q) = 25% in a moderate 300×300 network. The first skyline is returned to the rendezvous point at 1.36, 1.46. 2.16 second by TRDSS, RDSS, and TDSS respectively. RDSS and TRDSS provides the 50% skylines to the rendezvous point at 1.56 and 1.66 second respectively, while TDSS provides the 50% skylines at 2.65 second. It shows that both RDSS and TRDSS provides better progressiveness than TDSS until near 80% of skylines are returned, which is the fraction of the internal skylines. Algorithms such as RDSS and TRDSS in which the rendezvous point directly gathers internal skylines provide better progressive results. Note again that, the final delay to skylines by TDSS is faster than that of RDSS, 3.23 and 5.16 second respectively because TDSS benefits from parallel skyline search within the partitioned space. 133 Chapter 6 Conclusions and Future Work Sensor readings in a WSN reflect the spatial and temporal correlations of physical attributes existing inherently in the environment. We proposed the CAG algorithm that forms clusters of nodes sensing similar values. These clusters remain unchanged as long as the sensor readings do not change significantly. If the change in sensor readings is large, the cluster adjustment algorithm changes the cluster membership to ensure that the sensor readings within a cluster are similar. If a small number of nodes needs to join a different cluster (which happens frequently in our experiments due to a large change in sensor readings or communication problems), they can do so locally by communicating with the neighboring nodes associated with other clusters. By avoiding global coordination and only using local communications, CAG can efficiently perform cluster adjustment. Each cluster generates a single aggregate result, thereby providing the energy efficient and approximate aggregation results with small and often negligible and bounded error. We demonstrated the effectiveness of CAG in performing energy efficient in-network aggrega- tion leveraging both spatial and temporal correlations. We mathematically modeled the spatial correlation of the measured data and evaluated the efficiency and accuracy of CAG analytically and empirically. CAG can maximally take advantage of stronger spatial and temporal corre- lations by increasing energy efficiency and providing the results with predictable and bounded errors. These benefits are amplified as the density of nodes becomes higher. CAG is shown to 134 scale gracefully when the number of nodes in the network grows. Moreover, CAG is shown to be resilient to the packet loss. CAG is the first system which realizes the semantic broadcast to conserve energy, while ensuring bounded approximation by leveraging spatial and temporal correlations prevalent in the nature. Wewouldliketoextendthisworkbyfocusingonthedesignofahybrid(proactiveandreactive) clustering protocol. Current CAG implementation collects the sensor reading per cluster in every epoch proactively. Whenever the cluster structure changes, CAG will resample, recompute, and send back the aggregate value to the base station. This mechanism enables a reactive data acquisition. We would like to provide proactive and reactive data acquisition depending on the application requirement or data characteristics. We develop an in-network processing steamflood monitoring system prototype. We create the rule-based problem and false alarm identification algorithm by collaboratively exploiting spatial and temporal correlations in the sensor readings. As a proof-of-concept, we performed the simu- lations with the realistic setup and parameters, and show that our suggested algorithm can detect the anomalies quickly, reliably, and accurately, thereby improving the current SCADA system. As a long term plan, we will evaluate our algorithm using a more realistic simulation scenario in which more complex topology will make decision difficult. We also plan to add mechanism to quantify the confidence of our detection, identification, and localization results in our algorithm. This systematically quantifies the quality of the computed result on-line. Analternative designofdetection, identification, andlocalization systemcanuseacentralized architectureinwhichallthesensorreadingsaresenttoandprocessedatasinglenode. Weplanto quantify the limitation and strength of a centralized approach and compare it to our distributed approach. We designed three distributed algorithms to compute spatial skyline in wireless sensor net- works. Spatialskylineisausefulprimitivetoselectmultiplegoodpositionsfortrackingandother collaborative applications. We showed that all three DSS algorithms outperform the centralized 135 algorithms while providing 100% accurate skylines progressively with modest delay and given multiple queries concurrently. Moreover, we showed that TDSS has the best energy efficiency, best scalability, least delay, and best communication load balancing among three DSS algorithms. DSS can be extended in three major directions. First, we plan to extend our work to compute distributed skylines with dynamic query points and dynamic sensor nodes over time. Second, we plan to extend our system to support other operations such as top-K or kNN. Third, we plan to design techniques to make DSS robust to node failures and packet drops. 136 Bibliography [1] DanielJ.Abadi,SamuelMadden,andWolfgangLindner. REED:Robust,EfficientFiltering and Event Detection in Sensor Networks. In International Conference on Very Large Data Bases (VLDB), August 2005. [2] Shyam Antony, Ping Wu, Divy Agrawal, and Amr El Abbadi. MultiObjective OLAP: Efficient Skyline Computation over Ad-Hoc Aggregations. In ICDE, April 2008. [3] Wolf-Tilo Balke, Ulrich Gntzer, and Jason Xin Zheng. Efficient Distributed Skylining for Web Information Systems. In EDBT, March 2004. [4] Boulat A. Bash and Peter J. Desnoyers. Exact distributed Voronoi cell computation in sensor networks. In IPSN, April 2007. [5] Stephan Borzsonyi, Donald Kossmann, and Konrad Stocker. The Skyline Operator. In ICDE, April 2001. [6] Chee-Yong Chan, H. V. Jagadish, Kian-Lee Tan, Anthony K. H. Tung, and Zhenjie Zhang. Finding k-Dominant Skylines in High Dimensional Space. In SIGMOD, June 2006. [7] Hekang Chen, Shuigeng Zhou, and Jihong Guan. Towards Energy-Efficient Skyline Moni- toring in Wireless Sensor Networks. In EWSN, June 2007. [8] Sze-Foo Chien. Critical Flow of Wet Steam Through Chokes. In Society of Petroleum Engineers (SPE), March 1990. [9] Sze-Foo Chien. Critical Flow Properties of Wet Steam. In Society of Petroleum Engineers (SPE), February 1993. [10] Sze-FooChienandJamesL.G.Schrodt. DeterminationofSteamQualityandFlowRateUs- ing Pressure Data From an Orifice Meter and a Critical Flowmeter. In Society of Petroleum Engineers (SPE), May 1995. [11] David Chu, Amol Deshpande, Joseph M. Hellerstein, and Wei Hong. Approximate Data Collection in Sensor Networks using Probabilistic Models. In International Conference on Data Engineering (ICDE), April 2006. [12] MauriceChu,HorstHaussecker,andFengZhao. ScalableInformation-DrivenSensorQuery- ing and Routing for ad hoc Heterogeneous Sensor Networks. In International Journal of High Performance Computing Applications, Vol. 16, No. 3, 2002. [13] Jeffrey Considine, Feifei Li, George Kollios, and John Byers. Approximate aggregation techniques for sensor databases. In International Conference on Data Engineering (ICDE), March 2004. [14] ThomasH.Cormen,CharlesE.Leiserson,RonaldL.Rivest,andCliffordStein. Introduction to Algorithms. In The MIT Press, 2001. 137 [15] Jim C.P.Liou. Pipeline Variable Uncertainties And Their Effects on Leak Detectability. In American Petroleum Institute (API) 1149, November 1993. [16] Razvan Cristescu, Baltasar Beferull-Lozano, and Martin Vetterli. On network correlated data gathering. In IEEE INFOCOM, March 2004. [17] Bin Cui, Hua Lu, Quanqing Xu, Lijiang Chen, Yafei Dai, and Yongluan Zhou. Parallel Distributed Processing of Constrained Skyline Queries by Filtering. In ICDE, April 2008. [18] Mark R.T. Dale, Philip Dixon, Marie-Josee Fortin, Pierre Legendre, Donald E. Myers, and Michael S. Rosenbreg. Conceptual and mathematical relationships among methods for spatial analysis. In Ecography Vol.25 No.5, October 2002. [19] Mark de Berg, M. van Krefeld, M. Overmars, and O. Schwarzkopf. Computational Geome- try: Algorithms and Applications. In Springer, 2000. [20] Ke Deng, Xiaofang Zhou, and Heng Tao Shen. Multi-source Skyline Query Processing in Road Networks. In ICDE, April 2007. [21] Amol Deshpande, Carlos Guestrin, Samuel R. Madden, Joseph M. Hellerstein, and Wei Hong. Model-Driven Data Acquisition in Sensor Networks. In VLDB, August 2004. [22] Dale Erickson and David Twaite. Pipeline Integrity Monitoring System for Leak Detec- tion, Control, and Optimization of Wet Gas Pipelines. In SPE ATCE (Annual Technical Conference and Exhibition), October 1996. [23] Qing Fang, Feng Zhao, and Leonidas Guibas. Lightweight Sensing and Communication Protocols for Target Enumeration and Aggregation. In MobiHoc, June 2003. [24] Carmen Fernandez and peter J. Green. Modeling spatially correlated data via mixtures: a Bayesian approach. In University of St Andrews, University of Bristol, January 2002. [25] K.RubenGabrielandRobertR.Sokal. ANewStatisticalApproachtoGeographicVariation Analysis. In Systematic Zoology, Vol. 18, No. 3, September 1969. [26] Deepak Ganesan, Ben Greenstein, Denis Perelyubskiy, Deborah Estrin, and John Heide- mann. An Evaluation of Multi-resolution Storage for Sensor Networks. InACM Conference on Embedded Networked Sensor Systems (SenSys), November 2003. [27] Phillip B. Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In VLDB, September 2001. [28] Ashish Goel and Deborah Estrin. Simultaneous optimization for concave costs: single sink aggregation or single source buy-at-bulk. In ACM-SIAM symposium on Discrete algorithms (SODA), 2003. [29] Samir Goel and Tomasz Imielinski. Prediction-based Monitoring in Sensor Networks: Tak- ing Lessons from MPEG. In ACM the Computer Communication Review (CCR), 2001. [30] Lin Gu, Dong Jia, Pascal Vicaire, Ting Yan, Liqian Luo, Ajay Tirumala, Qing Cao, Tian He,JohnA.Stankovic,TarekAbdelzaher,,andBruceH.Krogh. LightweightDetectionand Classification for Wireless Sensor Networks in Realistic Environments. In ACM Conference on Embedded Networked Sensor Systems (SenSys), November 2005. [31] Carlos Guestrin, Peter Bodi, Romain Thibau, Mark Paski, and Samuel Madden. Dis- tributedregression: anefficientframeworkformodelingsensornetworkdata. InInformation Processing in Sensor Networks (IPSN), April 2004. 138 [32] Chao Gui and Prasant Mohapatra. Power conservation and Quality of Surveillance in Target Tracking Sensor Networks. In International Conference on Mobile Computing and Networking (Mobicom), May 2004. [33] HimanshuGupta,VishnuNavda,SamirR.Das,andVishalChowdhary. EfficientGathering of Correlated Data in Sensor Networks. In ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), May 2005. [34] J. M. Halley, S. Hartley, A. S. Kallimanis, W. E. Kunin, J. J. Lennon, and S. P. Sgardelis. Uses And Abuses Of Fractal Methodology In Ecology. In Ecology Letters, 2004. [35] Wendi Rabiner Heinzelman, Anantha Chandrakasan, and Hari Balakrishnan. Energy- Efficient Communication Protocol for Wireless Microsensor networks. In Hawaii Inter- national Conference on System Sciences (HICSS), January 2000. [36] Jason Hill, Robert Szewczyk, Alec Woo, Seth Hollar, David Culler, and Kristofer Pister. System Architecture Directions for Network Sensors. In International Conference on Archi- tectural Support for Programming Languages and Operating Systems (ASPLOS), November 2000. [37] K.C.Hong. Steamfloodreservoirmanagement: thermalenhancedoilrecovery. InPennWell Books, January 1994. [38] Kai Hormann and Alexander Agathos. The Point in Polygon Problem for Arbitrary Poly- gons. In Computational Geometry, Vol. 20, No. 3, 2001. [39] Zhiyong Huang, Christian S. Jensen, Hua Lu, and Beng Chin Ooi. Skyline Queries Against Mobile Lightweight Devices in MANETs. In ICDE, April 2006. [40] Zhiyong Huang, Hua Lu, Beng Chin Ooi, and Anthony K. H. Tung. Continuous Skyline Queries for Moving Objects. In TKDE, December 2006 Vol. 18 Issue 12. [41] ChalermekIntanagonwiwat, DeborahEstrin, RameshGovindan, andJohnHeidemann. Im- pactofNetworkDensityonDataAggregationinWirelessSensorNetworks. InInternational Conference on Distributed Computing Systems, July 2002. [42] Chalermek Intanagonwiwat, Ramesh Govindan, and Deborah Estrin. Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks. In International Conference on Mobile Computing and Networking (Mobicom), August 2000. [43] Invensys. Dynsim (Dynamic simulation). In http://www.simsci- esscor.com/us/eng/products/productlist/dynsim/default.htm, 2005. [44] Ankur Jain, Edward Y. Chang, and Yuan-Fang Wang. Adaptive Stream Resource Manage- ment Using Kalman Filters. In ACM SIGMOD, June 2004. [45] Apoorva Jindal and Konstantinos Psounis. Modeling spatially-correlated sensor network data. InIEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON), October 2004. [46] Brad Karp and H. T. Kung. GPSR: Greedy Perimeter Stateless Routing for Wireless Networks. In MOBICOM, August 2000. [47] Mohamed Khalefa, Mohamed Mokbel, and Justin Levandoski. Skyline Query Processing for Incomplete Data. In ICDE, April 2008. 139 [48] Young-Jin Kim, Ramesh Govindan, Brad Karp, and Scott Shenker. Geographic Routing Made Practical. In NSDI, May 2005. [49] Young-Jin Kim, Ramesh Govindan, Brad Karp, and Scott Shenker. Lazy Cross-Link Re- moval for Geographic Routing. In SenSys, November 2006. [50] Bhaskar Krishnamachari and Sitharama Iyengar. Distributed Bayesian Algorithms for Fault-Tolerant Event Region Detection in Wireless Sensor Networks. In IEEE Transac- tions on Computers, March 2004. [51] Fabian Kuhn, Roger Wattenhofer, and Aaron Zollinger. Worst-Case Optimal and Average- Case Efficient Geometric Ad-Hoc Routing. In MobiHoc, June 2003. [52] Larry W. Lake. Enhanced oil recovery. In Prentice Hall, 1989. [53] Jack Lennon. Red-shifts and red herrings in geographical ecology. In Ecography Vol.23 No.1, February 2000. [54] Philip Levis, Nelson Lee, Matt Welsh, and David Culler. TOSSIM: Accurate and Scalable Simulation of Entire TinyOS Applications. In ACM Conference on Embedded Networked Sensor Systems (SenSys), November 2003. [55] Xiang Lian and Lei Chen. Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases. In SIGMOD, June 2008. [56] Xuemin Lin, Yidong Yuan, Qing Zhang, and Ying Zhang. Selecting Stars: The k Most Representative Skyline Operator. In ICDE, April 2007. [57] Juan Liu, Jie Liu, James Reich, Patrick Cheung, and Feng Zhao. Distributed Group Man- agement for Track Initiation and Maintenance in Target Localization Applications. In In- formation Processing in Sensor Networks (IPSN), April 2003. [58] Martin Lukac, Lewis Girod, and Deborah Estrin. Applications, Technologies, Architec- tures, and Protocols for Computer Communication. In SIGCOMM workshop on Challenged networks, September 2006. [59] Liqian Luo, Tarek Abdelzaher, Tian He, and John A. Stankovic. Design and Comparison of Lightweight Group Management Strategies in EnviroSuite. In International Conference on Distributed Computing in Sensor Systems (DCOSS), July 2005. [60] Samuel Madden, Michael J. Franklin, Joseph Hellerstein, and Wei Hong. TAG: a Tiny AGgregation Service for Ad-Hoc Sensor Networks. In OSDI, December 2002. [61] Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. TAG: Tiny AGgregation service for ad-hoc sensor networks. In Symposium on Operating Systems Design and Implementation (OSDI), December 2002. [62] Alan Mainwaring, Robert Szewczyk, Joseph Polastre, and John Anderson. Habitat moni- toring on great duck island. In http://www.greatduckisland.net, 2005. [63] Arati Manjeshwar and Dharma P. Agrawal. TEEN: A Routing Protocol for Enhanced Effi- ciency in Wireless Sensor Networks. In International Workshop on Parallel and Distributed Computing, Issues in Wireless Networks and Mobile Computing (IPDPS), April 2001. [64] Arati Manjeshwar and Dharma P. Agrawal. APTEEN: A Hybrid Protocol for Efficient Routing and Comprehensive Information Retrieval in Wireless Sensor Networks. In Inter- national Workshop on Parallel and Distributed Computing, Issues in Wireless Networks and Mobile Computing (IPDPS), April 2002. 140 [65] Michael Morse, Jignesh M. Patel, and H.V. Jagadish. Efficient Skyline Computation over LowCardinality Domains. In VLDB, September 2007. [66] Suman Nath, Phillip Gibbons, Zachary Anderson, and Srinivasan Seshan. Synopsis Diffu- sion for Robust Aggregation in Sensor Networks. In ACM Conference on Embedded Net- worked Sensor Systems (SenSys), November 2004. [67] Chris Olston, Jing Jiang, and Jennifer Widom. Adaptive filters for continuous queries over distributed data streams. In ACM SIGMOD, June 2003. [68] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. Progressive skyline compu- tation in database systems. In TODS Vol. 30 No. 1, March 2005. [69] Sundeep Pattem, Bhaskar Krishnamachari, and Ramesh Govindan. The Impact of Spatial Correlation on Routing with Compression in Wireless Sensor Networks. In Information Processing in Sensor Networks (IPSN), April 2004. [70] Jian Pei, Bin Jiang, Xuemin Lin, and Yidong Yuan. Probabilistic Skylines on Uncertain Data. In VLDB, September 2007. [71] qhull@qhull.org. Qhull. In http://www.qhull.org/. [72] Nithya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, and Deb- orah Estrin. Sympathy for the Sensor Network Debugger. In ACM ACM Conference on Embedded Networked Sensor Systems (SenSys), November 2005. [73] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Schenker. A Scalable Content-addressable Network. In SIGCOMM, August 2001. [74] Sylvia Ratnasamy, Brad Karp, Scott Shenker, Deborah Estrin, Ramesh Govindan, Li Yin, , and Fang Yu. Data-Centric Storage in Sensornets with GHT, A Geographic Hash Table. In MONET, Vol.8 No.4, 2003. [75] Lorenzo A. Rossi, Bhaskar Krishnamachari, and C.-C. Jay Kuo. Distributed Parameter Es- timationforMonitoringDiffusionPhenomenaUsingPhysicalModels. InIEEECommunica- tions Society Conference on Sensor and Ad Hoc Communications and Networks (SECON), October 2004. [76] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. In Prentice Hall, 2003. [77] Nikos Sarkas, Gautam Das, Nick Koudas, and Anthony K. H. Tung. Categorical Skylines for Streaming Data. In SIGMOD, June 2008. [78] Oliver Schabenberger and Carol A. Gotway. Statistical Methods for Spatial Data Analysis. In Chapman & Hall/CRC, 2005. [79] Mohamed A. Sharaf, Jonathan Beaver, Alexandros Labrinidis, and Panos K. Chrysanthis. TiNA: A Scheme for Temporal Coherency-Aware in-Network Aggregation. In ACM Work- shop on Data Engineering for Wireless and Mobile Access (MobiDe), September 2003. [80] Mehdi Sharifzadeh and Cyrus Shahabi. The spatial skyline queries. In VLDB, September 2006. [81] Masanobu Shinozuka and Xuejiang Dong. Damage Detection and Localization for Water Delivery Systems. In University of California, Irvine, September 2005. 141 [82] Fred Stann and John Heidemann. BARD: Bayesian-Assisted Resource Discovery. In IEEE INFOCOM, March 2005. [83] Ivan Stoianov, Lama Nachman, and Sam Madden. PIPENET: A Wireless Sensor Network for Pipeline Monitoring. InInformation Processing in Sensor Networks (IPSN), April 2007. [84] Ivan Stoianov, Lama Nachman, Andrew Whittle, Sam Madden, and Ralph Kling. Sensor NetworksforMonitoringWaterSupplyandSewerSystems: LessonsFromBoston. InWater Distribution System Analysis Symposium, August 2006. [85] Robert Szewczyk, Alan Mainwaring, Joseph Polastre, and David Culler. An Analysis of a Large Scale Habitat Monitoring Application. In ACM Conference on Embedded Networked Sensor Systems (SenSys), November 2004. [86] Yufei Tao and Dimitris Papadias. Maintaining Sliding Window Skylines on Data Streams. In TKDE, Vol. 18, No. 3, March 2006. [87] Yufei Tao, Xiaokui Xiao, and Jian Pei. SUBSKY: Efficient Computation of Skylines in Subspaces. In ICDE, April 2006. [88] Yufei Tao, Xiaokui Xiao, and Jian Pei. Efficient Skylines and Top-k Retrieval in Subspaces. In TKDE, Vol. 19, No. 8, August 2007. [89] T.K.Perkins. Critical and Subcritical Flow of Multiphase Mixtures Through Chokes. In SPE Drilling and Completion, December 1993. [90] Waldo R. Tobler. A computer movie simulating urban growth in the Detroit region. In Economic Geography Vol.46 No.2, 1970. [91] Gilman Tolle, Joseph Polastre, Robert Szewczyk, David Culler, Neil Turner, Kevin Tu, Stephen Burgess, Todd Dawson, Phil Buonadonna, David Gay, and Wei Hong. A Macro- scope in the Redwoods. In ACM Conference on Embedded Networked Sensor Systems (Sen- Sys), November 2005. [92] Godfried T. Toussaint. The relative neighborhood graph of a finite planar set. In Pattern Recognition, Vol. 12, No. 4. [93] Akrivi Vlachou, Christos Doulkeridis, and Yannis Kotidis. Angle-based Space Partitioning for Efficient Parallel Skyline Computation. In SIGMOD, June 2008. [94] Akrivi Vlachou, Christos Doulkeridis, Kjetil Noervaag, and Michalis Vazirgiannis. Skyline- based Peer-to-Peer Top-k Query Processing. In ICDE, April 2008. [95] Akrivi Vlachou, Christos Doulkeridis, Michalis Vazirgiannis, and Yannis Kotidis. SKYPEER:EfficientSubspaceSkylineComputationoverDistributedData. InICDE,April 2007. [96] Matt Welsh and Geoff Mainland. Programming Sensor Networks Using Abstract Regions. In NSDI, March 2004. [97] Alec Woo, Samuel R. Madden, and Ramesh Govindan. Networking Support for Query Processing in Sensor Networks. In Communications of the ACM (CACM), June 2004. [98] Ping Wu, Caijie Zhang, Ying Feng, Ben Y. Zhao, Divyakant Agrawal, and Amr El Abbadi. Parallelizing Skyline Queries for Scalable Distribution. In EDBT, March 2006. 142 [99] Ning Xu, Sumit Rangwala, Krishna Kant Chintalapudi, Deepak Ganesan, Alan Broad, Ramesh Govindan, and Deborah Estrin. A Wireless Sensor Network for Structural Moni- toring. In ACM Conference on Embedded Networked Sensor Systems (SenSys), November 2004. [100] Yong Yao and Johannes Gehrke. Query Processing for Sensor Networks. In Biennial Con- ference on Innovative Data Systems Research (CIDR), January 2003. [101] SunHee Yoon and Cyrus Shahabi. An Experimental Study of the Effectiveness of Clustered AGgregation (CAG) Leveraging Spatial and Temporal Correlations in Wireless Sensor Net- works. In USC Computer Science Department Technical Report 05-869, 2005. [102] SunHee Yoon and Cyrus Shahabi. Exploiting Spatial Correlation Towards an Energy Ef- ficient Clustered AGgregation Technique (CAG). In IEEE International Conference on Communications (ICC), May 2005. [103] SunHeeYoonandCyrusShahabi. TheClusteredAGgregation(CAG)TechniqueLeveraging Spatial and Temporal Correlations in Wireless Sensor Networks. In ACM Transactions On Sensor Networks (TOSN) Vol. 3 Issue 1, March 2007. [104] YanYu, RameshGovindan, andDeborahEstrin. GeographicalandEnergyAwareRouting: ARecursiveDataDisseminationProtocolforWirelessSensorNetworks. InUCLAComputer Science Department Technical Report UCLA/CSD-TR-01-0023, May 2001. [105] Wensheng Zhang and Guohong Cao. Optimizing Tree Reconfiguration for Mobile Target Tracking in Sensor Networks. In IEEE INFOCOM, March 2004. [106] Feng Zhao, Xenofon Koutsoukos, Horst Haussecker, Jim Reich, and Patrick Cheung. Mon- itoring and Fault Diagnosis of Hybrid Systems. In Transactions on Systems, Man, and Cybernetics, Part B Vol.35 No.6, December 2005. [107] Jerry Zhao, Ramesh Govindan, and Deborah Estrin. Computing aggregates for monitoring wireless sensor networks. In IEEE International Workshop on Sensor Network Protocols and Applications (SNPA), May 2003. [108] Lin Zhu, Yufei Tao, and Shuigeng Zhou. Efficient Distributed Skyline Retrieval. In TKDE, 2008. 143
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Models and algorithms for energy efficient wireless sensor networks
PDF
Robust routing and energy management in wireless sensor networks
PDF
Distributed wavelet compression algorithms for wireless sensor networks
PDF
Techniques for efficient information transfer in sensor networks
PDF
A protocol framework for attacker traceback in wireless multi-hop networks
PDF
Cooperation in wireless networks with selfish users
PDF
Realistic modeling of wireless communication graphs for the design of efficient sensor network routing protocols
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Algorithmic aspects of throughput-delay performance for fast data collection in wireless sensor networks
PDF
Aging analysis in large-scale wireless sensor networks
PDF
On location support and one-hop data collection in wireless sensor networks
PDF
Domical: a new cooperative caching framework for streaming media in wireless home networks
PDF
Transport layer rate control protocols for wireless sensor networks: from theory to practice
PDF
Rate adaptation in networks of wireless sensors
PDF
Reliable and power efficient protocols for space communication and wireless ad-hoc networks
PDF
Efficient data collection in wireless sensor networks: modeling and algorithms
PDF
Language abstractions and program analysis techniques to build reliable, efficient, and robust networked systems
PDF
Understanding and exploiting the acoustic propagation delay in underwater sensor networks
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
PDF
Efficient pipelines for vision-based context sensing
Asset Metadata
Creator
Yoon, SunHee
(author)
Core Title
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/01/2008
Defense Date
06/06/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data correlation,distributed algorithm,in-network aggregation,OAI-PMH Harvest,oilfield monitoring,spatial skyline,steam and waterflood monitoring
Language
English
Advisor
Shahabi, Cyrus (
committee chair
), Govindan, Ramesh (
committee member
), Krishnamachari, Bhaskar (
committee member
)
Creator Email
sunhee.yoon@gmail.com,sunheeyo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1497
Unique identifier
UC165730
Identifier
etd-Yoon-2274 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-92479 (legacy record id),usctheses-m1497 (legacy record id)
Legacy Identifier
etd-Yoon-2274.pdf
Dmrecord
92479
Document Type
Dissertation
Rights
Yoon, SunHee
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
data correlation
distributed algorithm
in-network aggregation
oilfield monitoring
spatial skyline
steam and waterflood monitoring