Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 869 (2005)
(USC DC Other)
USC Computer Science Technical Reports, no. 869 (2005)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
An Experimental Study of the Effectiveness of Clustered AGgregation (CAG) Leveraging Spatial and Temporal Correlations in Wireless Sensor Networks SUNHEE YOON and CYRUS SHAHABI Department of Computer Science University of Southern California Sensed data in Wireless Sensor Networks (WSN) reflect the spatial and temporal correlations of physical attributes existing intrinsically in the environment. In our previous work, we proposed CAG (Clustered AGgregation) which exploits this spatial correlation to trade-off accuracy for efficiency during in-network aggregation. In this paper, we present the updated CAG algorithm that forms clusters of nodes sensing similar values within a given threshold (spatial correlation), andtheseclustersremainunchangedaslongasthesensorvaluesstaywithinathresholdovertime (temporal correlation). With CAG, only one sensor reading per cluster is transmitted whereas with Tiny AGgregation (TAG) all the nodes in the network transmit the sensor readings. Thus, CAG provides energy efficient and approximate aggregation results where the error is bounded by a user-provided threshold. In this paper we extend our study in five directions: First, we design CAG for two modes of operations (interactive and streaming) to enable CAG to be used in different environments and for different purposes. Interactive mode is appropriate for dynamic and ad-hoc queries, whereas the streaming mode is appropriate for continuous queries. Second, we propose a fixed range clustering method which makes the performance of our system independent of the magnitude of sensor readings and the network topology. Third, using mica2 motes, we perform a large- scale measurement of real environmental data (temperature and light, both indoor and outdoor) and the wireless radio reliability, which were used for both analytical modeling and simulation experiments. Fourth, we model the spatially correlated data using the properties of our real world measurements. Fifth, we investigate the effectiveness of CAG that exploits the temporal as well as spatial correlations using both the measured and modeled data. Our experimental result shows that CAG in the interactive mode, with the user-provided error threshold of 10%, can save 50% of energy over TAG with only 4% inaccuracy in the result. The streamingmodeofCAGcansaveevenmoreenergy(upto73.21%)overtheinteractivemodewhen datashowshightemporalcorrelation. CAGisthefirstsystem thatleveragesspatialandtemporal correlations to improve energy efficiency of in-network aggregation. This study analytically and empirically validates CAG’s effectiveness. Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Net- workArchitectureandDesign—Distributed networks; C.2.4[Computer-Communication Net- works]: Distributed Systems—Distributed databases; Distributed applications; I.6 [Computing Methodologies]: Simulation and Modeling This is an extended version of a paper that appeared in the IEEE International Conference on Communications (ICC), May 2005. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, theACMcopyright/servernotice, thetitleof thepublication, and itsdateappear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2005 ACM 0000-0000/2005/0000-0001 $5.00 ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005, Pages 1–0??. 2 · SunHee Yoon et al. General Terms: Algorithm, Design, Measurement, Performance Additional Key Words and Phrases: In-network Processing and Aggregation, Clustering, Spatial and Temporal Correlation, Energy Efficiency, Accuracy, Approximation, Modeling 1. INTRODUCTION In-network query processing and data aggregation are widely used to save energy, increase scalability, and reduce computation in many monitoring applications of WSN such as wildlife habitat monitoring [Szewczyk et al. 2004], structural health monitoring [Xu et al. 2004], moving target tracking [Fang et al. 2003], toxic waste monitoring, and seismic monitoring [Husker et al. 2003]. There has been a number of research studies pursuing efficient in-network aggre- gation in the literature [Intanagonwiwat et al. 2000; Madden et al. 2002; Yao and Gehrke 2003]. A tree-based routing is used in TAG [Madden et al. 2002], while a data-centric, gradient based routing is used in Directed Diffusion [Intanagonwiwat et al. 2000]. TAG, the landmark in-network query processing system based on an aggregation tree, supports data aggregation utilizing a number of sensor nodes. TAG has a fixed set of query operators and a query processor that runs on each node. Alternately, Directed Diffusion allows users to define their own in-network aggregation operators. Neither of these systems exploit the spatial correlation of data to achieve even more efficient in-network aggregation. Tobler’s first law of geography states that “Everything is related to everything else, but near things are more related than distant things” [Tobler 1970]. This statistical observation implies that data correlation decreases with increasing spa- tial separation. By leveraging this spatial property of sensor data, we proposed a scheme called Clustered AGgregation (CAG) [Yoon and Shahabi 2005] to improve existing in-network aggregation mechanisms. CAG forms clusters of the sensor nodes sensing similar values and transmits only a single value per cluster as op- posed to a single value per node in TAG-like schemes. Thus, CAG can significantly reduce the number of transmissions which results in energy savings by incurring a small error in query result. A user-provided error threshold, τ, is a parameter provided by the user to describe the accuracy requirement of the result. This error threshold is used while building clusters such that the difference among the sensor readings in the cluster is bounded by the threshold. This ensures that the resulting approximate answer always stays within the error threshold of the correct answer. Inthispaper,weextendCAGtooperateintwomodes,interactiveandstreaming modes, depending on the dynamics of the environment. The interactive mode alternates query and response phases. This is appropriate for scenarios where the environment (network topology and data) changes dynamically, or users change the approximation granularity or query attributes over time. On the other hand, in the streaming mode, the clusterheads keep transmitting streams of response for a query that is issued just once. This mode of operation is appropriate when the environmentdoesnotchangeasfrequentlyandthequeryremainsvalidforacertain period of time. Note that the interactive mode of CAG only exploits the spatial correlationofthesensordatatoformclusters,whereasthestreamingmodeofCAG ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 3 leverages both temporal and spatial correlations. The latter adjusts clusters locally asthedataandtopologychangeovertime,sotheclusteradjustmentsareinfrequent when the data is more temporally correlated. With the interactive mode of CAG, wedonotcountthenumberofnodesperclusterforeveryquery. Therefore,wegive the same weight to different clusters independent of their size. With the streaming mode, however, we count the number of nodes per cluster for every query and use the cluster size as a weight to compute the aggregate function. Therefore, the resultsofthestreamingmodearemoreaccuratethanthoseoftheinteractivemode. The advantage of CAG is the high precision of approximate results (errors are bounded by a user-provided error threshold) and it still saves a significant amount of energy by using only outlier samples. This benefit amplifies when the number of sensor nodes, the density of node deployment, and the level of data correlation (both spatial and temporal) increase. Although the interactive mode of CAG is oblivious to the number of nodes within a cluster, it still provides a good approx- imation when the data is normally distributed and highly correlated. However, the interactive mode cannot guarantee that the result is within the error threshold while the streaming mode can. CAG is a mechanism to implement or augment into the existing sensor network query system without OS/infrastructure modification; it requires only a single additional clause (in the syntax used by TAG) to spec- ify the user-provided error threshold value. To the best of our knowledge, CAG is a novel system for efficient approximation of in-network aggregation in that it supports semantic broadcast [Woo et al. 2004] by leveraging both the spatial and temporal correlations prevalent in the real world data. In this paper, we propose an upgraded CAG which provides a unified functional framework that performs well independent of the size of the network (number of nodes and area) and the values of the sensed data. We verify the effectiveness of our modified CAG algorithm under real world scenario with empirical sensor data and mica2-based wireless link quality measurements. We study the performance and accuracy of CAG using both simulations and analytical models. We summarize the main contributions of this paper as follows: —Customization: We designed two modes of CAG (interactive and streaming) depending on the application purposes and the environmental characteristics. —Data independent clustering: Current CAG groups nodes into the same cluster if their sensor readings fall within a fixed range. Fixed range clustering, as opposedtothevariablerangeclusteringfromourpreviouswork, ensuresthatthe performance of CAG becomes independent of the magnitude of sensor readings and network topology 1 . —Measurement: We performed a large-scale measurement of the environmental data (temperature and light, both indoor and outdoor) using mica2 motes. We also measured the wireless link reliability using mica2 radios at different trans- mission powers. —Model: We modeled the spatially correlated sensor data using our real-world measurements. Our data model captures two kinds of spatial property: linear and spherical. 1 Section 5.3 and 6.1 describe this property in detail. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 4 · SunHee Yoon et al. —Spatio-temporal correlation: Weinvestigated the efficiency ofCAG byexploiting thetemporalaswellasspatialcorrelationsusingbothourmeasuredandmodeled data. Our experimental results with measured data and link reliability indicate that CAG, with error threshold = 10%, can save up to 51.25% of transmissions over TAG in the interactive mode. The streaming mode can save up to 73.21% over the interactive mode in two hours when data shows high temporal correlation. This paper subsumes our previous work [Yoon and Shahabi 2005]. The previous version of CAG was not evaluated based on systematic large scale measurement of sensor data. Thus, it provided incomplete results. For example, in our previous work, we did not consider the potential problems from using data that does not follow a normal distribution. Running CAG with synthetic data always resulted in anerrorboundedbytheuser-providederrorthreshold. However,wediscoveredthat the error is not bounded by the user-provided error threshold under experiments using the measured data from the Great Duck Island [Mainwaring et al. 2005], because the data did not follow a normal distribution. We address this problem by proposing the streaming mode of CAG which counts the number of nodes within a cluster to assign weights to the clusterhead values. In addition, our previous method of clustering results in variable range clusters and the accuracy of result depends on the magnitude of the clusterhead sensor value. This made our earlier analysis of CAG intractable for the multi-hop wireless environments. To solve this problem, we propose the fixed range clustering. Theremainderofthispaperisorganizedasfollows. Section2presentstherelated work and Section 3 describes the modified CAG algorithm. Section 4 discusses the interactive and streaming modes of CAG including how CAG takes advantage of temporal correlation. Section 5 presents the data measurements, the spatial data models, and our analysis. Section 6 describes the evaluation metrics and our experimental setup, and reports the results from our experiments. Section 7 concludes this paper. 2. RELATED WORK Several in-network aggregation techniques have been proposed for energy-efficient communication in WSN. TinyDB [Madden et al. 2002], Directed-diffusion [In- tanagonwiwat et al. 2000], and Cougar [Yao and Gehrke 2003] are the first gen- eration of in-network aggregation systems. These approaches use tree or Directed AcyclicGraph(DAG)topologyasanunderlyingroutingframework. However,they do not consider further energy optimization using spatial or temporal correlation and approximate aggregation. More recently, two approaches, Digest Diffusion [Zhao et al. 2003] and Synop- sis Diffusion [Nath et al. 2004] are proposed to support robust communication for duplicate-insensitive and duplicate-sensitive aggregates, respectively. However, these two systems are not without energy overhead. Digest Diffusion requires each node to maintain link quality statistics to construct the routing tree. Synopsis Dif- fusionwithadaptiveringtopologyconsumesslightlymoreenergythanTAGdueto its redundant transmissions and receptions. CAG achieves significant energy sav- ing compared to TAG by exploiting spatial and temporal correlations and allowing ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 5 negligible error in the result. There has been various research on optimizing tree-based routing. [Cristescu et al. 2004] proposed a distributed approximation algorithm for correlated data gathering in a tree topology. The authors suggested a coding strategy based on Slepian-Wolf model and joint entropy coding model to jointly optimize the trans- mission structure of the tree and data allocation at a node. [Goel and Estrin 2003] proposedarandomizedtreestructurealgorithmwhichsimultaneouslyoptimizesall concave non-decreasing aggregate functions. CAG’s approach is different: it is not about optimizing the routing tree by forming better trees; it is about optimizing (minimizing)bothenergyusage(forwarding)andresultingerror(application)using an existing query routing tree. Allowing for an approximate result instead of requiring an exact answer enables designingenergy-efficientin-networkaggregationmechanisms. Approximateresults canbeusedinaninteractivesettinginwhichusersmayfirstaskforaroughpicture of regional data before they decide to drill-down further [Ganesan et al. 2003]. In this scenario, not every sensed data is required to compute the synopsis. [Considine etal.2004]proposedanapproximateaggregationtechniquebygeneralizingtheFla- joletandMartinduplicate-insensitivesketchesforduplicate-sensitiveaggregates. It is known that both energy efficiency and accuracy are important in time-critical monitoring. In many systems, however, higher accuracy comes at a higher energy cost. CAG addresses this problem by providing bounded approximate result with significant energy reduction. [Olston et al. 2003] designed an adaptive bounded-width filter in which filter widths are adjusted continually to match current data dynamics. In this scheme, distributeddatastreamstransmitaboundedapproximateanswertothecentralized sitewithreducedoverhead. [Jainetal.2004]triedtominimizeresourceusagewhile satisfying the precision requirement by designing a prediction system using Dual Kalman Filter (DKF). As such, sophisticated filter or prediction scheme can be incorporated in WSN to prevent unnecessary data transmission. User-provided error threshold τ functions as a precision requirement in CAG with fixed-width filter. There have been many studies modeling spatial correlation property in the con- text of WSN. [Deshpande et al. 2004] proposed a model-based query prototype called BBQ which uses a model based on time-varying multivariate Gaussians. [Guestrin et al. 2004] proposed distributed regression, which is an efficient andgen- eral framework for in-network modeling of sensor data. In this work, rather than communicating the data, nodes communicate constraints on the model parameters thereby significantly reducing the communication cost. [Jindal and Psounis 2004] proposed a method to generate spatially correlated data based on a mathemati- cal model. [Lennon 2000] demonstrated the consequences of ignoring the spatial autocorrelation, i.e. variogram, using synthetic but realistic spatial pattern with known spatial property (fractal model). They show that if spatial autocorrelation isignored, thentheanalysisofecologicalpatternsproducesverymisleadingresults. [Rossi et al. 2004] modeled the diffusion phenomena using the Partial Differential Equations (PDE’s) in order to estimate parameters for diffusion monitoring. [Fer- nandez and peter J. Green 2002] modeled spatially correlated data in a Bayesian ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 6 · SunHee Yoon et al. framework using statistical approach. In this study, we statistically model the spa- tial correlation property with the measured environmental data using variogram and PDF (Probability Density Function) as a function of internode distance. We observed three data models (linear, spherical, and gaussian) which describe the data from our measurement study. Techniques such as LEACH [Heinzelman et al. 2000], TEEN [Manjeshwar and Agrawal 2001], APTEEN [Manjeshwar and Agrawal 2002] use hierarchical clusters and routing to save energy. LEACH [Heinzelman et al. 2000] forms cluster based onthereceivedsignalstrengthanduseslocalclusterheadasrouters. Transmissions are made only by clusterheads. LEACH utilizes randomized rotation of local clus- terheadstoevenlydistributetheenergyoverheadamongthesensorsinthenetwork. TEEN [Manjeshwar and Agrawal 2001] is another hierarchical protocol designed to be responsive to sudden changes in the sensor readings. The nodes transmit sen- sor readings only when they fall below or above the specified thresholds. While this saves energy, it is not suitable for applications that require periodic reports. APTEEN [Manjeshwar and Agrawal 2002] addresses both periodic data collection and prompt reporting of time-critical events. None of these protocols, however, leverages spatial and temporal correlations to improve efficiency. [Pattem et al. 2004] analyzed the total cost for jointly optimizing the routing performance and data compression using the joint entropy of sources leveraging underlying spatial correlation. Authors claimed that there exists a static, near op- timalclustersizeforrangesofspatialcorrelation. Alternatively,CAGisanadaptive routing scheme with lossy aggregation which provides the bounded approximation where the resulting error is smaller than the user-provided error threshold. PREMON [Goel and Imielinski 2001] and TiNA [Sharaf et al. 2003] are similar to CAG. PREMON forms clusters based on a prediction model while CAG forms clusters using real-time sensor values. TiNA exploits temporal correlation in sen- sor data while CAG takes advantage of both spatial and temporal correlations in sensor data. [Deshpande et al. 2004] proposed a data acquisition method based on statistical models. Unlike CAG, they do not provide approximate results. They do not form clusters; neither do they take into account packet losses in the network. [Gupta et al. 2005] proposed an efficient data gathering algorithm exploiting the spatial correlation. However, their algorithm is not based on the clustering tech- nique and the overhead from selecting the connected correlation-dominating set compromises the efficiency of the proposed algorithm. In addition, their work is not validated empirically using the measured sensor data and does not address the network and data dynamics. CAGexploitssemanticbroadcast[Wooetal.2004]inordertoreducethecommu- nication overhead by leveraging spatial and temporal correlations. CAG achieves efficientin-networkstorageandprocessingbyallowingaunifiedmechanismbetween query routing (networking) and query processing (application). Instead of gather- ing and compressing all the data (lossless algorithm), CAG generates synopsis by filtering out insignificant elements in data streams (lossy algorithm) to minimize response time, storage, computation, and communication costs. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 7 3. OVERVIEW OF THE CAG ALGORITHM CAG, originally introduced in [Yoon and Shahabi 2005], requires only represen- tative values to participate in aggregation by forming clusters leveraging spatial correlation of data. On the other hand, TAG, a landmark system performing in- network aggregation, requires every node to participate in aggregation. The preva- lence of spatial correlation in environmental phenomena makes it possible for CAG to ignore redundant data and quickly generate a synopsis of the data distribution with significant energy savings. The CAG algorithm operates in two phases: query and response. During the query phase, CAG forms clusters when TAG-like forwarding tree is built using a user-specified error threshold τ. In the response phase, CAG transmits a single value per cluster. CAG is a lossy clustering method in that only the clusterheads contribute to the aggregation. Algorithm 1 Pseudocode of the CAG algorithm 1: Function Query.Received: 2: if (CR−Range×τ)≤v ij <(CR+Range×τ) then 3: clusterhead=FALSE; 4: broadcast query Q; 5: else 6: CR =MR; 7: clusterhead=TRUE; 8: broadcast query Q; 9: end if 10: 11: Function Response.Received: 12: enque response packet R to the buff; 13: 14: Function Epoch.Fired: 15: if clusterhead then 16: forward aggregate(buff +MR); 17: else if size(buff)>0 then 18: forward aggregate(buff); 19: end if A user-provided error threshold, τ, is used in CAG while building clusters. Each node decides to join a cluster based on Clusterhead sensor Reading (CR) and My local sensor Reading (MR); if |MR − CR| < Range × τ, where Range = MaxValue− MinValue of the entire data set 2 , then the sensor node joins the cluster. Forming clusters using τ is termed τ-approximation, because τ also func- tions as the error bound of the result such that |EstimatedResult−CorrectResult| Range <τ. 2 We assume the knowledge of the maximum and minimum values of the entire data or we can obtain these values by running CAG withτ =0 in the beginning. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 8 · SunHee Yoon et al. This is why τ, a user-provided error threshold, is interchangeable with a user- provided error-tolerance threshold. Algorithm 1 shows the pseudocode of the CAG algorithm. For encoding a query, CAG augments the TAG syntax only with a threshold τ clause. A user initiates CAG by specifying a query UQ = < QueryID,O i ,τ > to be injected into the network with a threshold τ for the monitoring attribute O i . CAG supports disseminating multiple queries with different QueryID and O i . Subsequently, thebasestationbroadcaststhequerypacket Q =<UQ,ParentID, MyID, level, CR>, where level is the depth of the current node in the forwarding tree. Note that CR is included in the query to be compared with each MR when it is received by a node. Clusters are formed when the forwarding tree is built. Once all the nodes receive the query packet, the response phase starts. At the endofeachepoch,onlyclusterheadstransmitpacketswiththefollowingtuple: R = < ParentID, ChildrenID, MR, CR >. If the clusterheads cannot communicate witheachother,theintermediatenodes,termedbridge nodes,arerequiredtobridge the segments of the forwarding tree. Bridge nodes do not contribute their sensor readings to the aggregate by default, but they can optionally participate in the aggregation because they transmit the packets anyway. A more detailed example of the CAG algorithm execution is described in [Yoon and Shahabi 2005]. 4. CAG’S TWO MODES OF OPERATION Depending on the purpose of deployment, dynamics of networks, and the rate at which data changes, CAG can work in two modes: interactive and streaming. In the interactive mode, CAG generates response packets whenever a node receives a query message, whereas in the streaming mode, for a given query, CAG keeps on generating responses at a specified time interval. The interactive mode of CAG exploits only the spatial correlation of sensed data, whereas the streaming mode of CAG takes advantage of both spatial and temporal correlations of data. 4.1 Interactive Mode The interactive mode of CAG alternates query and response phase for each user query. Duringtheresponsephase,onlyclusterheadnodestransmitthequeryresult, so the resulting aggregation is oblivious to the underlying number of sensor nodes. This mode eases user interaction by allowing a user to interactively refine queries. While monitoring physical phenomena using a large sensor network, a user may initiate a query with a larger threshold value to get a rough picture of the whole data and may decide to set up a smaller threshold value to fetch data at a finer granularity only from interesting regions. Fig. 1(a) depicts the interactive mode of CAG. The interactive mode is appropriate in an environment where the physical at- tributes being sensed such as temperature and light change rapidly, and the net- work connectivity changes unpredictably because CAG, in this mode, builds a new forwarding tree each time a query is sent out. This newly formed clustered tree can address the dynamics of network and data on the fly. This adaptation makes it possible to use energy in a more balanced way compared to the streaming mode described in Section 4.2 because the clusterhead node is not fixed for each query. However,theinteractivemoderequirestheadditionaloverheadforbroadcastinga ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 9 2 8 5 9 10 7 12 11 4 3 6 Cluster1 Cluster4 Cluster2 Cluster5 Cluster3 1 13 (a) Interactive mode of CAG: a single response per query. 2 8 5 9 10 4 3 6 Cluster1 Cluster4 Cluster2 Cluster5 1 t1 t2 t3 7 12 11 Cluster3 13 (b) Streaming mode of CAG: multiple responses (at times t1<t2<t3, etc.) per query. Fig.1. TwomodesofCAG’soperation: interactiveandstreamingmodes. Thesolid lines indicate the query propagation and the dotted lines indicate the response. Black nodes are clusterheads, gray nodes are bridges, and white nodes are non- participating nodes. queryeachtimeauserwantsnewdatafromthenetwork. Thisfrequentlyrebuilding the tree can be wasteful if the sensed data is almost the same over time. If data is unchanged, clusterhead nodes and the forwarding tree is likely to be the same. Moreover,intheinteractivemode,CAGdoesnotcountthenumberofnodeswithin a cluster; this may trade the accuracy of result for energy saving by reducing the number of packet transmissions. In our earlier experiments using the measured data from the Great Duck Island [Yoon and Shahabi 2005], we observed that CAG may result in an out-of-bound error, regardless of the error threshold and data correlation, when the data values do not follow the normal distribution. Even though CAG, in the interactive mode, provides results that are less accurate, the interactivemodecanimmediatelyaddressthechangingenvironmentintotheresult, whichenablesausertoestimatethechangingpatternofphysicalphenomena. Note that the interactive mode only leverages the spatial correlation of data and cannot take advantage of the temporal correlation. 4.2 Streaming Mode The streaming mode of CAG generates multiple responses per query. These re- sponses are generated periodically. This saves query broadcasting overhead com- pared to the interactive mode which requires flooding a new query to the network every time a result is desired. Fig. 1(b) depicts the streaming mode of CAG. The streaming mode of CAG can be used to overcome the out-of-bound error (explained in Section 4.1) of the interactive mode. With the streaming mode, each member of the cluster sends a packet to the clusterhead sothat the clusterhead can calculatethesizeofthecluster. Thecountisupdatedonlywhenclusteradjustments happen. In an environment where sensor reading and network topology are static, clusters change, split, or merge rarely. Thus, the model based clustering or the optimized clustering method for specific scenario can be successfully applied in the ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 10 · SunHee Yoon et al. Interactive Mode Streaming Mode Description Single response per query Multiple responses per query (periodic or event driven) Appropriate for Dynamic environment (data and network topology) Static environment (data and network topology) Exploiting property Spatial correlation Spatial and temporal correla- tion Clustering Different cluster and cluster- head per query Same cluster and clusterhead for a certain period of time Advantages 1) Good for reactive, interac- tive, one-shot query 2) On- line approximation addressing dynamics 3) Good estimation when data is normally dis- tributed 4) Fair energy usage: does not need clusterhead rota- tion 1) Good for proactive/reactive query (periodic response or event driven) 2) Appropriate for model-based clustering 3) Saves query transmission over- head 4) Accurate: bounded er- ror even without normally dis- tributed data Disadvantages 1) Query broadcast overhead perquery2)Notaccuratewhen dataisnotnormallydistributed 1) Unfair energy usage: Energy bottleneck at clusterhead with- out rotation technique 2) Extra overhead and complexity from clusterhead rotation policy and counting the number of nodes in a cluster Table I. Comparison of two modes of CAG’s operation: interactive and streaming. streaming mode in a static environment. Counting the number of nodes within a cluster ensures the accurate aggrega- tion result. The clusterhead node transmits the product of the clusterhead value and the size of its cluster as a response to the query. This transmission of the weighted aggregate as opposed to the clusterhead value ensures that the resulting error is always bounded by the threshold even with the data that is not normally distributed. When we count the number of nodes per query, the energy overhead increases due to the increased communication. Fortunately, re-computing the size of the cluster will not happen often in static environments, so the benefit from reducing the redundant query messages overcomes this extra overhead. Theclusterheadsarefixedthroughoutthedurationofthequeryinthestreaming mode. Thus, the clusterheads can become an energy bottleneck. This mode thus needsaclusterheadrotationtechniquetomaximizethenetworklifetime. Therota- tionpolicycanbeintermsoftheremainingenergy[Heinzelmanetal.2000],number ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 11 (a) Picture of the Exposition Park in Los Angeles. (b) 4th floor of Tutor Hall at USC. (c) Map of the Great Duck Island with sensor deployment. Fig. 2. Pictures of outdoor and indoor environments where the data is measured with the map of the Great Duck Island. of links, etc. 3 Incorporating an energy and computationally efficient clusterhead rotation technique is a challenging task. Table I compares these two modes of CAG operations. 4.2.1 Temporal Correlation. Spatial correlation of data is the primary property exploited by the CAG algorithm. The interactive mode, in fact, only takes advan- tage of the spatial correlation. Clusters from the streaming mode of CAG adjust locally depending on the data and topology changes, so this encompass the tempo- ralaswellasthespatialcorrelationofthesenseddata. Increasednetworkanddata dynamics can trigger frequent cluster changes. As the environment becomes more static,theclusteradjustingoverheadbecomessmaller. Efficientadaptiveclustering is one of the research challenges in ad hoc networks. Existing works on dynamic clustering address node mobility and topology dynamics but not data dynamics. 5. MEASUREMENT, MODEL, AND ANALYSIS In this section, we describe 1) our setup for the environmental data measurement and the collected data sets, 2) the spatial data model of the measured data, 3) the analysis of performance and accuracy. 5.1 Data Sets For the model, analysis, and simulation, we used the following four data sets: 1) our own measured sensor data from the Exposition Park and Tutor Hall (node placement in a regular grid), 2) measured sensor data gathered from the Great DuckIsland(irregularnodeplacement), 3)syntheticdatausingastatisticalmodel, and4)syntheticdatabasedonanecologicalmodel. Eventhoughtheenvironmental attribute being sampled is continuous (e.g., temperature), discrete samples of the attribute is used in our study. Weusedsemivariogram tocomparethecorrelationpropertyofeachdataset. The semivariogram 4 is the most common way to characterize the correlation between 3 This is a topic we plan to investigate later. 4 We simply refer the semivariogram as a variogram from now on. The difference between the two is the multiplicative factor of 2 which only effects the magnitude and not the trend of the statistics. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 12 · SunHee Yoon et al. pairs of points separated by a spatial distance [Dale et al. 2002]. In probabilistic notation, the variogram is defined as follows: γ(h) = 1 2 E[(X(p)−X(p+h)) 2 ] for all possible locations p, where X(p) and X(p+h) are the values at the head and tail of each pair of points at a distance h. 5.1.1 Sensor data measurement in a regular grid. We measured two modalities (lightandtemperature)ofrealdatausingmica2motesandMTS300sensorboards in two different environments, indoor on the 4th floor of Tutor Hall at USC and outdoor at Exposition Park in Los Angeles. At each position, the mote and the sensor board was placed on the ground and data is measured twice to understand the sensitivity of data depending on the sensor board’s orientation: (1) the sensor boardfacingtheskyand(2)thesensorboardfacingtheground. Samplesaretaken at 200 millisecond interval; arithmetic mean of twenty values makes one reading to prevent getting an inaccurate data which might be generated from the single instance of malfunctioning sensor. —Outdoor environment (Exposition Park in Los Angeles): We measured theenvironmentaldataat100(10×10)locationsusingthemica2/MTS300with 10meterinternodedistanceintheExpositionParklocatedinLosAngeles. Light andtemperaturereadingsweretakenatfourdifferenttimes(1,4,6,7PM)ofthe day. We decided 10 meter as an internode distance because less than 10 meter is too redundant to monitor the outdoor environment thereby may waste extra energy. Also, the mica2 mote radio (433 MHz Chipcon CC1000) is reliable up to a little more than 20 meters even with the default transmission power as shown in Fig. 10(b). —Indoor environment (4th floor of Tutor Hall at USC): We took mea- surement at 40 locations using mica2/MTS 300 with 5 meter internode distance in the rooms and hallway on the 4th floor of Tutor Hall at USC. Nodes sensed 2 modalities (light and temperature) at 7 PM. We deployed the sensors more densely than outdoor environment because the data is affected by various in- door equipments (desks, chairs) and building (walls, windows). These factors can weaken the radio transmission, too. The part of light and temperature data measured from the Exposition Park is presented in Fig. 3 5 . The variograms of data measured from the Exposition Park (sensorboard up) and Tutor Hall are presented in Fig. 5. 5.1.2 Data with irregular mote placement on the Great Duck Island. The four kinds of modalities (humidity, temperature, light, and pressure) measured on the Great Duck Island [Mainwaring et al. 2005] constitute this data set. Different modalities are in different units, but we used raw values in all cases. Most vari- ograms using the real sensor data from the Great Duck Island (Fig. 6(a)), Exposi- tion Park (Fig. 5(b), 5(a)), and Tutor Hall (Fig. 5(e)) show the linear functional pattern similar with that of ecological spatial pattern (Fig. 6(a)) in different mag- nitude. This is an evidence that the synthetic ecological data model (Section 5.1.4) captures spatial property of real physical data. 5 Here we do not show all the data we measured due to space constraint. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 13 1010 1011 1012 1013 1014 1015 1016 1017 1018 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 1010 1012 1014 1016 1018 1020 Raw data (a) Light data at 1 PM. 1002 1004 1006 1008 1010 1012 1014 1016 1018 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 1000 1005 1010 1015 1020 Raw data (b) Light data at 4 PM. 950 960 970 980 990 1000 1010 1020 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 950 960 970 980 990 1000 1010 1020 1030 Raw data (c) Light data at 6 PM. 0 100 200 300 400 500 600 700 800 900 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 0 100 200 300 400 500 600 700 800 900 1000 Raw data (d) Light data at 7 PM. 930 940 950 960 970 980 990 1000 1010 1020 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 920 940 960 980 1000 1020 Raw data (e) Light data at 1 PM (sensorboard down). 0 50 100 150 200 250 300 350 400 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 0 50 100 150 200 250 300 350 400 Raw data (f) Light data at 7 PM (sensorboard down). 480 500 520 540 560 580 600 620 640 660 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 450 480 510 540 570 600 630 660 690 Raw data (g) Temperature data at 1 PM. 410 420 430 440 450 460 470 480 490 500 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 400 420 440 460 480 500 Raw data (h) Temperature data at 4 PM. 415 420 425 430 435 440 445 450 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 400 410 420 430 440 450 460 470 480 490 500 Raw data (i) Temperature data at 6 PM. 350 360 370 380 390 400 410 420 0 10 20 30 40 50 60 70 80 90 X coordinate 0 10 20 30 40 50 60 70 80 90 Y coordinate 350 360 370 380 390 400 410 420 430 440 450 Raw data (j) Temperature data at 7 PM. Fig. 3. Measured light and temperature data from the Exposition Park. All the sensorboards are facing the sky except (e) and (f) in which case the sensorboards are facing down. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 14 · SunHee Yoon et al. (a) 7h data. (b) 9h data. (c) Spatial pattern data. Fig. 4. Synthetic data using statistical model (7h and 9h) and ecological model (spatial pattern). As these sensor nodes are not deployed in a grid (irregular internode distance), distancesaresubdividedintoanumberofintervalscalledlags tosimplifyvariogram computation [Dale et al. 2002]. 5.1.3 Synthetic data from the statistical model. Sensor data is generated us- ing the method suggested in [Jindal and Psounis 2004] for a 250m×250m two- dimensional grid. Five data sets with different degrees of correlation are generated with parameters α=1/2 i , β, and h=1,3,5,7, and 9. The correlation coefficient h determines the level of correlation; an h of 1 generates data with almost no spatial correlation (similar to i.i.d. random), and a larger h results in a higher spatial correlation. Our previous research [Yoon and Shahabi 2005] used this data to com- pare the effect of different spatial correlation levels on the efficiency and accuracy. In this paper we compare the two mathematical models based on the measured temperature data and 7h synthetic data (Fig. 4(a)) 6 . 5.1.4 Synthetic data from the ecological model. Weusedthespatially correlated data (Fig. 4(c)) generated based on the ecological (environmental) patterns model providedby [Lennon2000]in250m×250mtwo-dimensionalgrid. Eventhoughthis dataissynthetic,itcontainsrealisticspatialpatternswithknownspatialproperties bearing the fractal pattern in the environment. Fig. 6(a) includes the variogram of this pattern. This spatial pattern presents the fractal characteristic of the environ- ment with a high correlation level between 7h (Fig. 4(a)) and 9h (Fig. 4(b)). The variogram of this fractal data follows a linear pattern. Note again that, most of the variograms using the real sensor data (both from the Exposition Park and Great Duck Island) also present linear patterns. 5.2 The Spatial Data Model In this Section, we mathematically model the property of spatial correlation using the measured sensor data from different environments. Fractal structures are ubiquitous in the nature with the key property of self- similarity across a range of spatial scale [Halley et al. 2004]. Fractal objects or behavior often emerge in ecological models even if the models are not explicitly 6 In [YoonandShahabi2005], wereportedthedetailsonthedifferentlevelsofspatially correlated data and the corresponding results. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 15 0 500 1000 1500 2000 0 10 20 30 40 50 60 70 80 1PM 4PM 6PM 7PM γ(h) Distanceh (a) Variograms of measured temperature with sensorboard up at 1, 4, 6, 7 PM from the Exposition Park. 0 50 100 150 200 250 0 10 20 30 40 50 60 70 80 1PM 4PM 6PM γ(h) Distanceh (b) Variograms of measured light with sensorboard up at 1, 4, 6 PM from the Exposition Park. 1000 100 10 1 80 50 40 30 20 10 1PM 4PM 6PM 7PM log(h) log(γ(h)) (c) Plot of log(γ(h)) as a function of log(h) with measured temperature with sensorboard up at 1, 4, 6, 7 PM from the Exposition Park. 0 10000 20000 30000 40000 50000 0 10 20 30 40 50 Sensorboard Up Sensorboard Down γ(h) Distanceh (d) Variograms of measured light at 7 PM from Tutor Hall at USC. 0 10 20 30 40 50 0 10 20 30 40 50 Sensorboard Up Sensorboard Down γ(h) Distanceh (e) Variograms of measured temperature at 7 PM from Tutor Hall at USC. Fig. 5. Variograms of light and temperature measurements. designed. Natural landscapes are not ideal fractals, but such models provide the simplest available means of simulating spatially complex landscapes and serve as neutral habitat model. For a fractal pattern on a two-dimensional landscape, the relationshipshouldbelinearinagraphoflog(γ(h))againstthelog(h)[Halleyetal. 2004]. Fig.5(a)showsthevariogramsoftemperaturedatameasuredfromtheExposition Park. Fig.5(c)showsthatthelogplotofthefig.5(a)whichpresentsdifferentlinear functions using 4, 6, 7 PM temperature data. Accordingly, these temperature data might be fractal data patterns in the long range. In order to infer fractality, one needs to show linearity over at least 2 orders of magnitude (x-axis in Fig. 5(c) has to span from 10 to 1000). That means that on the x-axis (and y-axis), which has a log scale, there should be a fairly straight line for a distance of at least two. Not having enough data points to determine this data as a fractal model, we simply conclude this measured data follows a linear model. The variogram of spherical model increases linearly in the beginning, then it becomes a sill, which is a plateau. That is, the expected difference of the sensed valuesbetweentwopointsstopincreasingatcertainpointalthoughthedistance(h) ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 16 · SunHee Yoon et al. between the nodes increases. In spherical pattern, data is correlated over shorter distance than in linear or fractal patterns. All the temperature data from different time of the day show linear property (6 PM shows almost random) except for the 1 PM data which shows spherical characteristics. Variograms with 0 slope present no correlation such as i.i.d. random values. Fig.5(b)showsthevariogramsoflightvaluemeasuredfromtheExpositionPark. Each variogram from different time of the day shows the different linear function- ality (1 PM shows the least correlation (almost random) and 6 PM shows the strongest correlation). The spatial correlation property holds for both the orientation (face up and face down) of the sensorboard. For the light sensor value, the sensed value with sensor- board facing up addresses the physical phenomenon better, and shows the spatial correlation more clearly than its counterpart. On the other hand, the tempera- ture value was not sensitive to the orientation of the sensorboard; both show very negligible differences as shown in Fig. 3. Fig. 5(d) shows the variograms of indoor light bearing spherical property with cyclicity. Cyclicity may be affected from specific indoor environmental periodicity (because we measured the data by placing the mote on the floor of the hallway and offices with a lot of desks and chairs) or can be due to the limited amount of data points within the office environment (We did not measure larger than 50 lag distance (h) because the longest side of the building is a little more than 60 meters). Fig. 5(e) shows the variograms of indoor temperature presenting the gaussian pattern (sensorboard down) and the spherical pattern (sensorboard up). Again, there is insufficient data (due to the limitation on the length of the building for our indoor measurement) to project what happens beyond 50 meters. The Equations for the spherical, linear, and gaussian models used for variograms are given below, where c is the nugget effect (γ(0) = c), S is the sill, and a is the range of influence [Schabenberger and Gotway 2005]. As the distance h approaches zero, the measurement error and microscale variation induce nugget effect. γ(h) = c+(S−c){ 3 2 h a − 1 2 ( h a ) 3 } , for 0<h≤a : for spherical c+(S−c) , for h≥a 0 , otherwise γ(h) = c+(S−c) h a : for linear γ(h) = ( c+(S−c)(1−exp(− 3h 2 a 2 )), for h>0 : for gaussian 0 , otherwise These models are common for variograms used in practice in spatial statis- tics [Schabenberger and Gotway 2005]. For data from the indoor measurement, we cannot apply the gaussian model because we cannot compute S and a reliably because of insufficient data points. Thus, we only focus on the models for the data from the outdoor measurement in this paper. When we apply the temperature data to these Equations, we get Equation (1) ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 17 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 0 10 20 30 40 50 60 70 80 Spatial Pattern Humidity Temperature Light Pressure γ(h) Distanceh (a) Variograms of spatial pattern and the measured data (light, temperature, humidity, pressure) from the Great Duck Island. 0 500 1000 1500 2000 2500 0 10 20 30 40 50 60 70 80 Spherical model: 1PM Data: 1PM Linear model: 7PM Data: 7PM γ(h) Distanceh (b) Variograms of temperature data from the Exposition Park and their models in the spatial statistics: spherical (1 PM) and linear (7 PM) models. Fig. 6. Variograms of measured and synthetic data. for spherical model and Equation (2) for the linear model. Fig. 6(b) shows the two temperature variograms (1 PM and 7 PM) and their spatial statistical models for the spherical and linear model. γ(h) = 1254×( 3 2 h 36 − 1 2 ( h 36 ) 3 ) , for 0<h≤36 : for spherical 1254 , for h≥36 0 , otherwise (1) γ(h) = 425 81 ×h : for linear (2) AsshowninFig.6(a),variogramsoffourmodalitiesmeasuredontheGreatDuck Island and the variogram of spatial pattern from ecological model show the linear function. If we ignore the nugget effect [Schabenberger and Gotway 2005], which makes the variogram start at a non-zero value at h=0, all variograms in Fig. 6(a) become similar in magnitude and pattern. Note that the important factor in the variogram is not the magnitude but the shape of graph. The different magnitudes of each modality in variogram is due to the difference in magnitudes of the raw sensor values across different modalities. In short, the outdoor light readings at 1, 4, 6, and 7 PM, and the outdoor temperature readings at 4, 6, and 7 PM, show the linear property. The indoor temperature at 7 PM shows gaussian data model. The outdoor temperature at 1 PMandtheindoorlightat7PMshowthesphericalproperty. Becauseofthestrong sunlight at 1 PM, the temperature values in the outdoor environment changes by a large amount (480 ∼ 660) even at a short distance due to the shades under the several trees (spherical pattern in Fig. 3(g)). On the other hand, the range of light readings at 1 PM is small (1010 ∼ 1018) under the same shadows (almost random as shown in Fig. 3(a)). The magnitude of variogram for light at 7 PM is much larger than those at 1, 4, and 6 PM because of its huge data range (0 ∼ 900) from ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 18 · SunHee Yoon et al. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 CAG4, 1PM CAG4, 7PM CAG10, 1PM CAG10, 7PM model 1PM model 7PM P c Distanceh (a) Simulation and mathematical model of Pc with CAG4 and CAG10 with temperature data from the Exposition Park. Linear model (model 7PM) has larger heavy tail than spherical model (model 1PM). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 P c CAG4,7h CAG4,9h CAG10,7h CAG10,9h Distanceh (b) Simulation model of Pc with CAG4 and CAG10 using 7h and 9h synthetic data. Fig. 7. Mathematical model of P c . CAGn means CAG with τ =n%. Both (a) and (b) use 10 meter internode distance in a regular grid. the effect of street light (some places were bright and some were completely dark) at night (Fig. 3(d)), so it is not included in Fig. 5(b). From all our measurements at different times of the day and both in indoor and outdoor environments, we conclude that creating and applying the proper models such as 1) linear, 2) spherical, 3) gaussian is appropriate if we synthetically (mathematically) model the data from environmental sensing 7 . In order to analyze CAG, we model the measured spatial data using PDF as a function of internode distance h and data model. For this, we selected two data sets, the temperature data at 1 PM and 7 PM measured in the Exposition Park, which can be the representative environmental data models. We define P c be the probability of two nodes being in the same cluster. We model P c by curve fitting the simulation result of CAG using τ =4% and 10%. Fig. 7(a) shows the result of simulations and their mathematical models 8 . For small τ (e.g., CAG with τ = 4% or less), we observed that both spherical and linear patterns present similar P c : both follow the same polynomial which is a function of h as in Equation (4). For large τ (e.g., CAG with τ = 10%), we observedP c of linear model is more heavy-tailed than that of spherical model; both depend on data model and h. Thus, we model P c for the relatively large τ such as the linear pattern with a logarithmic function as presented in Equation (3), and the spherical pattern with a polynomial function as in Equation (4). 7 We model the data mathematically for the purpose of our study. A more sophisticated modeling is out of the scope of this study. We will address this problem in our future work. 8 We did not measure the sensor data where D ij < 10m. Thus, we just connected two measured points whereD ij =0m and 10m. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 19 0 1 2 3 4 5 6 7 20 10 4 2 1 0 Absolute error (%) Threshold (τ) E r (1PM) E r (7PM) Perfect-rel: 1PM Perfect-rel: 7PM (a) Simulation (streaming mode with perfect reliability) and analytical models of the absolute error of two temperature data patterns (linear (7 PM) and spherical (1 PM)). 0 1 2 3 4 5 6 7 20 10 4 2 1 0 Absolute error (%) Perfect-rel: 1PM Perfect-rel: 7PM Measured-rel: 1PM Measured-rel: 7PM Threshold (τ) (b) Absolute error of two data patterns (linear (7 PM) and Spherical (1 PM)) of temperature in terms of threshold both with perfect reliability and measured reliability (using default radio transmit power). 100 80 60 40 20 0 0 20 40 60 80 100 Absolute error (%) Perfect-rel: 1PM Perfect-rel: 7PM Measured-rel: 1PM Measured-rel: 7PM Internode Distance (h) (c) Absolute error of two data patterns (linear (7 PM) and spherical (1 PM)) of temperature as a function of internode distance both with perfect reliability and measured reliability (using default-power) with fixed τ = 4%. 0 1 2 3 4 5 6 7 20 10 4 2 1 0 Absolute error (%) 10m: 1PM 10m: 7PM 20m: 1PM 20m: 7PM Threshold (τ) (d) Absolute error of two data patterns (linear (7 PM) and spherical (1 PM)) of temperature as a function of threshold with measured reliability using default radio transmit power when internode distance (h) is fixed at 10 and 20 meters. Fig. 8. Accuracy model as a function of threshold (τ) and internode distance (h). P c = τ 2(log(h)+h) : for linear (3) P c = 2τh −2 : for spherical (4) In our earlier work [Yoon and Shahabi 2005], we observed P c is larger than those in the Fig. 7(a) using the synthetic 7h data which corresponds to the similar correlation level of temperature from the Great Duck Island. This difference can be attributed to the changed measurement setting: in this study we use 10 meter internode distance in a regular grid while in our earlier work [Yoon and Shahabi 2005], we use random positioning with an average internode distance of 10 meter. To help us compare these two sets of results under the same condition, we repeated ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 20 · SunHee Yoon et al. experiments from our previous work with nodes placed in a regular grid with 10 meterinternodedistanceusing7hand9hdata. Weobservedthattheourmeasured temperature from the Exposition Park (Fig. 7(a)) is more spatially correlated than 7h synthetic data (Fig. 7(b)) in terms of P c . The absolute error of the streaming mode of CAG, E r , is obtained empirically by curve fitting graphs in the Fig. 8(a). We model E r as a constant for a linear model in Equation (5) and logarithmic function for a spherical model in Equation (6). E r = 0.41 = c , where c is a constant : for linear (5) E r = 3 2 log(τ +1) : for spherical (6) WewanttodemonstratethatE r isboundedbythegiventhresholdinthestream- ing mode when there is no packet loss. Thus, we did our first simulation in a topol- ogy with perfect reliability. We also run the same experiments using the measured reliability (Fig. 10). Fig. 8(b) shows that the impact of packet loss on the abso- lute error is minor regardless of threshold. With measured reliability, only when τ = 1%, the error is larger than the threshold. For all other values of τ, the error is bounded by τ. Fig. 8(c) describes the absolute error of spherical and linear model as a function of internode distance (h) with perfect reliability. The shape of this graph is the same as the one shown in Fig. 7(a) only with the left and right sides reversed. Up until 50 meter internode distance, the absolute error is negligible. Fig. 8(d) shows that with a fixed internode distance such that h = 10 and 20 meters, the absolute error decreases as τ increases. We observed that E r of linear pattern (7 PM) is smaller thanthatofsphericalpattern(1PM)atagivendistance. Also, itindicates that as the distance doubles, the absolute error also almost doubles. 5.3 Analysis of CAG: Efficiency and Accuracy In this Section, we formally analyze the efficiency and accuracy of the upgraded CAG algorithm in terms of the number of transmissions and the absolute error, respectively. We analyze the efficiency of CAG for the interactive mode or a single response of streaming mode, because the efficiency of streaming mode additionally depends on the temporal correlation 9 . We analyze the efficiency of upgraded CAG forbothsinglehopandmultiplehopnetworktopologiesusingasinglemathematical formula. This is in contrast to the theoretical result in our previous work [Yoon and Shahabi 2005] which showed that the mathematical formula for the number of transmissions is different for single hop and multiple hop topologies. Note that the upgraded CAG groups nodes in the same cluster if their sensor values are within a fixed range. We also prove that the absolute error is always bounded by the given threshold value in the streaming mode even when the data is not normally distributed. 9 The impact of the different patterns of temporal correlation on CAG is investigated in Sec- tion 6.2.3 with simulations ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 21 v v 1 v k … v 2 1 1 1 k k k d … … … (a) Tree with an average branching factor k and a depth d for the analysis of the number of clusters. v v 1 v 2 1 1 … v 1 2 v 2 2 v 3 2 v 4 2 (b) An example of a set of clusters formed in a balanced binary tree. Fig. 9. Examples of the query routing tree for the analysis of CAG. In this Section, we define v as the sensor reading at the root node and v j i as the sensor reading of the ith child of v in the jth level in the tree. We assume that the sensor value v j i is i.i.d. random variable in the range [0, 1]. We also assume that eachnodeinthetreehasanaveragebranchingfactorofk. Fig.9(a)depictsak-ary balancedtreewithdepthdandFig.9(b)showsthesametreewithannotationusing the variables used in this analysis. Let T be the entire tree, and N T be the number of clusters in T. N v j i is the number of clusters in the subtree rooted at node v j i . Even though our analysis does not address the bridge nodes, our simulation does (Fig. 12(b)). To calculate the expected number of transmissions, we need to compute the expected number of clusters. We begin to build clusters from the root node with its single-hop children. As we assumed v has k children in average, N T is given by: N T = 1+N v 1 1 +...+N v 1 k −n{v 1 i |v 1 i is in the same cluster with v} (7) When we compute the expected value of Equation (7), we obtain the following equation. E[N T ] = 1+ k X i=1 E[N v 1 i ]− k X i=1 P r [v 1 i is in the same cluster with v] (8) Here, P c = P r [v j i is in the same cluster with v], the probability that v j i is in the same cluster as v, is bounded by 2τ as follows: by using the absolute range such that |v j i −v|≤τ, we get, ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 22 · SunHee Yoon et al. P c = −τ ≤v j i −v ≤τ [0..1] = v+τ ,if 0<v <τ 2τ ,if τ ≤v ≤1−τ 1−v+τ ,if 1−τ <v ≤1 = Z τ 0 (v+τ)dv+(1−2τ)2τ + Z 1 1−τ (1−v+τ)dv = τ(2−τ) (9) ≤ 2τ 1 =2τ. Now we will show that the expression for the number of transmissions is the same regardless of the number of hops by analyzing both single-hop and multiple- hop scenarios. As a special case, we can compute the number of clusters only with the single- hop nodes by extending Equation (8) and combining Equation (9) as follows. In this scenario, all the nodes including the root nodes are within the single-hop radio range from any node in the network. Thus, the number of children of the root nodes is k =|T|−1. E[N T ] = 1+ k X i=1 E[N v 1 i ]− k X i=1 P c = 1+k−k×P c = 1+(1−P c )k = 1+(1−2τ +τ 2 )(|T|−1) = |T|(1−2τ +τ 2 )+τ(2−τ) = N(1−2τ +τ 2 )+τ(2−τ) = (N −1)τ(τ −1)+N (10) Now we show how to hand-compute the number of clusters in a multiple-hop topology to motivate the derivation of an expression for the number of clusters in a query distribution tree in CAG. The example binary B-tree in Fig. 9(b) has depth =2, k =2 and P c = 1 2 . ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 23 E[N T ] = 1+ k X i=1 E[N v 1 i ]− k X i=1 P c = 1+k−kP c : for d=1 = 1+2−2× 1 2 = =2 E[N T ] = 1+(k−kP c )+k(k−kP c ) : for d=2 = 2+2(2−2× 1 2 ) = 4 In this way, we can generalize the single-hop scenario into the multiple-hop sce- nariobyiterativelyusingEquation(8)foreachnodeinthenetworkuptothedepth d−1. E[N T ] = 1+ k X i=1 E[N v 1 i ]− k X i=1 P c = 1+k−kP c : for d=1 = 1+(k−kP c )+k(k−kP c ) : for d=2 = 1+(k−kP c )+k(k−kP c )+k 2 (k−kP c ) : for d=3 = ... Thus, E[N T ] for a k-ary balanced tree with depth d is given below: E[N T ] = 1+ (k−kP c )(k d −1) k−1 (11) = 1+ k(1−τ(2−τ))(k d −1) k−1 Equation (11) validates Equation (10) which is for the single-hop scenario as a specialcaseof(11)withk =|T|−1andd=1. Thus,weproposeasingleexpression, Equation (11), to estimate the number of clusters in both single-hop and multi-hop topologies. This is an improvement over our earlier work in which we needed two different expressions for those two scenarios. As in Equation (11), the expected number of clusters depends on the correlation level P c and the branching factor k where P c =τ(2−τ) and k <|T|=N. The expected size of a cluster, E[S T ], can be computed as follows: ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 24 · SunHee Yoon et al. E[S T ] = Total number of nodes Expected number of clusters = N E[N T ] = N 1+ (k−kPc)(k d −1) k−1 (12) = N(k−1) (k−1)+k(1−τ(2−τ))(k d −1) As shown in Equation (12), the size of a cluster, E[S T ], is also a function of P c and k where P c = τ(2−τ) ≤ 2τ. In theory, P c is bounded by 2τ as in Equation (9). In reality, however, the size of a cluster depends on the actual value of P c as defined in Equation (3) and (4) in Section 5.2. That is, E[S T ] is a function of P c and k, where P c = τ 2 log(h)+h for the linear data model and P c = 2τh −2 for the spherical data model. Because we build clusters based on the absolute range (defined in Table II), the absolute error is always bounded by τ, such that |v j i − v| ≤ τ where v j i is normalized to [0, 1]. Thus, the maximum error per node is τ. If the total number of nodes is N, the maximum error for the entire network becomes Nτ. Because we compute the AVERAGE as an aggregation operator, the maximum error we get becomes Nτ N = τ. If we compute the SUM aggregation operation, the maximum error becomes Nτ. Note that the actual magnitude of absolute error without packet loss in the streaming mode is a constant for linear model as in Equation 5 and logarithmic function for spherical model as in Equation 6. This result corresponds to the shape of the variograms of the linear and spherical data models (Fig. 5(a)) respectively. We can formally prove that the absolute error E r is always bounded by τ in the streaming mode 10 . Assume that the values in each cluster are sorted. Let v ij be the jth unique value in the cluster i, and n(v ij ) be the number of nodes with jth unique value in the cluster i. C i be the clusterhead value for the cluster i. n(i) be the number of nodes in the cluster i. n(i)= P d(i) j=1 n(v ij ), where d(i) is the number of unique values in the cluster i. Let k be the number of clusters in the network, and N be the total number of nodes in the network such that N = P k i=1 n(i). We cancomputetheAVERAGEoperationcorrectlyusingtheTAG,andapproximately using the CAG as in Equation (13) and (14), respectively. CorrectResult = P k i=1 P d(i) j=1 n(V ij )×V ij N (13) EstimatedResult = P k i=1 n(i)×C i N (14) The absolute error, E r , using the CAG can be computed as follows. 10 We mentioned in Section 4.1 that interactive mode may not provide the bounded error when the data is not normally distributed. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 25 E r = |EstimatedResult−CorrectResult| [0..1] , where 0<τ ≤0.5 = P k i=1 n(i)C i − P k i=1 P d(i) j=1 n(V ij )V ij N = P k i=1 P d(i) j=1 n(V ij )C i − P k i=1 P d(i) j=1 n(V ij )V ij N = P k i=1 P d(i) j=1 n(V ij )|C i −V ij | N ≤ Nτ N ≤ τ (15) As shown in Equation (15), E r is always bounded by τ when it computes the average in streaming mode although the data is not normally distributed. 6. EXPERIMENTAL STUDY In this section, we describe 1) the evaluation metrics and experimental setup 2) re- sultsofinteractivemode,3)evaluationresultsofstreamingmode,and4)evaluation results of exploiting temporal correlation. 6.1 Evaluation Metrics and Experimental Setup The primary metric used for evaluation, the reduced number of transmissions, uses the number of transmissions to estimate the energy cost in WSN. This approach is reasonable because radio transmissions consume far more energy than any other operation in a sensor node [Hill et al. 2000]. In the interactive mode, the reduced number of transmissions is calculated as nTX(TAG)−nTX(CAG) nTX(TAG) ×100, where nTX is the number of transmissions (note that nTX(TAG) is the same as nTX(CAG) with τ = 0). In this mode, the number of packets transmitted excludes query packets because this number is the same in both TAG and CAG regardless of τ. In the streaming mode, we compute the reduced number of transmissions com- pared to the interactive mode as nTX(Interactive)−nTX(Streaming) nTX(Interactive) ×100. This re- duction is due to the reduced number of queries and the reduced number of packet transmissions due to infrequent cluster adjustment. We compute the average number of bridge nodes (per query) participating in aggregation to understand their contribution to the total communication overhead. Bridge nodes are used for the topology connectivity and they can optionally con- tribute to the aggregate. Note that the bridge nodes did not contribute to the aggregate in our simulations. Another metric is the absolute error of the result for a given τ calculated as |EstimatedResult−CorrectResult| Range ×100, where Range = MaxValue−MinValue as- suming that we know the MaxValue and MinValue of the sensor readings for the entire network. This range can be determined in advance by surveying the ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 26 · SunHee Yoon et al. 0 20 40 60 80 100 0 20 40 60 80 100 120 Reception Rate (%) Distance (meters) (a) Reception rate with maximum transmission power (26.7mA typical current consumption). 0 20 40 60 80 100 0 20 40 60 80 100 120 Reception Rate (%) Distance (meters) (b) Reception rate with default transmission power (10.4mA typical current consumption). Fig. 10. Scatter plot of distance vs. reception rate profile using 433 MHz Chipcon CC1000. data set. Note that the absolute error is computed using EstimatedResult and CorrectResult from the same query cycle. To understand the effect of packet loss on precision, we ran simulations on two types of topologies: 1) lossless topologies and 2) topologies constructed using em- pirical loss rates. Comparing results from these two topologies gives us insight into the difference between theoretical results and what might be observed in real world implementation of CAG. We used the loss profile from our own measurement using mica2 motes to assign reliabilities to links between nodes (Fig. 10). The reception rate was measured using 433 MHz Chipcon CC1000 by counting the suc- cessfully received packets among 500 packet transmissions. We collected this data with two different radio strengths: 1) maximum radio transmission power (26.7mA typical current consumption) (Fig. 10(a)) and 2) default radio transmission power (10.4mA typical current consumption) (Fig. 10(b)). Although the maximum power uses more energy, unexpectedly, this does not affect our topology construction as well as link reliability because our internode distance is 10 meters; the reception rate was close to 100% up to 25 meters in both cases. Thus, we present results only from simulations with the default transmission power. We apply two metrics to quantify the temporal correlation of data. To compute theefficiencyofCAGwithtemporalaswellasspatialcorrelation,weusethecluster change rate which is the percent of simulation timesteps in which at least one node changes cluster. Another metric to evaluate the streaming mode of CAG is node change rate whichistheaveragenumberofnodeschangingtoadifferentclusterper timestep. TableIIcomparesthemetricsusedtoevaluatethepreviousversion[Yoon and Shahabi 2005] and current version of CAG. We used the TOSSIM simulator of TinyOS 1.1.8 for our simulation study [Levis et al. 2003]. We used the temperature data collected with sensorboard up at 1, 4, 6, and 7 PM from the Exposition Park. One hundred nodes were deployed in a regular 100m×100m grid. We also used temperature readings collected on the Great Duck Island to study the efficiency of CAG with a long temporal history. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 27 Metrics CAG (old) CAG (updated) Cluster for- mation range Relative range: CR±CR×τ Absolute range: CR ± Range × τ, where Range = MaxValue−MinValue Reduced number of transmissions nTX(TAG)−nTX(CAG) nTX(TAG) ×100 nTX(Interactive)−nTX(Streaming) nTX(Interactive) ×100 Accuracy of result Relative error: E r = |EstimatedResult−CorrectResult| CorrectResult × 100 Absolute error: E r = |EstimatedResult−CorrectResult| Range × 100 Number of bridge nodes Number of participating nodes - Number of clusterheads Same with old CAG Reduced number of transmissions per density three densities: moderate, dense, sparse N/A Cluster change rate N/A (Number of timesteps when at least one node changes cluster)/(Total number of timesteps) ×100 Node change rate N/A E[number of nodes changing cluster per timestep] Table II. Comparison of CAG metrics. Each node at position (x, y) uses the value from the corresponding position in the empirical data sets. We generated four different topologies (the root node is placed at each corner of 100×100 meter square) with the above configuration and results are averaged over 20 runs for each topology. We observed that the inclusion of the bridge nodes in the aggregation only contributes marginally to the improvement of the precision of the result. We implemented Average, Count, and Sum aggregation operators but only report the results for Average in this paper. We chose τ = 0,1,2,4,10 and 20% to be consistent with our previous work as well as to cover typical values of τ that users might be interested in. 6.2 Results In this section, we report the results of our several experiments both in interactive and streaming modes to understand: 1) the efficiency and precision tradeoff, 2) the effect of link reliability on precision, 3) the effect of bridge nodes on efficiency and accuracy, and4)theimpactofdifferentpatternsoftemporalcorrelationsonCAG’s efficiency. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 28 · SunHee Yoon et al. Fig. 11. A snapshot of CAG tree with 375 nodes randomly placed in 250m×250m space with 9h synthetic data and τ = 20%. The big black square near the bottom left corner is the root node and other small black circles are nodes in the root cluster. Clusterhead nodes (except the root node) are thesmall black squares and the non-clusterhead nodes (except theroot cluster) are small empty circles. The number beside each node indicates clusterhead node id, and the arrow points to the parent node in the query routing tree. 6.2.1 Interactive mode of CAG. InthisSection,wepresenttheperformanceand precision tradeoff in the interactive mode of CAG. In this mode, the clusterhead values are not weighted by the respective cluster sizes. Thus, if the data from all thenodesdonotfollowthenormaldistribution, wecannotguaranteethattheerror from the result is bounded by the given τ. Fig. 11 shows a snapshot of the CAG tree with 375 nodes randomly placed in 250m×250m space with 9h synthetic data and τ = 20%. The root node (the big black square) is located in the bottom-left area,andallothernodesshowninblackcirclesareinthesameclusterwiththeroot nodes. As expected, most new clusters are built along the diagonal band spanning from the top-left to the bottom-right (clusterhead nodes, small black square nodes, are spread there) where there is a large difference in the magnitude of data. This figure confirms the validity of our implementation of the CAG algorithm. Fig. 12(a) shows the performance of CAG in terms of the reduced number of ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 29 0 20 40 60 80 100 20 10 4 2 1 0 Reduced number of transmissions (%) 1PM 4PM 6PM 7PM Threshold (τ) (a) Performance with the reliability of default transmission power. 0 10 20 30 20 10 4 2 1 0 Number of bridge nodes 1PM 4PM 6PM 7PM Threshold (τ) (b) Average number of bridge nodes in one query cycle. 0 1 2 3 4 5 6 7 8 9 10 11 12 20 10 4 2 1 0 Absolute error (%) 1PM 4PM 6PM 7PM Threshold (τ) (c) Precision with the reliability of default transmissions power. 0 1 2 3 4 5 6 7 8 9 10 11 12 20 10 4 2 1 0 Absolute error (%) 1PM 4PM 6PM 7PM Threshold (τ) (d) Precision with perfect link reliability (bounded by τ). Fig. 12. Performance and precision tradeoff in the interactive mode with measured temperature data from the Exposition Park. transmissionsintheinteractivemodeusingthedefaulttransmissionpower. Because of the packet loss, the reduced number of transmissions is not 0 (8.75 ∼ 20%) even for the CAG with τ = 0%. CAG with τ = 4% has a transmission saving up to 37.5% at 6PM and CAG with τ = 10% has a transmission saving up to 51.25% at 7PM. Fig.12(b)showsthenumberofbridgenodesamongallparticipatingnodes(clus- terhead nodes plus bridge nodes). As the error threshold increases, more nodes choose not to respond. Thus, we need more bridge nodes to keep the tree con- nected. Because a modest increase in the number of bridge nodes is accompanied by a dramatic reduction in the number of clusterheads, the total number of trans- missions still decreases as the threshold value increases. Fig. 12(c) shows the accuracy of the result obtained using the interactive mode of CAG with default transmission power (without retransmissions). Due to the unreliable links, the resulting errors is out of bound when τ = 1% and 2%. The ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 30 · SunHee Yoon et al. error caused by the packet loss is 9.375% when τ =10% at 4 PM. Fig. 12(d) shows the accuracy of the result obtained using the interactive mode of CAG with perfect link reliability. Even though we expected that the resulting error is not guaranteed to be within the given threshold τ in the interactive mode, we observed that the resulting error was always bounded by τ. This can be an indication such that our measured temperature data in the physical world follows the normal distribution. 6.2.2 Streaming Mode of CAG. In the streaming mode, each cluster and its clusterhead do not change in every query cycle as long as sensor readings stay within the threshold. Therefore, if a cluster does not change, we do not need to re-count the number of nodes within the cluster. Fig. 13(a) depicts the reduction in the number of transmissions if no cluster changes for 100 timesteps (e.g., 1 hour and 40 minutes using per minute response) and 1800 timesteps (e.g., 30 minutes using per second response). If no cluster changes for 100 query cycles, CAG reduces transmissions over the interactive mode by 50% to 70% as τ increases from 0% to 20% with 1 PM data (53% to 75% with 7 PM data). With 1800 query cycle, CAG reduces transmissions over the interactive mode by 94% to 97% as τ increases from 0% to 20% with 1 PM data (95% to 98% with 7 PM data). This reduction in packet transmissions is validated by using a more realistic scenario as in Fig. 13(b). We found many instances of the temperature data from the Great Duck Island which do not change for a long period of time and when they change they show a stair-wise pattern (Fig. 14(b)). This is probably due to the shadows that keep places cool for a period of time even in the middle of the day. In order to emulate this stair-wise data pattern, we divided this data (1 PM to 6 PM) into two data sets; the first set covers 1 PM to 4 PM and the second set spans the interval 4 PM to 6 PM. As τ increases from 0% to 20%, CAG reduced the number of transmissions significantly with both data sets: 67% to 82% for 1 PM to 4 PM data and 56% to 78% for 4 PM to 6 PM data. Fig.13(c)showstheresultingerrorwithpacketlossusingthedefaulttransmission power. Although the current mica2 radio is unreliable and CAG does not use retransmissions, CAG provides the bounded error except for CAG with τ = 1%. With a small threshold, the accuracy can be easily deteriorated even with a few packet losses and this makes impossible for the error to be bounded for small threshold values such as 1%. Both in the interactive and streaming modes, the error in the result is always boundedbythethresholdwhenτ ≥4%withpacketlossusingthedefaulttransmis- sion power. This can be a good evidence that CAG with τ ≥4% might be resilient to the packet loss in the real world. Thus, from our experiments, we empirically concludethatτ around4%maybeareasonabletradingpointtoachievebothsmall errors and energy saving in reality. Fig. 13(d) shows that the error in the result is always bounded by the threshold whenthereisnopacketloss. Notethattheresultingerrordoesnotincreaselinearly with increasing τ. This is because the temperature data is highly correlated in the physical world. Most sensor readings are close to the clusterhead value. Thus, the error increases sub-linearly with the increasing threshold. We observed that the ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 31 0 20 40 60 80 100 120 140 160 20 10 4 2 1 0 Reduced number of transmissions (%) 1PM 100 7PM 100 1PM 1800 7PM 1800 Threshold (τ) (a) Improvement of streaming mode over interactive mode for given timesteps. 0 20 40 60 80 100 120 140 160 20 10 4 2 1 0 Reduced number of transmissions (%) 1PM 3 hours 4PM 2 hours Threshold (τ) (b) Improvement of streaming mode over interactive mode by applying stair-wise data in Fig. 14(b). 0 1 2 3 4 5 6 7 8 20 10 4 2 1 0 Absolute error (%) 1PM 4PM 6PM 7PM Threshold (τ) (c) Absolute error in the result obtained using the streaming mode of CAG with default transmit power reliability. 0 1 2 3 4 5 6 7 8 20 10 4 2 1 0 Absolute error (%) 1PM 4PM 6PM 7PM Threshold (τ) (d) Absolute error in the result obtained using the streaming mode of CAGwith perfect link reliability. Fig. 13. Performance and precision tradeoff in the streaming mode with measured temperature data from the Exposition Park. packet loss results in an error larger than τ, and 1% and 2% thresholds result in smaller reduction in the number of transmissions than using a bigger τ, especially when CAG is deployed for a short duration. Thus, we can get more benefit by relaxing τ a little bit. This supports our claim that τ around 4% may provide the best energy-accuracy tradeoff. Note that the updated CAG algorithm does not have more overhead than the previous CAG algorithm in terms of the number of transmissions. We omit the results on performance and precision using the indoor environmental data due to the space limitation and the fact that it does not show a significant difference from the result using the outdoor data. Our earlier work [Yoon and Shahabi 2005] described the impact of different levels of correlation and density on efficiency and accuracy. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 32 · SunHee Yoon et al. 24000 24500 25000 25500 26000 13 14 15 16 17 18 19 Temperature Time Temperature, GDI Linear model (a) Snapshot of temperature changing linearly from 1 PM to 7 PM. 27500 28000 28500 29000 13 14 15 16 17 18 19 Temperature Time Temperature, GDI (b) Snapshot of temperature changing stair-wise from 1 PM to 7 PM. Fig.14. Linearandstair-wisedatadynamicsobservedinthedatafromGreatDuck Island. 6.2.3 Exploiting Temporal Correlation. In this Section, we study the temporal correlation of data and its impact on the CAG algorithm. The interactive mode of CAG is not affected by the temporal correlation because the interactive mode is used for one-shot query and a tree is built every time a new query is issued. However, the performance of the streaming mode of CAG is deeply influenced by the co-existing correlation, both temporal and spatial correlations. Fig. 14(a) shows an example of the temperature data measured on the Great DuckIsland(takenevery5minutes)whichchangeslinearlyfrom1PMto7PM.We model this data pattern as a linear function shown in the same figure. Fig. 14(b) shows an example of the temperature data measured on the Great Duck Island which changes stair-wise from 1 PM to 7 PM. Although they both are correlated temporally, CAGwithstair-wisedatacanresultinsignificantlymoresavingsinthe number of transmissions than with linear data. To quantify the impact of temporal correlation on energy efficiency, we evaluate CAG using two metrics defined in Section 6.1: cluster change rate and node change rate. To understand the impact of linear data on the performance of CAG, we used linear interpolation to convert our temperature data from the Exposition Park (from 4 PM to 6 PM) into timeseries of two hours with the resolution of one minute. Sensor readings were taken every 5 minutes on the Great Duck Island, but our simulation uses a timestep of one minute. This finer granularity can better address the dynamics of data. For the stair-wise data, we predicted that it only requires the same number of cluster changes as the number of the corners of stairs (roughly three times in 14(b)) regardless of threshold. Fig. 15(a) shows that cluster change rate, which is the percent of timesteps that at least one node changes cluster. If τ is between 1%, 2% and 4%, the cluster change rate increases from 0% to 31%, and 70%, respectively. That is, cluster reformation is required in 70% of the timesteps (84 timesteps out of 121) when τ is 4%. This seems like a heavy overhead and counterintuitive at first glance. However, the average size of the cluster increases as the error threshold increases. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 33 100 80 60 40 20 0 20 10 4 2 1 0 % Cluster change rate Node change rate Threshold (τ) (a) Cluster change rate and node change rate using linear interpolation with temperature data at 4 PM to 6 PM from the Exposition Park. 100 80 60 40 20 0 20 10 4 2 1 0 Node change rate (%) one-run, value-compare diff-run, cid-compare Threshold (τ) (b) Comparison of two methods for the node change rate using the temperature data (4 and 6 PM, assuming they are consecutive timestep) from the Exposition Park. We used the diff-run, cid-compare in our simulations. Fig. 15. Impacts of different spatial and temporal correlations on the performance of CAG. Accordingly, this large size of cluster increases the probability that a small number of nodes will migrate to a different cluster when sensor readings change linearly. ThenodechangerateshowninFig.15(a)illustratesthatonaverageonly0%,1.4%, and 6% of entire nodes (0, 1.4, and 6 nodes out of 100 nodes) change clusters per timestep whenτ is 1%, 2% and 4%, respectively. If we fix these inconsistent cluster membership locally with the nodes from neighboring clusters 11 , this overhead can be reduced significantly. Fig. 15(b) compares two different methods to compute the node change rate. In this graph, we assume that the temperature data at 4 PM and 6 PM are adjacent datapointsinatimeserieswithtwohourinterval. Thefirstmethod(one-run,value- compare) runs CAG once with 4 PM data, and compares the resulting clusterhead valuewiththesensorreadingofeachnodeinthenexttimestep(6PM)todetermine if the node still stays in the same cluster. This method works correctly if the clusterhead values at 4 PM and 6 PM are identical, i.e., data is static. The second method (different-run, cid-compare) runs CAG with 4 PM and 6 PM data sets separately and compares if there is any node which has different cluster- head id. This assumes that the clusterhead value changes over time. Consequently, clusterhead nodes determine their cluster range at every timestep. In this method, the underlying routing protocol is required to support the message exchange be- tweentheclusterheadnodeandnon-clusterheadnodestokeeptrackoftheupdated clusterhead value. This amplifies the tree maintenance overhead which is defined as the number of transmissions due to the number of nodes changing clusters as shown in Fig. 15(b). Because this method compares clusterhead ids from two separate runs (4 PM and 6 PM) of CAG based on different data sets on its timesteps, most nodes are 11 This is out of the scope of this paper. We will study this optimization in our future work. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 34 · SunHee Yoon et al. self-clusterhead nodes (node id = clusterhead id) when the τ value is small. This results in very small number of nodes which change clusters. However, fewer nodes keep the same clusterhead id (along with their cluster member nodes) in the next timestep compared to the previous timestep as τ becomes larger, because a large number of nodes have different clusterhead id in the next timestep which generates different data sets. Thus, this results in larger number of nodes which change clusters. Boththemethodsarecorrectbutonlydifferentinmethodologybasedonthetem- poral data dynamics and the existence of underlying routing support for periodic or reactive dissemination. However, in reality where data is dynamically changing, the first method is not appropriate, because it compared local sensor readings with theclusterheadvaluefromthepast. Therefore, ifthedata(clusterheadnodevalue) does not change significantly over time, i.e., it is highly temporally correlated as in the stair-wise pattern, we can select the first method and achieve significant energy saving. In our simulations, we used the second method in computing the cluster change rate and the node change rate to address the worst case scenario. 7. CONCLUSION AND FUTURE WORK In this paper, we demonstrated the effectiveness of CAG, in both interactive and streaming modes, in performing energy efficient in-network aggregation leveraging both spatial and temporal correlations. We mathematically modeled the spatial correlation using the measured data, and evaluated the efficiency and accuracy of CAG analytically and empirically. CAG can maximally take advantage of stronger spatial and temporal correlations by increasing energy efficiency and providing the results with predictable and bounded errors. These benefits are amplified as the density of nodes becomes higher. CAG is shown to scale gracefully when the num- ber of nodes in the network grows. Moreover, CAG is shown to be resilient to the packet loss. CAG is the first system which realizes the semantic broadcast to con- serve energy ensuring bounded approximation by leveraging spatial and temporal correlations prevalent in the nature. Wewouldliketoextendthisworkbyfocusingonthedesignofahybrid(proactive and reactive) clustering protocol which supports the localized cluster adjustment whenever cluster structure changes (change, merge, and split clusters). If a small number of nodes needs to join a different cluster (due to communication problems or a large change in sensor readings), which happens frequently in our experiments, they can do so locally by communicating with the neighboring nodes associated with other clusters. By avoiding global communications, and adjusting clusters onlyusinglocalcommunications,CAGcansavesignificantnumberoftransmissions and hence energy, especially when only a few nodes change clusters. Whenever the cluster structure changes, CAG will send back the aggregate value up to the base station because at least one sensor is reporting data outside the current threshold. Thismechanismenablesareactivedataacquisition. Ourworkcanbeextendedinto ahybridalgorithmwhichenablestree-basedroutingtosupportany-to-anyrouting. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. AnExperimentalStudyoftheEffectivenessofClusteredAGgregation(CAG)LeveragingSpatialandTemporal... · 35 Acknowledgment The authors would like to thank professor David Kempe for discussions about the updated algorithm, professor Bhaskar Krishnamachari for useful discussions, and Sundeep Pattem for valuable feedback. REFERENCES Considine, J., Li, F., Kollios, G., and Byers, J. 2004. Approximate aggregation techniques for sensor databases. In International Conference on Data Engineering (ICDE). Cristescu, R., Beferull-Lozano, B., and Vetterli, M. 2004. On network correlated data gathering. In IEEE INFOCOM. Dale, M. R., Dixon, P., Fortin, M.-J., Legendre, P., Myers, D. E., and Rosenbreg, M. S. 2002. Conceptual and mathematical relationships among methods for spatial analysis. In Ecography Vol.25 No.5. Deshpande,A.,Guestrin,C.,Madden,S.R.,Hellerstein,J.M.,andHong,W.2004. Model- Driven Data Acquisition in Sensor Networks. In VLDB. Fang, Q., Zhao, F., and Guibas, L. 2003. Lightweight Sensing and Communication Protocols for Target Enumeration and Aggregation. In MobiHoc. Fernandez, C. and peter J. Green. 2002. Modeling spatially correlated data via mixtures: a Bayesian approach. In University of St Andrews, University of Bristol. Ganesan, D., Greenstein, B., Perelyubskiy, D., Estrin, D., and Heidemann, J. 2003. An Evaluation of Multi-resolution Storage for Sensor Networks. In ACM Conference on Embedded Networked Sensor Systems (SenSys). Goel,A.andEstrin,D. 2003. Simultaneousoptimization forconcavecosts: single sinkaggrega- tion or single source buy-at-bulk. In ACM-SIAM symposium on Discrete algorithms (SODA). Goel, S. and Imielinski, T. 2001. Prediction-based Monitoring in Sensor Networks: Taking Lessons from MPEG. In ACM the Computer Communication Review (CCR). Guestrin, C., Bodi, P., Thibau, R., Paski, M., and Madden, S. 2004. Distributed regression: an efficient framework for modeling sensor network data. In Information Processing in Sensor Networks (IPSN). Gupta, H.,Navda, V.,Das, S. R.,andChowdhary, V. 2005. Efficient Gathering of Correlated Data in Sensor Networks. In ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc). Halley, J. M.,Hartley, S.,Kallimanis, A. S.,Kunin, W. E.,Lennon, J. J.,andSgardelis, S. P. 2004. Uses And Abuses Of Fractal Methodology In Ecology. In Ecology Letters. Heinzelman, W. R., Chandrakasan, A., and Balakrishnan, H. 2000. Energy-Efficient Com- munication Protocol for Wireless Microsensor networks. In Hawaii International Conference on System Sciences (HICSS). Hill, J., Szewczyk, R., Woo, A., Hollar, S., Culler, D., and Pister, K. 2000. System Architecture Directions for Network Sensors. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Husker, A., Kohler, M., , and Davis, P. 2003. Seismic Amplitude Variations Due to Site and Basin Edge Effects in the Los Angeles Basin. In Trans American Geophysical Union. Intanagonwiwat, C., Govindan, R., and Estrin, D. 2000. Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks. In International Conference on Mobile Computing and Networking (Mobicom). Jain, A., Chang, E. Y., and Wang, Y.-F. 2004. Adaptive Stream Resource Management Using Kalman Filters. In ACM SIGMOD. Jindal, A. and Psounis, K. 2004. Modeling spatially-correlated sensor network data. In IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON). Lennon, J. 2000. Red-shifts and red herrings in geographical ecology. In Ecography Vol.23 No.1. ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005. 36 · SunHee Yoon et al. Levis, P., Lee, N., Welsh, M., and Culler, D. 2003. TOSSIM: Accurate and Scalable Sim- ulation of Entire TinyOS Applications. In ACM Conference on Embedded Networked Sensor Systems (SenSys). Madden, S. R., Franklin, M. J., Hellerstein, J. M., and Hong, W. 2002. TAG: Tiny AG- gregation service for ad-hoc sensor networks. In Symposium on Operating Systems Design and Implementation (OSDI). Mainwaring, A.,Szewczyk, R.,Polastre, J.,andAnderson, J. 2005. Habitat monitoring on great duck island. In http://www.greatduckisland.net. Manjeshwar, A. and Agrawal, D. P. 2001. TEEN: A Routing Protocol for Enhanced Effi- ciency in Wireless Sensor Networks. In International Workshop on Parallel and Distributed Computing, Issues in Wireless Networks and Mobile Computing (IPDPS). Manjeshwar, A. and Agrawal, D. P. 2002. APTEEN: A Hybrid Protocol for Efficient Rout- ing and Comprehensive Information Retrieval in Wireless Sensor Networks. In International Workshop on Parallel and Distributed Computing, Issues in Wireless Networks and Mobile Computing (IPDPS). Nath, S., Gibbons, P., Anderson, Z., and Seshan, S. 2004. Synopsis Diffusion for Robust Aggregation inSensorNetworks. InACM Conference on Embedded Networked Sensor Systems (SenSys). Olston, C., Jiang, J., and Widom, J. 2003. Adaptive filters for continuous queries over dis- tributed data streams. In ACM SIGMOD. Pattem,S.,Krishnamachari,B.,andGovindan,R.2004. TheImpactofSpatialCorrelationon Routing with Compression in Wireless Sensor Networks. In Information Processing in Sensor Networks (IPSN). Rossi, L. A., Krishnamachari, B., and Kuo, C.-C. J. 2004. Distributed Parameter Estimation forMonitoringDiffusionPhenomenaUsingPhysicalModels. InIEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON). Schabenberger, O. and Gotway, C. A. 2005. Statistical Methods for Spatial Data Analysis. In Chapman & Hall/CRC. Sharaf,M.A.,Beaver,J.,Labrinidis,A.,andChrysanthis,P.K. 2003. TiNA: A Scheme for Temporal Coherency-Aware in-Network Aggregation. In ACM Workshop on Data Engineering for Wireless and Mobile Access (MobiDe). Szewczyk, R., Mainwaring, A., Polastre, J., and Culler, D. 2004. An Analysis of a Large Scale Habitat Monitoring Application. In ACM Conference on Embedded Networked Sensor Systems (SenSys). Tobler, W. R. 1970. A computer movie simulating urban growth in the Detroit region. In Economic Geography Vol.46 No.2. Woo, A., Madden, S. R., and Govindan, R. 2004. Networking Support for Query Processing in Sensor Networks. In Communications of the ACM (CACM). Xu, N., Rangwala, S., Chintalapudi, K. K., Ganesan, D., Broad, A., Govindan, R., and Estrin, D. 2004. A Wireless Sensor Network for Structural Monitoring. In ACM Conference on Embedded Networked Sensor Systems (SenSys). Yao, Y. and Gehrke, J. 2003. Query Processing for Sensor Networks. In Biennial Conference on Innovative Data Systems Research (CIDR). Yoon, S. and Shahabi, C. 2005. Exploiting Spatial Correlation Towards an Energy Efficient Clustered AGgregation Technique (CAG). In IEEE International Conference on Communica- tions. Zhao, J., Govindan, R., and Estrin, D. 2003. Computing aggregates for monitoring wireless sensor networks. In IEEE International Workshop on Sensor Network Protocols and Applica- tions (SNPA). ACM Transactions on Sensor Networks, Vol. V, No. N, August 2005.
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 868 (2005)
PDF
USC Computer Science Technical Reports, no. 845 (2005)
PDF
USC Computer Science Technical Reports, no. 774 (2002)
PDF
USC Computer Science Technical Reports, no. 839 (2004)
PDF
USC Computer Science Technical Reports, no. 748 (2001)
PDF
USC Computer Science Technical Reports, no. 719 (1999)
PDF
USC Computer Science Technical Reports, no. 959 (2015)
PDF
USC Computer Science Technical Reports, no. 828 (2004)
PDF
USC Computer Science Technical Reports, no. 855 (2005)
PDF
USC Computer Science Technical Reports, no. 840 (2005)
PDF
USC Computer Science Technical Reports, no. 750 (2001)
PDF
USC Computer Science Technical Reports, no. 813 (2004)
PDF
USC Computer Science Technical Reports, no. 814 (2004)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 830 (2004)
PDF
USC Computer Science Technical Reports, no. 968 (2016)
PDF
USC Computer Science Technical Reports, no. 943 (2014)
PDF
USC Computer Science Technical Reports, no. 835 (2004)
PDF
USC Computer Science Technical Reports, no. 844 (2005)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
Description
SunHee Yoon, Cyrus Shahabi. "An experimental study of the effectiveness of clustered aggregation (CAG) leveraging spatial and temporal correlations in wireless sensor networks." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 869 (2005).
Asset Metadata
Creator
Shahabi, Cyrus
(author),
Yoon, SunHee
(author)
Core Title
USC Computer Science Technical Reports, no. 869 (2005)
Alternative Title
An experimental study of the effectiveness of clustered aggregation (CAG) leveraging spatial and temporal correlations in wireless sensor networks (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
36 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270725
Identifier
05-869 An Experimental Study of the Effectiveness of Clustered AGgregation (CAG) Leveraging Spatial and Temporal Correlations in Wireless Sensor Networks (filename)
Legacy Identifier
usc-cstr-05-869
Format
36 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/