Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reliable languages and systems for sensor networks
(USC Thesis Other)
Reliable languages and systems for sensor networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RELIABLE LANGUAGES AND SYSTEMS FOR SENSOR NETWORKS by Ramakrishna Gummadi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2007 Copyright 2007 Ramakrishna Gummadi Table of Contents List Of Tables iv List Of Figures v Abstract viii 1 Chapter 1: Introduction 1 2 Chapter 2: Atomicity Guarantees for TinyOS Programs 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Inter-Node Race Conditions in Sensor Network Applications . . . . . . . . . . . . . 11 2.2.1 The Dirty Read Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 The Lost Update Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Design and Implementation of GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Goals and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 The GAG API for Atomic Views . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.3 Implementation of GAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Fixing Inter-Node Race Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Finding inter-node race conditions . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Using GAG to fix race conditions . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1 Methodology and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3 Chapter 3: Pleiades: A language with serializability guarantees 49 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 The Pleiades Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.1 Design Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.2 Parking Cars with Pleiades . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.3 Parking Cars with nesC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.3.1 A Centralized nesC Implementation . . . . . . . . . . . . . . . . . 61 3.2.3.2 A Distributed nesC Implementation . . . . . . . . . . . . . . . . . 62 3.2.4 Other Features of Pleiades . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.1 Program Partitioning and Migration . . . . . . . . . . . . . . . . . . . . . . 66 3.3.2 Serializable Execution of cfors . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 ii 3.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4 Chapter 4: Kairos: An Eventual-Consistency Programming Language 86 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Kairos Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.1 Kairos Abstractions and Programming Primitives. . . . . . . . . . . . . . . 91 4.3.2 Examples of Programming with Kairos. . . . . . . . . . . . . . . . . . . . . 94 4.4 Kairos Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.5 Kairos Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5 Chapter 5: Fault-Tolerance Support for Kairos 111 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3 Generic Checkpoint Recovery in a Macroprogramming System . . . . . . . . . . . 119 5.3.1 An Overview of Kairos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.2 Recovery in Macroprograms . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3.3 Manual Failure Recovery for Macroprogramming . . . . . . . . . . . . . . . 122 5.3.4 Recovering from Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.4 Automated Recovery Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.4.1 Declarative Recovery Annotations . . . . . . . . . . . . . . . . . . . . . . . 129 5.4.2 Selecting and Managing Checkpoints . . . . . . . . . . . . . . . . . . . . . . 130 5.4.3 Transparent Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 Chapter 6: Conclusions 149 Bibliography 149 iii List Of Tables 2.1 Source size without and with GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1 Performance of Find-Nodecut. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1 Performance of Vehicle Tracking in Kairos . . . . . . . . . . . . . . . . . . . . . . . 107 iv List Of Figures 1.1 Design space of prior and our proposed systems with respect to readability, relia- bility, and resource usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Taxonomy of Programming languages and systems. . . . . . . . . . . . . . . . . . . 4 2.1 Routing tree construction pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Illustration of a routing loop with MultiHopLQI. . . . . . . . . . . . . . . . . . . . 15 2.3 Localization pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Contaminant detection pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Multi-target tracking pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Data collection with filtering pseudocode. . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 GAG API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.8 Data structures used by GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.9 Pseudocode for localization with GAG. . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.10 Pseudocode for contamination detection with GAG. . . . . . . . . . . . . . . . . . . 32 2.11 Pseudocode for multi-target tracking with GAG. . . . . . . . . . . . . . . . . . . . . 33 2.12 Evaluation testbed of 40 Imote2 motes. . . . . . . . . . . . . . . . . . . . . . . . . 36 2.13 Plotsofdistributionofroutingloopsizesandthepercentageofdatalostasaresult at each loop size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.14 Plotsofincreaseinroutingtreeconvergenceandaverageper-nodemessageoverhead incurred by GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.15 Plot of localization error as function of the localization graph size. . . . . . . . . . 39 v 2.16 Plots of localization latency increase and average per-node message overhead in- curred by GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.17 Plots of missed contaminant detection events with network size. . . . . . . . . . . . 41 2.18 Plots of increase in contamination detection latency and average per-node message cost incurred by GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.19 Plot of percentage of target tracking failures with network size. . . . . . . . . . . . 43 2.20 Plots of increase in target tracking latency and average per-node message overhead incurred by GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.21 Plot of unfiltered data from each data set. . . . . . . . . . . . . . . . . . . . . . . . 45 2.22 Plots of increase in filter instantiation latency and average per-node message over- head incurred by GAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 A street-parking application in Pleiades. . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Reliable but inefficient street-parking in nesC. . . . . . . . . . . . . . . . . . . . . . 59 3.3 Efficient but unreliable street-parking in nesC. . . . . . . . . . . . . . . . . . . . . 60 3.4 Algorithm for determining nodecuts. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.5 Nodecuts generated for the street-parking example. . . . . . . . . . . . . . . . . . . 81 3.6 Locking algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.7 Deadlock algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.8 PEG application error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.9 Street parking latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.10 Street parking message cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1 Taxonomy of Programming Models for Sensor Networks . . . . . . . . . . . . . . . 89 4.2 Kairos Programming Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Procedural Code for Building a Shortest-path Routing Tree . . . . . . . . . . . . . 95 4.4 Procedural Code for Localizing Sensor Nodes . . . . . . . . . . . . . . . . . . . . . 97 4.5 Procedural Code for Vehicle Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6 StargatewithMica2asaNIC(left),StargateArray(middle),andCeilingMica2dot Array (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vi 4.7 ConvergenceTime(left),Overhead(middle),andOPPStretch(right)fortheRout- ing Tree Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8 Average Error in Localization (L) and Localization Success Rate (R) . . . . . . . . 110 5.1 Distribution of outage durations in a real sensor network. . . . . . . . . . . . . . . 115 5.2 Send and receive procedures for data aggregation in a node-level program with manual recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.3 Example macroprogram for computing average temperature and light readings. . . 142 5.4 Example macroprogram with manual recovery code. . . . . . . . . . . . . . . . . . 142 5.5 Example macroprogram for recovering from partitions. . . . . . . . . . . . . . . . . 143 5.6 Common tasks and their merge functions. . . . . . . . . . . . . . . . . . . . . . . . 144 5.7 Example macroprogram to illustrate Declarative Recovery (DR). . . . . . . . . . . 144 5.8 A single Mica-Z controlled by a PC (left), a single Mica-Z attached to a Stargate (center), and Mica-Z’s (circled) on the ceiling (right). . . . . . . . . . . . . . . . . . 145 5.9 Availability comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and DR-Partition Recovery (DR-PR) strategies with increasing node failures. . . . . . 145 5.10 Availability comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and DR-Partition Recovery (DR-PR) strategies with increasing node failures. . . . . . 146 5.11 Accuracy comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and No Recovery (NR) strategies with increasing node failures. . . . . . . . . . . . . . . . . 146 5.12 Accuracy comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), DR- Partition Recovery (DR-PR), and No Recovery (NR) strategies with increasing node failures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.13 Accuracy Comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and DR-Partition Recovery (DR-PR) strategies with increasing node failures. . . . . . 147 5.14 Message overhead comparison of TR-SW, TR-HW, and DR-PR strategies. . . . . . 148 5.15 Memory overhead for CRR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 vii Abstract Sensor networks promise to allow the world around us to be observed, measured, and even con- trolled at a fine granularity. However, in order to realize the full potential of sensor networks, it is increasingly apparent that they should be easily, reliably, and efficiently programmable. Surpris- ingly, the state-of-the-art programming languages and systems focus mostly on programmability and efficiency, and only poorly support reliability, if at all. In this thesis, we take the first step toward achieving all three goals by building three related languages and systems, each of which supports reliability. First, we show how one can easily modify existing code, which is primarily designed for effi- ciency, inordertoprovidereliability. Sincetoday’sprogrammingsystemsarenoteasilyaccessible to non-experts, we design and implement two languages that are easy to program, and also offer trade-offs in terms of reliability and efficiency. Our experimental results from these three systems indicatethatitispossibletobuildreliableandefficiencysystemsthatarealsosimpletoprogram. viii Chapter 1 Chapter 1: Introduction Wireless sensor networks consist of a system of distributed sensors embedded in the physical world. They promise to allow observation of previously unobservable phenomena. They are in- creasinglyusedinscientificandcommercialapplications[TPS + ,KAB + ,SMPC,GJV + ]. However, constructing practical and reliable wirelessly-networked systems out of them is still a significant challenge. Thisisbecausetheprogrammermustcopewithsevereresource,bandwidth,andpower constraints on the sensor nodes, in addition to dealing with the various challenges of distributed systems, such as the need to maintain consistency and synchronization among numerous, asyn- chronous loosely coupled nodes and the need for fault tolerance. Originally, such resource constraints have forced the developers of infrastructure software such as languages, operating systems and network stacks to focus on optimization of resource consumption. The targeted resources are chiefly code size, data memory and network messaging. Since the resulting abstractions are not always easy-to-use for non-expert programmers, later research has concentrated on providing programming primitives that are easier to use. However, reliability considerations were secondary, so support for reliable programming is slim to non- existent. In this thesis, we first show that, in order to be practical (i.e., work correctly under real- world conditions), programming support for sensor networks needs to incorporate reliability as a 1 primary goal (Section 2.2). By reliability, we mean a programming model that guarantees data consistencyandcorrectorderingofdistributedexecution. Forefficiencyreasons,theprogramming modelmustprovidevariousrelaxedconsistencyandcorrectnesssemantics,anditsimplementation must optimize resources while respecting the semantics promised to the programmer. Sincesupportforreliabilityisnotadequatelymetbyexistinglanguagesorsystemssoftware,we have designed three systems, including two languages, to satisfy the reliability requirement. The primaryreasonforthesethreesystemsistoexplorethedesignspaceofprogrammability,reliability, and resource usage requirements, as shown in Figure 1.1. These three systems are Atomic Views (or GAG) (Chapter 2), Pleiades (Chapter 3), and Kairos (Chapter 4). Atomic Views is a library that provides reliability while maintaining compatibility with existing code, Pleiades is a language that provides rigorous reliability guarantees, while Kairos allows the programmer to trade-off resource usage for more relaxed reliability guarantees. Unlike existing languages and systems like TinyOS [LMG + 04] and TinyDB [MFHHb], all three systems provide some degree of programmer-controllable reliability. The relationship between the three pieces of work presented in this thesis and other closely related work proposed in the literature is shown in Figure 1.2. This relationship is described in more detail in Section 4.2, we give a brief overview of these three systems here. Firstly, Atomic Views augments TinyOS programs in order to provide reliability, and is in the same class of node- level systems that provide additional safety guarantees for TinyOS programs, such as Mat´ e [LC]. Secondly, Pleiades and Kairos are centralized programming languages for expressing the global behavior of an aggregate collection of sensor nodes. They provide a way to name and address nodes, so they are different from query-processing languages such as TinyDB (Figure 1.2). We make the following contributions in this dissertation. 2 Readability & Expressivity Reliability Resource Usage Kairos Atomic Views (GAG) Pleiades TinyOS TinyDB, Abstract Regions, Hoods The Three R’s Figure 1.1: Design space of prior and our proposed systems with respect to readability, reliability, and resource usage. • In Chapter 2, we examine race conditions in distributed sensor network programs, and de- scribe the design, implementation, and evaluation of a library, calledGAG (Global Atomic- ity Guarantor), that provides global atomicity guarantees for such programs. GAG supports atomic reads and writes of data shared across nodes. The library is designed to support the core requirement of atomicity from a sensor network application’s perspective: a consistent view of shared data across all nodes in the network. The focus on this sole requirement makes GAG more lightweight than traditional transaction-based approaches to atomicity, which additionally must provide persistence and tolerate complex failure modes of nodes. Weshowthatexistingsensornetworkapplicationscanbecheaplymodifiedtouseouratom- icity library. Further, the resulting programs are correct under distributed execution and under common modes of failure seen in sensor networks, while incurring less than 20% additional overhead in messages and memory. 3 Programming Models Collective Node-Level Global Behavior Local Behavior Compositional Safe Execution Node-independent • TinyDB • Cougar Node-Dependent • Kairos • Pleiades • Regiment Data-Centric • EnviroTrack • State-centric Geometric • Regions • Hoods • TinyOS • SNACK • TENET • Maté • Trickle • VM* • Atomic Views Figure 1.2: Taxonomy of Programming languages and systems. • In Chapter 3, we present the Pleiades programming language, its compiler, and its runtime. The Pleiades language extends the C language with constructs that allow programmers to name and access node-local state within the network and to specify simple forms of concurrent execution. The compiler and runtime system cooperate to implement Pleiades programs efficiently and reliably. First, the compiler employs a novel program analysis to translate Pleiades programs into message-efficient units of work implemented in nesC. The Pleiades runtime system orchestrates execution of these units, using TinyOS services, across a network of sensor nodes. Second, the compiler and runtime system employ novel locking, deadlock detection, and deadlock recovery algorithms that guarantee serializability in the face of concurrent execution. We illustrate the readability, reliability and efficiency benefits of the Pleiades language through detailed experiments, and demonstrate that the Pleiades implementationofarealisticapplicationperformssimilartoahand-codednesCversionthat contains more than ten times as much code. 4 • Kairos is a language that provides relaxed consistency semantics while presenting the same centralized programming model as Pleiades. Kairos’ compile-time and runtime subsystems expose a small set of programming primitives, while hiding from the programmer the de- tails of distributed-code generation and instantiation, remote data access and management, and inter-node program flow coordination. In Chapter 4, we describe Kairos’ program- ming model, and demonstrate its suitability, through actual implementation, for a variety ofdistributedprograms—bothinfrastructureservicesandsignalprocessingtasks—typically encountered in sensor network literature: routing tree construction, localization, and object tracking. Our experimental results suggest that Kairos does not adversely affect the perfor- mance or accuracy of distributed programs, while our implementation experiences suggest that it greatly raises the level of abstraction presented to the programmer. • Since nodes in a sensor network are exposed to unpredictable environments, sensor-network applications must handle a wide variety of faults: software errors, node and link failures, and network partitions. The code to manually detect and recover from faults crosscuts the entire application, is tedious to implement correctly and efficiently, and is fragile in the face of program modifications. In Chapter 5, we investigate language support for modu- larly managing faults. Our insight is that such support can be naturally provided as an extension to existing “macroprogramming” systems for sensor networks. In such a system, a programmer describes a sensor network application as a centralized program; a compiler then produces equivalent node-level programs. We describe a simple checkpoint API for macroprograms, which can be automatically implemented in a distributed fashion across the network. We also describe declarative annotations that allow programmers to specify checkpointingstrategiesatahigherlevelofabstraction. Wehaveimplementedourapproach intheKairosmacroprogrammingsystem. Experimentsshowittoimproveapplicationavail- ability by an order of magnitude and incur low messaging overhead. 5 Whilewehaveexploredvariouspointsinthepossibledesignspaceofsensornetworksprograms, we do not claim to provide optimal choices of reliability, readability, and resource consumption for every possible program. In fact, a single such system might not exist at all. So, instead of a one-size-fits-all approach, the programmer might benefit from a well-tested suite of options. While we have designed our systems with well-defined heuristics in mind, and implemented and evaluated them on real devices, we still need to gain more experience across real deployments with a large number of nodes in order to understand the implications of these choices better. Furthermore, several pieces of the work presented here can be extended or refined in order to yield better systems, and such improvements are described within the individual chapters. 6 Chapter 2 Chapter 2: Atomicity Guarantees for TinyOS Programs 2.1 Introduction Thesensingandcomputational capabilities ofsensornetworkplatformshaverecentlyevolvedsig- nificantly. For example, the Intel Imote2 [imo] and UCLA LEAP [MHY + ] platforms have 32-bit processors, modular low-power designs, and can support sophisticated sensors such as CMOS- cameras and high-rate acoustic modules. Researchers have already taken advantage of these capabilities to build sophisticated distributed systems for self-calibrating localization [GLTE] and cooperative transcoding [GMP + ]. A key feature of these systems is the complex interaction between application instances executing concurrently on different nodes. As sensor network plat- forms evolve, we expect such interactions to increase. In general, instances of a distributed program read and modify data generated either by themselves or by others, and may share the modified data with other nodes. If these instances do not see consistent views of this shared data, distributed programs can produce incorrect results. For example, while building a multihop routing tree for data collection, a node looks for beacon messages among neighbors and tries to select a node offering the best quality path to the root of the tree as its parent. However, if the programmer does not carefully ensure that the routing beacon being received by the node is not due to the result of a previous stale announcement 7 originating from the listening node itself, transient routing loops can result. This is an instance of a dirty read problem, in which nodes use partially-processed or stale data from other nodes. MultiHopLQI, the default TinyOS routing component, suffers from routing loops as a result of dirtyreads(Section2.2),andsuchroutingloopscanresultinsignificantdatalossindatagathering applications. As another example, consider the scenario where nodes are trying to track two moving targets inasensorfield. Manyproposedalgorithms[CHZ,Coa]runparticlefiltersforeachtargetatanode designated as the master for that target, and the responsibility of being a master migrates from nodetonodeasthetargetmoves. Iftwotargetshappentobeneareachother,twodifferentmaster nodes can designate the same third node as the master for the next tracking period. So, during this update process, the first master-update message could be overwritten by the subsequent message, thereby making the first target untrackable from that point onward, and leading to incorrect execution. This lost update problem occurs in general when nodes concurrently try to update the state of a third node and one of the updates is lost. Thetwoproblemsdescribedabovearefundamentallydueto inter-node raceconditions, which pertain to the interactions among distributed nodes. This situation is different from intra-node race conditions, which are caused by the interleaved execution on a single node of threads in multi-threadedsystems(e.g.,pthreads)ortasksandeventsinevent-drivensystems(e.g.,TinyOS). Solutionsfordetectingorpreventingintra-noderaceconditionshavebeenextensivelystudiedand includenesC’ssupportforatomicblocksandstaticraceconditiondetection, TinyOS’sguarantees for not preempting a task with another task, and the mutual exclusion support in pthreads. Unfortunately, techniques for handling intra-node race conditions do not address the problem ofinter-noderaceconditions. Thisisbecause,evenwhendataisbeingsharedacrosstwoinstances of the same program running on two different nodes, these two program instances execute com- pletely independently of one another. Therefore, preventing inter-node race conditions requires 8 a form of control-flow synchronization across the distributed nodes. Implementing such synchro- nization manually is tedious and error prone, requires large-scale application modifications, and can be prohibitively expensive in terms of synchronization overheads. We observe that to avoid inter-node race conditions it suffices to provide globally atomic data operationsthatinvolvefetchingremotedata, readingandwritinganode’slocaldata, andmaking the computed results atomically available to other nodes. In this paper we discuss the design and implementation of GAG (Global Atomicity Guarantor), a library that provides such global atomicity guarantees for sensor network programs. The GAG API allows programs to read from and write to a remote variable (i.e., a named variable residing at another node). Programmers declarenamedatomic viewsandassociateeachreadorwriteoperationwithaspecificatomicview. This approach provides fine-grained control over atomicity without altering the basic structure of the code. When an atomic view is committed through the GAG API, all variables in that view are updated in an all-or-nothing manner. That is, either all variables are updated at all non-failed nodes, or none are. Any failed node is thus consistently isolated, and all non-failed nodes see consistent values, allowing the application to execute correctly under concurrency and failures. If the commit fails, the API gives the programmer the ability to later retry the entire set of read and write operations. GAG internally uses locks for performing read and write operations. When a readfor a remote variable is executed for the first time, GAG fetches the value of the remote variable, along with a read lock for that variable. Similarly, before writing to a remote variable for the first time, GAG requests a write lock from the remote node. Once GAG acquires the correct type of lock, all reads and writes to that variable are done locally. If the remote node cannot grant a lock because a conflicting operation is in progress, the read or write operation returns an error, and the programmer can retry later. During the commit operation, any outstanding writes are committed to remote nodes. This operationsucceedsonallnon-failednodes,therebyensuringcorrectnessforsubsequentoperations. 9 For concurrent operations, dirty reads are avoided by disallowing a node from acquiring a read lock while another node has a write lock. Lost updates are similarly avoided by disallowing a node from acquiring a write lock while another node has a write lock. We evaluate programs written in GAG using a testbed of 40 Imote2 motes equipped with 802.15.4 radios. We find that GAG corrects problems due to race conditions in five different sensornet applications. We also show that GAG requires only minor modifications to the original applications. By contrast, if a programmer were to manually implement similar functionality in an application, up to 40% of application code (for typical programs of more than 600 lines) would be devoted to ensuring global atomicity. Furthermore, we find that using GAG imposes less than 20% additional messaging and less than 15% additional memory (ROM and RAM) overheads compared to the originally incorrect programs. GAG is inspired by prior work on atomicity in distributed systems and databases. However, GAGprovidesqualitativelydifferenttypesofatomicitythanthatexaminedinthesecontexts. The database literature has focused on atomic access to persistent data, while GAG provides atomic updatestoprogramstateanddoesnotneedtoworryaboutpersistenceornodefailures. GAGalso differs significantly from the recent work on Transactional Memory [LR06], which provides atom- icity for multi-core processors. Unlike multi-core processors, sensor networks consist of loosely coupled nodes, so GAG is designed to work correctly when some nodes fail during execution. This means nodes should never block on one another; they should also not share any explicit or long- term state information in order to minimize fate sharing. GAG also meets the following other sensornet-specific goals and requirements: lower communication overhead than database trans- actions or transactional memories; low memory footprint; split-phase internal communication; user-configurable failure detection; and fully user-controlled application-level retries. We make four main contributions in this chapter: 10 • We show that inter-node race conditions are present in sensor networks, and are caused eitherbydirtyreadsorlostupdatesWeshowhowfivequalitativelydifferentsensornetwork applications function incorrectly due to these same two underlying problems (Section 2.2). • We describe the design and implementation of a simple API for global atomicity in sensor network programs that solves the dirty read and the lost update problems (Section 2.3). • We show how to easily modify such applications to use the GAG API in order to correct them (Section 2.4). • Weextensivelyevaluatethemodifiedprogramsusingreal-worlddatasetswithgroundtruth, andexperimentallyverifythattheydonotencountertheinter-noderaceconditionsthatoc- curwithout GAGsupport. Thisrobustnessisachievedwithamodestlatencyandmessaging overhead (Section 2.5). 2.2 Inter-NodeRaceConditionsinSensorNetworkApplications In this section, we demonstrate the existence of inter-node race conditions in five qualitatively different sensor network applications: routing tree construction for data gathering, localization, contaminant detection, target tracking, anddatafiltering. Thefirstthreeapplications sufferfrom the dirty read problem, while the last two are affected by lost updates. These race conditions are inherent in the functionality of such applications, and, furthermore, existing code and algorithms do not fix them. In later sections (Section 2.4.2), we demonstrate how these race conditions can be prevented by using our GAG library. 2.2.1 The Dirty Read Problem For many common sensor network applications, the programmer has a natural expectation that a certain set of writes performed by a node n be correlated, i.e., that other nodes can only see 11 the results of n’s write operations as a unit. If this expectation is violated, i.e., other nodes see the result of some of n’s writes but not others, we say that a dirty read has occurred. In its simplest form, the dirty read problem can be described using the following notation. Let r n (x,n ′ ) represent a read operation executed at node n on data x belonging to node n ′ , and w n (x,n ′ ) be a write operation executed at node n on x belonging to node n ′ . Consider the following sequence of reads and writes occurring at different nodes during execution: w n (x,n ′ ),r n ′′(x,n ′ ),w n (y,n ′′′ ) (2.1) If node n intends that its writes to the variable x at n ′ and to the variable y at n ′′′ be correlated, then the intervening read of x by node n ′′ constitutes a dirty read. Note that n, n ′ , and n ′′′ could all be the same nodes. Unfortunately, intra-node techniques for ensuring atomicity, such as nesC’s atomic sections or TinyOS’ task preemption and completion guarantees, are not sufficient to avoid the dirty read problem. This is because the problem spans multiple nodes in the network. In our example sequence of operations above, node n has no control over the actions of other nodes like n ′′ , which means atomicity of the two writes cannot be guaranteed. Tobetterillustratehowthedirtyreadproblemcanimpactthecorrectnessofprograms,welook at three different sensornet applications that exhibit the problem. For each example application, we explain how the example contains an instance of the dirty read problem. We then present pseudocode based on actual previously proposed code and algorithms that concretely illustrates this problem. We also give a quantitative sense of the impact of incorrect execution. In what follows, we assume that sensor network programs are written using a facility that provides access to named remote variables. Many distributed applications either provide their own communication libraries to access state at remote nodes via a read/write abstraction, or use facilities such as Abstract Regions [WM] and Hoods [WSBC] that provide such an abstraction. 12 Ourownatomicitylibrary, GAG,alsoprovidessuchafacility,whichisimplementedusingarouting layerthatprovidesany-to-anyrouting(Section2.3). So,aprogram’scomputationcanbemodeled and represented through a read/write notation without any loss of generality. Routingtreeexample. Inroutingtreecomputation,eachnoderepeatedlybroadcastsarouting beacon which contains the node’s current path metric to the root, and selects its routing parent based on beacon messages heard from neighbors. The programmer’s expectation in writing a routingtreemoduleisthatalltheneighborshearingabeaconwillconsistentlyprocessthebeacon: i.e., there are no situations where one node uses stale beacon announcements, while another node uses fresh beacons. If the program does not ensure this, routing loops can result. For example, consider nodes A, B, and C, where A is currently the parent of B and C. When A’s parent fails, it advertises this fact to B and C, but because of processing delays and lost beacons, C does not process this update, while B does. So, C still thinks there is a valid path through A. B therefore transiently selects C as a parent, advertises that choice, which causes A to pick B as a parent, resulting in a three-node loop. ThisisaninstanceofthedirtyreadproblembecauseB haseagerly(ordirtily)readA’sbeacon and used this information to select C as the parent before C had a chance to process A’s beacon as well. In terms of Definition 2.1, node n corresponds to A, n ′ and n ′′ to B, n ′′′ to C. Variables x and y refer to A’s beacon announcing the route failure to nodes B and C, respectively. Ideally, both B and C should have processed A’s beacon before B reads the beacon and reacts to it. If the beacon were processed at both B and C first, then B would not choose C as the parent, and in turn A would not choose B, thereby preventing the routing loop. MultiHopLQI, the current default TinyOS routing layer is susceptible to this dirty read prob- lem. It uses a form of building a routing tree whose pseudocode is shown in Figure 2.1. There are two independently triggered functions: process routing beacons, which is triggered whenever a nodereceivesaroutingbeacon, and find new parent, whichistriggeredwhenthecurrentparent is dead. These functions use a link quality metric known as LQI (or Link Quality Indicator) 13 process_routing_beacons(rt_beacon){ 1: if (LQI(rt_beacon)+LQI(contents(rt_beacon)) > cur_lqi_to_root) { 2: parent=sender(rt_beacon); 3: cur_lqi_to_root=LQI(rt_beacon)+LQI(contents(rt_beacon)); } 4: if (sender(rt_beacon)==parent) broadcast(routing_beacon); } find_new_parent(){ 5: parent=NULL; cur_lqi_to_root=0; 6: broadcast(route_invalid_beacon); 7: foreach (n in neighbor_nodes(self)){ 8: if(parent(n)!=NULL && LQI(self,n)+get(cur_lqi_to_root,n)>cur_lqi_to_root){ 9: parent=n; 10: cur_lqi_to_root=LQI(self,n)+get(cur_lqi_to_root,n); } } 11: if (parent!=NULL) broadcast(route_valid_beacon); } Figure 2.1: Routing tree construction pseudocode. in order to select a parent link, and, ultimately, a multihop path, to the root. The higher the LQI value, the better the offered quality of a link. The link’s LQI value is embedded within the MAC header of a received packet. While LQI is thus a single-hop metric, it can be summed over multiple hops in order to derive a multihop metric. process routing beacons takes a routing beacon parameter rt beacon, which contains the summed LQI value from the sender of the beacon to the root. Thus, in line 1, we add the LQI value of the routing beacon to the LQI value of the beacon contents, which gives the LQI value from the current node to the root, and compare it against the currently stored LQI value, current lqi to root. If the received value is better, the node has found a new parent in line 2, and it adjusts its current lqi to root in line 3. Also, if the routing beacon happens to be originated by the node’s parent, the node has to broadcast a new routing beacon (line 4), with an updated LQI value in the packet contents, so that the node’s own neighbors can be aware of a potential path to the root. In find new parent, the node tries to find a new parent after its current parent is dead. It can lose its current parent because of two reasons: the link between the two could have gone bad, or the parent could have failed. After losing its parent, it resets its parent and cur lqi root 14 (line 5). It then broadcasts a routing beacon invalidating routes through itself (line 6). In lines 7–10, the node searches for a potential parent among its available neighboring nodes. In line 8, the node first tests whether a neighbor has a valid parent, and whether it has a higher quality pathtotherootthanallotherneighborsithadconsideredsofar. Ifso, itupdatesits parent(line 9) value, and computes its new cur lqi to root value appropriately. If it has found a parent, it broadcasts a route valid beacon to its own neighbors (line 11). Dirtyreadisaproblemherebecausethe route invalid beaconwritebroadcasttothenode’s neighborsmightnotbereadbyallofthematthesametime. Forexample,ifAisthenodeexecut- ing find new parent, and B and C are its children, and B receives the route invalid beacon but C does not, then find new parent will be triggered at B, which causes it to use C as its parent (lines 7–10), because C still thinks there is a path through A. B will then announce a route valid beacon (line 11), which is received by A. This triggers process routing beacons at A, which means A selects B as its parent. Thus, we have a loop: C thinks A is its valid parent, B thinks C is its valid parent, and A thinks B is its valid parent. This problem is fundamentally because A’s writes of route invalid beacon to B and C were interleaved with B’s read of C’s parent in line 8. 3-cycle routing loop Figure 2.2: Illustration of a routing loop with MultiHopLQI. Figure 2.2 shows a 3-cycle routing loop occurring on a 50-node testbed consisting of TelosB motes embedded into the false ceiling inside an office building. The wireless link conditions the nodes experience in this setup are typical of real deployments, consisting of both good and bad link conditions, which vary greatly with time. We ran a data collection application, which used 15 MultiHopLQI to build the routing tree. In a 4-hour long experiment on this testbed, we found 247 instances of routing loops ranging in length from 3 hops to 8 hops. In Section 2.5, we show that GAG can fix such problems with only minor source code modifications. Localization example. We now consider ad-hoc localization of sensor node positions. In this application, every node in the network uses the estimated locations of its neighbors to iteratively improve its own estimate [SHS, MLRT]. The process continues at a node until its own estimate cannot be significantly improved. Localization suffers from the dirty-read problem because a remote node might read the in- termediate value of a node’s coordinates, before these coordinates have converged in a particular iteration. In Definition 2.1, the two writes represent modifications to a node n’s own location es- timate whileinthemiddleofaniterationofthealgorithm. Therefore n ′ and n ′′′ refertothesame node n, while n ′′ refers to a remote node that dirtily reads n’s location before it has converged. Even in this special case where n, n ′ , and n ′′′ refer to the same node (thus making the writes in Definition 2.1 local), techniques for preventing intra-node race conditions like nesC’s atomic sections are still insufficient. This is because in order for node n to refine its location estimate, it has to fetch data from other nodes, so the two local writes are separated by network calls. Figure 2.3 shows the pseudocode for iterative localization. Actual localization systems use various ranging and bearing techniques to estimate their position to beacon nodes, and then use sophisticated algorithms, such as graph-based multilateration [SHS] or quadrilateral-based estimation [MLRT], tocomputethenode’sownlocation. However, theircommunication behavior is still be captured by the pseudocode in Figure 2.3. iterative localization is triggered when a node hears a potential localization beacon. In lines 1–4, it reads the latest position estimates of all neighbors that have valid coordinates, and storesthemintolocalarraysx[n], y[n]. Inline5,itrecomputesitsownxandycoordinatesusing the collected position estimates of other nodes. If the new position estimation differs significantly from its old estimation (line 6), it broadcasts a new localization beacon itself so that other nodes 16 iterative_localization(localization_beacon){ 1: foreach (n in neighbor_nodes(self)) { 2: if(localized(n)) { 3: x[n]=x_pos(n); 4: y[n]=y_pos(n); } } 5: recompute(x,y,x[n],y[n]); 6: if(change(x,y)>threshold) 7: broadcast(localization_beacon); } Figure 2.3: Localization pseudocode. might also benefit from the node’s new estimate. In this way, the nodes in the network converge to their correct coordinates. Thedirtyreadproblemcanaffectcorrectnessandconvergenceasfollows. Whilenisexecuting iterative localization, a neighboring node n ′ could read n’s coordinates anytime before the computationinline5completes(i.e.,anytimefromlines1–5,whichincludesrelativelylongperiods duringwhichnreadscoordinatesofneighboringnodesinlines1–4). Ifn ′ readspartiallycomputed values of n’s x and y coordinates during the time recompute (line 5) is executing, it will read incorrect results. If it reads them while n is executing lines 1–4, its convergence time increases because it missed an opportunity to get better position estimates of n had it waited until the end of line 5. We ran the graph-based localization algorithm proposed in [SHS] on a testbed of 40 Imote2 nodes placed around an office floor 40mx25m in size. We compared the computed values against ground truth, and found that the dirty read problem leads to inaccuracies of more than 6%in the computed locations. We describe a more detailed evaluation in Section 2.5. Contaminant detection example. Our final example illustrates the dirty read problem in an application for detecting a contaminant spread. Every epoch, each sensor node tries to determine the presence of contamination by sampling its own sensor and comparing it with the status of a neighbor. If both it and its neighbor are in the same state (i.e., both of them are submerged in the contaminant, or both of them are outside the contaminant region), the node does nothing, 17 because a new contamination event is not required. However, if a neighbor is in a different state than its own (i.e., one of them is contaminated, and the other is not), it raises a new detection event for the contaminant. In order to make this example work correctly, the application writer would like to ensure that thetwowriteactionsof(a)updatingthecontaminantdetectionstatusatanodeand(b)raisinga contaminant detection event after testing the neighbor’s status are uninterrupted. Otherwise, the contaminant will not be detected in some scenarios. Formally, contaminant detection suffers from dirty reads according to Definition 2.1 as follows: w n (x,n ′ ) corresponds to node n taking a sensor reading and updating its contaminant status, r n ′′(x,n ′ ) corresponds to another node reading n’s contaminant status, and w n (y,n ′′′ ) corresponds to n possibly signaling a new detection event. contaminant_detect(){ 1: if (sensor_reading > threshold) contaminant_seen=TRUE; 2: foreach (n in neighbor_nodes(self)) { 3: contaminant_seen[n]=get(contaminant_seen,n); 4: if(contaminant_seen[n]!=contaminant_seen){ 5: contaminant_detected=TRUE; } } } Figure 2.4: Contaminant detection pseudocode. ThepseudocodeshowninFigure2.4hasbeenmodeledaftercontaminantdetectionalgorithms described before [WM]. contaminant detect is invoked periodically at each node. In line 1, the nodechecksitsownsensorstatus. Inlines2–6,itcomparesitsstatusagainsteachofitsneighbors. First, it reads the value of contaminant seen from node n (line 3). If the two statuses differ (line 4), the node declares that a new contaminant has been detected (line 5). Let us examine what this pseudocode does under the following scenario. In the first epoch, thereisnocontaminant, sondoesnotsenseanything, andneitherdoesitsneighborn ′ . Therefore neither node raises a detection event. In the second epoch, let us say that the contaminant has spread fast enough to cover both n and n ′ . We would like at least one of the two nodes to detect this new contamination event. However, the pseudocode may not function correctly in this instance because both n and n ′ change their status concurrently. This means n ′ may read the 18 changed (or dirty) status of n in line 3. Thus, the contaminant goes undetected by n ′ in line 4. But the situation is symmetric for n, which means neither sensor may manage to detect the contamination event even though both of them sensed it. Werancontaminant detectona40nodetestbedofImote2motesandfounda9%probability that a detection event might not be raised due to the dirty read problem. In a later section, we present the detailed results from this experiment, and show how GAG can be used to solve this dirty read problem. 2.2.2 The Lost Update Problem The lost update problem arises when data written by one node is overwritten by a second node. More precisely, let: r n (x,n ′ ),w n (x,n ′ ) beasequenceoftwooperationsexecutedatanoden. Thismeansnreadsandwritestoavariable x at node n ′ . If another node n ′′ also concurrently reads and writes to the same variable x at n ′ through another sequence of operations: r n ′′(x,n ′ ),w n ′′(x,n ′ ), this could lead to the following overall execution ordering: r n (x,n ′ ),r n ′′(x,n ′ ),w n (x,n ′ ),w n ′′(x,n ′ ) (2.2) This could lead to incorrect results because node n’s write is overwritten by n ′′ (i.e., n’s update waslost). Sincethisisaninter-noderace-condition,ittoocannotbepreventedthroughintra-node techniques such as nesC’s atomic sections or task preemption guarantees in TinyOS: node n has 19 no control over how n ′ processes reads and writes or over when node n ′′ performs its reads and writes. We illustrate this problem by considering two examples: a multi-target tracking application whosegoalistotracktwomovingtargetsinasensorfield, andadatacollectionapplicationwhich filters out faulty data before sending it down the data collection tree. Multi-target tracking. Many target tracking algorithms [CHZ, Coa] use a master node to process sensor values using sophisticated particle filtering algorithms. In order to reduce com- munication overhead, the responsibility for this master computation migrates from node to node. Thus, a node close to the target is designated as the master in one epoch, and, as the target moves, a different node is designated as the master for the next epoch by the master node in the previous epoch. In doing so, the programmer does not want to lose track of a target at a master node because another node has concurrently tasked the master to track another target. track(){ 1: foreach (n in neighbor_nodes(self)) 2: filter_states[n]=get(particle_filter_state,n); 3: compute_target_position(filter_states[n]); 4: new_master=find_new_master(filter_states[n]); 5: remote_master_satus=get(master_satus,new_master); 6: add(remote_master_status,target); 7: set(remote_master_status,master_status,new_master); } Figure 2.5: Multi-target tracking pseudocode. Figure2.5showsthepseudocodefortargettracking,asproposedin[CHZ]. trackisinvokedat thecurrentmaster. Itretrievestheparticlefilterstatesatallitsneighbors(lines1–2). Thesestates includethesensorvaluessampledintherecentepochs. Usingthestatesofallofitsneighbors, the master computes the likely position of the target (line 3). It also determines a new master that is best suited for tracking the target in the next epoch (line 4). Since a node, such as new master, can be tracking multiple targets, the current master first reads a copy of the master status list from new master, which contains the list of all targets new master is responsible for (line 5). It then locally adds the target to this list (line 6), and writes back this modified list in line 7. In the next epoch, new master now correctly knows that it has an additional target to track. 20 Unfortunately, this approach can easily result in errors because of the lost update problem: If two different masters tracking two different targets read the same value of master status in line 5 and try to write back their now different locally modified lists in line 7, the first addition is lost because it is completely overwritten by the master that executes line 7 second. In practice, we found that more than 4% of execution runs of target tracking on our 40-node Imote2 testbed lost a target and executed incorrectly due to this problem (Section 2.5). Data collection with filtering. The lost update problem can affect applications such as data collection. Researchers are increasingly becoming aware of faulty data being returned by sensors under real-world deployment conditions [RSE + 07, WAJR + , TPS + ], where the usable data yield has been as low as 50%. Some of the reasons for anomalous data reported by nodes include physical sensors that are stuck at some values because of hardware interfacing problems, and sensors that return out-of-range or incorrect values because their calibrations have drifted, or because the battery voltage to the processor or sensor modules has become low. In response, researchers have proposed to rectify this problem by deploying fault detection and data filtering techniques directly inside the network [RBB + 06]. For example, while collecting data along a tree, a parent node might choose to first filter data being received from children. This filtering needs to be done at a parent because only the parent can determine whether certain sensed readings, such as out-of-range values, are truly anomalous, by comparing them to values received from other nodes. Thus, whenever a new child joins a parent due to a routing change, it wouldthenadditselftothefilterlist(thelistofchildrenofanode). Indoingso,theprogrammer’s requirement is that a node should not accidentally overwrite another node that is concurrently adding itself to the filter. Conceptually, the list of children of a node is a data structure that should be maintained by a routing protocol. However, some sensor network routing protocols (e.g., MultiHopLQI) do not maintain this list, so an application that needs to filter out anomalous readings might need to implement the functionality to maintain this list. Using a remote variable access facility, one way 21 to maintain this list would be to have a child, when it finds a new parent, read that parent’s list, add itself to that list, and write it back. add_to_filter(){ 1: new_filter_list=get(filter_list,parent); 2: add(self,new_filter_list); 3: set(new_filter_list,filter_list,parent); } Figure 2.6: Data collection with filtering pseudocode. However, such an implementation can easily lead to lost updates. Multiple children can si- multaneously try to modify a filter list at the parent, and this could lead to some children being unable to add themselves to the list. Figure 2.6 shows the pseudocode for updating the filter list at a parent. add to filter is invoked at a node whenever there is a routing change. It reads the current filter list from the parent (line 1) into a local variable new filter list, updates it locally(line2), andwritesbacktheupdatedlist(line3). Butiftwonodesexecute add to filter concurrently, a distributed race condition occurs, potentially leading to the lost update problem. We used data sets from James Reserve, Great Duck Island, and Soil Monitoring [RSE + 07] to simulatehowcorrectlyadatafilteringcomponentcouldfunctionifitsufferedfromthelostupdate problem while maintaining the node membership list. We found that the lost update problem could cause more than 8% of the data to be unfiltered (Section 2.5). 2.3 Design and Implementation of GAG In this section, we first describe the goals and requirements of the library, followed by GAG’s API, and end with a description of its implementation. 2.3.1 Goals and Requirements The design of GAG is driven by four goals. The most important goal of GAG is to allow application programmers to write programs that do not contain inter-node race conditions caused by dirty reads or lost updates. Deciding on the 22 right abstraction is a challenge, since there are several available abstractions (and associated im- plementation choices), ranging from transactional approaches to optimistic concurrency methods. We provide a simple and lightweight abstraction called an atomic view, which guarantees freedom from inter-node race-conditions, as explained below in Section 2.3.2. Asecondgoalisthat GAGshouldmakeiteasyfortheprogrammertoaddsupportforavoiding inter-node race conditions to an existing application that uses remote variable reads and writes but exhibits such race conditions. Such an approach enables reliable programming, allowing the programmertoreadandupdateprogramlogicwithouthavingtoincorporateanewprogramming paradigm. A third, yet important, goal is that the implementation of GAG should respect the bandwidth and other resource constraints of sensor nodes. This requires careful attention to memory uti- lization, messaging cost, and a careful balance between keeping state in the GAG subsystem to simplifyprogrammingversusrelyingontheprogrammertomanagepotentiallyapplication-specific resources more efficiently. Finally, we require that GAG be robust to node failures. This implies careful implementation of GAG internals, such as avoiding fate sharing by keeping the view state entirely local to a node requesting the view, using non-blocking communication, and having built-in support for simple yet effective failure detection through user-configurable acknowledgments and retries. 2.3.2 The GAG API for Atomic Views GAG provides the abstraction of named atomic views. An atomic view is defined as a set of variables and their associated values. This set is implicitly associated with a single node (the node at which the atomic view is created). Thus, two nodes can independently create atomic views of the same set of variables. Each atomic view can be referenced by a locally unique identifier. The variables in an atomic view may belong to any node. When a variable v in an atomic view is first read, v’s current value is read into the view. Subsequent reads or writes to v 23 are reflected only in the atomic view, and not in the value of the original variable. Finally, GAG providesafacilityto commit anatomicview. Thisoperationupdatesallchangedvariableswithin theviewinanall-or-nothingmanner. Thus, ifanatomicviewcontainstwovariables, v fromnode n and w from node n ′ , either both n and n ′ are updated during the commit, or neither are. Anatomicviewavoidsthedirtyreadproblembecausevariablesaddedtoaviewareguaranteed nottohavetheirupdatesseenbyotherviewsuntilacommitiscalledontheview. Itavoidsthelost update problem because a view either successfully commits if it does not conflict with any other view, or it does not commit at all, in case of conflicts. This allows the view to be retried until it commits, whichguaranteesthatthecommittedviewdidnotconflictwithanyotherexistingview. By avoiding these two problems, an atomic view guarantees correct execution for applications. Figure 2.7 shows the API provided by GAG to use atomic views. There are calls to initialize an atomic view, associate a node’s variable with the atomic view, read and write to this variable, and commit writes consistently in order to guarantee the atomic view. //initialize the Atomic Identifier "aid" result_t init(aid_t aid); //associate "var" with "aid" result_t assoc(aid_t aid,node_t node, var_t var); //read value of "var" from "node" into local ‘‘lvar’’ result_t read(aid_t aid,var_t* &lvar,node_t node,var_t var); //store value from local ‘‘lvar’’ into "var" from "node" result_t write(aid_t aid,var_t lvar,node_t node,var_t var); //commit all stored writes result_t commit(aid_t aid, timeout_t timeout, uint retries); Figure 2.7: GAG API. All API calls take an atomic view ID (or aid) argument. The programmer can initialize a new view with a call to init, which initializes a new view identifier. This identifier has purely local visibility: GAG instances at remote nodes know nothing about identifier values used at the node. The assoc function associates a variable var at node node with a previously initialized 24 aid. This action essentially adds the variable var to the set of variables that form the atomic view. This variable could be either a remote or a local variable. The type of this variable, var t, can be user-defined. All other argument types are statically fixed by the API. The programmer can read or write to a previously associated variable var residing at node using read and write calls. These calls return success if the action successfully completes, or an error that the programmer can check for and take corrective action. We describe later the sources of error in these calls, but a simple corrective action for some types of transient errors is to retry the operation again until it succeeds. read reads the value of var residing at node into a local variable lvar. The remote variable must have been associated with the atomic view argument (i.e.,aid) before. Similarly, write writes the value of the local variable lvar into the local atomic view for remote variable var. commitisthefinaloperationintheAPI.Ittakesan aidargument, andatomicallywritesback all values previously written via write calls in that atomic view. The precisesemantics of commit are that none of the nodes see the effects of previous writes to the view until commit is called, and all non-failed nodes are correctly updated after commit finishes. In order to correctly handle failed nodes, commit allows the programmer to specify a timeout, and the maximum number of retries it must attempt for any node within this timeout period. TheAPIthusallowstheprogrammertodefinethesemanticsoffailure,whichconsiderablyreduces the implementation burden, while allowing application-level correctness, as further explained in the next section. 2.3.3 Implementation of GAG The implementation of GAG has the following features: correct implementation of reads, writes, and commits; low communication overhead; low memory overhead; no fate sharing or blocking among nodes; split-phase internal communication; user-configurable failure detection; and fully user-controlled application-level retries. We discuss each of these features in turn. 25 Correct implementation of reads, writes, and commits. GAG uses read and write locks for enforcing atomic views. When a read operation is executed for a variable var at node n, GAG acquires a read lock before reading var. If the lock cannot be obtained because another node is writing to the variable, GAG returns an error. If the variable cannot be read because the remote node is no longer available, GAG returns a different error. If the lock was successfully obtained, the variable is read-locked at the remote node, and returned to GAG. GAG also obtains a write lock the first time a write operation is executed for a variable. Write lock acquisition may fail for similar reasons as read lock acquisition: other nodes may be reading or writing to the same variable, ortheremotenodemayhavefailed, and GAGreturnsappropriateerrorcodes. Allwrites are done only to the atomic view. Locks are held until a commit is done, at which time, locks are releasedandanyupdatesarewrittenbackreliably. Ifanynodeisunreachable(afterauser-defined number of retries and a timeout, see below) during the commit operation, updates to that node are ignored. This execution is correct because reads, writes, and commits maintain the definition of an atomic view: it is actually safe for commit to write back updates because locks acquired during reads and writes ensured the atomic view does not suffer from dirty reads or lost updates. This correctness property is maintained during lock acquisition or node failures because no updates are committed at the time a read or write fails. This means the programmer can either retry (if it is a lock acquisition failure), or give up on the read (if it is node failure), and take some application-specific action such as using a different acceptable node. If any node fails after a read or write but before a commit, correctness is still maintained, according to the definition of a commit: none of the nodes see the effects of previous writes to the view until commit is called, and all non-failed nodes are correctly updated after commit finishes. Since commit tries to reliably write back updates, and retries unacknowledged write attempts until the user-defined timeout and retries parameters are reached, commit meets a 26 user-defined correctness definition. Reliable node failure detection is hard, and this design allows the programmer to specify failure semantics. Low communicationoverhead. GAGcarefullyminimizesthenetworkoverheadinmaintaining the lock metadata. To reduce messaging, GAG combines both locking and data reading into one message. So, further reads to the variable do not require network access. Also, once a write lock is obtained, GAG does writes locally until commit is called. Finally, commit has no additional overhead if there are no failures (the normal case) because a write is done once, which is required even if the program does not use GAG. Thus, GAG implements the commit semantics cheaply, without requiring a costly and blocking two-phase commit protocol used in distributed database transactions. This is because GAG need not deal with data persistence and the accompanying failure recovery issues such as managing undo and redo actions [SH]. Low memory overhead. GAG maintains atomic views with a minimal memory overhead by using three simple state tables, as shown in Figure 2.8. The Var table holds the state of the variablesbeingusedintheatomicviews. Thisstateinformationincludesthenameofthevariable, whether the variable entry is valid, and the node that the variable originally resides at. The Aid table holds the association between an aid and the list of variables and their lock types (i.e., read or write lock) currently in the view. The Lock table holds the list of variables exported by the node to other remote nodes, including their lock types. Since these tables maintain only the essential information, and since the table entries are small, and since it is also possible to inline the code for reads and writes, GAG has a low memory footprint. For example, both ROM and RAM overheads for all applications considered in Section 2.5 are less than 20% of the application size. No blocking or fate sharing. These are two important implementation features of GAG that provide robustness to node failures. First, read or write operations do not block at a remote node if a lock cannot be granted because of a conflict. This means GAG does not need to worry about deadlocks or priority inversions. The locking error condition returned by a read or a write means 27 var node valid read(aid,&lvar,node,var); write(aid,val,node,var); commit(aid); aid {var,LT} list Aid table Var table Multihop wireless network Requests Replies Queue manager var LT Lock table Figure 2.8: Data structures used by GAG. the programmer can make application-level decisions about how to retry reads or writes, which become optimization choices rather than correctness criteria. Second,notwonodesinthesystemneedeachothertofunctioncorrectly,whichmeansthereis no fate sharing. The node using atomic views does not need any other node to function correctly, as discussed before. Also, the remote node sharing a variable does not maintain the list of readers or a writer in its Lock table. Instead, it only records whether a variable is accessed at all, and if so, whether the access is a read or a write. This means the readers or the writer may also crash, and the other nodes in the system will function correctly. This is because a reader crash does not affect anyone else, while the crash of a writer is indistinguishable from an active writer, which means the remote variable owner immediately denies the lock to other nodes. The nodes can make an application-level decision what to do next, just as they have to with normal lock-rejection replies. Split-phase internal communication. GAG implements reads and writes using split-phase requests and replies, as shown in Figure 2.8. A queue manager is used to service outstanding requests and replies for reads, writes, and commits. A read request is processed by the remote queue manager, which returns the read lock and the variable data as a reply, after adding the variabletotheLock table. Requestsforwritelocksduringwriteoperationsarehandledsimilarly. For commits, there can be multiple outstanding requests to write back values. Small queue sizes (16 queue buffers) are sufficient in practice for holding these requests and replies. The queue 28 managerusesaloop-freeversionofMultiHopLQIroutinginordertoaccessremotenodesmultiple radio hops away (Section 2.5). User-configurable failure detection. The commit operation allows a programmer to specify the timeout and the retry count in order for it to declare a remote node as a failed node. For example, if the only writes performed by commit are to the node’s local variables, timeout and retries can be zero. In practice, given typical packet throughputs (up to∼100 packets/s), we foundthatatimeoutofonesecondandaretrycountoffiveworkswellforone-hopnodes. Correct estimation of timeout and retry parameters for failure detection of nodes multiple hops away is challenging, so we use a simple heuristic of two-second timeouts and eight retries, which has worked well in practice. 2.4 Fixing Inter-Node Race Conditions In this section, we show how one can fix distributed race conditions using the GAG library. Be- fore this, we describe a simple methodology that we have found useful to detect inter-node race conditions. 2.4.1 Finding inter-node race conditions In developing GAG we have found it useful to develop a simple semi-automated tool to find inter- node race conditions. This tool has been useful not just in detecting race conditions in programs that do not use GAG, but also in incorrect atomicity implementations that use GAG. Basically, the tool logs every read and write from each instance of a running program. Then, it post-processes these logs to detect read-write sub-sequences that correspond to the dirty read or the lost update problem (these sequences are those that define the dirty read problem in Definition 2.1 and the lost update problem in Definition 2.2). The occurrence of such a sub- sequence indicates a potential race condition. We then manually analyze application results to 29 determineifthatparticularread-writesub-sequenceresultedinanincorrectapplicationexecution. While this approach is not perfect, it has worked well enough for us. We were able to find and fix problems in all applications this way. We also note that automated detection of race conditions is a very challenging problem in general [SBN + 97], so we are content with simple rules that match typical usage scenarios of sensor networks. 2.4.2 Using GAG to fix race conditions In this section, we take three of the five application examples discussed in Section 2.2 and show howtofixtheirinter-noderaceconditionsusing GAG.Spacelimitationspreventusfromdescribing the other two applications (routing tree and data aggregation with filtering), but we have exper- imentally verified that our GAG modifications to these applications work correctly, and require simple and localized changes similar to those described below. For any application, fixing an inter-node race condition using GAG involves wrapping the relevant reads and writes into atomic views, and determining where in the program to commit the atomic views. Localization. Recall that the localization application can suffer from a race condition arising from a dirty read. To avoid this race condition, we need to ensure that writes done at a node are not visible to other nodes until all related later writes at the node have also committed. This would allow other nodes to be aware of the latest consistent values of the node’s variables. Figure2.9showshow GAGcanbeusedtoavoidtheraceconditioninlocalization. Thechanges involve replacing the direct read of a variable with a call of the form: read(aid,...), and the direct write to a variable with a call of the form: write(aid,...), as explained below. The changes are straightforward and local, so the application does not change its code structure. Thus, the conceptual overhead in using GAG is cheap for the programmer. In Section 2.5, we show that the numberofchangedlinesintheactualcodeisalsosmall, andamountstolessthan10%ofchanged lines of code. 30 iterative_localization(localization_beacon){ 1: init(aid); 2: assoc(aid,self,x); assoc(aid,self,y); 3: foreach (n in neighbor_nodes(self)) { 4: if(localized(n)) { 5: assoc(aid,n,x); assoc(aid,n,y); 6: while(!read(aid,&x[n],n,x)){sleep(100)}; 7: while(!read(aid,&y[n],n,y)){sleep(100)}; } } 8: recompute(x,y,x[n],y[n]); 9: commit(aid,0,0); 10: if(change(x,y)>threshold) 11: broadcast(localization_beacon); } Figure 2.9: Pseudocode for localization with GAG. Inlines1–2, anodeinitializesanatomicviewandassociatesits xand yvalueswithit. Inlines 5–7, we replace the original statements: x[n]=x pos(n); y[n]=y pos(n); with corresponding read(aid,&x[n],n,x) and read(aid,&y[n],n,y) calls, after first associating these remote vari- ables with the atomic view. The main difference is that read(aid,&x[n],n,x) can fail if some other node is writing to x at node n, which means we retry the read every 100ms by using a while loop. Note that these retries for read and write operations are application-level retries done by the programmer, and are conceptually different from the retries used internally by GAG for the commit operation. Finally, the node commits its updated x and y coordinates. Since these writes are purely local, we do not need timeouts or retries. Recall that the dirty read problem in localization arose because a remote node could attempt to read x and y before line 8 (Section 2.2). Committing these variables after line 8 avoids this. Contaminant detection. Contaminant detection also suffers from a dirty-read race condition. Figure 2.10 shows the pseudocode for correctly detecting a contaminant under all scenarios. In order to avoid the race condition triggered by the dirty read problem, we would like to serialize the read action of one node with the write action of another node. This means at least one of the nodes will correctly determine the presence of the contaminant. In lines 1–2, the node creates a new atomic view in aid, and initializes its local variable, contaminant seen, with aid. Inline4, thewriteintheoriginalstatement“contaminant seen = 31 contaminant_detect(){ 1: init(aid); 2: assoc(aid,self,contaminant_seen); 3: if(sensor_reading > threshold) 4: while(!write(aid,TRUE,self,contaminant_seen)) sleep(100); 5: foreach (n in neighbor_nodes(self)) { 6: assoc(aid,n,sensor_reading); 7: while(!read(aid,contaminant_seen[n],n, contaminant_seen)) sleep(100); 8: if(contaminant_seen[n]!=contaminant_seen){ 9: contaminant_detected=TRUE; } } 11:commit(aid,0,0); } Figure 2.10: Pseudocode for contamination detection with GAG. TRUE;”isreplacedwiththeanalogouswriteoperation: write(aid,TRUE,self,contaminant seen). This write is retried in a while loop until the write lock is successfully acquired. Since this is a local write operation, this retry is cheap. In lines 6–7, the read in the original program: contaminant seen[n]=get(contaminant seen,n) is straightforwardly replaced with a read op- eration that is retried every 100ms if a read lock cannot be acquired. This read lock cannot be acquired only if node n is actively writing to its contaminant seen, so this is an acceptable overhead in order to ensure correctness. Finally, in line 11, the node commits the write to its local contaminant seen performed in line 4, so that other nodes may consistently see the result. Let us revisit the contamination scenario described in Section 2.2.1 that led to an incorrect execution. In the first epoch, there is no contaminant at either node. In the second epoch, the contaminant affects both nodes. The fixed pseudocode correctly detects the contaminant in this case as well. This is because with atomic views, the node that executes first will see the contaminant at itself while noticing that the other node has not yet detected it. So, it correctly sets contaminant detected in line 9 and determines that there is a new contaminant. Target tracking. For target tracking, we need to avoid the lost update problem caused when multiple masters concurrently designate the same node as the next master for multiple targets. This involves making the read–modify–write sequence involved in the master-transfer process atomic. The required modifications for doing so are shown in Figure 2.11. 32 track(){ 1: foreach (n in neighbor_nodes(self)) 2: filter_states[n]=get(particle_filter_state,n); 3: compute_target_position(filter_states[n]); 4: new_master=find_new_master(filter_states[n]); 5: init(aid); 6: assoc(aid,new_master,master_status); 7: while(!read(aid,remote_master_status,new_master, master_status)) sleep(100); 8: add(remote_master_status,target); 9: if(!write(aid,remote_master_status,new_master, master_status)) { 11: sleep(random(100)); 12: goto 5; } 12:commit(aid,1,5); } Figure 2.11: Pseudocode for multi-target tracking with GAG. In line 5, we initialize an atomic view as usual. In line 7, we substitute a normal read of master status at new master with a read associated with the atomic view. We copy the read dataintoaprivatevariableremote master status,andupdatethisprivatedataasintheoriginal programinline8. Inline9,wewritetheupdatedvaluein remote master statusintotheatomic view containing the variable master status. During this write, GAG upgrades the read lock for master status acquired during the read in line 7 to a write lock. The modified data is then written back atomically in lines 10. This program prevents the lost update problem as follows. If two nodes concurrently try to modify the master status list at new master by first reading the same value of master status in line 7, neither node will be able to incorrectly update master status in line 9. This is because if both nodes already have a read lock on master status, then their attempts to upgrade their read locks to a write lock will each fail. Therefore, the two nodes must backoff from this deadlock situation. This is an easy process when atomic views are supported. Each node retries the read–modify–write process from the beginning. In order to avoid being deadlocked on a retry, nodes use a simple randomized backoff procedure by sleeping for a random amount of time before contending for the write lock. This procedure is shown in lines 10–12. 33 Inline10, ifthewriteisunabletoacquireawritelock, theprogrammer retriesexecutionfrom line 5 onward through a goto statement in line 12. This creates a new atomic view and discards updates from the previous view. Before retrying, the program sleeps for a random amount of timebetween0–100ms(line11). Allowingtheprogrammertocontrolretriesthiswayremovesthe need for GAG to implement complicated deadlock detection and recovery algorithms for ensuring atomicity, and gives the programmer the flexibility to adjust retries. Thefirstnodetoexecutetheretrycanpotentiallygrabthewritelockon master statuswhile the second node is still asleep. Thus, we correctly serialize accesses in the presence of multiple writers that can potentially deadlock with one another, without imposing any message overhead when there is no contention for write locks. 2.5 Evaluation We describe a detailed evaluation of the five applications presented in Section 2.2 in order to understandtheimpactof GAGinpreservingapplication-levelcorrectness,andwhatcostsitincurs intermsofcodemodificationcomplexityandperformancemetricssuchasincreaseinapplication- level latencies and message costs. We describe the methodology used to evaluate GAG, define our metrics, and then present the experimental results. 2.5.1 Methodology and metrics We implemented GAG as a C library that can be linked to programs written for the Intel Imote2 platform. TheImote2nodeshavea320MHzXScaleprocessor,connectorsforsensormodulessuch asacoustic,magnetometerandaccelerometersensors,Zigbee802.15.4radio(CC2420),32MBeach of onboard Flash and RAM, and can be battery-powered. The Imote2 runs both TinyOS and Linux, and we use Linux as the main evaluation environment in this paper. Our implementation 34 ran on a testbed of 40 Imote2 nodes deployed around an office floor 40mx25m in size, as shown in Figure 5.8. We implemented each of the five applications described in Section 2.2 both with and without GAG support. In order to verify that the results we obtained were correct, we compared the results against ground truths. For the routing tree and the data collection applications, we also used data sets from three real sensor deployments: Great Duck Island, James Reserve, and data from a Soil Monitoring deployment in Bangladesh. The deployment topologies are also known. Thesmallestdatasethadover25,000timestampedsamplesfromvarioussensors. Foraconsistent workload, we used 25,000 samples selected at random from the other two data sets for these two applications. For the other three applications—localization, contaminant detection and target tracking, the link connectivity between nodes in Figure 5.8 was obtained by setting the 802.15.4 radio power level to 0x7 (or -15dBm). We then ran our loop-free version of MultiHopLQI in order to build routingtreesrootedateverynode. Usingthisinformation,weobtainedoptimalany-anymultihop routing topologies, and we statically hard-coded them into these applications. We use two main metrics to evaluate GAG. They are justified by the design goals described in Section 2.3.1. The first is by how much GAG improves application-level accuracy, and at what cost. This cost includes application-level latency and message overhead. Latencies and message overheadsaremeasuredattheapplicationlevel,andthereforerepresentthemetricuserscaremost about. They include all component costs of GAG, such as the cost of acquiring locks, performing commits, and application-level retries under contention. Second, we examine how easily applications can be modified to use GAG. Since the task of fixing programs to avoid race conditions falls to the programmer, we would like to ensure that they are not unduly burdened. While programming effort is hard to quantify, we have extensively examined three applications in Section 2.4.2, and showed that a small number of local fixes to them makes them correct. Here, we quantify the required changes. 35 Figure 2.12: Evaluation testbed of 40 Imote2 motes. We also measured the ROM and RAM costs for the compiled binaries with and without GAG. We found that the overhead introduced by GAG was less than 15%. We do not show the detailed breakdown for each of the applications due to lack of space. 2.5.2 Results Each of the five applications described in Section 2.2 exhibited correctness problems due to race conditions in some or all of our experimental runs, as quantified below. In each case, we ran the application augmented with GAG, and the errors due to race conditions were eliminated. Also, in all cases, the application-level cost imposed by GAG was less than 20% for both latencies and message overheads. In order to make these overheads explicit, we show them as absolute numbers in the results below We used the procedure described in Section 2.4.1 for finding race conditions to verify that the errors we observed were in fact due to race conditions. We also compared results both with and 36 without GAG against ground truth to further verify that the errors were meaningful. We ran each experiment at least 10 times, and plot the averaged results along with 95% confidence intervals. RoutingLoops. Inthisexperiment,40nodessendtheirdatatotherootatarateof0.6packet/s, which is equal to the sending rate averaged over the three data sets. We observed routing loops ranging from 3–7 nodes during this entire process. We plot the distribution of routing loops and their impact on data collection for a MultiHopLQI implementation in Figure 2.13. The primary y-axis shows the CDF of their frequency of occurrence, while the secondary y-axis shows the loss in data collection caused due to these loops. Interestingly, while smaller loops occur more frequently, larger loops persist longer, and have a larger impact on data loss when they occur. The net result is that data loss due to routing loops of any size ranges between 4–8%. 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 Size of the routing loop CDF of routing loop sizes 0 1 2 3 4 5 6 7 8 9 Data loss (%) Data loss CDF Figure 2.13: Plots of distribution of routing loop sizes and the percentage of data lost as a result at each loop size. GAGprevents theseloops, andFigure2.14 showstheincreaseinroutingtreeconvergencetime and the number of messages per node incurred as a result. The increased convergence time is under 2s, which compares well with 30–60s required even without loops. The message overhead 37 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 3 4 5 6 7 Prevented routing loop size Increase in convergence time (s) 0 5 10 15 20 25 30 35 40 Message overhead (packets/node) Message overhead Convergence time increase Figure 2.14: Plots of increase in routing tree convergence and average per-node message overhead incurred by GAG. is also reassuring, because it grows linearly with the loop size, and is less than 15% of the average packetspernodeinallcases. SinceGAGpreventsdatalossoncetheroutingtreeisfirstestablished, applications such as data collection can thus benefit from it. We re-ran this routing tree experiment by substituting node failures for link failures in order to understand the behavior of GAG to node failures. We found no substantial difference in performance from Figure 2.14, which shows that GAG is also robust to node failures. Localization. We ran the localization application described in Section 2.2.1 on the 40-node testbed. The number of neighbors available to a node, which forms its localization graph, ranged from 3–8. We plot the position error in localization caused due to dirty reads in Figure 2.15. This error can be more than 6%, which can be unacceptable for some applications. Surprisingly, this error is larger if a node has more neighbors, because it has a higher chance of reading incorrect values from its neighbors, then computing its coordinates incorrectly, and also propagating them to other nodes. Figure 2.16 shows the latency and message overheads incurred by a corrected 38 version of the program that uses GAG. The latency increase for all localization graph sizes was less than 18% of the original application latency. The average message overhead per node was less than 12%. Furthermore, both these quantities increase linearly with the localization graph size. 0 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Size of the localization graph Error in localization (%) Figure 2.15: Plot of localization error as function of the localization graph size. Contaminant Detection. We ran experiments that modeled a contaminant originating near a random sensor node and spreading radially on the mesh of sensor nodes in the testbed. In Figure2.17,weplotthenumberofcontaminationdetectioneventslostasafunctionofthenumber of nodes in the network. When there are fewer total nodes, each node has fewer neighbors on average, which means the potential for race conditions is similarly lowered. As the network size increases, the number of such missed instances can be as high as 9% of the total number of generated events. GAG can fix this scenario as well (Figure 2.18), with a latency overhead of less than 6% (or 0.2s), and an average message overhead across an entire experiment of less than 10 packets per node. Since GAG incurs these overheads only when there is an actual race condition, 39 0 1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 Localization graph size Increase in localization latency (s) 0 5 10 15 20 25 30 35 40 45 Message overhead (packets/node) Message overhead Latency increase Figure2.16: Plotsoflocalizationlatencyincreaseandaverageper-nodemessageoverheadincurred by GAG. both these overheads increase more slowly with the network size than the contaminant detection miss rate itself. Target tracking. We ran a target tracking experiment consisting of two nodes moving in the sensor field formed by the testbed nodes. As explained in Section 2.2.1, there can be situations where a target might be completely lost due to the dirty write problem. Each target moves once around the sensor field in opposite directions. Figure 2.19 shows the percentage of time a target was lost in such an experiment as the network size increases. Interestingly, this percentage decreases with the number of nodes because the sensor nodes form a denser sampling field, which meanstheprobabilityoftwonodesoverwritingeachotherdecreaseswithnetworksize. Thisfailure rate stabilizes to around 5% because the rectangular layout of the testbed in Figure 5.8 means the two targets have to overlap for certain periods. GAG fixes all errors from such race conditions, with both latency and and message overhead of less than 10% (Figure 2.20). Interestingly, these 40 0 1 2 3 4 5 6 7 8 9 10 8 12 16 20 24 28 32 36 40 Network size (#nodes) Contaminant events missed (%) Figure 2.17: Plots of missed contaminant detection events with network size. overheads keep decreasing with network size because our retry implementation in Figure 2.4.1 ensures that contending nodes usually need at most one retry. Filtered data collection. We ran a data collection experiment that fetched data from nodes according to the topologies and data rates specified in each of the three real data sets. Whenever a route changed, a node installed a filter for itself at its parent, as described in Section 2.2.1. These filters removed stuck-at and out-of-range data from the children. We measured the amount of unfiltered data seen at th root for each of these data sets (Figure 2.21). Such unfiltered data can be more than 8% of the total data, and depends primarily on the topology: the Great Duck Island set had fewer children on average than the Soil Monitoring set, which had highly clustered nodes. Heretoo, GAGpreventedanyunfiltereddatafromreachingtheroot. Figure2.22showsthe overheadsincurredasaresult. Thisoverheadisonlyincurredduringtheactualfilterinstantiation operation, and was less than 2% of the latency and message overhead of the entire application. 41 0 0.05 0.1 0.15 0.2 0.25 8 12 16 20 24 28 32 36 40 Network size (#nodes) Increase in contaminant detection latency (s) 0 2 4 6 8 10 12 Message Overhead (packets/node) Message overhead Latency increase Figure 2.18: Plots of increase in contamination detection latency and average per-node message cost incurred by GAG. Application Source lines Sour lines withoutGAG withGAG Routing tree 370 385 Localization 641 662 Contaminant detection 453 479 Target tracking 489 503 Data collection 243 268 Table 2.1: Source size without and with GAG. Code overhead. Table 2.1 shows the code size for the five applications with and without GAG. All examples required us to modify less than 15% of the source lines and added less than 10% to the final source code. Along with our detailed qualitative description of how one can easily convertanoriginalprogramtoonewith GAGsupport,thisresultshowsthat GAGdoesnotimpose undue burden on the programmer. 42 0 5 10 15 20 25 30 8 12 16 20 24 28 32 36 40 Network size (#nodes) Tracking failure (%) Figure 2.19: Plot of percentage of target tracking failures with network size. 2.6 Related Work We classify prior research related to support for correct execution into three main classes. Several sensor networks systems have addressed consistency and correctness issues, database systems have thoroughly explored transaction support, and consistency models have been examined in distributed systems. Correctness in Sensor Networks. Levis et al. anticipate the need for addressing con- currency issues in sensor networks application [LGC], where they observe that programming ab- stractions must provide correct inter-node data synchronization in order to avoid race conditions during data sharing. However, they do not discuss solutions for such race conditions. To the best of our knowledge, the problem of providing correctness by preventing inter-node race conditions has not been studied before. There exists a large body of complementary literature for provides weak forms of correct- ness and consistency guarantees. For example, Trickle [LPCS] is a system for distributing code 43 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 8 12 16 20 24 28 32 36 40 Network size (#nodes) Increase in tracking latency (s) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Message Overhead (packets/node) Message overhead Latency increase Figure 2.20: Plots of increase in target tracking latency and average per-node message overhead incurred by GAG. updates that uses eventual consistency semantics for correctness. The role assignment program- ming abstraction in [FR] allows a sensor network to be correctly and easily configured using a configuration language. Regiment [NMW] is a system that provides functional programming support for macroprogramming a large number of sensors. Since it has no side-effects, it can provide correctness without worrying about dirty reads or lost updates. However, a functional- programming paradigm also means that applications that need in-place updates, such as routing tree construction, would be harder to implement. Pleiades [KGMG] is a language that provides strong consistency guarantees, but programs need to be written specifically in it. In contrast to allofthesesystems, GAGisasimpleandefficientlibrarythatisusablebyanynode-levelprogram, and does not require the programmer to use new abstractions. Finally, several node-level facilities have been proposed to guarantee correctness and simplify programming. nesC and TinyOS provide atomic sections within a task, detect potential race conditions using static program analysis, and guarantee that tasks run to completion and do not 44 0 1 2 3 4 5 6 7 8 9 Great Duck Island James Reserve Soil Monitoring Data set Unfiltered Data (%) Figure 2.21: Plot of unfiltered data from each data set. preempt each other [GLvB + ]. Systems like the t-kernel [GSb] further enhance program reliability byusingload-timeprocessingtoprovidememoryprotection. However,asexplainedinSection2.2, they do not address inter-node race conditions because they have no control over when remote nodes read and write shared data. Correctness in Databases. Databases [SH] provide stronger consistency guarantees than what GAG provides through atomic views. For example, they offer data persistence and allow nodes to recover from complex failures (such as cascading failures across nodes, and recursive failures during recovery). They achieve this through the transaction abstraction, which offers strongerconsistencythanatomicviews: inadditiontoavoidingthedirtyreadandthelostupdate problems, a transaction can provide stronger guarantees like repeatable-reads, which means that when the transaction is re-executed from the beginning, the set of data read by the transaction is the same as before. Transactions also allow nesting, while atomic views are flat. While suitable for database applications, such strong guarantees are unnecessary for sensor networks, and add high overheads in terms of message cost, latency, and memory. They also sacrificeavailabilityforconsistency,beingunabletofunctioninthepresenceofnetworkpartitions. 45 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Great Duck Island James Reserve Soil Monitoring Latency increase in installing filters 0 1 2 3 4 5 6 7 8 9 Message overhead (patckets/node) Latency increase Message overhead Figure2.22: Plotsofincreaseinfilterinstantiationlatencyandaverageper-nodemessageoverhead incurred by GAG. In contrast, GAG only avoids inter-node race conditions, which means the application writer can ensure the program works correctly under node failures and partitions. Also, as described in Section 2.3.3, several features of GAG make it more suitable for the sensornet domain. Correctness in Distributed Systems. Several distributed systems have examined weak andrelaxedconsistencymodels[AG]thatmeettheprogrammer’sexpectationofcorrectness. Such support includes gossip-based synchronization [DGH + ], meeting application-specific consistency requirements [TTP + b], and providing strong consistency for only file metadata in distributed file systems [PGPH90]. By contrast, GAG provides strong consistency through atomic views, while allowing the programmer herself to define what variables need this consistency. This design prevents all inter-node race conditions, while retaining flexibility and efficiency. While application-specific algorithms to handle inter-node race conditions, such as loop-free routing tree algorithms, have been well-studied, they are not generic, unlike atomic views. Also, such algorithms additionally target optimality at the expense of high memory, communication, or algorithmic simplicity. For example, using a link-state routing approach costs too much memory for nodes because entire network state is kept at each node, while using a path vector approach 46 involves highcommunication costsbecauseroutingbeaconshavetocontain fullpathinformation. In contrast, atomic views provide a generic and low-cost approach for consistency, while exposing several optimizations to the programmer. Closestinspiritto GAGistherecentresearchonsoftwareTransactionMemory[LR06]. Sofar, Transactional Memory supports only single-node concurrency, such as for multi-core CPUs, while GAGtoleratesnodefailuresinadistributedsetting. Also,thelossyandlow-bandwidthradiolinks force GAG to use a pessimistic locking approach—locks are fetched during the initial read and write operations, instead of waiting until the commit operation as with Transactional Memory. Thisissothat, incaseofconflicts, thenode’sprogramcanrecognizethisconditionandnotifythe programmer during the read/write operations themselves, instead of wasting bandwidth by doing additional reads and writes that would fail anyway. But transactional memory for multi-core CPUs typically trade off bandwidth for simpler read and write semantics. 2.7 Conclusions and Future Work In this paper, we showed that inter-node race conditions exist in many common sensor network applications. These race conditions can cause important errors in the applications and can con- foundprogrammerexpectations. Weclassifiedinter-noderaceconditionsintotwocategories,dirty reads and lost updates, and illustrated how they arise in several examples. These race conditions cannot be prevented by existing techniques for preventing intra-node race conditions in sensor networks. We proposed atomic views as a simple but sufficient abstraction that prevents the dirty read andlostupdateproblems, anddesignedandimplementedthe GAGlibrary. GAGhasseveralprop- erties important to sensor network environments, such as low resource requirements, robustness against failures, and user controllability. We showed how a programmer can find and fix errors cheaply using GAG. We evaluated GAG using real-world data sets and ground truths, and showed 47 that it imposes a modest tradeoff in terms of application latency and messaging in return for providing correctness. In future work, we would like to understand how well GAG works in real deployments, and we would like to optimize GAG both at the API and the implementation levels in order to further reduce overheads. 48 Chapter 3 Chapter 3: Pleiades: A language with serializability guarantees 3.1 Introduction Current practice in sensor network programming uses a a highly concurrent dialect of C called nesC [GLvB + ], which is a node-level language — a nesC program is written for an individual node in the network. nesC statically detects potential race conditions and optimizes hardware resources using whole-program analysis. nesC programs use the services of the TinyOS operating system [HSW + ], which provides basic runtime support for statically linked programs. TinyOS exposes an event-driven execution and scheduling model and provides a library of reusable low- level components that encapsulate widely used functionality, such as timers and radios. TinyOS was designed for efficient execution on low-power, limited-memory sensor nodes called motes. nesC and TinyOS provide abstractions and libraries that simplify node-level sensor-network application programming, but ensuring the efficiency and reliability of sensor network applica- tions is still tedious and error prone (Section 3.2). For example, the programmer must manually decompose a high-level distributed algorithm into programs for each individual sensor node, must 49 ensure that these programs efficiently communicate with one another, must implement any neces- sarydataconsistencyandcontrol-flowsynchronizationprotocolsamongthesenode-levelprograms, and must explicitly manage resources at each node. We are pursuing an alternative approach to programming sensor networks that significantly raises the level of abstraction over current practice. The critical change is one of perspective: rather than writing programs from the point of view of an individual node in the network, pro- grammers implement a central program that conceptually has access to the entire network. This change allows a programmer to focus attention on the higher-level algorithmics of an application, and the compiler automatically generates the node-level programs that properly and efficiently implement the application on the network. In the literature, this style of programming sensor networks is known as macroprogramming [WM]. We have instantiated our macroprogramming approach in the context of a modest extension to C called Pleiades, which augments C with constructs for addressing the nodes in a network and accessing local state from individual nodes. These features allow programmers to naturally express the global intent of their sensor-network programs without worrying about the low-level details of inter-node communication and node-level resource management. By default, a Pleiades program is defined to have a sequential thread of control, which provides a simple semantics for programmers to understand and reason about. However, Pleiades includes a novel language constructforparalleliterationcalledcfor,whichcanbeused,forexample,toiterateconcurrently over all the nodes in the network or all one-hop neighbors of a particular node. The Pleiades compiler translates Pleiades programs into node-level nesC programs that can be directly linked with standard TinyOS components and the Pleiades runtime system and executed over a network of sensor motes. The key technical challenge for Pleiades is the need to automati- cally implement high-level centralized programs in an efficient and reliable manner on the nodes in the network. The Pleiades compiler and runtime system cooperate to meet this challenge in a practical manner (Section 5.5.1). This chapter makes the following contributions: 50 1. Automaticprogrampartitioningandmigrationforminimizingenergyconsump- tion. Energy efficiency is of primary concern for sensor nodes because they are typically battery-powered. Wireless communication consumes significant battery energy, and so it is criticaltominimizecommunicationcostsamongnodes. Pleiadesusesanovelcombinationof static and dynamic information in order to determine at which node to execute each state- ment of a Pleiades program. A compile-time analysis first partitions a program’s statements into nodecuts, each representing a unit of work to be executed on a single node. The run- time system then uses knowledge of the actual nodes involved in a nodecut’s computation to determine at which node it should be executed in order to minimize the communication overhead. 2. An easy-to-use and reliable concurrency primitive. Concurrent execution is a nat- ural component of sensor network applications, since each sensor node can execute code in parallel. However, withconcurrencycomesthepotentialforsubtleerrorsinsynchronization that can affect application reliability. To support concurrency while ensuring reliability, the Pleiades runtime system guarantees serializability for each cfor: the effect of a cfor loop always corresponds to some sequential execution of the loop. To achieve this semantics, the runtime system automatically synchronizes access to variables among cfor iterations via locks, alleviating the programmer of this burden. Locking has the potential to cause deadlocks, so the compiler and runtime system also support a novel distributed deadlock detection and recovery algorithm for cfors. 3. A mote-based implementation and its evaluation. We have implemented Pleiades on the widely used, but highly memory-constrained, mote platform. The motes we use have 10kB RAM for program variables and 48kB ROM for compiled code. Our implementation generatesevent-drivennode-levelnesCcodethatisconceptually similar towhataprogram- merwouldmanuallywritetoday. Weevaluatethreeapplicationsbelongingtothreedifferent 51 classes (Section 5.5). We first compare the performance of a sophisticated pursuit-evasion game macroprogram with that of a hand-coded nesC version written by others [GGJ + ]. We find that the Pleiades program is significantly more compact (the source code size less than 10% as large), well-structured, and easy to understand. At the same time, the Pleiades im- plementation has comparable performance with the native nesC implementation. We then evaluate a car parking application that requires a strict notion of consistency andshow that the Pleiades implementation of the concurrent execution is reliable. We finally demonstrate theutilityofcontrolflowmigrationwithinasimplenetworkinformationgatheringexample. Researchers have previously explored abstractions for programming sensor networks in the aggregate [WM, GGG, NKSI], as well as intermediate program representations to support compi- lation of such programs [NAW]. However, to our knowledge, a self-contained macroprogramming systemformotes—onethatgeneratesthecompletecodenecessaryforstand-aloneexecution—has not previously been explored or reported on. Pleiades is also related to research on parallel and distributedsystems. Unliketraditionalparallelsystemsandresearchonautomaticparallelization, we are primarily interested in achieving high task-level parallelism rather than data parallelism, giventhelooselycoupledandasynchronousnatureofsensornetworks. Further, wetargetconcur- rency support toward minimizing energy consumption rather than latency, since sensor networks are primarily power constrained. Unlike traditional distributed systems, Pleiades features a cen- tralized programming model and pushes the burden of concurrency control and synchronization to the compiler and runtime. A more detailed comparison with related work is presented in Section 5.6. 52 3.2 The Pleiades Language 3.2.1 Design Rationale Pleiades is designed to provide a simple programming model that addresses the challenges and requirements of sensor network programming. Pleiades’ sequential semantics makes programs easy to understand and is natural when programming sensor networks in a centralized fashion. Concurrency is introduced in a simple manner appropriate to the domain, via the cfor construct for node iteration. At the same time, the sequential semantics is still appropriate for the purpose of programmer understanding, because Pleiades ensures serializability of cfors. This strong form of consistency and reliability is important for a growing class of sensor network applications, like carparkingandthepartofanapplicationresponsibleforbuildingaroutingtreeacrossthenodes. For these kinds of applications, we argue that Pleiades’s sequential semantics is the right one. We have also used Pleiades for applications such as routing, localization, time synchronization and data collection, which require consistency for at least some program variables. To our knowledge, no other macroprogramming system guarantees even weak forms of consistency. While Pleiadesprovidesasequentialsemantics,itnonethelessefficientlyandnaturallysupports event-drivenexecution. Pleiadeshasspeciallanguagesupportforsensorsandtimersthatprovides a synchronous abstraction for event-driven execution. The synchronous semantics is easy for programmers to understand and fits well with the sequential nature of a Pleiades program. Under the covers, the language constructs are compiled to efficient event-driven nesC code. 3.2.2 Parking Cars with Pleiades We illustrate the language features of Pleiades and the benefits they provide over node-level nesC programs through a small but realistic example application. It involves low cost wireless sensors that are deployed on streets in a city to help drivers find a free space. (According to recent surveys [Sho], searching for a free parking spot already accounts for up to 45% of vehicular 53 traffic in some metropolitan areas.) Each space on the street has an associated sensor node that maintains the space’s status (free or occupied). The goal is to identify a sensor node with a free spot that is as close to the desired destination of the driver as possible. For ease of explanation, we define distance by hop count in the network, but it is straightforward to base this on physical distance. We consider an implementation of this application in Pleiades as well as two node-level versions written in nesC [GLvB + ]. We show that the Pleiades version is simultaneously readable, reliable, and efficient. Each of the two nesC versions is more complex and provides reliability or efficiency, but not both simultaneously. Figure 3.1 shows the key procedure that makes up a version of the street-parking application written in Pleiades. When a car arrives near the deployed area, a space near the driver’s indicated destination is found and reserved for it by invoking reserve, passing the car’s desired location. The reserve procedure finds the closest sensor node to the desired destination and checks if its space is free. If so, the space is reserved for the car. If not, the node’s neighbors are recursively and concurrently checked. The code in Figure 3.1 makes critical use of Pleiades’s centralized view of a sensor network. We describe the associated language constructs in turn. Node Naming. Pleiades provides a set of language constructs that allow programmers to easily access nodes and node-local state in a high-level, centralized, and topology-independent manner. The node type provides an abstraction of a single network node, and the nodeset type providesaniteratorabstractionforanunorderedcollectionofnodes. Forexample,variable n(line 8)inreserveholdsthenodethatisclosesttothedesiredposition(thecodefortheclosest node functionisnotshown),and nToExamine(line9)maintainsthesetofnodesthatshouldbechecked to see if the associated space is free. Thesetofcurrentlyavailablenodesinthenetworkisreturnedbyinvokingget network nodes(), which returns a nodeset. Pleiades also provides a get neighbors(n) procedure that returns a nodeset containing n’s current one-hop radio neighbors. In Figure 3.1, the reserve procedure 54 1: #include "pleiades.h" 2: boolean nodelocal isfree=TRUE; 3: nodeset nodelocal neighbors; 4: node nodelocal neighborIter; 5: void reserve(pos dst) { 6: boolean reserved=FALSE; 7: node nodeIter,reservedNode=NULL; 8: node n=closest_node(dst); 9: nodeset loose nToExamine=add_node(n, empty_nodeset()); 10: nodeset loose nExamined=empty_nodeset(); 11: if(isfree@n) { 12: reserved=TRUE; reservedNode=n; 13: isfree@n=FALSE; 14: return; 15: } 16: while(!reserved && !empty(nToExamine)){ 17: cfor(nodeIter=get_first(nToExamine);nodeIter!=NULL; nodeIter = get_next(nToExamine)){ 18: neighbors@nodeIter=get_neighbors(nodeIter); 19: for(neighborIter@nodeIter=get_first(neighbors@nodeIter); neighborIter@nodeIter!=NULL; neighborIter@nodeIter=get_next(neighbors@nodeIter)){ 20: if(!member(neighborIter@nodeIter,nExamined)) 21: add_node(neighborIter@nodeIter,nToExamine); 22: } 23: if(isfree@nodeIter){ 24: if(!reserved){ 25: reserved=TRUE; reservedNode=nodeIter; 26: isfree@nodeIter=FALSE; 27: break; 28: } 29: } 30: remove_node(nodeIter,nToExamine); 31: add_node(nodeIter,nExamined); 32: } 33: } 34:} Figure 3.1: A street-parking application in Pleiades. 55 uses get neighbors (line 18) to add an examined node’s neighbors to the nToExamine set. The Pleiades runtime implements get neighbors by maintaining a set of sensor nodes that are reach- able through wireless broadcast. Node-Local Variables. Pleiades extends standard C variable naming to address node-local state. This facility allows programmers to naturally express distributed computations and elim- inates the need for programmers to manually implement inter-node data access and commu- nication. Node-local variables are declared as ordinary C variables but include the attribute nodelocal, as shown for the isfree variable (line 2) in Figure 3.1. The attribute indicates that there is one version of the variable per node in the network. A node-local variable is addressed inside a Pleiades program using a new expression var@e, where varisa nodelocal variableand eisanexpressionoftype node. Forexample, the reserve procedure uses this syntax to check if each node in nToExamine is free (line 23). An expression of the form var@e can appear anywhere that a C l-value can appear; in particular, a node-local variable can be updated through assignment. Allvariablesnotannotatedas nodelocalaretreatedasordinaryCvariables,whosescopeand lifetime respect C’s standard semantics. In Pleiades, we call these central variables, to distinguish them from node-local variables. In our example code, reserved is a central variable (line 6), which is therefore shared across all nodes in the network. Concurrency. Bydefault,a Pleiadesprogramhasasequentialexecutionsemantics. However, Pleiades also provides a simple form of programmer-directed concurrency. The cfor loop is like an ordinary for loop but allows for concurrent execution of the loop’s iterations. A cfor loop can iterate over any nodeset, and the loop body will be executed concurrently for each node in the set. For example, the reserve procedure in Figure 3.1 concurrently iterates over the nodes in nToExamine (line 17), in order to check if any of these nodes is free. While concurrency is often essential to achieve good performance, it can cause subtle errors that are difficult to understand and debug. For example, a purely concurrent semantics of the 56 cforin reservecaneasilycausemultiplefreenodestoreadavalueoffalseforthe reservedflag. This will have the effect of making each such node believe that it has been selected for the new car and is therefore no longer free. To help programmers obtain the benefits of concurrency while maintaining reliability, the Pleiades compiler and runtime system ensure that the execution of a cfor is always serializable: the effect of a cfor always corresponds to some sequential execution of the loop. In reserve, serializability ensures that only one free node will reserve itself for the new car; the other free nodes will see the updated value of the reserved flag at that point. Section 3.3.2 explains our algorithm for ensuring serializability for cfor loops. Pleiades allows cfors to be arbitrarily nested. The serializability semantics of a single cfor is naturally extended for nested cfors. Intuitively, the inner cfor is serialized as part of the iteration of the serialized outer cfor. So, in Figure 3.1, the programmer could have replaced the simple for in line 19 with a cfor, and the execution would be correct. It would also increase the available concurrency because multiple threads from the nested cfor iterations would be active at a node. However, in this case, it would not be efficient to use a cfor because the message and latency overheads involved in starting and terminating the concurrent threads and remotely accessing nExamined and nToExamine would offset the potential concurrency gain from executing on multiple neighboring nodes of nodeIter. In general, a programmer must weigh the benefits of fine-grained concurrency through nested cfors against the start-up and finalization overheads of such concurrency. Loose Variables. While serializability provides strong guarantees on the behavior of cfor loops, sensor network applications often have variables that do not need serializability semantics and can obtain timeliness and message efficiency benefits by using a looser consistency model. Examples include routing beacons that are used to maintain trees for sensor data collection, and sensorvaluesthatneedtobefilteredorsmoothedusingsamplesfromneighboringnodes. Pleiades lets a programmer annotate such variables as loose, in which case accesses to these variables are not synchronized within a cfor. The consistency model used for loose variables closely follows 57 releaseconsistencysemantics[KCZ92]. Writestoaloosevariablecanbere-ordered. Thebeginning ofanew cforstatementortheendofanyactive cforstatementactassynchronizationvariables, ensuring that the current thread of control has no more outstanding writes. In Figure 3.1, variables nToExamine and nExamined are annotated as loose (lines 9 and 10) in order to gain additional concurrency and avoid lock overhead on them. These annotations are based on the two observations that it is safe to examine a node in nToExamine multiple times, and that only a cfor iteration on nodeIter can remove the candidate node nodeIter from nToExamine. Alternatively, the programmer can derive the same concurrency in this case without using loose by temporarily storing the set of nodes that would be added to nToExamine in line 21 and deferring the add node operations on this set until after statement 31. In general, the programmer can derive maximum concurrency while ensuring serializability by organizing her code so that writes on serialized variables happen toward the end of a cfor. Bydefault, loosevariablesarestillreliablyaccessed,buttheprogrammercanfurtherannotate a loose variable to be unreliable, so that the implementation can use the wireless broadcast facility. In Section 5.5, we evaluate the street parking example with reliable loose variables and a separate application that primarily uses unreliable loose variables. Automatic Control Flow Migration. Ultimately a centralized Pleiades program must be executed at the individual nodes of the network. As described in Section 5.5.1, the Pleiades implementation automatically partitions a Pleiades program into units of work to be executed on individual nodes and determines the best node on which to execute each unit of work in order to minimize communication costs. For example, the first five statements of the code (lines 6– 10) execute at the node invoking reserve. The implementation then migrates the execution of statements in lines 11–16 to node n. This is because it is cheaper to simply transfer control to n than to first read isfree@n and later write it back if necessary. Similarly, each iteration of the cfor loop will execute at the node identified by the current value of nodeIter (line 17). While 58 it does not happen in this example, the execution of a single cfor iteration can also successively migrate to other nodes. 3.2.3 Parking Cars with nesC Pleiades provides several important advantages over the traditional node-level programming for sensor networks in use today. To make things concrete, we consider how the street-parking algorithm would be implemented in nesC. We describe two different nesC implementations: a centralizedversionthatisrelativelysimpleandreliablebuthighlyinefficient, andamorecomplex distributedversionthatisefficientbutunreliable. Incontrast, the Pleiadesversionisbothreliable and efficient. 3.2.3.1 A Centralized nesC Implementation First, it is possible to implement a centralized version of the algorithm in nesC, wherein most of the algorithm is executed on a single node. The major advantage of this approach is its relative simplicityforprogrammers. However,thisversionisextremelyinefficientintermsofbothmessage cost and latency. Figure 3.2 shows the core functions that comprise such a program. The overall logic is similar to that of the Pleiades version from Figure 3.1. However, programmers must explicitly manage the details of inter-node communication. Because nesC uses an asynchronous, split-phase approach to such communication [GLvB + ], the application’s logic must be partitioned across multiple callback functions at remote read/write boundaries. The control flow is as follows. A task reserve (line 9) is spawned on the node closest to the car, which, in turn, calls the closest node function (line 10) in the Topology com- ponent (this component is not shown). Since all tasks in nesC run to completion, and since Topology.closest node performs a split-phase lookup operation for the desired closest node, the callback function found node is later invoked by Topology (line 12). The callback creates a 59 1: module ReserveM { 2: uses { ... } 3: provides { ... } 4: } implementation { 5: nodeset nToExamine, nExamined; 6: boolean reserved, isfree, is_remote_free; 7: node closest, reserved_node, req, iter, iter1; 8: pos dst; 9: task void reserve() { 10: call Topology.closest_node(dst); 11: } 12: event void Topology.found_node(node n){ 13: closest = n; req=TOS_LOCAL_ADDRESS; 14: post transfer_control(); 15: } 16: task void transfer_control() { 17: uint8_t i; 18: //Trigger remote doReserve() at ‘‘closest’’ node 19: //Also, send ‘‘req’’ and ‘‘closest’’ node values 20: } 21: task void doReserve() { 22: if (isfree) { 23: reserved_node=TOS_LOCAL_ADDRESS; 24: call MsgInt.send_reply(req,FOUND);} 25: else { 26: nToExamine=call Topology.get_neighbors(); 27: call RemoteRW.aread(nToExamine,ISFREE); } 28: } 29: event void RemoteRW.aread_done(done_t done) { 30: if (done==ISFREE) continue_reserve(); 31: else if (done==NEIGHBORS) build_more_nodes(); 32: } 33: void continue_reserve() { 34: for(iter=get_first(nToExamine);iter!=NULL; iter=get_next(nToExamine)) { 35: remove_node(iter, nToExamine); 36: add_node(iter, nExamined); 37: if(is_remote_free=call RemoteRW.read(iter,ISFREE)){ 38: reserved_node=iter; reserved=TRUE; 39: call RemoteRW.awrite(iter,ISFREE,0); } 40: } 41: if (!reserved) 42: call RemoteRW.aread(nToExamine,NEIGHBORS); 43: } 44: void build_more_nodes(){ 45: nodeset nl; 46: for(iter=get_first(nToExamine);iter!=NULL; iter=get_next(nToExamine)) { 47: nl=(call RemoteRW.read(iter,NEIGHBORS)); 48: for(iter1=get_first(nl); iter1!=NULL; iter1=get_next(nl)) 49: if(!member(iter1,nExamined)) 50: add_node(iter1,nToExamine); } 51: call RemoteRW.aread(nToExamine,ISFREE); 52: } 53:} Figure 3.2: Reliable but inefficient street-parking in nesC. 60 1: module ReserveM { 2: uses { ... } 3: provides { ... } 4: } implementation { 5: boolean isfree, seen, reserved; 6: pos dst; 7: node start_node[], req, orig, reserved_node; 8: uint8_t cnt_start_node, hopcount; 9: task void reserve() { 10: call Topology.closest_node(dst); 11:} 12:event void Topology.found_node(node n){ 13: orig=TOS_LOCAL_ADDRESS; 14: start_node[0]=n, req=n, hopcount=HOP_MAX; 15: cnt_start_node=1; 16: post transfer_control(); 17:} 18:task void transfer_control() { 19: uint8_t i; 20: for (i=0;i<cnt_start_node;i++) { 21: //Trigger remote doReserve() at every start_node[i] 22: //Also, send each node our req, orig, hopcount values 23: } 24:} 25:task void doReserve(){ 26: if(!seen) {seen=TRUE;} 27: if (isfree && !seen){ 28: reserved_node=TOS_LOCAL_ADDRESS; 29: isfree=FALSE; 30: call MsgInt.send_reply(req,FOUND); } 31: else flood_neighbors(); 32:} 33:void flood_neighbors() { 34: nodeset nl=Topology.get_neighbors(); 35: node iter; 36: hopcount--; 37: if (hopcount>0) { 38: cnt_start_node=0; 39: for (iter=get_first(nl);iter!=NULL;iter=get_next(nl)) 40: start_node[cnt_start_node++]=iter; 41: post transfer_control(); } 42:} 43:event void MsgInt.receive_reply(node rep,msg_t msg){ 44: if (msg==FOUND) { 45: if (!reserved){ 46: reserved_node=rep; 47: call MsgInt.send_reply(rep,ACCEPT); 48: call MsgInt.send_reply(orig,FOUND); } 49: else call MsgInt.send_reply(rep,REJECT); } 50: else if(msg==REJECT){isfree=TRUE;} 51:} 52:} //end implementation Figure 3.3: Efficient but unreliable street-parking in nesC. 61 new task transfer control (line 14), which ultimately triggers doReserve on the closest node (line 21). The rest of the algorithm then runs centrally on the closest node. doReserve, executing on closest, either finds itself free (line 22) or creates the nToExamine set with its current neighbor set (line 26). Next, it concurrently and asynchronously reads isfrees at nToExamine (line 27) using aread of the RemoteRW component (not shown). When the asynchronous read completes, it signals aread done (line 29), and continue reserve is called (line 30). Such reads are locally cached in the RemoteRW component, so that continue reserve can synchronously read them in line 37. If no node with a free spot is found (lines 37–41), more neighboring nodes of the current nodes are searched using another asynchronous read (line 42), which, ultimately calls build more nodes (line 31). Sincethecodeisexecutedonasinglenode,thisapproachmaintainsarelativelystraightforward structure, similar to that of the Pleiades code. The main drawback of this approach to node-level programmingisinefficiency. Messagecostishighbecauseisfreeofeverynodeiscentrallyfetched and checked from a single node. In contrast, the Pleiades version from Figure 3.1 uses a cfor to allow each node to locally process its own data, using the code migration techniques described in Section 5.5.1. Thus, even for small example topologies of two-hop radius, it can be shown that the Pleiades version requires around half the messages required by the nesC version; this message count for Pleiades includes all control overhead for code migration and for ensuring serializability of cfors. The concurrent cfor iterations in Pleiades also find a free spot earlier than is possible in the nesC version. In the nesC version, continue reserve in line 42 waits on RemoteRW.aread for all remote neighbors in nToExamine to be asynchronously read, and build more nodes in line 51 similarly waits until all remote isfrees in nToExamine are read. 62 3.2.3.2 A Distributed nesC Implementation The Pleiades version of car parking in Figure 3.1 does a breadth-first search around the closest node, moving to the next depth in a distributed fashion only if no free slot is found in the currentone. Unfortunately,adistributedimplementationinnesCthatprovidesthesamebehavior as the Pleiades version would be exceedingly complex. Such an implementation would require the programmer to manually implement many of the same concurrency control techniques that Pleiadesautomaticallyimplementsfor cfors, asdiscussedinSection3.3.2. Forexample, toensure that exactly one free space is reserved for a car, the programmer would have to implement a form of distributed locking for conceptually central variables. In general the use of locking would then require manual support for distributed deadlock detection or avoidance. Similarly, to ensure that the closest free space is always found, the programmer would have to manually synchronize execution across the nodes in the network, to ensure that a depth d is completely explored before moving on to depth d+1. Therefore, in practice a distributed version in nesC would forgo synchronization, as shown in Figure 3.3. Here we do a distributed flooding-based search around the closest node, in order to find a free spot. The control flow is as follows. After reserve is invoked (line 9), doReserve is ultimately triggered, in a manner similar to the previous version. The only difference here is that doReserve may be active at multiple nodes that receive the flooding request and may be activated multiple times by several neighbors (lines 39–41). Since a node must process a request exactly once even if its doReserve is triggered multiple times by its neighbors, doReserve uses a flag seen (line 26) to ignore all but the first request. To limit the number of duplicate requests at a node, the code also suppresses broadcasts to neighbors when the hopcount reaches 0 (line 37). This is an effective technique when the network diameter is unknown and when we want to ensure the flooded requests prefer shorter hops from the flooding initiator (node req in line 14). receive reply (line 43) is a callback that is invoked 63 by the local message interface component MsgInt (not shown) whenever a remote node sends a message. When a spot is found at a remote node, it sends FOUND to the flooding initiator (line 30), which rejects all but the first successfully replying node (lines 45–49). If a remote node is rejected, it sets itself back to free (line 50). As described earlier, the Pleiades version performs a breadth-first search on the topology, distributedly determining if there is a free slot at depth d before moving on to depth d+1. By contrast, the flooding approach starts up the free-slot determination concurrently at all network nodes by flooding the transfer of control. Given this distinction, two things follow. First, the Pleiades approach is always more message efficient, since it avoids multiple requests to the same node. Second, the flooding approach has lower latency, since it can find a spot more quickly when the free spot is far away. The flooding approach is also much more efficient in terms of both messaging costs and latency than the centralized nesC version shown in Section 3.2.3.1. Despite the latency advantage, the code in Figure 3.3 is significantly less understandable and reliable than the Pleiades version. The programmer is responsible for explicitly managing the communication among nodes. For efficiency, this requires maintaining information about hop counts and other network details. It also requires that conceptually “central” variables be packaged up and passed among the nodes explicitly, taking care to maintain consistency. For example, a special protocol is used in receive reply (lines 44–50) to ensure a consistent view of the reservedflag,inordertoavoidhavingmultiplenodesbereservedforthesamecar. Similarly, in transfer control (lines 21–22), a node explicitly sends the values of the node originating the request and the node closest to the destination that initiated the search. In the Pleiades version, the combination of central variables and cfors takes care of these low-level details automatically. Finally, thefloodingversion, unliketheothertwoversions, makesnoguaranteethatthefirstnode to reply is the topologically closest node. So, if we want it to reliably return only a closest node, the reqnodeexecuting MsgInt.receive reply(line43)mustwaitforanindeterminableamount of time before accepting a replying node, negating the latency advantage. 64 3.2.4 Other Features of Pleiades Pleiades includes other language constructs to support the implementation of common sensor network idioms, which we briefly describe. Sensors and Timers. As mentioned earlier, Pleiades uses special kinds of variables as an abstraction for sensors, which are critical components of sensor-network applications. Sensor readings are asynchronous events, and Pleiades provides a facility to synchronously wait for such an event to occur. In particular, Pleiades’s wait function takes a sensor variable and returns when the sensor takes a reading. At that point, the associated variable contains the most recent reading and the program can take appropriate action. For example, this mechanism is used in order for the car-parking application to wait for notification that a parked car has left its spot, at which point the spot’s sensor sets its associated isfree variable defined in line 2 of Figure 3.1 to TRUE (this operation is not shown), so that it can once again service remote reserve requests. A similar technique is used to model timers, which fire at some user-specified rate. Modules. A Pleiades program consists of a number of modules, which are executed concur- rently. Each module encapsulates a logically independent application-level computation, such as building a shortest path tree rooted at a given node, computing an aggregate, or routing applica- tion data to a given node. A module is a set of functions that can invoke each other and define and use global and local variables of both central and nodelocal type. Since modules are meant to be independent tasks, we currently provide no synchronization among modules. 3.3 Implementation This section describes the Pleiades compiler and runtime system. The Pleiades compiler is built as an extension to the CIL infrastructure for C analysis and transformation [NMRW]. Our compiler accepts a Pleiades program as input and produces node-level nesC code that can be linked with standard TinyOS components and the Pleiades runtime system. The Pleiades runtime system is 65 a collection of TinyOS modules that orchestrates the execution of the compiler-generated nesC code across the nodes in the network. The Pleiades compiler and runtime cooperate to tackle two key technical challenges. First, they must partition a Pleiades program into chunks that can be executed on individual nodes and determine at which node to run each chunk, striving to minimize communication costs. Second, they must provide concurrent but serializable execution of cfors. We discuss each challenge in turn. 3.3.1 Program Partitioning and Migration Partitioning. The PleiadescompilerperformsadataflowanalysisinordertopartitionaPleiades program into a set of nodecuts. Each nodecut is then converted into a nesC task [GLvB + ], to be executed by the Pleiades runtime system on a single node in the network. At one extreme, one could consider the entire Pleiades program to be a single nodecut and execute it at one node, fetching node-local and central variables from other nodes as needed (moving the data to the computation). The other extreme would be to consider each instruction in the Pleiades program as its own nodecut, executing it on the node whose local variables are used in the computation (moving the computation to the data). Both of these strategies lead to generated code that has high messaging overhead and high latency, in the first case due to the on-the-fly fetching of individual variables, and in the second case due to the per-instruction migration of the thread of control. Weadoptacompilationstrategyfor Pleiadesthatliesinbetweenthesetwoextremes,involving bothcontrolflowmigrationanddatamovement. Anodecutcanincludeanynumberofstatements, but it must have the property that just before it is to be executed, the runtime system can determine the location of all the node-local variables needed for the nodecut’s execution. We therefore define a nodecut as a subgraph of a program’s control-flow graph (CFG) such that for 66 every expression of the form var@e in the subgraph, the l-values in e have no reaching definitions within that subgraph. Given this property, the runtime system can retrieve all the necessary node-local and cen- tral variables concurrently, before beginning execution of a nodecut, which improves the latency immensely over the first strategy above. At the same time, because the runtime system has in- formation about the required node-local variables, it can determine the best node (in terms of messaging costs) at which to execute the nodecut, thereby obtaining the benefits of the second strategy above without the latency and message costs of per-statement migration. Find-Nodecut(P) 1 Compute the CFG G of the program P 2 for all nodes n∈ G 3 do nodecut(n)←entry(G) 4 for all nodes n∈ G 5 do if n contains an expression of the form exp1@v 6 then NC←{n ′ ∈G|nodecut(n ′ )= nodecut(n)} 7 RD←{n ′ ∈ NC|n ′ contains a definition of v that reaches n} 8 SUB← S rd∈RD graph of all paths from rd to n 9 D←{n ′ ∈ NC|n ′ dominates n in NC and ∀rd∈ RD n ′ post-dominates 10 rd in SUB} 11 pick some node d∈ D as the entry node of a new nodecut 12 nodecut(d)←d 13 ∀n ′ ∈NC that are reachable from d in NC without traversing a back edge, 14 nodecut(n ′ )←d 15 return set of nodecuts formed Figure 3.4: Algorithm for determining nodecuts. Intuitively, the goal is to make each nodecut as large as possible, in order to minimize the controlanddatacostsassociatedwithamigration. Sinceanodecutrunstoitscompletionwithout anyfurthercommunication,thisapproachwouldstaticallyminimizethetotalcommunicationcost ofaprogram. Wemakethegoalofminimizingmigrationsprecisebystrivingtominimizethetotal number of edges in the program’s CFG that cross from one nodecut to another, since each such edge represents a migration of the dynamic thread of control from one sensor node to another. This optimization problem is exactly equivalent to the directed unweighted multi-cut problem, which is known to be NP-complete [CFR]. Therefore, instead of finding the optimal partition of 67 a CFG into nodecuts, the Pleiades compiler uses a heuristic algorithm that works well in practice, as shown in Section 5.5. The algorithm starts by assuming that all CFG nodes are in the same nodecut and does a forward traversal through the CFG, creating new nodecuts along the way. For each CFG node n containing an expression of the form var@e, we find all reaching definitions of the l-values in e and collect the subset R of such definitions that occur within n’s nodecut. If R is nonempty, we induce a new nodecut by finding a CFG node d that dominates node n and post-dominates all of the nodes in R. Node d then becomes the entry node of the new nodecut. Any such node d can be used, but our implementation uses simple heuristics that attempt to keep the bodies of conditionals and loops in the same nodecut whenever possible. The implementation also uses heuristics to increase the potential for concurrency. For example, the body of a cfor is always partitionedintonodecutsthatdonotcontainanystatementsfromoutsidethe cfor, sothatthese nodecuts can be executed concurrently. The five nodecuts computed by our algorithm for the street-parking example in Figure 3.1 are showninFigure3.5. Nodecut2isinducedduetotheuseof isfree@ninline11ofFigure3.1,since n is defined in line 8. The transitions from nodecut 2 to 3 and nodecut 3 to 4 are induced to keep the cfor body separate from statements outside the loop, as mentioned above. Further, an extra nodecutisinducedwithinthe cforbody(nodecut5)tomaximizereadconcurrency. Theheuristic attempts to separate read and written variables into different nodecuts so that the acquisition of write locks, which is done before a nodecut starts execution, can be delayed until the write locks are actually required. In the current implementation we assume that a Pleiades program does not create aliases among node variables. Such aliasing has not been necessary in any of our experiments with the Pleiades language so far. It is straightforward to augment our algorithm for generating nodecuts to handle node aliasing by consulting a static may-alias analysis. 68 ControlFlowMigration. ThePleiadesruntimesystemisresponsibleforsequentially(ignoring cforforthemoment)executingeachnodecutproducedbythecompileracrossthesensornetwork. WhenexecutionofanodecutC completesatsomenoden,thatnode’sruntimesystemdetermines an appropriate node n ′ at which to run the subsequent nodecut C ′ and migrates the thread of control to n ′ . All of the Pleiades program’s central variables migrate along with the thread of control, thereby making them available to C ′ . Because of the special property of nodecuts, the runtime system knows exactly which node-local variables are required by C ′ , so these variables are also concurrently fetched to n ′ before execution of C ′ is begun. To determine where the next nodecut should be executed, the runtime uses the overall migra- tion cost as the metric. The runtime knows the number of node-local variables needed from each node for executing the next nodecut as well as the distances (the number of radio hops) of these nodesrelativetoeachotheraccordingtothecurrenttopology. Theruntimechoosesthenodethat minimizesthecostoftransfersfromwithinthisset. Forexample, nodecut2inFigure3.1accesses the node-local variable isfree@n, as well as two central variables reserved and reservedNode. The cost of running this nodecut at the node executing nodecut 1 is the cost of fetching the value of isfree from n at the beginning of nodecut 2 and writing back isfree if necessary. This cost is two reliable messages across multiple radio hops. By contrast, if the runtime at nodecut 1 hands offnodecut2tonode n, thecostisthatoftransferringthethreadofcontrolalongwiththecentral variables. This is only onereliable message acrossthe samenumber of hops. So, Pleiades executes nodecut 2 at n. Since the nodecuts along with the set of node-local variables accessed in each nodecut are statically supplied by the compiler, our migration approach thus exploits a novel combination of static and dynamic information in order to optimize energy efficiency. We note that this approach does not require every node to keep a fully consistent topological map, but only the relative distances of the nodes involved in the nodecut. In our current implementation, nodes use a statically configured topological map in order to make the migration decision; we will explore 69 lightweight, dynamic approaches to determine approximate topological maps as part of future work. 3.3.2 Serializable Execution of cfors To execute a cfor loop, the Pleiades runtime system forks a separate thread for each iteration of the loop. We call the forking thread the cfor coordinator. Program execution following the cfor onlycontinuesoncealltheforkedthreadshavejoined. Eachforkedthreadisinitiallyplacedatthe node representing the value of the variable the cfor iterates over, and any subsequent nodecuts in the thread are placed using the migration algorithm for nodecuts described above. A forked thread may itself execute a cfor statement, in which case that thread becomes the coordinator for the inner cfor, forking threads and awaiting their join. To provide reliability in the face of concurrency, Pleiades ensures serializability of cfor loops. This allows programmers to correctly understand their Pleiades programs in terms of a sequential execution semantics. The Pleiades compiler and runtime ensure serializability by transparently locking variables accessed in each cfor body. The use of locking has the potential to cause deadlocks, so we also provide a novel distributed deadlock detection and recovery algorithm. Distributed Locking. To ensure serializability, the Pleiades implementation protects each node-local and central variable accessed within a cfor iteration with its own lock. We employ a pessimisticlockingapproach, sincethisconsumeslessmemorythanoptimisticapproachessuchas versioning. To ensure serializability, a lock must be held until the end of the outermost cfor iter- ation being executed; thus, the implementation uses strict two-phase locking. However, locks are acquired on demand rather than at the beginning of the cfor iteration, thereby achieving greater concurrency. Tofurtherincreaseconcurrency, ouralgorithmdistinguishesbetweenreadandwrite locks. Readers can be concurrent with one another, while a writer requires exclusive access. The 70 implementation acquires locks at the granularity of a nodecut. This allows the locks to be fetched along with the associated variables before the nodecut’s execution, decreasing messaging costs. Our algorithm acquires locks in a hierarchical manner. Each cfor coordinator keeps track of which locks it holds, the type of each lock (read or write), which of its spawned threads are currently using each lock, and which of its threads are currently blocked waiting for each lock. When a nodecut requires a particular lock, it asks the coordinator of its innermost enclosing cfor for the lock. If the coordinator has the lock, it either provides the lock or blocks the thread, depending on the lock’s current status, and updates the lock information it maintains appropriately. If the coordinator does not have the lock, it recursively requests the lock from its cfor coordinator, thereby handling arbitrarily nested cfors. Once the top-level cfor coordinator has been reached, it acquires the lock from the variable’s owner and grants the lock to the requesting thread (who will then grant the lock to its requesting thread, and so on down to the original requester). Once a thread has obtained the lock on a variable, it fetches the actual value of the variable directly from the owner. When a spawned thread joins, it returns its locks to its cfor coordinator, who may therefore be able to unblock threads waiting for these locks. Also, if any of the locks owned by the joining thread were write locks, before releasing the locks it writes back the current value of the variable at the owner. It is possible to argue that this locking scheme always results in a serializable execution of a cfor, but we omit the details due to space constraints. Let us revisit the street parking example in Figure 3.1. For each cfor iteration, the Pleiades runtime at the coordinator sends a message containing the fork command to each of the remote nodes selected for execution. Each node initially acquires a read and write lock respectively on its own versions of the node-local variables isfree and neighbors. isfree uses a read lock instead ofawritelockeventhoughitcanpotentiallybemodifiedinline26,becauseusingareadlockfirst and then upgrading it to a write lock if the conditional in line 23 succeeds significantly enhances concurrency. On receiving these locks, the threads fetch the variable values from the owners and 71 begin concurrent execution of the initial nodecut of the cfor (nodecut 3 in Figure 3.5). Threads that run on nodes with an occupied parking space fail the if condition in line 23, release their locks, and join with the cfor coordinator. Threads on nodes that have a free space contend for a write lock on central variables reserved and reservedNode and have to execute the second nodecut of the cfor sequentially. The first thread to do so is selected as the winner, and other nodes do not change their isfree status. Distributed Deadlock Detection and Recovery. While the locking algorithm ensures seri- alizability of cfors, it can give rise to deadlocks. One possibility would be to statically ensure the absence of deadlocks, for example via a static or dynamic global ordering on the locks. However, such an approach would be very conservative in the face of cfors containing multiple nodecuts, nested and conditional cfors, or cfors that contain updates to node variables, thereby overly restricting the amount of concurrency possible. Further, we expect deadlocks to be relatively infrequent. Therefore Pleiades instead implements a dynamic scheme for distributed deadlock detection and recovery. While such schemes can be heavyweight and tricky in general [Elm], we exploit the fork-join structure of a cfor to arrive at a simple and efficient state-based deadlock detection algorithm. Our algorithm requires only two bits of state per thread, does not rely on timeouts, and finds deadlocks as soon as it is safe to determine the condition. Furthermore, this algorithm is implemented by the compiler and runtime, without any programmer intervention. We require every thread to record its state during execution, which is either executing, blocked, or joined. We define a cfor coordinator to be executing if at least one of the coor- dinator’s spawned threads is executing, blocked if at least one of the coordinator’s threads is blocked and none are executing, and joined if all of the coordinator’s threads are joined. A thread can easily update its state appropriately as its locks are requested and released during the locking algorithm described above, in the process also causing the thread to recursively update 72 the state of its cfor coordinator. The program is deadlocked if and only if the top-level cfor coordinator ever has its state set to blocked. Once a deadlock has been detected, we use a simple recovery algorithm. Starting from the top-levelcforcoordinator,wewalkdowntheuniquepathtothehighestthreadinthetreeofcfor coordinators that has at least two blocked child threads. We then release all locks held by these blocked threads and re-execute them in some sequential order. This simple approach guarantees that we will not encounter another deadlock after restart. To support re-execution, each thread recordstheinitialvaluesofallvariablestowhichitwrites,sothatthevariablespreviouslyupdated at their owners can be rolled back appropriately during deadlock recovery. We assume that the iterations are idempotent, so there are no harmful side-effects of re-execution. This is true in many sensor networks programs, which primarily involve sensing and actuation as side effects. 3.4 Evaluation Wehaveimplementedthe PleiadescompilerandruntimedescribedinSection5.5.1. Inthissection, we describe an evaluation of this implementation for various applications, with Pleiades running onTelosBTmoteSkymotes. Wefirstdiscusstheperformanceofa Pleiadesapplicationrelativeto a nesC implementation of that same application. Then, we quantify the performance of Pleiades support for serializability and nodecut migration. Pleiades and nesC Comparison. We compare a Pleiades implementation of a Pursuit-Evasion Game (PEG) against a hand-coded node-level nesC implementation of the same application writ- ten by others [GGJ + ] on a 40 node mote testbed. PEGs [SSW + ] have been explored extensively in robotics research. In a PEG, multiple robots (the pursuers) collectively determine the location of one or more evaders using the sensor network, and try to corral them. The mote implementation of this game consists of three components: a leader election module performs data fusion to determine the centroid of all sensors that detect an evader; a landmark 73 Application No. of nodecuts Optimal Generated by Find-Nodecut PEG 14 21 Tree Building 4 6 Leader Election and 3 4 Location Reporting Landmark Routing 3 6 Landmark-to-Pursuer Routing 4 5 Street Parking 3 4 Sum/Max 2 3 Table 3.1: Performance of Find-Nodecut. routing module routes leader reports to a landmark node; in turn, the landmark routes reports to pursuers. The Pleiades version of PEG implements the leader election component of PEG, and leverages the routing provided by the Pleiades runtime to route the leader reports directly to the pursuer. It is less than a tenth of the nesC implementation in terms of lines of code (63 lines as opposed to 780). An important feature of this application is that it requires no serializability semantics for the core leader election module; in fact, the data we present below wereobtainedusingaversionof Pleiadesthatdidnotsupportserializability. Wealsoimplemented PEG on Pleiades with full serializability support for leader election, and found that it does not incur additional overhead due to locking, because leader election needs only read locks, which are acquired once at the beginning, and retained until the end. Figure 3.8 depicts the main application-perceived measure of performance, the error in po- sition estimate on a topological (reduced) map of the environment [KB]. This figure is highly encouraging; the Pleiades program exhibits comparable error to a hand-crafted nesC program. The frequency of 2- and 3-hop errors is slightly higher for Pleiades-PEG than for mote-PEG. On the other hand, Pleiades-PEG does not incur instances of 5-hop error that mote-PEG does. 74 We also measured the latency between when a mote detects an evader and when the corre- spondingleaderreportreachesthepursuer. Mote-PEGhasnoticeablylowerlatencythan Pleiades- PEG, but for most nodes (about 80%), this latency difference is within a factor of two. This is because our implementation of Pleiades is unoptimized for handling cfor forks and joins, and because our nodecut placement implementation relies on relatively static hop count information. There is scope for improving both significantly. The average network overhead for mote-PEG is 193 messages per minute, while for Pleiades- PEG is 243. The minimum and maximum network overhead is 137 and 253 for mote-PEG and 146 and 341 for Pleiades-PEG. While these results merit further study, they suggest that Pleiades performance can be comparable to that of node-level programming. Serializability Evaluation. We ran the street-parking application of Figure 3.1 on a 10-node chain mote topology. This topology is an extreme configuration, and thus stresses our serial- izability implementation, because the efficiency of packet delivery in a chain of wireless nodes drops dramatically with the length of the chain. In our experiments, 10 requests for free spots arrive sequentially at the node in the center of the chain. To illustrate the power of Pleiades’s serializability guarantees, and to understand its performance, we ran four different versions of the application: SP-NL,inwhichweconfiguredthe Pleiadescompilerandruntimetodisablelocking; SP, which uses the complete Pleiades compiler and runtime for locking, deadlock detection and recovery; SPID-NR, in which we induced a deadlock into the application and configured the Pleiades runtime to disable deadlock recovery; and SPID, which uses the complete Pleiades im- plementation with the deadlock-induced application. To improve performance, we implemented message aggregation for lock requests and forwarded locks across consecutive nodecuts. As expected, SP and SPID execute correctly, assigning exactly one spot to each request. SPID-NR fails to allocate a spot to all but the first request; in the absence of recovery code, the programdeadlocksafterthefirstrequest. Finally,SP-NLviolatesthecorrectnessrequirementsof theapplication, correctlysatisfyingthefirstrequest, butassigningtwofreespotsineachdirection 75 of the center node for the next four requests; consequently, it also fails to satisfy the last four requests. Figure 3.9 plots the time taken to assign a spot to the request, and Figure 3.10 plots the total numberofbytestransmittedoverthenetworkforeachrequest. Thesamequalitativeobservations maybedrawnfrombothgraphs. SPandSPIDmessagecostandlatencyincreasesincesuccessive requests have to search farther out into the network to find a free spot. However, for the initial requests, the overhead of SP is comparable to that of SP-NL. Moreover, SPID message cost and latency are only moderately higher than SP. The difference is attributable to the sequential execution of the cfor threads during deadlock recovery, with rollback overhead being negligible. The periodic spikes in both plots arise because, for even-numbered requests, there are two free spots at the same distance away from the requester that contend to satisfy the request. These two free spots also cause a deadlock in the case of SPID. Finally, the latency and overhead of SP-NL flatten out for later requests because they each incur the same cost: they search the entire network for a free spot and fail, because spots were incorrectly over-allocated during earlier requests. Thus, our Pleiades implementation correctly ensures serializability and incurs moderate over- headfordeadlockdetectionandrecovery. Theabsoluteoverheadnumbersimplythatevenforthe request which encounters the highest overhead, the average bandwidth of a node used by Pleiades is around 250bps, with the maximum being 1kbps at the node where the requests come in. This is quite reasonable, considering that the maximum data rate for the TelosB motes is 250kbps. The absolute latency seems modestly high compared to the expected response time for human interactivity. For example, the last request takes almost a minute and a half to satisfy. This is an artifact of the end-to-end reliable transport layer that Pleiades currently uses, which waits for 2 seconds, before trying to resend a packet that has not been acknowledged as received. We believe that the overall latency can be significantly reduced by optimizing the transport layer. 76 The Benefits of Migration. Finally, we briefly report on a small experiment on a 5-node chain thatquantifiesthebenefitof Pleiades’scontrolflowmigration. Inthisapplication, anodeaccesses node-local nodesets from other nodes more than a hop away, so that application-level network information can be gathered. Without migration, the total message cost is 780 bytes, while, with migration, it is 120 bytes. Thus, we see that, even for small topologies, control flow migration can provide significant benefits. 3.5 Related Work Pleiadesisrelatedtomanyprogrammingconceptsdevelopedinparallelanddistributedcomputing. We classify related work into three broad categories. They are embedded and sensor systems languages, concurrent and distributed systems languages, and parallel programming languages. Embedded and Sensor Networks Languages. Several researchers have explored program- ming languages for expressing the global behavior of applications running on a network of less- constrained 32-bit embedded devices (e.g., iPAQs). Pleiades’s programming model borrows from our earlier work on Kairos [GGG], an extension to Python that also provides support for iter- ating over nodes and accessing node-local state. However, Kairos does not support automatic code migration or serializability. Kairos provides support for application-specific recovery mecha- nisms [GKMG], which Pleiades lacks. SpatialViews [NKSI] is an extension to Java that supports an expressive abstraction for defining and iterating over a virtual network. In SpatialViews, con- trol flow migrates to nodes that meet the application requirements. To avoid concurrency errors, SpatialViews restricts the programming model within iterators. Regiment [NW] is a functional programming language for centrally programming sensor net- works that models all sensor data generated within a programmer-specified region as a data stream. Regiment is a purely functional language, so the compiler can potentially optimize pro- gram execution extensively according to the network topology. On the other hand, since the 77 language is side-effect-free, it does not support the ability to update node-local state. For exam- ple, the car parking application would be much harder to write in Regiment. TinyDB [MFHHb] provides a declarative interface for centrally manipulating the data in a sensor network. This interface makes certain applications reliable and efficient but it is not Turing-complete. Because TinyDB lacks support for arbitrary computation at nodes, it cannot be easily used to implement the kinds of applications we support, like car parking. Research on Abstract Regions [WM] provides local-neighborhood abstractions for simplifying node-level programming. This work is focused on programmability and efficiency and does not provide support for consistency or reliability. Concurrent and Distributed Systems. Argus [Lis88] is a distributed programming lan- guage for constructing reliable distributed programs. Argus allows the programmer to define concurrent objects and guarantees their atomicity and recovery through a nested transactions facility, but makes the programmer responsible for ensuring serializability across atomic objects and for handling any application-level deadlocks. Recently, composable Software Transactional Memory (STM) [LR06] has been proposed as an abstraction for reliable and efficient concurrent programming. Also, Atomos [CMC + ] is a new programming language with support for implicit transactions and strong atomicity features. Our cfor construct, with its serializability semantics and nesting ability, is designed in a similar spirit—a concurrency primitive with simplicity, efficiency, reliability, and composability as goals. Unlike these systems, however, Pleiades derives concurrency from a set of loosely coupled and distributed, resource constrained nodes. Therefore, the Pleiades implementation of cfor emphasizes message and memory efficiency over throughput or latency. For the same reason, it uses a simple distributed locking algorithm for serializability and a novel low-state algorithm for distributed deadlock detection and recovery. Pleiades’ cfors are also similar to atomic sections in Autolocker [MZGB06]inthatbothimplementations usestricttwo-phase locking. ButAutolocker guarantees the absence of deadlocks through pessimistic locking, while Pleiades uses an optimistic 78 locking model in which locks are acquired or upgraded as needed, and any deadlocks are detected and recovered by the runtime. Approaches to automatic generation of distributed programs have also been explored. For example, Coign [HS] is a system for automatically partitioning coarse-grained components. Mag- netOS [LRW + ] also has support for partitioning a program written to a single system image abstraction. A program transformation approach for generating multi-tier applications from se- quentialprogramsisdescribedin[NT]. Allthesesystemsareprimarilymeantforpartitioningand distribution of programs into coarse-grained components, that can then be run concurrently on multiple nodes. Pleiades differs from these systems in generating nesC programs with fine-grained nodecuts and supporting lightweight control flow migration across such nodecuts. Parallel Processing Languages. Pleiades differs from prior parallel and concurrent program- ming languages such as Linda [GC] and Split-C [CADG + ] by obviating the need for explicit locking and synchronization code. Pleiades also differs from automatic parallelization languages such as High Performance Fortran [Koe92] by equipping the compiler and runtime with serial- izability facilities. This is because parallel programming languages focus on data parallelism on mostly symmetric processors, leaving to the programmer the responsibility of ensuring deadlock and livelock freedom at the application level. On the other hand, Pleiades offers task-level par- allelism, where data sharing among sensor nodes is common, and where it is desirable to offload the correct implementation of concurrency to the compiler and runtime. 3.6 Conclusions and Future Work Pleiades enables a sensor network programmer to implement an application as a central program that has access to the entire network. This critical change of perspective simplifies the task of programming sensor network applications on motes and can still provide application performance comparable to hand-coded versions. Pleiades employs a novel program analysis for partitioning 79 centralprogramsintonode-levelprogramsandformigratingcontrolflowacrossthenodes. Pleiades alsoprovidesasimpleconstructthatallowsaprogrammertoexpressconcurrency. Thisconstruct usesdistributedlockingalongwithsimpledeadlockdetectionandrecoverytoensureserializability. Together, these features ensure that Pleiades programs are understandable, efficient, and reliable. Our implementation of these features runs realistic applications on memory-limited motes. Whileourcurrent Pleiadesimplementationisrobusttooneaspectofnetworkdynamics(packet loss), the failure of a cfor coordinator can cause an application to fail. We are currently imple- mentingsupportforhandlingnodedynamicssuchascrashesandadditionsthroughasimpleretry- based mechanism that extends the reliable routing and transport mechanisms already present in the runtime. The basic idea is that node failures trigger an undo mechanism similar to that already used for deadlock recovery, which allows the initiator of the computation to retry. This approachnaturallyfitsthesemanticsofthecforconstructandcomplementsourprogrammability, efficiency and reliability contributions. In future work, we intend to optimize the message and latency costs of our implementation by exploring more efficient message batching alternatives. We also plan to support various relaxed consistency models as alternatives to serializability. In addition, we would like to allow the programmer to be able to easily trade off quality of results for time of distributed execution. Finally, we plan to examine approaches to specifying sophisticated power management policies in Pleiades. 80 Nodecut 2 Nodecut 1 Transit Edge Nodecut 5 Nodecut 4 Nodecut 3 false end true 4 true true if(isfree@iter) if(!reserved) true true end reserved = true false false true 7 9 12 14 13 15 false nodeIter = get_first(nToExamine) if(nodeIter!=NULL) if(neighborIter@nodeIter!=NULL) reservedNode = nodeIter isfree@nodeIter = false get_next(neighbors@nodeIter) neighborIter@nodeIter = if(!member(neighborIter@nodeIter, nExamined)) remove_node(nodeIter,nToExamine) add_node(nodeIter,nExamined) nodeIter = get_next(nToExamine) neighbors@nodeIter = get_neighbors(iter) get_first(neighbors@nodeIter) neighborIter@nodeIter = false reserved = false true end 1 start if(isfree@n) reserved = true isfree@n = false 2 reservedNode = NULL n = closest_node(dst) reservedNode = n nExamined = {} nToExamine = {n} false if(!reserved && !empty(nToExamine)) 3 5 6 8 10 11 false add_node(neighborIter@nodeIter,nToExamine) Figure 3.5: Nodecuts generated for the street-parking example. 81 execute(thread t) 1 while true 2 do switch next-operation(t) 3 case read(x):if t does not have a read lock on v 4 then request-lock(x,read,t) 5 if lock not obtained 6 then set-state(t,blocked) 7 suspend execution of t 8 case write(x):if t does not have a write lock on v 9 then request-lock(x,write,t) 10 if lock not obtained 11 then set-state(t,blocked) 12 suspend execution of t 13 case cfor(c):spawn-threads(c);return 14 case join:for each lock l owned by t 15 do release-lock(l,t) 16 send-join(t);return 17 execute-next-operation(t) lock-granted(lock l,variable v,mode m,thread t) 1 store lock l at t 2 if t was suspended waiting for l 3 then resume execution of t 4 set-state(t,executing);return 5 child t= first in queue wanting a lock on v at t 6 if mode of l ==read and child t wanted a write lock on v 7 then return 8 remove child t from queue 9 lock-granted(l,v, mode of child t ,child t) 10 mark lock l as being used by child t at t release-lock(lock l,thread t) 1 if t is the top-most level thread 2 then copy l back to the owner of the variable locked by it 3 else copy l back to cfor coordinator(t) 4 delete lock l at t 5 mark l as unused at t 6 if queue of threads wanting l is non-empty at cfor coordinator(t) 7 then new t= first in queue wanting l 8 remove child t from queue 9 lock-granted(l, var of l , mode of new t ,new t) 10 mark lock l as being used by new t at cfor coordinator(t) Figure 3.6: Locking algorithm. 82 request-lock(variable v,mode m,thread t) 1 if t is the top-most level thread 2 then fetch lock l from owner of variable 3 lock-granted(l,v,m,t);return 4 if cfor coordinator(t) does not have any locks on v 5 then request-lock(v,m,cfor coordinator(t)) 6 add t to the queue of threads wanting 7 a lock on v at cfor coordinator(t); return 8 if m == read 9 then if cfor coordinator(t) has read lock l on v 10 then lock-granted(l,v,m,t) 11 mark lock l as being used 12 by t at cfor coordinator(t) 13 return 14 if cfor coordinator(t) has a write lock l on v 15 then if l is being used as a read lock or is free 16 then lock-granted(l,v,m,t) 17 mark lock l as being used 18 by t at cfor coordinator(t) 19 else add t to the queue of threads wanting 20 a lock on v at cfor coordinator(t) 21 return 22 if m == write 23 then if cfor coordinator(t) has a read lock l on v 24 then add t to the queue of threads wanting 25 a lock on v at cfor coordinator(t) 26 request-lock(v,m,cfor coordinator(t)) 27 return 28 if cfor coordinator(t) has a write lock l on v 29 then if l is being used 30 then add t to the queue of threads wanting 31 a lock on v at cfor coordinator(t) 32 return 33 else lock-granted(l,v,m,t) 34 mark lock l as being used 35 by t at cfor coordinator(t) send-join(thread t) 1 terminate t 2 if cfor coordinator(t) has no more executing children 3 then execute(cfor coordinator(t)) 4 set-state(t,joined) set-state(thread t,state s) 1 //s = executing or s = blocked 2 state(t) = s 3 if s = executing 4 then if t is the topmost-level thread 5 then return 6 else set-state(cfor coordinator(t)),executing 7 else detect-deadlock(t) detect-deadlock(thread t) 1 if t executes join 2 then state(t) = joined; 3 detect-deadlock(cfor coordinator(t)) 4 if t is a cfor coordinator 5 then if ∄ c ∈ Children(t), state(c) = executing 6 then state(t) = blocked 7 if state(t) = blocked 8 then if t is the topmost level thread 9 then recover-deadlock(t) 10 else detect-deadlock(cfor coordinator(t)) 83 recover-deadlock(thread t) 1 B = {c|c ∈ Children(t) and state(c) = blocked} 2 if |B| > 1 3 then for b ∈ B 4 do for each lock l owned by b 5 do release-lock(l,b) restart and execute b serially 6 else if |B| = 1 7 then recover-deadlock(b) Where B = {b} Figure 3.7: Deadlock algorithm. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 Position error Fraction of reports Mote-PEG Pleiades-PEG Figure 3.8: PEG application error. 84 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 7 8 9 10 11 Request ID Latency (seconds) Correct execution (SP) Correct execution with induced deadlocks (SPID) Incorrect execution without locking (SP-NL) Incorrect execution under deadlocks (SPID-NR) Figure 3.9: Street parking latency. 0 5000 10000 15000 20000 25000 30000 0 1 2 3 4 5 6 7 8 9 10 11 Request ID Total message cost (bytes) Incorrect execution under deadlocks (SPID-NR) Incorrect execution without locking (SP-NL) Correct execution (SP) Correct execution with induced deadlocks (SPID) Figure 3.10: Street parking message cost. 85 Chapter 4 Chapter 4: Kairos: An Eventual-Consistency Programming Language 4.1 Introduction Two broad classes of programming models are currently being investigated by the community. One class focuses on providing higher-level abstractions for specifying a node’s local behavior in a distributed computation. Examples of this approach include the recent work on node-local or region-based abstractions [WSBC, WM]. By contrast, a second class considers programming a sensor network in the large (this has sometimes been called macroprogramming). One line of research in this class enables a user to declaratively specify a distributed computation over a wirelesssensornetwork,wherethedetailsofthenetworkarelargelyhiddenfromtheprogrammer. Examples in this class include TinyDB [MFHHa, MFHHb], and Cougar [FSG]. Kairos’ programming model specifies the global behavior of a distributed sensornet computa- tion using a centralized approach to sensornet programming. Kairos presents an abstraction of a sensor network as a collection of nodes (Section 4.3) that can all be tasked together simultane- ously within a single program. The programmer is presented with three constructs: reading and writing variables at nodes, iterating through the one-hop neighbors of a node, and addressing ar- bitrary nodes. Using only these three simple language constructs, programmers implicitly express 86 both distributed data flow and distributed control flow. We argue that these constructs are also natural for expressing computations in sensor networks: intuitively, sensor network algorithms process named data generated at individual nodes, often by moving such data to other nodes. Allowing the programmer to express the computation by manipulating variables at nodes allows us to almost directly use “textbook” algorithms, as we show later in detail in Section 4.3.2. Giventhesinglecentralizedprogram,Kairos’compile-timeandruntimesystemsconstructand help execute a node-specialized version of the compiled program for all nodes within a network. The code generation portion of Kairos is implemented as a language preprocessor add-on to the compiler toolchain of the native language. The compiled binary that is the single-node derivation of the distributed program includes runtime calls to translate remote reads and, sometimes, local writesintonetworkmessages. TheKairosruntimelibrarythatispresentateverynodeimplements these runtime calls, and communicates with remote Kairos instances to manage access to node state. Kairos is language-independent in that its constructs can be retrofitted into the toolchains of existing languages. Kairos (and the ideas behind it) are related to shared-memory based parallel programming models implemented over message passing infrastructures. Kairos is different from these in one important respect. It leverages the observation that most distributed computations in sensor networks will rely on eventual consistency of shared node state both for robustness to node and link failure, and for energy efficiency. Kairos’ runtime loosely synchronizes state across nodes, achieving higher efficiency and greater robustness over alternatives that provide tight distributed program synchronization semantics (such Sequential Consistency, and variants thereof [AG]). We have implemented Kairos as an extension to Python. Due to space constraints of this paper, we describe our implementation of the language extensions and the runtime system in detailinatechnicalreport[GGG05b]. OnKairos, wehaveimplementedthreedistributedcompu- tations that exemplify system services and signal processing tasks encountered in current sensor networks: constructing a shortest path routing tree, localizing a given set of nodes [SHS], and 87 vehicle tracking [JLZ]. We exhibit each of them in detail in Section 4.3 to illustrate Kairos’ ex- pressibility. We then demonstrate through extensive experimentation (Section 5.5) that Kairos’ levelofabstractiondoesnotsacrifice performance, yetenables compact and flexible realizationsof these fairly sophisticated algorithms. For example, in both the localization and vehicle tracking experiments, we found that the performance (convergence time, and network message traffic) and accuracy of Kairos are within 2x of the reported performance of explicitly distributed original versions, while the Kairos versions of the programs are more succinct and, we believe, are easier to write. 4.2 Related Work In this section, we give a brief taxonomy (Figure 4.1) of sensornet programming and place our work in the context of other existing work in the area. The term “sensornet programming” seems to refer to two broad classes of work that we categorize as programming abstractions and programming support. The former class is focused on providing programmers with abstractions of sensors and sensor data. The latter is focused on providing additional runtime mechanisms that simplify program execution. Examples of such mechanisms include safe code execution, or reliable code distribution. We now consider the research on sensor network programming abstractions. Broadly speak- ing, this research can be sub-divided into two sub-classes: one sub-class focuses on providing the programmer abstractions that simplify the task of specifying the node local behavior of a distributed computation, while the second enables programmers to express the global behavior of the distributed computation. Intheformersub-class, threedifferenttypesofprogramming abstractions havebeenexplored. Forexample,Liuet al.[LCL + ]andCheonget al.[CLLZ]haveconsiderednodegroupabstractions that permit programmers to express communication within groups sharing some common group 88 Macro-programming Abstractions Support Global behavior Local Behavior Composition Distribution & Safe Execution Automatic Optimization Node- independent • TAG, Cougar • DFuse Node-dependent • Kairos • Regiment • Split-C Data-Centric • EIP, State- space Geometric • Regions, Hood Sensorware SNACK Mate Tofu Trickle Deluge Impala Figure 4.1: Taxonomy of Programming Models for Sensor Networks state. Data-centric mechanisms are used to efficiently implement these abstractions. By contrast, Mainland et al. [WM] and Whitehouse et al. [WSBC] show that topologically defined group abstractions (“neighborhoods” and “regions” respectively) are capable of expressing a number of local behaviors powerfully. Finally, the work on EIP [ABC + ] provides abstractions for physical objects in the environment, enabling programmers to express tracking applications. Kairosfallsintothesub-classfocusedonprovidingabstractionsforexpressingtheglobalbehav- ior of distributed computations. One line of research in this sub-class provides node-independent abstractions—these programming systems do not contain explicit abstractions for nodes, but rather express a distributed computation in a network-independent way. Thus, the work on SQL-like expressive but Turing-incomplete query systems (e.g., TinyDB [MFHHb, MFHHa] and Cougar [FSG]), falls into this class. Another body of work provides support for expressing com- putations over logical topologies [BOP, BP] or task graphs [KWA + ] which are then dynamically mapped to a network instance. This represents a plausible alternative to macroprogramming sen- sornetworks. However,exportingthenetworktopologyasanabstractioncanimposesomerigidity in the programming model. It can also add complexity to maintaining the mapping between the logical and the physical topology when nodes fail. 89 Complementary to these approaches, node-dependent abstractions allow a programmer to ex- press the global behavior of a distributed computation in terms of nodes and node state. Kairos, as we shall discuss later, falls into this class. As we show, these abstractions are natural for expressing a variety of distributed computations. The only other piece of work in this area is Regiment[NW], arecentwork. WhileKairosfocusesonanarrowsetofflexiblelanguage-agnostic abstractions, Regimentfocusesonexploringhow functional programming paradigmsmightbeap- plied to programming sensor networks in the large, while Split-C [CADG + ] provides “split” local- global address spaces to ease parallel programming that Kairos also provides through the remote variable access facility, but confines itself to the “C” language that lacks a rich object-oriented data model and a language-level concurrency model . Therefore, the fundamental concepts in these two works are language-specific. Finally, quite complementary to the work on programming abstractions is the large body of literature devoted to systems in support of network programming. Such systems enable high- levelcompositionofsensornetworkapplications(Sensorware[BHS]andSNACK[GKE]), efficient distributionofcode(Deluge[HC]), supportforsandboxedapplicationexecution(Mat´ e[LC]), and techniques for automatic performance adaptation (Impala [LSZM]). 4.3 Kairos Programming Model In this section, we describe the Kairos abstractions and discuss their expressibility and flexibility using three canonical sensor network distributed applications: routing tree construction, ad-hoc localization, and vehicle tracking. 90 4.3.1 Kairos Abstractions and Programming Primitives Kairosisasimplesetofextensionstoaprogramminglanguagethatallowsprogrammerstoexpress the global behavior of a distributed computation. Kairos extends the programming language by providing three simple abstractions. The first of these is the node abstraction. Programmers explicitly manipulate nodes and lists of nodes. Nodes are logically named using integer identifiers. The logical naming of nodes does not correspondto a topological structure. Thus, at the time of program composition, Kairos does not require programmers to specify a network topology. In Kairos, the node datatype exports operators like equality, ordering (based on node name), and type testing. In addition, Kairos provides a node list iterator data type for manipulating node sets. ThesecondabstractionthatKairosprovidesisthelistofone-hop neighbors ofanode. Syntac- tically, the programmer calls a get neighbors() function. The Kairos runtime returns the current list of the node’s radio neighbors. Given the broadcast nature of wireless communication, this is a natural abstraction for sensor network programming (and is similar to regions [WM], and hoods [WSBC]). Programmersareexposedtotheunderlyingnetworktopologyusingthisabstrac- tion. A Kairos program typically is specified in terms of operations on the neighbor list; it may construct more complex topological structures by iterating on these neighbors. The third abstraction that Kairos provides is remote data access, namely the ability to read from variables at named nodes. Syntactically, the programmer uses a variable@node notation to do this. Kairos itself does not impose any restrictions on which remote variables may be read where and when. However, Kairos’ compiler extensions respect the scoping, lifetime, and access rules of variables imposed by the language it is extending. Of course, variables of types with node-local meaning (e.g., file descriptors, and memory pointers) cannot be meaningfully accessed remotely. Node Synchronization: Kairos’ remote access facility effectively provides a shared-memory abstraction across nodes. The key challenge (and a potential source of inefficiency) in Kairos 91 is the messaging cost of synchronizing node state. One might expect that nodes would need to synchronize their state with other nodes (update variable values at other nodes that have cached copies of those variables, or coordinate writes to a variable) often. In Kairos, only a node may writetoitsvariable,thusmutuallyexclusiveaccesstoremotevariablesisnotrequired;thereby,we also eliminate typically subtle distributed programming bugs arising from managing concurrent writes. Kairos leverages another property of distributed algorithms for sensor networks in order to achieve low overhead. We argue that, for fairly fundamental reasons, distributed algorithms will rely on a property we call eventual consistency: individual intermediate node states are not guaranteed to be consistent, but, in the absence of failure, the computation eventually converges. Thisnotionofeventualconsistencyislooselymoldedonsimilarideaspreviouslyproposedinwell- known systems such as Bayou [TTP + b]. The reason for this, is, of course, that sensor network algorithmsneedtobehighlyrobusttonodeandlinkfailures,andmanyoftheproposedalgorithms for sensor networks use soft-state techniques that essentially permit only eventual consistency. Thus, Kairos is designed under the assumption that loose synchrony of node state suffices for sensor network applications. Loose synchrony means that a read from a client to a remote object blocks only until the referenced object is initialized and available at the remote node and not on every read to the remote variable. This allows nodes to synchronize changed variables in a lazy manner, thereby reducing communication overhead. However, a reader might be reading a stale value of a variable, but because of the way distributed applications are designed for sensor networks, the nodes eventually converge to the right state. Where this form of consistency is inadequate, we provide a tighter consistency model, as described at the end of this section. The Mechanics of Kairos Programming: Before we discuss examples of programming in Kairos, we discuss the mechanics of programming and program execution (Figure 4.2). As we havesaidbefore,thedistinguishingfeatureofKairosisthatprogrammerswriteasinglecentralized version of the distributed computation in a programming language of their choice. This language, 92 Multi-hop wireless network Centralized Program Annotated Binary Kairos Preprocessor & Language Compiler Program Kairos Runtime Thread of control Sync Read/Write Cached Objects Managed Objects Queue Manager Requests Replies Sensor Node Link with runtime & distribute Program Kairos Runtime Thread of control Sync Read/Write Cached Objects Managed Objects Queue Manager Requests Replies Sensor Node Link with runtime & distribute Link with runtime & distribute Figure 4.2: Kairos Programming Architecture weshallassume,hasbeenextendedtoincorporatetheKairosabstractions. Foreaseofexposition, assume that a programmer has written a centralized program P that expresses a distributed computation; intherestofthissection,wediscussthetransformationsonPperformedbyKairos. Kairos’ abstractions are first processed using a preprocessor which resides as an extension to the language compiler. Thus,P is first pre-processed to generate annotated source code, which is then compiled into a binary P b using the native language compiler. While P represents a global specification of the distributed computation, P b is a node-specific version that contains code for what a single node does at any time, and what data, both remote and local, it manipulates. In generating P b , the Kairos preprocessor identifies and translates references to remote data into calls to the Kairos runtime. P b is linked to the Kairos runtime and can be distributed to all nodes in the sensor network through some form of code distribution and node re-programming facility[LPCS,HC]. Whenacopyisinstantiatedandrunoneachsensornode,theKairosruntime exports and manages program variables that are owned by the current node but are referenced by remote nodes; these objects are called managed objects in Figure 4.2. In addition, it also caches copiesofmanagedobjectsownedbyremotenodesinitscached objects pool. Accessestobothsets 93 of objects are managed through queues as asynchronous request/reply messages that are carried over a potentially multihop radio network. The user program that runs on a sensor node calls synchronously into Kairos runtime for reading remote objects, as well as for accessing local managed objects. These synchronous calls areautomaticallygeneratedbythepreprocessor. Theruntimeaccessesthesecachedandmanaged objectsonbehalfoftheprogramaftersuspendingthecallingthread. Theruntimeusesadditional background threads to manage object queues, but this aspect is transparent to the application, and the application is only aware of the usual language threading model. 4.3.2 Examples of Programming with Kairos We now illustrate Kairos’ expressibility and flexibility by describing how Kairos may be used to program three different distributed computations that have been proposed for sensor networks: routing tree construction, localization, and vehicle tracking. Routing Tree Construction In Figure 4.3, we illustrate a complete Kairos program for building a routing tree with a given root node. We have implemented this algorithm, and evaluate its performance in Section 5.5. Note that our program implements shortest-path routing, rather than selecting paths based on link-quality metrics [DABM]: we have experimented with the latter as well, as we describe below. The code shown in Figure 4.3 captures the essential functionality involved in constructing a routing tree while maintaining brevity and clarity. It shows how a centralized Kairos task looks, and illustrates how the Kairos primitives are used to express such a task. Program variable dist from rootistheonlyvariablethatneedstoberemotelyaccessedinlines18-19,andistherefore a managed object at a source node and a cached object at the one-hop neighbors of the source nodethatprogrammaticallyreadthisvariable. Theprogramalsoshowshowthe nodeand node list datatypes and their API’s are used. get available nodes() in lines 6 and 13 instructs the Kairos 94 1: void buildtree(node root) 2: node parent, self; 3: unsigned short dist_from_root; 4: node_list neighboring_nodes, full_node_set; 5: unsigned int sleep_interval=1000; //Initialization 6: full_node_set=get_available_nodes(); 7: for (node temp=get_first(full_node_set); temp!=NULL; temp=get_next(full_node_set)) 8: self=get_local_node_id(); 9: if (temp==root) 10: dist_from_root=0; parent=self; 11: else dist_from_root=INF; 12: neighboring_nodes=create_node_list(get_neighbors(temp)); 13: full_node_set=get_available_nodes(); 14: for (node iter1=get_first(full_node_set); iter1!=NULL; iter1=get_next(full_node_set)) 15: for(;;) //Event Loop 16: sleep(sleep_interval); 17: for (node iter2=get_first(neighboring_nodes); iter2!=NULL; iter2=get_next(neighboring_nodes)) 18: if (dist_from_root@iter2+1<dist_from_root) 19: dist_from_root=dist_from_root@iter2+1; 20: parent=iter2; Figure 4.3: Procedural Code for Building a Shortest-path Routing Tree preprocessortoincludetheenclosedcodeforeachiteratednode;italsoprovidesaniteratorhandle thatcanbeusedforaddressingnodesfromtheiterator’sperspective, asshowninline12. Finally, the program shows how the get neighbors() function is used in line 12 to acquire the one-hop neighbor list at every node. Theeventloopbetweenlines15-20thatrunsatallnodeseventuallypicksashortestpathfrom a node to the root node. Our implementation results show that the path monotonically converges to the optimal path, thereby demonstrating progressive correctness. Furthermore, the path found is stable and does not change unless there are transient or permanent link failures that cause nodes to be intermittently unreachable. ThiseventloopillustrateshowKairosleverageseventualconsistency. Theaccesstotheremote variable dist from root need not be synchronized at every step of the iteration; the reader can use the current cached copy, and use a lazy update mechanism to avoid overhead. As we shall see in Section 5.5, the convergence performance and the message overhead of loose synchrony in real- world experiments is reasonable. We also tried metrics other than shortest hop count (such as 95 fixing parents according to available bandwidth or loss rates, a common technique used in real- world routing systems [WTC]), and we found that the general principle of eventual consistency and loose synchrony can be applied to such scenarios as well. Let us examine Figure 4.3 for the flexibility programming to the Kairos model affords. If we want to change the behavior of the program to have the tree construction algorithm commence at a pre-set time that is programmed into a base station node with id 0, we could add a single line before the start of the for(){} loop at line 7: sleep(starting time@0-get current time()). The runtime would then automatically fetch the starting time value from node 0. Distributed Localization using Multi-lateration Figure 4.4 gives a complete distributed program for collaboratively fixing the locations of nodes withunknowncoordinates. ThebasicalgorithmwasdevelopedbySavvides et al.[SHS]. Ourgoal in implementing this algorithm in Kairos was to demonstrate that Kairos is flexible and powerful enough to program a relatively sophisticated distributed computation. We also wanted to explore how difficult it would be to program a “textbook” algorithm in Kairos, and compare the relative performance of Kairos with the reported original version (Section 5.5). The goal of the “cooperative multi-lateration” algorithm is to compute the locations of all un- known nodes in a connected meshed wireless graph given ranging measurements between one-hop neighboringnodesandasmallsetofbeaconnodesthatalreadyknowtheirposition. Sometimes,it may happen that there are not enough beacon nodes in the one-hop vicinity of an unknown node for it to mathematically laterize its location. The basic idea is to iteratively search for enough beacons and unknown nodes in the network graph so that, taken together, there are enough mea- surements and known co-ordinates to successfully deduce the locations of all unknown nodes in the sub-graph. 96 1: void CooperativeMultilateration() 2: boolean localized=false, not_localizable=false, is_beacon=GPS_available(); 3: node self=get_local_node_id(); 4: graph subgraph_to_localize=NULL; 5: node_list full_node_set=get_available_nodes(); 6: for (node iter=get_first(full_node_set); iter!=NULL; iter=get_next(full_node_set))) //At each node, start building a localization graph 7: participating_nodes=create_graph(iter); 8: node_list neighboring_nodes=get_neighbors(iter); 9: while ((!localized || !is_beacon) && !not_localizable) 10: for (node temp=get_first(neighboring_nodes); temp!=NULL; temp=get_next(neighboring_nodes)) //Extend the subgraph with neighboring nodes 11: extend_graph(subgraph_to_localize, temp, localized@temp||is_beacon@temp?beacon:unknown); //See if we can localize the currently available subgraph 12: if (graph newly_localized_g=subgraph_check(subgraph_to_localize)) 13: node_list newly_localized_l=get_vertices(newly_localized_g); 14: for (node temp=get_first(newly_localized_l); temp!=NULL; temp=get_next(newly_localized_l)) 15: if (temp==iter) localized=true; 16: continue; //If not, add nodes adjacent to the leaves of the accumulated subgraph and try again 17 node_list unlocalized_leaves; 18: unlocalized_leaves=get_leaves(subgraph_to_localize); 19: boolean is_extended=false; 20: for (node temp=get_first(unlocalized_leaves); temp!=NULL; temp=get_next(unlocalized_leaves)) 21: node_list next_hop_l=get_neighbors(temp); 22: for (node temp1=get_first(next_hop_l); temp1!=NULL; temp1=get_next(next_hop_l)) 23: extend_graph(subgraph_to_localize, temp1, localized@temp1||is_beacon@temp1?beacon:unknown); 24: is_extended=true; 25: if (!is_extended) not_localizable=true; Figure 4.4: Procedural Code for Localizing Sensor Nodes Figure 4.4 shows the complete code for the cooperative multi-lateration algorithm. ∗ The code localizes non-beacon nodes by progressively expanding the subgraph, (subgraph to localize), con- sidered at a given node with next-hop neighbors of unlocalized leaf vertexes (unlocalized leaves), and is an implementation of Savvides’ algorithm [SHS]. The process continues until either all nodes in the graph are considered (lines 20-25) and the graph is deemed unlocalizable, or until ∗ Of course, we have not included the low-level code that actually computes the range estimates using ultrasound beacons. Our code snippet assumes the existence of node-local OS/library support for this purpose. 97 the initiator localizes itself (using the auxiliary function subgraph check()) after acquiring a suffi- cient number of beacon nodes. This program once again illustrates eventual consistency because the variable localized@node is a monotonic boolean, and eventually attains its correct asymptotic valuewhenenclosedinaneventloop. WealsofoundaninterestingevidencetothevalueofKairos’ centralized global program specification approach—we encountered a subtle logical (corner-case recursion) bug in the original algorithm described in [SHS] in a local (i.e., bottom-up, node- specific) manner, that became apparent in Kairos. Vehicle Tracking For our final example, we consider a qualitatively different application: tracking moving vehicles in a sensor field. The program in Figure 4.5 is a straightforward translation of the algorithm described in [JLZ]. This algorithm uses probabilistic techniques to maintain belief states at nodes about the current location of a vehicle in a sensor field. Lines 14-16 correspond to step 1 of the algorithm given in [JLZ, p. 7] where nodes diffuse their beliefs about the vehicle location. Lines 17-21 compute the probability of the observation z t+1 at every grid location given vehicle location x t+1 at time t+1 (step 2 of the algorithm) using the latest sensing sample and vehicle dynamics. Lines 23-25 compute the overall posteriori probability of the vehicle position on the rectangular grid after incorporating the latest posteriori probability (step 3 of the algorithm). Finally, lines 26-40 compute the information utilities, I k ’s, at all one-hop neighboring nodes k for every node, and pick that k =argmax I k that maximizes this measure (steps 4 and 5). This node becomes the new “master” node: i.e., it executes the steps above for the next epoch, using data from all other nodes in the process. This program illustrates an important direction of future work in Kairos. In this algorithm, the latest values of p(z t+1 |x t+1 )[x][y]@neighbors must be used in line 33 at the master because these p(.)[x][y]’s are computed at each sensor node using the latest vehicle observation sample. With our loose synchronization model, we cannot insure that the master uses these latest values 98 computed at the remote sensor nodes because stale cached values may be returned instead by the master Kairos runtime, thereby adversely impacting the accuracy and convergence time of the tracking application. There are two possible solutions to this. One, which we have implemented currently in Kairos, is to provide a slightly tighter synchronization model that we call loop-level synchrony, where variables are synchronized at the beginning of an event loop (at line 11 of every iteration). Amoregeneraldirection,whichwehaveleftforfutureworkistoexploretemporal data abstractions. Thesewouldallowprogrammerstoexpresswhichsamplesofthetimeseriesp(.)[x][y] from remote nodes are of interest, while possibly allowing Kairos to preserve loose synchrony. 4.4 Kairos Implementation We have implemented the programming primitives discussed in the previous section, and have experimented with the three distributed algorithms described therein. In this section, we sketch the details of our implementation of the Kairos extensions and the Kairos runtime support. ImplementationPlatform. WehaveimplementedKairosextensionstoPythonontheStargate platform[Incc]. OurchoiceoftheStargateplatformwasdictatedbyexpediency,sinceitallowedus toquicklyprototypethemainideasbehindKairos, whileusingMica2[Inca]motesas“dumb”but realisticnetworkinterfaces; wedescribethehardwaredetailsofourimplementationinSection5.5. However, we believe it is possible to extend nesC [GLvB + ] and TinyOS [HSW + ] to implement Kairos directly on the motes without requiring Python and Stargates to control them, but have left that to future work. Our choice of the language to implement Kairos is perhaps a little non-standard. Python is an interpreted language commonly used for scripting Internet services and system administration tasks, and is not the obvious choice for a sensor network programming language. We note that other research has proposed extending interpreted languages like Tcl [BHS] for ease of scripting sensornetworkapplications. ThatwasnotourrationaleforselectingPython,however. Rather,we 99 selectedPythonbecausewewerefamiliarwithitsinternals,andbecausePythonhasgoodsupport for both embedding the language into a bigger program (Kairos, in our case), and dynamically extending the language data types, both of which enabled us to relatively easily implement the Kairos primitives. In the current incarnation, Kairos consists of a preprocessor/parser for Python that dynam- ically introduces new data types, and uses Python extensibility interfaces [vRJ] to “trap” from the Python interpretor into the Kairos runtime, thereby redirecting accesses to Python objects handled by the Kairos runtime (these are the managed and cached objects in Figure 4.2). Kairos runtime is implemented in C, and embeds the Python interpretor using Python’s embedding API’s [vRJ]. It services read/write requests for managed objects and remote read requests for cached objects from Python. It also manages the object queues shown in Figure 4.2, and uses Mica2 motes for accessing the multihop wireless network. The Kairos and Python runtimes to- gether use about 2MB memory for the examples we tested (ignoring standard shared libraries like libc, etc.), and can fit comfortably on the Stargates. Thus, a Kairos program is simply a Python script that uses Kairos primitives in addition to the standard Python language features. It is firstpreprocessedbyourpreprocessor, andtheninterpretedbytheembeddedPythoninterpretor. In what follows, we describe the innards and actions of the preprocessor and the runtime from a language-neutral viewpoint, even though the specifics of the actual implementation may differ slightly from language to language depending on the language semantics and the external-world interfaces it provides. The Kairos Preprocessor. Kairos’ preprocessor transforms a centralized piece of code that ex- presses a distributed computation into a node-specific version by generating additional code that inserts calls into the runtime layer. For example, consider lines 14-20 of the code in Figure 4.3 thatcreateanduseanodegroupiteratoroverthenode’sonehopneighborswhichareonlyknown at runtime. The dist from root variable at these remote nodes is accessed inside the loop. The Kairospreprocessorreplacesaccessestothisvariablewithinlinedcompilablecodethatinvokesaa 100 binary messaging interface between the application and the runtime. This RBI (Runtime Binary Interface) specifies how the request, reply, and data messages for reads and writes are communi- cated between the application and the runtime. In the case of Python, this RBI essentially takes the form of synchronous object accesses whose methods are implemented in the Kairos runtime using well-defined external object access API’s. Kairos’ preprocessor recognizes the remote reference type and size from its declaration in the program. Since, for a variable to be accessed using the variable@node notation, a variable V should already have been declared and defined for the local node in the first place (as a malloc()’ed or staticglobalvariable,orasalocalvariableonthestack),thepreprocessorsimplycreatesspacefor exactly oneadditionalcopyofthevariable. Then, foreachreadofaremoteobjectoftheform V@N, the preprocessor creates a structure in the RBI consisting of two slots: one specifying the node N, and one for the variable V. The preprocessor already knows the fixed size of the node slot from its ADT (Abstract Data Type) definition, and it dynamically builds the space for the variable slot depending on the variable type declaration. It then emits source code that, at runtime, copiesthevalueofthenodevariable Nfromtheapplication-private location intothefirstslot, and the identification (i.e., variable location, basic block number, and, for loop-level synchrony, loop iteration number) of the variable being accessed into the second slot. The runtime satisfies the read request from the remote node N if the cache is outdated or missing the particular variable, and returns a copy of the cached value to the application, which then proceeds to copy it into the second, application-private location allocated for V. Reads and writes to local managed objects work similarly, except that both of them are always executed non-blockingly by the runtime. The reason we first copy the returned object into application-private address space is because we can let the host compiler or interpreter infrastructure to then statically or dynamically typecheck these remote object accesses, and Kairos’ compiler additions can be restricted to the preprocessor stage. This idea is key to keeping Kairos simple to implement and semantically conformant with the variable manipulation rules of the language. 101 The Kairos Runtime. Conceptually, the code generated by the preprocessor would have been distributed using a reliable dissemination protocol [IGE] to all the nodes in the network. We have omitted this step in our implementation, and manually install the preprocessed code in our Stargates. We expect that doing so does not significantly reduce the realism of our experiments; Kairosassumeseventualconsistency,andevenifthecodedistributionprotocolweretoinstantiate application code at different time on different nodes, correctness would be preserved. The key to Kairos’ performance, as we have mentioned above, is that it maintains loose synchronybetweencopiesofavariableatanodeandatitsremotecounterpart. Areadtoaremote variableisalwayssatisfiedbyimmediatelyreturningthelocallycachedvalueofthecorresponding managed object (with one exception: the only time a read blocks is when the variable has not yet been instantiated). To implement read requests for remote objects, the runtime includes, in the network request message, the variable location and the basic block number of the code that the request appears in. The preprocessor allocates space for identifying the basic block number information associated with each runtime object, and generates code to update this associated information at each basic block entry and exit. We note that only a small fixed-size space is necessaryforholdingthecurrentruntimeobjectidentificationandannotationinformationbecause this space is used only to store statically scoped information about the single object currently under considertion, and is, thus, independent of the dynamic program behavior. When is the cached copy of the variable updated? There are two classical choices. When the remote node writes to the variable, it can push the changed value to all readers. This requires the writer to maintain state, but can be low overhead. Alternatively, a reader might choose to update its cache by polling the remote end every time a variable is accessed. If the variable is read infrequently, this is a preferable alternative. Kairos’s runtime implements a hybrid model of cached coherence. In Kairos, a remote reader runtimecachesmanagedobjectsinits cached objectspoolasshowninFigure4.2foruptoacertain timeout (currently, 10 seconds for loose consistency, and the beginning of a new loop iteration for 102 loop-level consistency). It satisfies read requests from this cache until the cache is next refreshed atthenewtimeout. Theowneralsokeepstrackofthelistofremotecachedcopies,andpropagates localwritesonamanagedobjecttoallcachedcopiesatremotenodesusingacallback mechanism. Thus, this cache consistency mechanism allows us to optimize network energy consumption by exploiting Kairos’ synchrony semantics. Finally, the Kairos runtime implements a reliable transport mechanism to update the value of a variable from the remote node. The protocol to do this is currently relatively simple and unoptimized. To read a variable from a remote node, the node floods a message through the network containing the node and variable names. It repeats this step periodically until it receives areply. Intermediatenodescachetheserequestmessagesandroutethereplybackintheopposite direction. This mechanism is inspired by mechanisms in Directed Diffusion [IGE]. With multiple outstanding requests, the Kairos runtime needs to manage these requests care- fully. As shown in Figure 4.2, the runtime queue manager at each node manages two queues. There is an incoming reply queue for read replies as well as for write updates that are propagated through callbacks from remote owners of these locally cached reads. There is also an outgo- ing request queue for remote read requests. The runtime services both the queues in a simple asynchronous manner using FIFO scheduling. The incoming queue is straightforward to service: each read reply is serviced immediately. However, reliability issues must be addressed with respect to requests in the outgoing queue, and requests are retransmitted if there is no acknowledgment from the remote destination. The queue scheduler component of the runtime services each request in the outgoing queue in a FIFO fashion, and associates a timeout and retry count with it. After the head of the outgoing queue is scheduled, it is put back at the end of the queue after initiating the timeout counter associated with it. If the acknowledgment from the remote end arrives in the incoming queue, the request object is removed from the outgoing queue; if not, the request is retried when the object reaches the front of the queue and its associated timeout has expired. Currently, we retry a request three 103 times before giving up and discarding the request silently. We rely on lower-layer MAC-level reliability as well as application-level reliability to provide additional retries and recover from such conditions respectively. 4.5 Kairos Evaluation We have implemented the programming primitives discussed in the previous section in Python using its embedding and extendability API’s [vRJ], and have experimented with the three dis- tributed algorithms described therein. More discussion about our implementation and evaluation can be found in [GGG05b]. Our testbed is a hybrid network of ground nodes and nodes mounted on a ceiling array. The 16 ground nodes are Stargates [Incc] that each run Kairos. In this setup, Kairos uses Emstar [EBB + ] to implement end-to-end reliable routing and topology management. Emstar, in turn, uses a Mica2 mote [Inca] mounted on the Stargate node (the leftmost picture in Figure 5.8 shows a single Stargate+Mica2 node) as the underlying network interface controller (NIC) to achieve realistic multihop wireless behavior. These Stargates were deployed in a small area (middle picture in Figure 5.8), making all the nodes reachable from any other node in a single physical hop (we created logical multihops over this set in the experiments below). The motes run TinyOS [HSW + ], but with S-MAC [YH] as the MAC layer. There is also an 8-node array of Mica2dots [Incb] mounted on a ceiling (rightmost picture in Figure5.8),andconnectedthroughamultiportserialcontrollertoastandardPCthatruns8Em- starprocesses. EachEmstarprocesscontrolsasingleMica2dotandisattachedtoaKairosprocess that also runs on the host PC. This arrangement allows us to extend the size of the evaluated network while still maintaining some measure of realism in wireless communication. The ceiling Mica2dots and ground Mica2s require physical multihopping for inter-node communication. The Mica2dot portion of the network also uses physical multihopping for inter-node communication. 104 To conduct experiments with a variety of controlled topologies, we wrote a topology manager in Emstar that enables us to specify neighbors for a given node and blacklist/whitelist a given neighbor. Dynamic topologies were simulated by blacklisting/whitelisting neighbors while the experimentwasinprogress. Theend-to-endreliableroutingmodulekeepstrackofalltheoutgoing packets (on the source node) and periodically retransmits the packets until an acknowledgment is received from the destination. Hop-by-hop retransmission by S-MAC is complementary and used as a performance enhancement. Routing Tree Performance: We implemented the routing tree described in Section 4.3.2 in Kairos,andmeasureditsperformance. Forcomparisonpurposes,wealsoimplementedOnePhase Pull(OPP)[HSE]routingdirectlyinEmstar. OPPformsthebaselinecasebecauseitisthelatest proposed refinement for directed-diffusion that is designed to be traffic-efficient by eliminating exploratory data messages: the routing tree is formed purely based on interest–requests (interest messages in directed diffusion) that are flooded to the network and responses (data) are routed along the gradients setup by the interest. To enable a fair comparison of the Kairos routing tree with OPP, we also implemented reliable routing for OPP. We varied the number of nodes in our network, and measured the time it takes for the routing tree in each case to stabilize (convergence time), and the overhead incurred in doing so. In the case of OPP, the resulting routing tree may not always be the shortest path routing tree (directed diffusiondoesnotrequirethat),whileKairosalwaysbuildsacorrectshortestpathroutingtree. So weadditionallymeasurethe“stretch”(theaveragednodedeviationfromtheshortestpathtree)of theresultingOPPtreewithrespecttotheKairosshortestpathtree. Thus,thisexperimentserves as a benchmark for efficiency and correctness metrics for Kairos’ eventual consistency model. We evaluated two scenarios: first to build a routing tree from scratch on a quiescent network, and second to study the dynamic performance when some links are deleted after the tree is constructed. Figure 4.7 shows the convergence time (“K” is for Kairos, and “before” and “after” denote the two scenarios before and after link failures), overhead, and stretch plots for OPP and 105 Kairos averaged across multiple runs; for stretch, we also plot the OPP standard deviation. It can be seen that Kairos always generates a better quality routing tree than OPP (OPP stretch is higher, especially as the network size increases) without incurring too much higher convergence time (∼30%) and byte overhead costs (∼2x) than OPP. Localization: We have implemented the collaborative multilateration algorithm described in Section4.3.2. Sincewedidnothavetheactualsensors(ultrasoundandgoodradiosignalstrength measurement) for ToA (Time of Arrival) ranging, we hard-coded the pairwise distances obtained from a simulation as variables in the Kairos program instead of acquiring them physically. We believe this is an acceptable artifact that does not compromise the results below. We perturbed thepairwisedistanceswithwhiteGaussiannoise(standarddeviation20mmtomatchexperiments in [SHS]) to reflect the realistic inaccuracies incurred with physical ranging devices. Weconsidertwoscenariosinbothofwhichwevarythetotalnumberofnodes. Inthefirstcase (left graph in Figure 4.8), we use topologies in which all nodes are localizable given a sufficient number and placement of initial beacon nodes, and calculate the average localization error for a given number of nodes. The average localization error in Kairos is within the same order shown in [SHS, Figure 9], thereby confirming that Kairos is competitive here. Note that this error decreases with increasing network size as expected because the Gaussian noise introduced by ranging is decreased at each node by localizing with respect to multiple sets of ranging nodes and averaging the results. In the second scenario (right graph in Figure 4.8), we vary the percentage of initial beacon nodes for the full 24 node topology, and calculate how many nodes ultimately become localizable. This graph roughly follows the pattern exhibited in [SHS, Figure 12], thereby validating our results again. Vehicle Tracking: For this purpose, we use the same vehicle tracking parameters as used in[JLZ](forgridsize,vehiclespeed,soundRMS,acousticsensormeasurementsimulations,sensor placement and connectivity, and Direction-of-Arrival sensor measurements) for comparing how Kairos performs against [JLZ]. A direct head-to-head comparison against the original algorithm 106 is not possible because we have fewer nodes than they have in their simulations, so we present the results of the implementation in a tabular form similar to theirs. We simulate the movement of a vehicle along the Y-axis at a constant velocity. The motion therefore perpendicularly bisects the X-axis. We do two sets of experiments. The first one is to measure the tracking accuracy as denoted by the location error (kˆ x MMSE −xk) and its standard deviation (kˆ x−ˆ x MMSE k 2 ) as well as the tracking overhead as denoted by the belief state as we vary the number of sensors (K). The main goal here is to see whether we observe good performance improvement as we double the number of sensors from 12 to 24. As table 4.1 shows, this is indeed the case: the error, error deviation, and exchanged belief state all decrease, and in the expected relative order. Avg Avg Avg Overhead K kˆ xMMSE −xk kˆ x− ˆ xMMSEk 2 (bytes) 12 42.39 1875.47 135 14 37.24 1297.39 104 16 34.73 1026.43 89 18 31.52 876.54 76 20 28.96 721.68 67 22 26.29 564.32 60 24 24.81 497.58 54 Table 4.1: Performance of Vehicle Tracking in Kairos In the second experiment, we vary the percentage of sensor nodes that are equipped with sensors that can do Direction-of-arrival (DOA) based ranging, and not just direction-agnostic circular acoustic amplitude sensors. These results are described in [GGG05b]. 4.6 Conclusion and Future Work Thispapershouldbeviewedasaninitialexplorationintoaparticularmodelofmacroprogramming sensor networks. Our contribution in this paper is introducing, describing, and evaluating this model on its expressivity, flexibility, and real-world performance metrics. Kairos is not perfect 107 in that, at least in its current incarnation, it does not fully shield programmers from having to understand the performance and robustness implications of structuring programs in a particular way; nor does it currently provide handles to let an application control the underlying runtime resources for predictability, resource management, or performance reasons. Finally, while Kairos includesamiddlewarecommunicationlayerintheruntimeservicethatshuttlesserializedprogram variables and objects across realistic multihop radio links, today, this layer lacks the ability to optimizecommunicationpatternsforagivensensornettopology. Therefore,webelievethatKairos opens up several avenues of research that will enable us to explore the continuum of tradeoffs between transparency, ease of programming, performance, and desirable systems features arising in macroprogramming a sensor network. 108 1: void track vehicle() 2: boolean master=true; 3: float zt+1, normalizing const; 4: float p(xt|zt)[MAX X][MAX Y], p(xt+1|zt)[MAX X][MAX Y], p(zt+1|xt+1)[MAX X][MAX Y], p(xt+1,z k t+1 |zt)[MAX X][MAX Y], p(z k t+1 |zt), p(xt+1|zt+1)[MAX X][MAX Y]; 5: float max I k =I k ; node argmax I k , self=get local node id(); 6: node list full node set=get available nodes(); 7: for (node iter=get first(full node set); iter!=NULL; iter=get next(full node set)) 8: for (int x=0; x<MAX X; x++) 9: for (int y=0; y<MAX Y; y++) 10: p(xt|zt)[x][y]= 1 MAX X×MAX Y ; 11: for(;;) 12: sleep(); 13: if (master) 14: for (int x=0; x<MAX X; x++) 15: for (int y=0; y<MAX Y; y++) 16: p(xt+1|zt)[x][y]= X 0≤x ′ <MAX X X 0≤y ′ <MAX Y δ( p x ′2 +y ′2 − p x 2 +y 2 −v)p(xt|zt) δ( p x ′2 +y ′2 − p x 2 +y 2 −v) ; 17: zt+1=sense z(); 18: normalizing const=0; 19: for (int x=0; x<MAX X; x++) 20: for (int y=0; y<MAX Y; y++) 21: p(zt+1|xt+1)[x][y]= r δa ˆ Φ ` a hi −rz rσ ´ −Φ ` a lo −rz rσ ´˜ ; 22: normalizing const+=p(zt+1|xt+1)[x][y]·p(xt+1|zt)[x][y]; 23: for (int x=0; x<MAX X; x++) 24: for (int y=0; y<MAX Y; y++) 25: p(xt+1|zt+1)[x][y]= p(z t+1 |x t+1 )[x][y]·p(x t+1 |z t )[x][y] normalizing const ; 26: node list neighboring nodes=get neighbors(iter); 27: append to list(neighboring nodes, self); 28: max I k =−∞; argmax I k =self; 29: for (node temp=get first(neighboring nodes); temp!=NULL; temp=get next(neighboring nodes)) 30: p(z k t+1 |zt)=0; 31: for (int x=0; x<MAX X; x++) 32: for (int y=0; y<MAX Y; y++) 33: p(xt+1,z k t+1 |zt)[x][y]=p(zt+1|xt+1)[x][y]@temp·p(xt+1|zt)[x][y]; 34: p(z k t+1 |zt)+=p(xt+1,z k t+1 |zt)[x][y]; 35: for (int x=0; x<MAX X; x++) 36: for (int y=0; y<MAX Y; y++) 37: I k + = log » p(x t+1 ,z k t+1 |z t )[x][y] p(x t+1 |z t )[x][y]p(z k t+1 |z t ) – ·p(xt+1,z k t+1 |zt)[x][y]; 38: if (max I k <I k ) argmax I k =temp; 39: if (argmax I k !=self) master=false; 40: master@argmax I k =true; Figure 4.5: Procedural Code for Vehicle Tracking 109 Figure 4.6: Stargate with Mica2 as a NIC (left), Stargate Array (middle), and Ceiling Mica2dot Array (right) 0 5 10 15 20 25 0 5 10 15 20 25 Number of nodes Convergence Time (S) Time K (after) Time OPP (after) Time OPP (before) Time K (before) 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 Number of Nodes Overhead (bytes) Overhead K (after) Overhead OPP (after) Overhead OPP (before) Overhead K (before) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 5 10 15 20 25 30 Number of Nodes OPP Stretch Stretch OPP (after) Stretch OPP (before) Figure4.7: ConvergenceTime(left),Overhead(middle),andOPPStretch(right)fortheRouting Tree Program 0 5 10 15 20 25 30 35 40 45 10 12 14 16 18 20 22 24 26 Number of Nodes Average Error (cm) 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 % of Beacons % of Resolved Nodes Figure 4.8: Average Error in Localization (L) and Localization Success Rate (R) 110 Chapter 5 Chapter 5: Fault-Tolerance Support for Kairos 5.1 Introduction We focus on the issue of fault tolerance for sensor networks. Maintaining application accuracy and availability in the face of faults is a nontrivial proposition. Software bugs can render a node partially or wholly unresponsive. Network and hardware dynamics such as node failures, burst lossesonlinks,networkpartitions,andreconfigurationeventsinvolvingnodeadditionanddeletion can completely disable nodes or alter their program state. For example, consider a vehicle tracking application, in which a group of nodes cooperatively and iteratively refines their estimate of the current position of a moving vehicle. If one or more nodes should fail in the middle of a computation, the resulting estimate can be incorrect because only partial data from the operational nodes is used. Depending on the extent and location of failure, the application may not even be able to form an estimate, effectively rendering it unavailable. Such failures are far more likely in sensor networks, where a large number of nodes are exposed to an unpredictable environment, than in traditional distributed systems. Researchers have made impressive strides in providing programming platforms (e.g., [HSW + , GLvB + ]) and services (e.g., [LLWC, MSL]) that simplify the development of sensor network systems. However, toourknowledge, noneofthesesystemsprovidesspecialsupportformanaging 111 faults. Instead, the programmer must manually implement a failure recovery strategy that is appropriate for the application at hand. In our vehicle tracking example above, a programmer might insert code to track the dependencies among program variables and nodes within the algorithm, detect when a node has failed, and discard failed dependencies in the final output in order to maintain program correctness and availability. The need for such ad hoc recovery code significantly complicates the development of robust sensor-network systems. Recovery code crosscuts the entire application and is intimately tangled with the application logic, making the system difficult to modify and maintain. Further, the recovery code is tedious to implement correctly, for example requiring synchronization among the nodes in the network while maintaining energy efficiency. We aim to provide declarative support for modularizing the failure concern, allowing sensor- network programmers to easily and reliably identify and recover from faults. Our insight is that this can be achieved by extending existing macroprogramming systems for sensor networks [NW, GGG05a]. Unlike the traditional approach, in which programmers directly implement the pro- gramstoberunonindividualnodesinthesensornetwork,macroprogrammingmakesitpossibleto write a centralized program to express a computation. The compiler then automatically produces node-level programs that implement the specified behavior in a distributed manner. Macropro- gramming allows programmers to focus on the algorithmic aspects of their applications, without worrying about low-level details like the protocol for communication among nodes. In this paper, we make the following contributions: • We describe a simple API for checkpointing, a generic recovery approach for a broad class of common sensor-network failures, in the context of a macroprogramming language. The API leverages the macroprogram’s centralized view to allow programmers to naturally specify application state to be checkpointed at desired points in the program. The programmer can later roll back to a previously created checkpoint in order to consistently undo the 112 effects of failed nodes, and re-execute the rolled back code with only the set of available nodes. This API is implemented by a novel low-cost distributed algorithm for checkpoint and rollback. The API also supports an important variant of recovery that is designed to preserve application work done during a network partition event, called Partition Recovery. • Our generic recovery API described above is a distinct improvement over ad hoc recovery techniques used in traditional sensor network systems, but it still requires the programmer to explicitly interleave recovery logic with the macroprogram. Our second contribution leveragestherecoveryAPItosupportaformofautomatedrecoverythatwecall Declarative Recovery. Declarative Recovery allows a programmer to provide modular code annotations that specify where checkpoints should be taken, and the macroprogramming system then automatically detects faults and rolls back execution appropriately. It also includes an algorithm to automatically determine at run time the nearest checkpoint to which it is sufficient to roll back in order for recovery to succeed. • Finally, we push automated recovery even further, to explore a form of Transparent Re- covery. In this recovery scheme, the system additionally automatically determines where checkpoints should be taken. We describe a simple set of heuristics for placing checkpoints that appropriately handles common macroprogramming patterns. We have instantiated this approach to failure recovery in sensor networks as an extension of our macroprogramming system called Kairos [GGG05a]. We have implemented three qualita- tively different sensor network applications using Kairos—localization, target tracking, and data aggregation—and have usedthem to evaluate therecovery API andthe declarative recoverytech- nique. Our primary metrics are the benefits of improvement in correctness and availability of a recovered application in comparison to an unrecovered application, and the performance costs of messaging and memory overheads. Our recovery strategies can improve application availability 113 by an order of magnitude: in some cases, an application is unavailable for 30 times fewer re- porting intervals than one which does not incorporate our recovery mechanisms. Our strategies fully preserve application accuracy for two common kinds of faults—software faults and network partitions, incur acceptable messaging overhead (less than 15% for vehicle tracking), and incur about a factor of two additional data memory for storing checkpoints. To our knowledge, ours is the first work to explicitly address generic failure recovery method- ologies for sensor networks. Techniques for detecting and concealing faults and for recover- ing from failures have been extensively considered in the distributed systems literature (e.g., [Gra, SC, MG, CKF + ]). Such work, however, has not examined the kind of high-level recovery APIandautomatedrecoverytechniquesthatwedescribe. Weareabletosupportthesetechniques in a practical manner by leveraging the centralized view of a distributed computation provided by macroprogramming systems. Therestofthepaperisstructuredasfollows. Section5.2motivatestheneedforfailurerecovery support in wireless sensor networks, and describes the complexity of manually implementing recovery within node-level programs. Section 5.3 provides an overview of Kairos, and describes our recovery API on top of Kairos and how it can be used for manual recovery. In Section 5.4, we describe how we can provide support for declarative and transparent recovery mechanisms. Section5.5detailsourevaluationoftheserecoverytechniquesforseveralclassesofsensornetwork applications. We describe related work in Section 5.6. Section 5.7 concludes and discusses future work. 5.2 Motivation Real-world sensor network deployments see significant failures. Figure 5.1 shows the distribution of failure durations in a real-world sensor network deployment at the James Reserve in Southern California [jr]. In this deployment, each sensor periodically sends readings to a base station; 114 failure to receive any readings from a sensor corresponds to a failure of the sending node or one or more other sensors nodes that would otherwise have forwarded the sender’s data to the base station. The figure plots the duration of outages (intervals during which no data was received fromasensor)foratotaloftwentysensornodesoveraperiodofseveralmonthsin2003and2004. During this period, each node transmitted data for a total of at least six months. There were a total of 543 outage events during this period. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 Hours CDF (P(y<Y)) Figure 5.1: Distribution of outage durations in a real sensor network. TheCumulativeDistributionFunctionofthedurationofoutagesconvergesslowly,andoutages range from a few minutes to well beyond six hours, with most outages shorter than three hours. Thus, in a real-world sensor network deployment, applications are likely to see a range of node failure and recovery time-scales; there is no single time-scale that one can engineer for. As such, it is desirable for an application to incorporate mechanisms that allow it to function for short periods of time with a smaller set of nodes than it started with, and to re-use nodes that might have been down for extended periods. Such mechanisms can improve the quality of a sensor network computation. 115 One possibility, then, is for an application writer to manually program failure recovery in sensor network applications. To illustrate some of the problems that arise from manual failure recovery, consider an application in which sensor nodes periodically send both temperature and light readings to a designated base station node, which we assume to be the node with the lowest ID. The base station aggregates the data it receives in some fashion. Even in this simple scenario, failures must be considered carefully: 1. What should be done if a node fails in some period when the base station has only been able to obtain one of the two sensor values (temperature and light) from the node? For our example, we assume that the base station must remove the effect of the incomplete sensor reading from the aggregation. 2. What should be done if the base station fails? In that case, a new base station must be elected, by finding the live node with the lowest ID. Further, whenever an old base station comes back up, sensor data from the old and current base stations must be merged, and the node with the lower ID must become the new base station. InalanguagelikenesC[GLvB + ],thedefaultnode-levelprogramminglanguagefortheBerkeley sensormotes[mic], thisapplication wouldtypically bewrittenasacollection ofcomponents, each pertaining to a different task, such as aggregation, leader election, and base-station merging. Every node has the code for all of the components and executes the appropriate procedures from thesecomponentsdependingonitsstate(i.e., whetheritisanormalnode, acurrentbasestation, or a rebooted old base station). For ease of presentation we focus on the functionality for data aggregation. Figure 5.2 shows pseudocode for the two main procedures. The aggregate send procedure is invoked by every node and periodically sends temperature and light readings to the base station. The value of bs is set by the leader-election component, which is not shown. 116 The aggregate receive procedure is invoked by the base station, in order to handle the receipt and aggregation of data from the nodes in the network. In each period (or epoch), the base station obtains a list of what it believes to be the live nodes, via a call to the local procedure get available nodes() (line 15). This list is maintained by the leader-election component (not shown) in an efficient way through a simple membership management protocol. This protocol would be part of the leader election component, whereby every node periodically announces its liveness. The base station then uses a select() facility (line 21) to wait for temperature or light data from these nodes (sent via aggregate send) and update local state appropriately. This process repeats until either all expected data from the live nodes has been received (line 25) or a timeout is received, indicating the end of the epoch (line 26). The aggregate receive procedure handles node failures through checkpoint and rollback, a standard failure recovery approach. The base station takes a checkpoint of its local state at the beginningofeachepoch(line13). Whenatimeoutissignaled, indicatingthatsomelivenodesdid not provide both sensor values, the base station restores this checkpoint (line 27). This has the effect of removing all data obtained in the current epoch from the aggregation, thereby ensuring consistency. (Itispossibletoperformfiner-grainedrecovery,forexampleretainingsensorreadings in the current epoch from any node for which both values were able to be obtained. However, doing this would require the programmer to manually track dependencies to ensure consistency, which is tedious and error prone.) To handle base station failures, we assume that whenever a node determines that the base station has not broadcast its liveness, as part of membership management described above, that node triggers leader election. To handle the situation when an old base station comes back, the current base station checks the live nodes at each epoch for a node with a lower ID, invoking the merge functionality if required (lines 16–17). Manual recovery as illustrated by our example has a number of drawbacks: 117 1. The code for the recovery concern is tangled with the rest of the application logic. For example, the base station must explicitly check for the presence of an old base station after accessing the live nodes (line 16) and must explicitly restore a taken checkpoint upon detecting a failure in the middle of an epoch (line 27). Further, because a checkpoint could be restored at any point in its dynamic lifetime, managing checkpoints is non-modular. For example, if the inner while loop in Figure 5.2 were defined in its own function, the checkpoint ckpt would have to be restored from there, requiring it to either be a global variable (whose deletion would then need to be manually managed to save space) or to be explicitly passed to the function. 2. Properrecoverymayrequiremanualtrackingofdependenciesacrossnodes. Inourexample, only the base station’s local state is of interest upon a node failure, so local checkpoint and recovery (take local ckpt and restore local ckpt) are sufficient. However, suppose the base station’s local state had dependencies with local state at other nodes in the network. In that case, whenever the base station required a rollback, the failed dependencies at other nodeswouldalsohavetobetrackedandremovedtomaintainconsistency. Further,whenever these dependencies change, through program maintenance or extension, the recovery code must likewise be updated. Dealing with network partitions also makes dependency tracking harder because a partition causes some nodes to be disconnected from others, thereby causing their states to drift as nodes in the two partitions work independently. After a partition is repaired, one option is for the programmer to simply discard the work done by nodes from one half of the partitioned network. However, this is sub-optimal, because work done by both sets of nodes canbeintegratedintothelong-termstateofthehealednetwork, whichimprovesthequality of the final results. But, again, this requires careful tracking of dependencies between data across nodes during and after the partition has occurred. 118 3. Similarly, proper recovery may require synchronization across nodes. If dependencies exist acrossnodes,requiringrollbacksatmultiplenodes,theprogrammermustbesuretosynchro- nize these rollbacks to ensure consistency. Otherwise, one node could restart its rolled-back executionbeforeanothernodehasbeenfullyrolledback. Manualsynchronizationisdifficult to implement both correctly and in an energy-efficient manner, which is critical on today’s resource-constrained sensor nodes. 5.3 Generic Checkpoint Recovery in a Macroprogramming System We first describe the particular macroprogramming language and system we use throughout this paper, calledKairos[GGG05a]. Whilewehaveconcretelyexaminedandevaluatedthetechniques described in this paper within Kairos, we believe that the key concepts can be adapted to other macroprogramming languages like Regiment [WM]. 5.3.1 An Overview of Kairos In this section, we briefly review the Kairos macroprogramming system; it is described in more detail elsewhere [GGG05a]. Kairos lets a programmer directly express the desired global behavior of a distributed compu- tation. The programmer achieves this by writing a centralized program in which sensor network data can be manipulated as ordinary program variables. The Kairos compiler then translates the centralized program into programs that execute on individual nodes, with the support of the Kairos runtime. WesummarizetheKairoslanguageabstractionshere. Kairosaugmentsahostlanguagewitha small number of new programming primitives, which allow a distributed computation on a sensor network to be expressed centrally. The programming model is analogous to that of mainstream 119 imperativeprogramminglanguages: Kairoshasasequentialsemanticsbydefaultandacentralized memory model. As such, Kairos fits well as an extension to commonly used languages. We have builtaKairosextensiontoPython,whichweuseinourimplementationandexperimentsreported here. Thedetaileddescriptionofourcompilationandruntimetechniquesisavailablein[GGG05a]. Kairosdecouplesasensornetworkprogramfromtheunderlyingnodetopology,therebymaking it instantiable on an arbitrary topology. The node data type is an abstraction of a network node. Nodes can be conveniently manipulated using a nodelist iterator data type that presents a set- based abstraction of a node collection. Kairos makes sure that the values contained in these variables are visible consistently and efficiently at all nodes. The function get available nodes() provides access to the nodelist representing all nodes in the network, while the get neighbors(node) function returns the current list of node’s radio neighbors. Given the broadcast nature of wireless communication, a neighbor list is a natural abstraction to build interacting groups of nodes in a program, and is similar to regions [WM] and hoods [WSBC]. Kairos provides a natural way to access the program state at any node from within the cen- tralized program. A node-local variable is a program variable that is instantiated per node. A particular node’s version of a variable can be accessed by the macroprogram through a var@node syntax. All other variables are instantiated only once within the network, and are called central variables. Kairos respects the scoping, lifetime, and access rules of variables imposed by the host language. Figure 5.3 shows the Kairos code that uses these abstractions for continuously computing the sample averages for light and temperature readings and storing them at a base station node. It works as follows. In lines 1, 2, and 3, we declare variables to represent the list of nodes in the network, a temporary node, a node that will be chosen as the base station, and the time to sleep between averaging intervals. In lines 4 and 5, we declare node-local variables. temp and lt are special node-local variables (indicated by their sensor attribute) that are continuously updated with new readings from a node. For each of the node-local variables, a copy of the variable with 120 the same name exists at each node in the network. In line 6, we store the list of active nodes in the network in the variable full node set, and in line 7 we instantiate the node with the lowest id as the base station node. Finally, in lines 8–12, we cause the network the repeatedly fetch temperature and light samples and store their average at the base station. Kairos implements the distributed version of a macroprogram in a network-efficient manner. The Kairos runtime has a distributed caching layer that makes sure updates to central variables arevisibleacrossthenetworkconsistently. Thecachinglayeralsobuffersupdatesfromnode-local variablessothattheprogrammercanperformsynchronousreadsandwrites. Forexample,inlines 11–12ofFigure5.3,themacroprogramupdatesthebasestation’snode-localvariablesinsequence, while in a node-level program such as Figure 5.2, the programmer is responsible for managing network messages that may arrive any time and out of order. Kairos minimizes communication overheadforbothdataandcontrolthroughthreetechniques: byallowingasynchronousexecution atnodesandminimizingtheircontrolflowsynchronization; byexploitingrelaxeddataconsistency semanticswherepossibleinordertofurtherreducecontroltrafficoverhead;andbycachingremote variables for reads and filtering unnecessary writes [GGG05a]. 5.3.2 Recovery in Macroprograms In addition to checkpointing, which was discussed in Section 5.2, there are two approaches for programming recovery into macroprograms. The simplest approach is for runtime support to provide error notifications, leaving it to the programmer to manually deal with failures. For example,theruntimecanreturnanerrorwhenreadstoaremotevariablefail. Thus,inFigure5.3, accesses to node-local variables av t, av l, count, temp and lt in lines 11–12 can return a special error code when a node is unavailable. But such a facility only solves one of the three problems withnode-levelmanualrecoverydescribedinSection5.2. Whiletheprogrammerisrelievedofthe burdenofmanuallysynchronizingsuchaccessesacrossnodesbecausethemacroprogram’sruntime implementssuchsynchronization, theprogrammermuststilldealwiththefirsttwoproblems: she 121 would have to add checks around each access to a node-local variable (there would be five such checksinlines11–12,forexample),andmanuallytrackdependenciesacrosssuchnode-localstates. The second approach is to augment the language with a transaction facility. While such a facility could potentially solve all the three problems of manual recovery, it would be heavy- weight unless carefully implemented. Support for nested transactions would be necessary in order to minimize lost work during recovery, but such transactions are difficult to implement efficiently and correctly in a distributed setting because of their potential for causing deadlocks and livelocks [Kna87]. Thus, we use checkpoint-based recovery. 5.3.3 Manual Failure Recovery for Macroprogramming In this section, we examine a checkpointing approach to manual failure recovery in the context of macroprogramming. In particular, we describe a small checkpointing API for Kairos. The key novelty is the way in which this API leverages Kairos’ centralized view of the network: pro- grammers specify checkpoints at the granularity of the macroprogram, and the runtime system carefully ensures the corresponding node-level programs take consistent checkpoints and rollback in a synchronized manner when a failure is detected. Programmers are still responsible for manu- allymanagingcheckpoints, therebysufferingfromsomeofthedrawbacksdescribedinSection5.2. These drawbacks are addressed by our automated recovery strategies, which build on the check- pointing API and are described in the next section. The Checkpointing API In a Kairos macroprogram, the programmer may call the following function at any point: Ckpt take_ckpt(nodelist nl); 122 Thisfunctiontakesaconsistentcheckpointateverynodeinthespecifiednodelist. Byaconsistent checkpoint, we mean that no node in nodelist nl proceeds in the computation until it knows that all other nodes in nl have also taken the checkpoint. This call returns a handle to the checkpoint. To rollback to a checkpoint, a programmer may call the following function: boolean restore_ckpt(Ckpt ckpt); This function takes a previously created checkpoint as an argument and restores the state at each node (again, consistently) to that at the specified checkpoint. Execution of the restored program then resumes at the statement following the point where the specified checkpoint was taken. Figure 5.4 shows how this API can be used to implement a version of fault tolerant sensor averaginginKairos. ItmeetstherequirementsdescribedatthebeginningofSection5.2exceptfor merging data from old base stations, which we describe in the following subsection. The recovery codethatisadditionaltothemacroprograminFigure5.3isshowninbold. Thebasicideabehind the recovery code is to use two checkpoints for the two failure scenarios that must be recovered from: when a base station crashes, ckpt1 created in line 6 is used, and when any other node crashes, ckpt2 created in line 13 is used. Whenever one or more nodes fail during the execution, the runtime ultimately triggers the recovery code in lines 18–24. This is because failures of nodes are detected in the background by the runtime and exposed through an internal variable called failed, which the programmer can check anywhere in the program. In case a node other than the base station crashes, the programmer restores checkpoint ckpt2 in lines 20–22, and in case the base station itself crashes, the programmer restores checkpoint ckpt1 in line 24. If ckpt1 is restored, execution resumes at line 7. The programmer takes another checkpoint in lines 7–9 if ckpt1 has been restored. If ckpt2 is restored, execution resumes at line 14. Thus, as long as any node other than bs crashes, the state at bs is unaffected, because of its recovery via checkpoint ckpt2. Further, since restoring a checkpoint reinstates a previous state of the program, the actual values of node-local variables that a program uses in the time between 123 takingandrestoringacheckpointisimmaterial. Lines16–17ofFigure5.4exploitthispropertyby notcheckingforthereturnvaluesofnode-localvariables. Ourimplementationcurrentlyreturnsa well-definederrorcode,whichisusefulifaprogrammerwantstoimplementfiner-grainedrecovery. A macroprogram written to use the checkpointing API is said to use checkpoint-rollback re- covery (CRR). CRR solves the last two problems of node-level recovery described in Section 5.2 as follows: 1. Unlike the manual checkpoints taken in the pseudocode in Figure 5.2, which are local to a particular node, the Kairos checkpointing API provides globally consistent checkpoints. Therefore, the programmer is relieved from manually tracking dependencies across nodes: as long as all nodes that have dependencies among one another are checkpointed and the call to take ckpt is placed at a globally consistent point, all dependencies will be properly handled automatically. 2. TheKairosruntimeautomaticallysynchronizesnodeswhenacheckpointistakenorrestored. For example, a node is only allowed to resume execution after restoring its local checkpoint once all other nodes have also restored their checkpoints. Therefore, the programmer is completely relieved of the burden of node synchronization. Detecting Faults Fault detection is a significant research challenge in its own right. In this paper, we assume non-malicious faults and use a simple, yet practical, fault detection strategy. A fault is said to occur when a read or write to a node fails after three successive retries. Variable reads and writes use a simple request/response protocol in Kairos. This protocol has a three-second timeout, a reasonable upper-bound on real-world latencies in sensor networks. When a fault is detected, the failed flag is set. 124 Implementing Checkpoints Kairos implements the checkpointing API, and its various components, in its runtime. This task involves coordinating the relevant node-level runtimes to efficiently take and restore checkpoints. By the time a program invokes take ckpt(nl), the runtime ensures that the value of nl is available at every node in the network. Each node within nl takes its own local checkpoint, sends out a completion message, and stops the node-level program execution until it hears the same message from other nodes. Nodes that are not in nl need not actually take a local checkpoint, but they still have to participate in a global consensus algorithm [EAWJ]. In the general case, this synchronization can be expensive, requiring O N 2 reliable point-to-point transmissions for a network of size N. Weproposetwonoveloptimizationsforcommunicationreduction. First,weexploitthebroad- cast nature of wireless sensor networks in order not to require every node to communicate with every other node. We build consensus using the following algorithm, which has two phases of execution. In the first phase, a node reliably broadcasts “Done/Wait” to its immediate neighbors after it takes its local checkpoint. Whenever it hears “Done/Wait” from all neighboring nodes, it enters the second phase of execution by reliably broadcasting “Done” in the local domain. Once it hears “Done” from all current neighbors, the node independently determines that a consistent globalcheckpointhasbeentaken. Intuitively,thisalgorithmworksbecauseanodewouldnothave entered the second phase of the protocol if any neighbor had not yet completed or even entered its own first phase. Thecostofthisprotocolisclearlyatmost2N reliablelocalbroadcasts. Anotherdistinguishing featureofthisprotocolisthatnodesonlyneedtosynchronizeduringthisoperationbutotherwise execute completely asynchronously with respect to one other. Second, we optimize a common case scenario in which a checkpoint is repeatedly taken over a single node. For example, in line 13 of Figure 5.4, we repeatedly checkpoint state at bs. In 125 such cases, the runtime can avoid global synchronization by only taking a local checkpoint. The runtime implements the correct consistency semantics so that such a checkpoint is valid. Our local checkpointing implementation uses the libckpt library to save the private process state (the data, heap and stack segments) locally at a node. When a program invokes restore ckpt(ckpt), the runtime restores the program state from the local checkpoints of nodes stored in ckpt structure. The distributed component of this operation uses the same machinery involved in take ckpt(), except that the remaining nodes should obvi- ously not expect protocol messages from the failed nodes. After the live nodes have agreed to consistently restore a ckpt, each node’s runtime locally restores its own state. Summary The checkpointing API provides an abstraction that specifies recovery actions at the level of the macroprogram itself. The Kairos runtime carefully ensures that consistent checkpoints are taken at each local node. This checkpointing mechanism is conceptually similar to other distributed checkpointing techniques, all of which, including ours, are variants of Chandy and Lamport’s algorithm [CL85]. The main novelty is that our implementation is asynchronous and optimized for the locally communicating and broadcast nature of sensor networks. Moreover, we are not awareofsimilarlanguage-levelrecoverytechniquesthataretightlyintegratedwiththeunderlying distributed programming system, a feature which is useful in providing support for partition recovery, as described in the next section. 5.3.4 Recovering from Partitions Checkpointing can lose work between the last checkpoint and when a restore ckpt() is called. If faults affect a single node, invoking CRR is the right choice if we only want to ensure consistency of the macroprogram state. In the case of many continuous-output applications, such as vehicle tracking, it also happens to be the optimal choice because it causes the macroprogram to respond 126 both rapidly and correctly to network dynamics, and compute the continuous output with no loss of accuracy. However, CRR on its own is not sufficient to properly handle network partitions. We define a partitioning as an event which causes one or more live nodes to be disconnected from the rest of the network. Suppose a network partition occurs anywhere between lines 11–15 in Figure 5.4. With CRR, Kairos would rollback the computation and resume it independently on both halves of the partition. There are now two bs values, each of which keeps accumulating averages. When the partition heals, we need a mechanism to let a programmer specify how to unify the work done by each side of the partition. Weprovidesuchamechanism,calledPartition Recovery,whichcombinestheglobalworkdone byeachgroupduringthepartition. Thegoalistopreservethevaluesofvariablesrepresentingthe long-livedprogramstate, suchas av l@bs, av t@bs, and count@bs. Theprogrammerinvokespartition recovery by specifying a merge function along with the macroprogram. The runtime indicates that the partition has healed by setting a healed variable, and the programmer can detect this condition similar to the check used for node failures in line 15 of Figure 5.4. In order to make the ensuing discussion clear, we show the entire code for the macroprogram in Figure 5.5. It is augmented with the code for dealing with partitions in bold. Figure 5.5 works as follows. When a partition occurs, the test for failed in line 18 would succeed. One half of the partition that contains the base station would work with a fewer set of nodes because its runtime restores ckpt2 to line 14 of the macroprogram, while the other half that does not contain the current base station will additionally obtain a new base station when its runtime restores ckpt1 to line 7 instead. Thus, there would be two global runtimes, each of which is the union of the local runtimes of its constituent nodes. We note that the original base station does not lose its long term state (i.e., its values of av l, av t, and count) because the rollback of its partition is only until line 14. As long as they are separate, both partitions work independently thereafter. 127 Later,whenthetwopartitionsmerge,theruntimesofthetwohalveswillindependentlydetect this condition, because the two distributed runtimes maintain information about which nodes are available. Before the merge, each runtime maintains its own copy of the central variables, and the copies may become out of sync. After the merge, the programmer may require access to both copies, in order to determine an appropriate value to use for that central variable upon program resumption. A programmer can indicate a central variable var whose values from the two partitions should be saved by declaring two additional central variables var P1 and var P2. For example, in the beginning of Figure 5.5, the programmer declares bs P1 and bs P2. Just before the two runtimes of a partition merge, bs P1 and bs P2 are updated respectively with values from each of the two partitions. The merge function is invoked by the programmer separately for each runtime in line 26, after she detects that the underlying partition has healed, by testing for the healed flag in line 25. The merge function synchronizes the two runtimes by making the first caller wait until the second caller has also invoked merge av. merge av first updates the global variable bs of the macroprogram to the lower-valued base station. It then updates the av l, av t, and count values at bs. When it exits, the two runtimes are considered unified, and the application resumes execution at line 11. One observation we can make regarding the application domain of sensor networks is that it is often possible to write merge functions that follow a well-known idiom. Figure 5.6 describes some common example applications and suitable merge functions. 5.4 Automated Recovery Strategies While our checkpointing API for macroprogramming is a significant step from node-local manual recovery, it still requires the programmer to manually create checkpoints, manage their lifetimes explicitly, and restore to the appropriate checkpoint at necessary places within the application logic. 128 5.4.1 Declarative Recovery Annotations In order to relieve the programmer from dealing with such issues, and in order to only allow her to reason about recovery modularly, we have designed a Declarative Recovery (DR) annota- tion technique. This annotation takes the following form: <nodelist, merge func> where nodelist is an expression that evaluates to a list of nodes potentially affected by a fault, and the optional merge func specifies a merge function to be used after a partition. The nodelist argument is spec- ified using a set-theoretic notation, with support for basic operations of union, intersection, and difference. Such an annotation may be placed at any line in the macroprogram, and more than one such annotation may be present in a given macroprogram. Whenaprogrammerplacesanannotation <nodelist, merge func>atsomepointintheprogram, she is indicating that the global program state is consistent at that point, and therefore that this is an appropriate point at which to take a checkpoint. When such an annotation is encountered during execution, the runtime automatically takes a checkpoint at all nodes in nodelist (using the checkpointing API described in the previous section). The runtime also starts a new recovery scope and watches for any failed or merged nodes in the background. This recovery scope lasts for the dynamic extent of the annotation’s smallest enclosing program block. When any remote access encounters a failure within a recovery scope, the runtime automati- cally rolls back the computation at each node in the macroprogram to the most recent relevant checkpoint. Relevance is determined by the nodelist argument to an annotation, which indicates that forward progress can be made from this point as long as at least one node in nodelist is live. Given this information, the runtime can automatically rollback to the checkpoint that discards the least amount of work while ensuring forward progress. We describe this rollback algorithm in more detail in the next subsection. Furthermore, when new nodes are added to the system or when a partition heals, the runtime also rolls back the computation, applies the specified merge function, and resumes the computation. 129 Figure 5.7 shows our averages example augmented with recovery annotations (lines 7 and 11). Lines 8 and 12 are additional code for ensuring that the program can make progress without the failed nodes, and are executed immediately after the macroprogram has been rolled back to the corresponding points. For simplicity, we do not show a merge function, which would be very similar to the one in Figure 5.5; thus, the second arguments of the annotations in lines 7 and 11 are NULL. In line 7, we annotate full node list as the set of nodes over which the checkpoint is defined. In line 11, we activate another recovery scope, defined only over the base station. This annotation indicates that forward progress can be made from line 11 as long as the base station is still live. Therefore, whenever any node other than bs fails during the annotation’s recovery scope, the runtime rolls the program back only to line 12, and re-initializes the set of currently available nodes. If the base station fails, however, the runtime instead rolls back to line 8, and subsequently chooses a new base station in line 9. These declarative recovery annotations eliminate the problems of manual checkpointing de- scribed earlier. A simple annotation tells the runtime where checkpoints should be taken. The runtime then automatically creates and manages these checkpoints, detects failures and deter- mines an appropriate checkpoint to restore, even across function boundaries. In this way, the recovery code is much more insulated from the application logic and much more robust to ap- plication updates. Finally, the runtime also automatically garbage collects checkpoints as they become inactive, as described in the next subsection. 5.4.2 Selecting and Managing Checkpoints In a macroprogram with several annotations, the checkpoints created by active annotations can be dynamically managed as a single list. In this list, a checkpoint A follows a checkpoint B if the line of code at which A was taken is executed after the line at which B was taken at run time. The runtime continuously tracks nodes’ membership status in the background to discover if one or more nodes have failed or been partitioned. If it detects such a condition, it searches 130 this checkpoint list for a checkpoint with an annotation that has specified at least one live node. Intuitively, the programmer intends each checkpoint to represent both a globally consistent state, and, orthogonally, a liveness condition that declares that the macroprogram can make forward progress if execution is retried from that point on, after discarding the effects of failed nodes during checkpoint recovery. The runtime allocates and maintains memory for checkpoints in an efficient and distributed manner. Metadata associated with a checkpoint, which includes the list of nodes over which the checkpointwastakenandthecheckpoint’sparentcheckpointinthelist,isreplicatedateverynode. One valuable optimization is that, if this metadata does not change when the next checkpoint is taken at the same place, global synchronization is averted. Thus, in Figure 5.7, when the runtime repeatedly takes a checkpoint over bs’s state, it can avoid global communication and synchronization after the first time. This is because of our observation that if two checkpoints are taken over the same node list, the older checkpoint can be safely replaced by the newer checkpoint—the older checkpoint will never be used in favor of the later checkpoint because (a) our liveness requirement during rollback applies equally to both checkpoints, and (b) our requirement to minimize work lost will cause the later checkpoint to preferentially be chosen over the older one. The runtime reclaims storage allocated for checkpoints in the following simple fashion. When- ever execution encounters an annotation, the runtime takes a checkpoint at that location, and thendiscardsanypreviouslytakencheckpointsatthatcodelocation. Thisstrategylazilydiscards checkpoints; an alternative would have been for the Kairos compiler to carefully discard check- points whenever execution exits the static program scope in which the annotation is declared, but our approach requires less work on the part of the compiler, and is comparably efficient. Further- more, before restoring the state corresponding to a selected checkpoint, the runtime discards the saved memory associated with all later checkpoints in the list. 131 5.4.3 Transparent Recovery While Declarative Recovery significantly simplifies programming recovery, it still requires the programmer to annotate code. In addition to identifying points in the code where consistent checkpoints canbetaken, theprogrammer has toindicate what theminimal setoflivenodes isat suchpointsinordertooptimizelostwork. Forexample,inFigure5.7,thesecondannotation’sfirst argument must contain the base station in order to avoid losing bs state. An interesting question arises whether it is possible to provide completely transparent recovery, without programmer involvement at all. We have taken a first step in this direction using a simple heuristic that we call Transparent Recovery (TR). In Transparent Recovery, the need for supplying declarative annotations is elim- inated, but the programmer must still supply a merge function for the program, because merge functions are inherently application-specific. We only allow one merge function to be provided; it is declared with a special attribute indicating that it is the merge function. Transparent Recovery works as follows. The Kairos compiler generates code to take a check- point after each update to a variable of type node or nodelist. Thus, going back to the original example (Figure 5.3), the compiler would direct the runtime to take a checkpoint of full node set after line 6, of bs after line 7, and of iter after line 10. Transparent recovery can be sub-optimal, losing more work than necessary, because it is not possible to infer the nodes that must be live at a given point to ensure that forward progress can be made. Therefore, TR’s rollback strategy must be conservative: upon the failure of a node n, TR rolls back to the latest checkpoint C such that C and all earlier checkpoints do not include noden. Intuitively,itissafetorollbacktoC ifthisconditionismet,sincenothingintheprogram executionuptoC dependeduponnoden. Ifnosuchcheckpointexists,theruntimesystemsimply rolls back to the beginning of the macroprogram. 132 Because all nodes are checkpointed at line 6 in Figure 5.3, this code represents a worst case of sorts for TR; failures always cause rollback to the beginning of the macroprogram. However, our experiments in Section 5.5 illustrate that the technique is practical for other common sensor- network applications. For example, TR is appropriate for many continuous-output applications like vehicle tracking, because nodes in such a network do not accumulate long-term state. 5.5 Evaluation In this section, we describe the results of experiments conducted on a wireless testbed using an implementation of Kairos and the recovery mechanisms described in this paper. We quantify the efficacy of our recovery techniques along various dimensions: error in application quality, application availability, and messaging and memory overhead. 5.5.1 Methodology Implementation: We implemented Kairos, and the recovery techniques for Kairos partly in Python (using its embedding and extending APIs) and partly in C. The Kairos runtime uses EmStar [EBB + ] to implement end-to-end reliable routing and topology management. Our Kairos implementationrunson32-bitembeddedplatformssuchastheStargate[Incc],aswellonPCs. We have presented the details of our Kairos implementation in [GGG05a]. To this implementation, we added the recovery API described in Section 5.3, and the compiler and runtime support for Declarative Recovery (DR) and Transparent Recovery (TR) described in Section 5.4. All experiments reported in this paper use this implementation. Applications: We evaluate the efficacy of recovery in Kairos using three representative sensor- net applications written in Kairos, the complete code for which is given in [GGG05a]. The three applicationsare: vehicletracking, forwhichanexplicitlydistributedalgorithmbasedonBayesian belief propagation is given in Liu et al. [JLZ]; node localization in a sensor network, for which we 133 macroprogram the distributed algorithm based on cooperative multi-lateration as given in Sav- vides et al.[SHS]; andquantileestimation, forwhichwemacroprogramthedistributedalgorithm, based on the concept of a summarizing data structure called q-digests, as given in Shrivastava et al. [SBSA]. These applications place different demands on Kairos, yet, as we show below, Kairos is able to satisfactorilyrecovereachapplication. Vehicletrackingisaninstanceofalocally-communicating, continuously-sensing, latency sensitive, periodic (duty-cycling) application. Localization is an example of a globally-communicating, single-shot, latency insensitive application, driven by net- work events such as node addition, deletion, mobility, and reconfiguration. Finally, q-digest is a network-wide locally-communicating application, whose output and latency sensitivity require- ments depend on its use: for collecting statistics over a continuously changing sensor field, it can be configured to be a latency tolerant, continuous output application; however, for low frequency rare event monitoring, it can be configured as a latency sensitive single-shot application. The Testbed: Our testbed consists of 36 nodes, of which 15 are Stargates with an attached Mica-Z mote (Figure 5.8). The remaining 21 are emulated nodes, each node being emulated by one EmStar process. Each emulated node uses a real (not emulated) Mica-Z mote for all communication. These Mica-Z motes are mounted on the ceiling of our laboratory (Figure 5.8). This setup allows us to simulate real-world multihop configurations without being constrained by the limited memory resources of the current generation of motes. In our testbed, all nodes arewithin a single physical hopof each other, butwe configure nodes to multi-hop through other nodes in order to more closely mimic real deployments. Specifically, we arrange these node to form a 6x6 2D torus topology. Experimental Setup: A single run of an experiment measures application performance metrics (described below) for N faults, where N ranges from 0 to 15. We inject three types of faults into the system, and in a given run, all injected faults are of the same type. In a software fault (SW), the application instance at a node is killed, leaving the Kairos runtime operational. In this 134 case, for example, remote reads of raw (unprocessed) sensor data can be satisfied by the Kairos runtime. Inahardwarefault(HW),theentirenodeisstopped,sothatneithertheKairosruntime northeapplicationcansendorreceivemessages. Wheninjectingasoftwareorhardwarefault, we are careful to keep the network itself connected. Finally, we also inject a network partition (PR), where N nodes are partitioned from the system, and the partition then heals after 2 minutes. In all cases, the network is started with no faults, and faults are injected immediately after the first call to get available nodes() has succeeded. We set algorithm parameters as follows. For vehicle tracking, we assume a constant speed targetmovingrandomlywithinthe6x6grid. Otherparametersofthealgorithmin[JLZ]arescaled to fit our topological dimensions. For localization, coordinates of beacon nodes are randomly perturbed with Gaussian noise according to the parameters in [SHS]. The q-digest application uses a Kairos application to construct the routing tree along which the data digests are sent. We configure q-digest to periodically (every 100s) send digests. Inallexperiments,therecoverylatencyforcheckpointing,afterfailuredetection,waslessthan a minute. Comparing Recovery Strategies: We evaluate transparent recovery (TR) for software faults (TR-SW) and for hardware faults (TR-HW). In the programs we evaluate, TR-SW and TR- HW are respectively equivalent to DR-SW and DR-HW because they happen to roll back to the same checkpoint in each case, and are thus exhibit identical performance. For evaluating the efficacy of recovery after partition healing, we evaluate Declarative Recovery with non-null merge functions in the annotations (DR-PR). Since Transparent Recovery is primarily meant for applications that don’t accumulate long-lived state, we do not evaluate TR-PR. For DR- PR, we have inserted declarative annotations into each of the Kairos applications; none of our applications requires more than 5 annotations, although the largest of our applications is nearly 250 lines of macroprogramming code, which is roughly how much large macroprogrammed tasks are in practice. 135 We compare these strategies against two baseline cases: the performance of the application without any faults (NF), and the performance of the application without recovery (NR) in the presence of faults. Metrics: Our comparison is based on the following four quantitative metrics. Our first metric is application availability, defined for two of our applications, vehicle tracking and q-digest. In these two applications, the application periodically (say every T-second intervals) returns a result (the current location of the vehicle, or the current median). When a fault occurs, the application may or may not be able to return an answer at a given instance. Define U F to be the fraction of intervals during which an unrecovered application (NR) did not return an answer. Define U R analogously, but for an application with a recovery strategy applied. Then, we define application availability to be log 10 ( UF UR ), a metric that is commonly used for representing availability. This logarithmic metric defines applications which return an answer during 0.999 of the intervals to have 1x (or 100% more) availability compared to one which returns an answer during 0.99 of the intervals. Our second metric measures application error. This metric applies to all three applications, of course, and is defined for vehicle tracking as |zN−zR| |zN| , z N is the approximation computed by NF, and z R is the approximation computed by either TR-SW, TR-HW or DR-PR. The metric is similarly defined for our other applications. Our last two metrics measure the messaging and memory overhead of recovery. They are defined as the additional fraction of messages sent, or memory used, relative to NF. We chose these metrics because, for recovery techniques to be practical in realistic multi-hop scenarios, they need to be lightweight in addition to being expressive. 5.5.2 Results In Figures 5.9, we plot the availability of the vehicle tracking application as a function of the number of faults. Notice that the advantages of recovery are apparent even with one failure; 136 the increase in availability is more than 1x, indicating that recovery strategies reduce the num- ber of intervals during which an answer is available by a factor of 10. As the number of faults increases,thisfactorrisestonearly30(10 1.5 ). Theavailabilityfortheq-digestapplicationisqual- itatively similar (Figure 5.10). For both applications, different strategies exhibit slightly different availabilities, mostly due to differences in the latency of recovery across the three approaches. In Figure 5.11, we plot the relative error in the position estimate as a function of the number offaultsforthevehicletrackingapplication. Ourbaselineforcomparingourrecoverystrategiesis thecasewherenorecoverystrategyisemployed(NR).TherelativeerrorunderNRisindicativeof the loss of application fidelity inherent in failure; for example, when nodes fail, a vehicle tracking application is essentially left with a less dense network than before, adversely affecting tracking quality. Our main observation in this graph is that, while the application accuracy degrades linearlywithincreasingnumbersoffaults(TR-HW),thisdegradationisnoworsethantherelative error under NR. This indicates that the loss of application accuracy is entirely inherent in node failure, and recovery does not exacerbate this loss. On the contrary, recovery reduces application error: TR-HW has lower application error than NR, because the latter has lower availability resulting in missed readings and therefore a more erroneous track (since the tracking algorithm usesasmoothedhistoryofpositionreadings). Furthermore,notethatTR-SWexhibitsnorelative erroratall;whentheKairosruntimeisabletorespondwithsensorreadings,eveniftheapplication instance at the node itself is dead, the overall application is still able to preserve fidelity. Finally, we do not show DR-PR in this graph; partition healing is not relevant to an application in which the answer (the position estimate) is continuously changing. In Figures 5.12 and 5.13, we plot the relative error of the various recovery strategies for q- digest and localization. The interesting difference between this graph and Figure 5.11 is that, as expected,DR-PRalsoexhibitszeroapplicationerrorsincepartitionrecoveryisabletorecoverlost work by merging two q-digests, or location estimates. Furthermore, since medians and location estimates are relatively less sensitive to node loss than vehicle tracking, the magnitude of error 137 for TR-HW is lower. Even in this case, however, this error is comparable to the application error without fault recovery (we do not shown application error under NR for localization because a single failure causes the application to not successfully terminate). In Figure 5.14, we show the messaging overheads for the various recovery strategies for each of the applications. The error bars in the figure depict the variation in overhead with the number of faults. We see that communication overhead of these mechanisms is independent of the severity of faults, and depends mostly on the nature of the application. TR-SW and TR-HW are almost equal because they share the same logic when invoked, and incur no more than 25% additional messaging overhead. DR-PR incurs twice as much overhead as TR-HW or TR-SW for some applications like q-digest that span the entire network, and, therefore, involve a large number of nodes. For applications like vehicle tracking in which, at any given time, only nodes within a certain locality are involved, the overhead of DR-PR is almost a constant and quite small (about 15%). Finally, we see in Figure 5.15, that TR requires between 2.2–2.5 times the data memory of an application without recovery, which measures the amount of checkpointing state maintained. Today’s sensor nodes have different program, SRAM, and flash memories. Since SRAM (data) memory can be stored inside a (much larger) flash, this problem is less severe. Nevertheless, clearly, this is an aspect of our system that could benefit from some optimization. This memory overheadisindependentofthenumberoffaults. Itdoesdependonlyonapplicationcharacteristics, specifically the average nesting depth of checkpoints. Interestingly, for our applications, this nesting depth happens to be comparable (slightly more than 2 for an average execution trial), hence the memory overhead appears to be the same across applications. Summary. Ourrecoverystrategiescanimproveapplicationavailabilitybyanorderofmagnitude, while preserving application accuracy for certain kinds of faults (software faults, and network partitions). They incur acceptable messaging overhead (less than 15% for vehicle tracking), and a factor of two additional data memory for checkpointing. While TR works well for continuous 138 output applications like vehicle tracking, those requiring longevity, such as q-digest, benefit from having declarative annotations with merge functions for partition recovery. 5.6 Related Work There are two primary approaches for generic rollback-based recovery schemes, a survey of which is given in [EAWJ]. Such schemes can be classified as either checkpoint-based or log-based. Log- based protocols tend to have unpredictable message logging requirements, which are hard to provision for in memory-restricted sensor nodes. However, log-based protocols are predominantly usedindatabases[Mos87]andfilesystems[RO]becausetheyhaveaccesstoalargeandpersistent disk storage. CRR is checkpoint-based, and there is an extensive set of algorithms and implemen- tations of distributed checkpoint schemes in a variety of domains ranging from loosely coupled message passing systems to tightly coupled multiprocessors [AM98, BQC98, Eln94, Joh90, KT87, NX95, Pla93, SS83, SY85]. Ataxonomy andsurveyof suchschemes is given in [KR]. Two impor- tantfeaturesofourcheckpointingAPIarethat(a)itiseasiertousethanmostofthesetechniques becauseitleverages the macroprogramming abstraction, and(b) it is implemented efficiently over the broadcast facility in wireless sensor networks. Declarative recovery through annotations is a novel aspect of our work. We are not aware of prior work similar to these language-level constructs, even though a growing body of liter- ature exists for augmenting systems such as MPI with recovery APIs and libraries [BBC + 02]. Also, recently, there is a renewed interest in implementing systems components using declarative approaches [LHSR, GSa], but they do not directly deal with recovery. Our partition recovery support is also novel. Others have proposed application-specific merge functions in varied contexts such as distributed file systems [PGPH90] and mobile comput- ing[TTP + a]. However,suchsystemshavenotbeenwidelypopularmainlybecausetheirgenerality meansmergefunctionsarehardtowrite. Webelievethatitissimplertowritemergefunctionsfor 139 sensor network applications because they are frequently numerical in nature. Madden et al. have proposed a form of merge functions for query processing in sensor networks [MFHHa], but those merge functions were for normal processing inside SQL queries, and not for recovery. Finally, we are not aware of any prior work that considered recovery in sensor networks, either in the context of macroprogramming systems such as [NW, WSBC, WM] or otherwise. 5.7 Conclusions and Future Work Failures are a critical concern for sensor-network systems, and one that crosscuts entire applica- tions. In this paper, we have described the problems for manual failure detection and recovery in sensor networks, and we show how the notion of macroprogramming can be used to largely untangle the failure concern from the application logic. First, we have designed a generic check- pointing API for macroprogramming systems that leverages the centralized view of a network to allow checkpoint and rollback to be specified at natural points in the overall application. Second, we explored two automated recovery strategies, which significantly raise the level of abstraction for specifying recovery and serve to further insulate the recovery concern from the rest of the application. We have implemented our checkpointing API and automated recovery strategies in the Kairos macroprogramming system, and experimental results illustrate their utility and practicality. Several avenues for future work remain. First, it would be useful to gather more experience with our techniques on real-world deployments. Second, our work on transparent recovery is only afirststep; weplantoexaminearangeofapplicationstobetterunderstandappropriateheuristics for transparent recovery that will be widely applicable. Finally, our recovery implementation is relatively unoptimized; in future work we will use program-analysis techniques to automatically minimize work lost in recovery and to minimize the memory overhead of checkpoints. 140 node bs; //executed at every sender void aggregate_send() { 1: uint temp,light; 2 for(;;) { 3: sleep(SAMPLE_INTERVAL); 4: sample(temp); 5: sample(light); 6: send_sample(temp,bs); 7: send_sample(light,bs); } } Ckpt ckpt; //executed at base station void aggregate_receive() { 8: time next_epoch; 9: list node_list, received_list; 10:uint av_l, av_t, count, timeout; 11:boolean done; 12:for (;;) { 13: ckpt=take_local_ckpt(); 14: next_epoch=get_cur_time()+SAMPLE_INTERVAL; 15: node_list=get_available_nodes(),received_list=NULL; //check if node_list has an old base station 16: if (hasLower(node_list,id())) { 17: ...//invoke merge()... } 18: timeout=SAMPLE_INTERVAL; 19: done=FALSE; 20: while (!done) { //wait till timeout or at least one node sends 21: received_list=select(TEMP_T|LIGHT_T,node_list,&timeout); 22: if(received_list!=NULL) { 23: //read temp and/or lt values; compute averages... 24: //remove node from node_list if bs got temp,lt... 25: if (node_list==NULL) done=TRUE; } 26: else{//bs timed out=>nodes in node_list are dead //restore node-local state to previous epoch 27: restore_local_ckpt(ckpt); } } 28: sleep(next_epoch-get_cur_time()); } } Figure5.2: Sendandreceiveproceduresfordataaggregationinanode-levelprogramwithmanual recovery. 141 void av() { 1: nodelist full_node_set; 2: node iter, bs; 3: uint sleep_interval=1000; 4: uint nodelocal count=1, av_t=0, av_l=0; 5: uint nodelocal sensor temp, lt; 6: full_node_set=get_available_nodes(); 7: bs=get_first(sort(full_node_set)); 8: for (;;) { 9: sleep(sleep_interval); 10: for (iter=get_first(full_node_set);iter!=NULL; iter=get_next(full_node_set)) { 11: av_t@bs=(av_t@bs*(count@bs-1)+temp@iter)/count@bs; 12: av_l@bs=(av_l@bs*(count@bs-1)+lt@iter)/count@bs++; } } } Figure 5.3: Example macroprogram for computing average temperature and light readings. Ckpt ckpt1, ckpt2; void av() { 1: nodelist full_node_set; 2: node iter, bs; 3: uint sleep_interval=1000; 4: uint nodelocal count=1, av_l=0, av_t=0;temp; 5: full_node_set=get_available_nodes(); 6: ckpt1=take_ckpt(full_node_set); //Check if we have to take another checkpoint 7: if (ckpt1.restored){ 8: full_node_set=get_available_nodes(); 9: ckpt1=take_ckpt(full_node_set) } 10:bs=get_first(sort(full_node_set)); 11:for (;;) { 12: sleep(sleep_interval); 13: ckpt2=take_ckpt(bs); 14: full_node_set=get_available_nodes(); 15: for (iter=get_first(full_node_set);iter!=NULL; iter=get_next(full_node_set)) { 16: av_t@bs=(av_t@bs*(count@bs-1)+temp@iter)/count@bs; 17: av_l@bs=(av_l@bs*(count@bs-1)+lt@iter)/count@bs++; } 18: if (_failed) { 19: full_node_set=get_available_nodes(); 20: if (member(bs,full_node_set)) { 21: //bs still alive=>another node crashed 22: restore_ckpt(ckpt2); 23: } else { 24: restore_ckpt(ckpt1); } } } } Figure 5.4: Example macroprogram with manual recovery code. 142 Ckpt ckpt1, ckpt2; node bs, bs_P1, bs_P2; uint nodelocal count=1, av_l=0, av_t=0; void av() { 1: nodelist full_node_set; 2: node iter; 3: uint sleep_interval=1000; 4: uint nodelocal temp; 5: full_node_set=get_available_nodes(); 6: ckpt1=take_ckpt(full_node_set); //Check if we have to take another checkpoint 7: if (ckpt1.restored){ 8: full_node_set=get_available_nodes(); 9: ckpt1=take_ckpt(full_node_set) } 10:bs=get_first(sort(full_node_set)); 11:for (;;) { 12: sleep(sleep_interval); 13: ckpt2=take_ckpt(bs); 14: full_node_set=get_available_nodes(); 15: for (iter=get_first(full_node_set);iter!=NULL; iter=get_next(full_node_set)) { 16: av_t@bs=(av_t@bs*(count@bs-1)+temp@iter)/count@bs; 17: av_l@bs=(av_l@bs*(count@bs-1)+lt@iter)/count@bs++; } 18: if (_failed) { 19: full_node_set=get_available_nodes(); 20: if (member(bs,full_node_set)) { 21: //bs still alive=>another node has crashed 22: restore_ckpt(ckpt2); 23: } else { 24: restore_ckpt(ckpt1); } } 25: if (_healed) { 26: merge_av(); } } } } void merge_av() { 27:bs=min(bs_P1,bs_P2); 28:av_l@bs=(av_l@bs_P1*count@bs_P1+av_l@bs_P2 *count@bs_P2)/(count@bs_P1+count@bs_P2); 29:av_t@bs=(av_t@bs_P1*count@bs_P1+av_t@bs_P2 *count@bs_P2)/(count@bs_P1+count@bs_P2); 30:count@bs=count@bs_P1+count@bs_P2; } Figure 5.5: Example macroprogram for recovering from partitions. 143 Duplicate insensitive counting/sketch theory, q-digests, approx. aggregates, etc. Max, min, quantiles, histograms, quantiles, etc. Non-aggregatable scalars Model/problem-specific but simple low-state spatiotemporal interpolated composition Isobars, contours, etc. Spatiotemporal state Textbook compositional formulae Vector aggregates, auto- and cross-correlations, covariance, Fourier transforms Linearly combinable vectors and matrices Simple aggregation Sum, average, count Aggregatable scalars Merge Function Example State Type Figure 5.6: Common tasks and their merge functions. void av() { 1: nodelist full_node_set; 2: node iter, bs; 3: uint sleep_interval=1000; 4: uint nodelocal count=1, av_t=0, av_l=0; 5: uint nodelocal sensor temp, lt; 6: full_node_set=get_available_nodes(); 7: <full_node_set,NULL> 8: full_node_set=get_available_nodes(); 9: bs=get_first(sort(full_node_set)); 10:for (;;) { 11: <{bs},NULL> 12: full_node_set=get_available_nodes(); 13: sleep(sleep_interval); 14: for (iter=get_first(full_node_set);iter!=NULL; iter=get_next(full_node_set)){ 15: av_t@bs=(av_t@bs*(count@bs-1)+temp@iter)/ count@bs; 16: av_l@bs=(av_l@bs*(count@bs-1)+lt@iter)/ count@bs++; } } } Figure 5.7: Example macroprogram to illustrate Declarative Recovery (DR). 144 Figure 5.8: A single Mica-Z controlled by a PC (left), a single Mica-Z attached to a Stargate (center), and Mica-Z’s (circled) on the ceiling (right). 0 0.5 1 1.5 2 2.5 3 0 5 10 15 20 Number of Failures Increase in Vehicle Tracking Availability Factor (Logarithmic) TR-HW TR-SW DR-PR Figure 5.9: Availability comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and DR- Partition Recovery (DR-PR) strategies with increasing node failures. 145 0 0.5 1 1.5 2 2.5 3 0 5 10 15 20 Number of Failures Increase in q-digest Availability Factor (Logarithmic) TR-SW DR-PR TR-HW Figure 5.10: Availability comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and DR-Partition Recovery (DR-PR) strategies with increasing node failures. 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 Number of Failures Vehicle Tracking Error % TR-SW TR-HW NR Figure 5.11: Accuracy comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and No Recovery (NR) strategies with increasing node failures. 146 0 5 10 15 20 25 30 0 5 10 15 20 Number of Failures q-digest Error % TR-SW, DR-PR NR TR-HW Figure 5.12: Accuracy comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), DR- Partition Recovery (DR-PR), and No Recovery (NR) strategies with increasing node failures. 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 Number of Failures Localization Errors % TR-SW, DR-PR TR-HW Figure 5.13: Accuracy Comparison of TR-Software (TR-SW), TR-Hardware (TR-HW), and DR- Partition Recovery (DR-PR) strategies with increasing node failures. 147 0 10 20 30 40 50 60 70 TR-SW TR-HW DR-PR Message overhead (%) q-digest Localization Vehicle Tracking Figure 5.14: Message overhead comparison of TR-SW, TR-HW, and DR-PR strategies. 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 1 2 4 8 16 Number of Failures Data Memory Overhead Factor q-digest Localization Vehicle Tracking Figure 5.15: Memory overhead for CRR. 148 Chapter 6 Chapter 6: Conclusions In this thesis, we have experimentally argued that reliability should be an important goal for programming languages and systems that support sensor networks programmers. We described and evaluated three different systems that offered different trade-offs in terms of compatibility, reliability, programmability, and resource requirements. We demonstrated that all three systems are practical for today’s hardware and applications. Once we incorporate reliability as an important concern, the number of research avenues increases dramatically. For example, a large amount of existing distributed systems and fault- tolerant systems literature is devoted to examining reliability, as described in detail in the related work of each chapter. In particular, the two main systems challenges of consistency and avail- ability that preoccupy a large number of current researchers in related fields pose particularly severe challenges for sensor networks. At the same time, they present tantalizing possibilities be- cause of the scale, distribution, concurrency, and resource-starvation of sensor networks. Dealing with the unexplored possibilities of reliability research, such as loose consistencies and consis- tency/availability tradeoffs can therefore provide deep, impactful and satisfying results in the future. 149 Bibliography [ABC + ] T. Abdelzaher, B. Blum, Q. Cao, Y. Chen, D. Evans, J. George, S. George, L. Gu, T. He, S. Krishnamurthy, L. Luo, S. Son, J. Stankovic, R. Stoleru, and A. Wood. Envirotrack: Towards an environmental computing paradigm for distributed sensor networks. ICDCS, 2004. [AG] S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 1996. [AM98] Lorenzo Alvisi and Keith Marzullo. Message logging: Pessimistic, optimistic, causal, and optimal. IEEE Trans. Softw. Eng., 24(2):149–159, 1998. [BBC + 02] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. Mpich-v: toward a scalable fault tolerant mpi for volatile nodes. In SC’02, pages 1–18, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. [BHS] A. Boulis, C. Han, and M. B. Srivastava. Design and implementation of a framework for efficient and programmable sensor networks. MobiSys, 2003. [BOP] A. Bakshi, J. Ou, and V. K. Prasanna. Towards automatic synthesis of a class of application-specific sensor networks. CASES, 2002. [BP] A. Bakshi and V. K. Prasanna. Algorithm design and synthesis for wireless sensor networks. ICPP, 2004. [BQC98] R. Baldoni, F. Quaglia, and B. Ciciani. A vp-accordant checkpointing protocol pre- venting useless checkpoints. In SRDS ’98, page 61, Washington, DC, USA, 1998. IEEE Computer Society. [CADG + ] David E. Culler, Andrea C. Arpaci-Dusseau, Seth Copen Goldstein, Arvind Krish- namurthy, Steven Lumetta, Thorsten von Eicken, and Katherine A. Yelick. Parallel programming in split-c. In Supercomputing, 1993. [CFR] Gruia Calinescu, Cristina G. Fernandes, and Bruce Reed. Multicuts in unweighted graphs with bounded degree and bounded tree-width. LNCS, 1998. [CHZ] M. Chu, H. Haussecker, and F. Zhao. Scalable information-driven sensor querying and routing for ad hoc heterogeneous sensor networks. International Journal of High Performance Computing Applications, 2002. [CKF + ] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot - a technique for cheap recovery. OSDI, 2004. [CL85] K M Chandy and L Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63–75, 1985. 150 [CLLZ] E. Cheong, J. Liebman, J. Liu, and F. Zhao. Tinygals: a programming model for event-driven embedded systems. SAC, 2003. [CMC + ] Brian D. Carlstrom, Austen McDonald, Hassan Chafi, JaeWoong Chung, Chi Cao Minh, Christos Kozyrakis, and Kunle Olukotun. The atomos transactional program- ming language. In PLDI 2006. [Coa] M. Coates. Distributed particle filters for sensor networks. In IPSN 2004. [DABM] Douglas S. J. De Couto, Daniel Aguayo, John Bicket, and Robert Morris. A high- throughput path metric for multi-hop wireless routing. In MobiCom 2003. [DGH + ] Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. Epidemic algorithms for replicated database maintenance. In PODC ’87. [EAWJ] E. N. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 2002. [EBB + ] J. Elson, S. Bien, N. Busek, V. Bychkovskiy, A. Cerpa, D. Ganesan, L. Girod, B. Greenstein, T. Schoellhammer, T. Stathopoulos, and D. Estrin. Emstar: An envi- ronment for developing wireless embedded systems software. CENS-TR-9, 2003. [Elm] Ahmed K. Elmagarmid. A survey of distributed deadlock detection algorithms. SIG- MOD Rec., 1986. [Eln94] E. Elnozahy. Manetho: fault tolerance in distributed systems using rollback-recovery and process replication. PhD thesis, 1994. [FR] Christian Frank and Kay Romer. Algorithms for generic role assignment in wireless sensor networks. In SenSys 2005. [FSG] W. F. Fung, D. Sun, and J. Gehrke. Cougar: the network is the database. In ACM SIGMOD 2002. [GC] DavidGelernterandNicholasCarriero. Coordinationlanguagesandtheirsignificance. Commun. ACM, 1992. [GGG] Ramakrishna Gummadi, Omprakash Gnawali, and Ramesh Govindan. Macro- programming wireless sensor networks using kairos. DCOSS 2005. [GGG05a] Ramakrishna Gummadi, Omprakash Gnawali, and Ramesh Govindan. Macro- programming wireless sensor networks using kairos. In DCOSS, 2005. [GGG05b] Ramakrishna Gummadi, Omprakash Gnawali, and Ramesh Govindan. Macro- programming wirless sensor networks using kairos. Technical Report USC CS Techni- cal Report Number 05-848, 2005. [GGJ + ] Omprakash Gnawali, Ben Greenstein, Ki-Young Jang, August Joki, Jeongyeup Paek, Marcos Vieira, Deborah Estrin, Ramesh Govindan, and Eddie Kohler. The TENET Architecture for Tiered Sensor Networks. In Sensys 2006. [GJV + ] L.Gu,D.Jia,P.Vicaire,T.Yan,L.Luo,T.He,A.Tirumala,Q.Caoand.J.Stankovic, T. Abdelzaher, and B.H. Krogh. Lightweight detection and classification for wireless sensor networks in realistic environments. SenSys 2005. 151 [GKE] B. Greenstein, E. Kohler, and D. Estrin. A sensor network application construction kit (SNACK). SenSys, 2004. [GKMG] Ramakrishna Gummadi, Nupur Kothari, Todd Millstein, and Ramesh Govindan. Declarative failure recovery for sensor networks. In AOSD 2007. [GLTE] Lewis Girod, Martin Lukac, Vlad Trifa, and Deborah Estrin. The design and imple- mentation of a self-calibrating distributed acoustic sensing platform. In SenSys’06. [GLvB + ] David Gay, Philip Levis, Robert von Behren, Matt Welsh, Eric Brewer, and David Culler. The nesC language: A holistic approach to networked embedded systems. In PLDI 2003. [GMP + ] BenGreenstein,ChristopherMar,AlexPesterev,ShahinFarshchi,EddieKohler,Jack Judy, and Deborah Estrin. Capturing high-frequency phenomena using a bandwidth- limited sensor network. In SenSys’06. [Gra] J. Gray. Why do computers stop and what can be done about it? SRDS’86. [GSa] T. Griffin and J. Sobrinho. Metarouting. SIGCOMM 2005. [GSb] Lin Gu and John A. Stankovic. t-kernel: Providing reliable os support to wireless sensor networks. In SenSys’06. [HC] J. W. Hui and D. Culler. The dynamic behavior of a data dissemination protocol for network programming at scale. SenSys, 2004. [HS] Galen C. Hunt and Michael L. Scott. The coign automatic distributed partitioning system. In OSDI 1999. [HSE] J. Heidemann, F. Silva, and D. Estrin. Matching data dissemination algorithms to application requirements. In SenSys, 2003. [HSW + ] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. Culler, and K. Pister. System architecture directions for networked sensors. SIGOPS Oper. Syst. Rev., 2000. [IGE] Chalermek Intanagonwiwat, Ramesh Govindan, and Deborah Estrin. Directed diffu- sion: a scalable and robust communication paradigm for sensor networks. In ACM Mobicom, 2000. [imo] Intel Imote2. http://embedded.seattle.intel-research.net/wiki/index.php? title=Intel Mote 2. [Inca] Crossbow Technology Inc. Mica2 series (mpr4x0). http://www.xbow.com/Products/productsdetails.aspx?sid=72. [Incb] Crossbow Technology Inc. Mica2dot series (mpr5x0). http://www.xbow.com/Products/productsdetails.aspx?sid=73. [Incc] Crossbow Technology Inc. Stargate platform. http://www.xbow.com/Products/XScale.htm. [JLZ] J. Reich J. Liu and F. Zhao. Collaborative in-network processing for target tracking. EURASIP, 2002. [Joh90] David Bruce Johnson. Distributed system fault tolerance using message logging and checkpointing. PhD thesis, 1990. 152 [jr] James reserve. [KAB + ] Lakshman Krishnamurthy, Robert Adler, Phil Buonadonna, Jasmeet Chhabra, Mick Flanigan, Nandakishore Kushalnagar, Lama Nachman, and Mark Yarvis. Design and Deployment of Industrial Sensor Networks: Experiences from a Semiconductor Plant and the North Sea. In SenSys 2005. [KB] B. J. Kuipers and Y.-T. Byun. A Robust Qualitative Method for Spatial Learning in Unknown Environments. In AAAI 1988. [KCZ92] Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy release consistency for soft- ware distributed shared memory. In ISCA 1992, 1992. [KGMG] NupurKothari,RamakrishnaGummadi,ToddMillstein,andRameshGovindan. Reli- able and efficient programming abstractions for wireless sensor networks. In PLDI’07. [Kna87] Edgar Knapp. Deadlock detection in distributed databases. ACM Comput. Surv., 19(4):303–328, 1987. [Koe92] Charles Koelbel. An overview of High Performance Fortran. SIGPLAN Fortran Fo- rum, 11(4), 1992. [KR] S Kalaiselvi and V Rajaraman. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana, Vol. 25, Part 5. [KT87] Richard Koo and Sam Toueg. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng., 13(1):23–31, 1987. [KWA + ] R. Kumar, M. Wolenetz, B. Agarwalla, J. Shin, P. Hutto, A. Paul, and U. Ramachan- dran. Dfuse: a framework for distributed data fusion. SenSys, 2003. [LC] P.LevisandD.Culler. Mat´ e: atinyvirtualmachineforsensornetworks. ASPLOS-X, 2002. [LCL + ] J. Liu, M. Chu, J. Liu, J. Reich, and F.Zhao. State-centric programming for sensor and actuator network systems. IEEE Perv. Computing, 2003. [LGC] P. Levis, D. Gay, and D. Culler. Bridging the gap: Programming sensor networks withapplicationspecificvirtualmachines. TechnicalReportUCB//CSD-04-1343Aug 2004. [LHSR] B. Loo, J. Hellerstein, I. Stoica, and R. Ramakrishnan. Declarative routing: Extensi- ble routing with declarative queries. SIGCOMM 2005. [Lis88] Barbara Liskov. Distributed programming in argus. Commun. ACM, 31(3), 1988. [LLWC] P.Levis,N.Lee,M.Welsh,andD.Culler. TOSSIM:Accurateandscalablesimulation of entire tinyos applications. In Sensys, 2003. [LMG + 04] Philip Levis, Samuel Madden, David Gay, Joseph Polastre, Robert Szewczyk, Alec Woo, EricA.Brewer, andDavidE.Culler. Theemergenceofnetworkingabstractions and techniques in tinyos. In NSDI, 2004. [LPCS] P. Levis, N. Patel, D. Culler, and S. Shenker. Trickle: A self-regulating algorithm for code propagation and maintenance in wireless sensor networks. NSDI’04. [LR06] James Larus and Ravi Rajwar, editors. Transactional Memory. 2006. 153 [LRW + ] Hongzhou Liu, Tom Roeder, Kevin Walsh, Rimon Barr, and Emin Gün Sirer. Design and implementation of a single system image operating system for ad hoc networks. In MobiSys 2005. [LSZM] T.Liu,C.M.Sadler,P.Zhang,andM.Martonosi. Implementingsoftwareonresource- constrained mobile sensors: experiences with impala and zebranet. MobiSys, 2004. [MFHHa] S. Madden, M. J. Franklin, J. Hellerstein, and W. Hong. TAG: A tiny AGgregation service for ad-hoc sensor networks. In OSDI, 2002. [MFHHb] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. The design of an acqui- sitional query processor for sensor networks. In SIGMOD 2003. [MG] B. Murphy and T. Gent. Measuring system and software reliability using an auto- mated data collection process. QREI, 1995. [MHY + ] Dustin McIntire, Kei Ho, Bernie Yip, Amarjeet Singh, Winston Wu, and William J. Kaiser. The low power energy aware processing (LEAP) embedded networked sensor system. In IPSN’06. [mic] Micaz mpr2400. [MLRT] David Moore, John Leonard, Daniela Rus, and Seth Teller. Robust distributed net- work localization with noisy range measurements. In SenSys’04. [Mos87] J. Eliot B. Moss. Log-based recovery for nested transactions. In VLDB ’87, pages 427–432, San Francisco, CA, USA, 1987. [MSL] M. Maroti, B. Kusy G. Simon, and A. Ledeczi. The flooding time synchronization protocol. Sensys, 2004. [MZGB06] BillMcCloskey,FengZhou,DavidGay,andEricBrewer. Autolocker: synchronization inference for atomic sections. In POPL 2006, 2006. [NAW] Ryan Newton, Arvind, and Matt Welsh. Building up to macroprogramming: an intermediate language for sensor networks. In IPSN 2005. [NKSI] Yang Ni, Ulrich Kremer, Adrian Stere, and Liviu Iftode. Programming ad-hoc net- works of mobile and resource-constrained devices. In PLDI 2005. [NMRW] GeorgeC.Necula,ScottMcPeak,S.P.Rahul,andWestleyWeimer. CIL:Intermediate Language and Tools for Analysis and Transformation of C Programs. In Conference on Compilier Construction, 2002. [NMW] Ryan Newton, Greg Morrisett, and Matt Welsh. The Regiment macroprogramming system. In IPSN’07. [NT] Matthias Neubauer and Peter Thiemann. From sequential programs to multi-tier applications by program transformation. In POPL 2005. [NW] R. Newton and M. Welsh. Region streams: Functional macroprogramming for sensor networks. DMSN, 2004. [NX95] Robert H. B. Netzer and Jian Xu. Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst., 6(2):165–169, 1995. 154 [PGPH90] Gerald J. Popek, Richard G. Guy, Thomas W. Page, Jr., and John S. Heidemann. Replication in Ficus distributed file systems. In Proceedings of the Workshop on Management of Replicated Data, 1990. [Pla93] James Steven Plank. Efficient checkpointing on MIMD architectures. PhD thesis, Princeton, NJ, USA, 1993. [RBB + 06] N. Ramanathan, L. Balzano, M. Burt, D. Estrin, E. Kohler, T. Harmon, C. Harvey, J. Jay, S. Rothenberg, and M. Srivastava. Rapid deployment with confidence: Cali- bration and fault detection in environmental sensor networks. Technical Report 62, CENS, 2006. [RO] M.Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst., 10(1):26–52. [RSE + 07] N.Ramanathan,T.Schoellhammer,D.Estrin,M.Hansen,T.Harmon,andE.Kohler. The final frontier: Embedding networked sensors in soil. In IPSN 2007, 2007. [SBN + 97] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas An- derson. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4), 1997. [SBSA] N. Shrivastava, C. Buragohain, S. Suri, and D. Agrawal. Medians and beyond: New aggregation techniques for sensor networks. SenSys’04. [SC] M.SullivanandR.Chillarege. Softwaredefectsandtheirimpactonsystemavailability - a study of field failures in operating systems. FTCS, 1991. [SH] Michael Stonebraker and Joseph M. Hellerstein, editors. Readings in Database Sys- tems, Third Edition. [Sho] Donald Shoup. New York Times Op-Ed: Gone Parkin’, http://www.nytimes.com/ 2007/03/29/opinion/29shoup.html. [SHS] A. Savvides, C. Han, and S. Srivastava. Dynamic fine-grained localization in ad-hoc networks of sensors. In MobiCom ’01. [SMPC] R. Szewczyk, A. Mainwaring, J. Polastre, and D. Culler. An analysis of a large scale habitat monitoring. Sensys, 2004. [SS83] Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst., 1(3):222– 238, 1983. [SSW + ] CorySharp, ShawnSchaffert, AlecWoo, NaveenSastry, ChrisKarlof, ShankarSastry, and David Culler. Design and implementation of a sensor network system for vehicle tracking and autonomous interception. In EWSN, 2005. [SY85] Rob Strom and Shaula Yemini. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst., 3(3):204–226, 1985. [TPS + ] Gilman Tolle, Joseph Polastre, Robert Szewczyk, David Culler, Neil Turner, Kevin Tu, Stephen Burgess, Todd Dawson, Phil Buonadonna, David Gay, and Wei Hong. A macroscope in the Redwoods. In SenSys 2005. [TTP + a] D.Terry,M.Theimer,K.Petersen,A.Demers,M.Spreitzer,andC.Hauser. Managing update conflicts in bayou, a weakly connected replicated storage system. SOSP 1995. 155 [TTP + b] D. B. Terry, M. M. Theimer, K. Petersen, Demers Demers, M. J. Spreitzer, and C.Hauser. ManagingupdateconflictsinBayou,aweaklyconnectedreplicatedstorage system. In SOSP 1995. [vRJ] Guido van Rossum and Fred L. Drake Jr., editors. Extending and embedding the python interpreter. http://docs.python.org/ext/ext.html. [WAJR + ] G. Werner-Allen, J. Johnson, M. Ruiz, J. Lees, and M. Welsh. Monitoring volcanic eruptions with a wireless sensor network. In EWSN’05. [WM] M. Welsh and G. Mainland. Programming sensor networks using abstract regions. NSDI’04. [WSBC] K. Whitehouse, C. Sharp, E. Brewer, and D. Culler. Hood: A neighborhood abstrac- tion for sensor networks. MobiSys, 2004. [WTC] A.Woo,T.Tong,andD.Culler. Tamingtheunderlyingchallengesofreliablemultihop routing in sensor networks. SenSys, 2003. [YH] W. Ye and J. Heidemann. Medium access control in wireless sensor networks. ISI- TR-580, 2003. 156
Abstract (if available)
Abstract
Sensor networks promise to allow the world around us to be observed, measured, and even controlled at a fine granularity. However, in order to realize the full potential of sensor networks, it is increasingly apparent that they should be easily, reliably, and efficiently programmable. Surprisingly, the tate-of-the-art programming languages and systems focus mostly on programmability and efficiency, and only poorly support reliability, if at all. In this thesis, we take the first step toward achieving all three goals by building three related languages and systems, each of which supports reliability. -- First, we show how one can easily modify existing code, which is primarily designed for efficiency, in order to provide reliability. Since today's programming systems are not easily accessible to non-experts, we design and implement two languages that are easy to program, and also offer trade-offs in terms of reliability and efficiency. Our experimental results from these three systems indicate that it is possible to build reliable and efficiency systems that are also simple to program.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Language abstractions and program analysis techniques to build reliable, efficient, and robust networked systems
PDF
Rate adaptation in networks of wireless sensors
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Robust routing and energy management in wireless sensor networks
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Reconfiguration in sensor networks
PDF
Techniques for efficient information transfer in sensor networks
PDF
Distributed wavelet compression algorithms for wireless sensor networks
PDF
Transport layer rate control protocols for wireless sensor networks: from theory to practice
PDF
Adaptive resource management in distributed systems
PDF
Domical: a new cooperative caching framework for streaming media in wireless home networks
PDF
Models and algorithms for energy efficient wireless sensor networks
PDF
Cooperation in wireless networks with selfish users
PDF
On location support and one-hop data collection in wireless sensor networks
PDF
Dynamic routing and rate control in stochastic network optimization: from theory to practice
PDF
Adaptive sampling with a robotic sensor network
PDF
Understanding and exploiting the acoustic propagation delay in underwater sensor networks
PDF
Aging analysis in large-scale wireless sensor networks
PDF
Distributed algorithms for source localization using quantized sensor readings
PDF
Realistic modeling of wireless communication graphs for the design of efficient sensor network routing protocols
Asset Metadata
Creator
Gummadi, Ramakrishna
(author)
Core Title
Reliable languages and systems for sensor networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/03/2007
Defense Date
05/14/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer systems,OAI-PMH Harvest,programming languages,sensor networks,wireless
Language
English
Advisor
Govindan, Ramesh (
committee chair
), Millstein, Todd (
committee member
), Neely, Michael J. (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
gummadi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m761
Unique identifier
UC1427570
Identifier
etd-Gummadi-20070803 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-539187 (legacy record id),usctheses-m761 (legacy record id)
Legacy Identifier
etd-Gummadi-20070803.pdf
Dmrecord
539187
Document Type
Dissertation
Rights
Gummadi, Ramakrishna
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
computer systems
programming languages
sensor networks
wireless