Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
(USC Thesis Other)
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH-PERFORMANCE DISTRIBUTED COMPUTING TECHNIQUES FOR WIRELESS IOT AND CONNECTED VEHICLE SYSTEMS by Kwame-Lante Wright A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2019 Copyright 2019 Kwame-Lante Wright Dedication To my beloved family, for their constant support. ii Acknowledgements This has been a long journey and I would like to express my sincere gratitude for those who have helped me along the way. First, I would like to thank my advisor, Professor Bhaskar Krishnamachari who welcomed me into his lab and has guided me throughout the PhD process. He has given me the freedom to explore and has encouraged me to develop independence in my academic life. I am grateful for the constant support he has provided throughout my time at USC. Dr. Fan Bai has been my industry advisor for the past few years and I am thankful for his valuable feedback and advice on research projects and my career. I would also like to thank the rest of my dissertation committee, Professor Ramesh Govindan and Professor Konstantinos Psounis, for their insightful comments and contributions to my research direction. I also thank Professor Pierluigi Nuzzo for serving on my qualifying exam committee and providing helpful feedback, and Pranav Sakulkar for stimulating conversations and collaboration on my research work. iii I am thankful for my labmates in the Autonomous Networks Research Group and other colleagues at USC with whom I've had thoughtful discussions on research and life. Their friendship has helped me to survive the PhD program both aca- demically and personally. Our adventures together have helped make my time at USC enjoyable and I feel lucky to be part of such a wonderful cohort. I would also like to thank the sta of the Electrical Engineering department, especially Diana Demetras, Shane Goodo, and Ted Low, for their help throughout my time at USC. Over the years, my research work has been supported in part by the Annenberg Foundation, USC Graduate School, National Science Foundation, General Motors Research and the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0053 1 . Thank you. 1 Any views, opinions, and/or ndings expressed are those of the author(s) and should not be interpreted as representing the ocial views or policies of GM, the Department of Defense or the U.S. Government. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xii Chapter 1: Introduction 1 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Polymorphic Computing . . . . . . . . . . . . . . . . . . . . 4 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Background and Related Works 7 2.1 Computational Ooading . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Articial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Inference and Training . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Acceleration and Tradeos . . . . . . . . . . . . . . . . . . . 12 2.3 Knapsack Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Common Variations . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 The Publish-Subscribe Paradigm . . . . . . . . . . . . . . . . . . . 15 2.4.1 Request-Response Messaging . . . . . . . . . . . . . . . . . . 15 2.4.2 Publish-Subscribe Messaging . . . . . . . . . . . . . . . . . . 16 2.5 Macroprogramming Frameworks . . . . . . . . . . . . . . . . . . . . 16 2.6 Privacy and Access Control . . . . . . . . . . . . . . . . . . . . . . 19 2.6.1 Access Control in Publish-Subscribe Messaging . . . . . . . 20 Chapter 3: Publish-Process-Subscribe Framework 22 3.1 Message Queue Telemetry Transport . . . . . . . . . . . . . . . . . 25 3.2 Noctua System Design . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 v 3.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.3 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Role-based Publishing . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Weighted Moving Average . . . . . . . . . . . . . . . . . . . 38 3.4.3 Application Example: Localization as a Service . . . . . . . 46 3.4.4 Impact of Role-based Publishing . . . . . . . . . . . . . . . . 47 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 4: Polymorphic Stream Processing 53 4.1 Data Deluge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Driver Perception Augmentation Application . . . . . . . . . . . . . 56 4.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.2 Frame Rate Adaptation . . . . . . . . . . . . . . . . . . . . 65 4.4.3 Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5.1 Mahimahi Network Emulation . . . . . . . . . . . . . . . . . 71 4.5.2 Image Processing Pipelines . . . . . . . . . . . . . . . . . . . 72 4.5.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5.4 Hardware Specications . . . . . . . . . . . . . . . . . . . . 72 4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6.1 Scenario 1: Overhead . . . . . . . . . . . . . . . . . . . . . . 74 4.6.2 Scenario 2: Scalability . . . . . . . . . . . . . . . . . . . . . 79 4.6.3 Scenario 3: Makespan Constraints . . . . . . . . . . . . . . . 81 4.6.4 Scenario 4: Intermittent Resource . . . . . . . . . . . . . . . 82 4.7 System Performance Model . . . . . . . . . . . . . . . . . . . . . . 84 4.7.1 Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 5: Utility-based Scheduling Algorithms for Polymorphic Ap- plications 92 5.1 The Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 97 5.1.3 Intractability . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 The Usher Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.1 Simple Greedy Algorithm . . . . . . . . . . . . . . . . . . . 100 5.2.2 Bundle-based Greedy Algorithm . . . . . . . . . . . . . . . . 102 vi 5.3 Trace-based Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.1 Schedule Quality on an Object Detection Dataset . . . . . . 106 5.3.2 Heterogeneous Links on a Random Dataset . . . . . . . . . . 113 5.3.3 Multi-Objective Utility in Edge Computing Environment . . 116 5.3.4 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . 121 5.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.1 Mobile Edge Computing . . . . . . . . . . . . . . . . . . . . 121 5.4.2 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.4.3 Knapsack Problems . . . . . . . . . . . . . . . . . . . . . . . 125 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Chapter 6: Conclusions 127 Bibliography 131 vii List Of Tables 3.1 Topic reference settings . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Hardware Specs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 VESPER Simulation Settings . . . . . . . . . . . . . . . . . . . . . 86 5.1 Performance Data and Resource Requirements of YOLO Variants . 107 5.2 Edge and Cloud Node Details . . . . . . . . . . . . . . . . . . . . . 116 viii List Of Figures 2.1 Articial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Example data ow through an MQTT broker . . . . . . . . . . . . 26 3.2 Noctua Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 A macro for temperature conversion . . . . . . . . . . . . . . . . . . 30 3.4 JSON object for HTTP POST request to create macro . . . . . . . 30 3.5 A macro for averaging two temperatures . . . . . . . . . . . . . . . 31 3.6 Privacy protection . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7 Role-Based Publishing with Noctua . . . . . . . . . . . . . . . . . . 34 3.8 Various sensors on a CCI Testbed node . . . . . . . . . . . . . . . . 38 3.9 Macro for a weighted moving average . . . . . . . . . . . . . . . . . 41 3.10 Minimuim MQTT PUBLISH messages required until rst result . . . 41 3.11 Delay between rst PUBLISH message and rst result . . . . . . . . 42 3.12 Various implementations of a three-sensor system . . . . . . . . . . 43 3.13 MQTT PUBLISH messages required for varying number of sensors . 45 3.14 MQTT PUBLISH messages required for varying number of subscribers 45 3.15 Macro for localization using maximum-likelihood estimation . . . . 48 ix 3.16 A person's path through an indoor environment . . . . . . . . . . . 49 3.17 Localization using maximum-likelihood (ML) estimation using macro on Noctua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.18 Average calculation times for location prediction for macro on Noctua 51 3.19 Average data loss due to setup latency associated with role-based publishing on Noctua . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 Potential processing resources available to a car . . . . . . . . . . . 55 4.2 Illustration of polymorphic computing: two dierent vision process- ing pipelines (variations of YOLO) for vehicle detection in images that oer dierent tradeos between accuracy and performance. . . 59 4.3 Parameters aecting the real-time constraints . . . . . . . . . . . . 62 4.4 Framework Architecture . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 Experimentation Testbed . . . . . . . . . . . . . . . . . . . . . . . . 73 4.7 Scenario 1: Accuracy/Throughput Performance . . . . . . . . . . . 75 4.8 Scenario 1: Makespan Distribution . . . . . . . . . . . . . . . . . . 75 4.9 Scenario 2: Accuracy/Throughput Performance . . . . . . . . . . . 76 4.10 Scenario 2: Makespan Distribution . . . . . . . . . . . . . . . . . . 76 4.11 Scenario 2: Percentage of work by each device . . . . . . . . . . . . 77 4.12 Scenario 3: Accuracy/Throughput Performance . . . . . . . . . . . 77 4.13 Scenario 3: Makespan Distribution . . . . . . . . . . . . . . . . . . 78 4.14 VESPER performance over time in Scenario 4 . . . . . . . . . . . . 82 4.15 VESPER Performance vs. Latency Constraint . . . . . . . . . . . . 87 x 4.16 VESPER Performance vs. Throughput Constraint . . . . . . . . . . 88 4.17 VESPER Performance vs. Number of Pipelines . . . . . . . . . . . 89 4.18 VESPER Performance vs. RSU Spacing . . . . . . . . . . . . . . . 90 5.1 Example of scheduling jobs onto nodes. Implementationm 2 of Job 0 is assigned to Node 0 while implementation m 1 of Job 1 is assigned to Node 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Utility (latency-based) vs. the number of GPUs per node for various numbers of nodes and jobs. . . . . . . . . . . . . . . . . . . . . . . . 109 5.3 Number of nodes used vs. the number of GPUs available per node. 111 5.4 Change in utility as (a) the number of nodes are varied and (b) the number of jobs are varied. . . . . . . . . . . . . . . . . . . . . . . . 112 5.5 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.6 Utility vs. the numbers of jobs from our random dataset. . . . . . . 115 5.7 Tradeos between accuracy, latency, energy and price obtained by four utility functions with dierent priorities. . . . . . . . . . . . . . 119 5.8 Schedules resulting from dierent utility functions . . . . . . . . . . 122 5.9 Runtime of Usher algorithms with ve nodes. . . . . . . . . . . . . 123 xi Abstract Recent advances in machine learning and articial intelligence have brought about a variety of new applications in the wireless IoT and vehicular environments, in- cluding those that employ speech recognition and image processing. Many of these applications are computationally intensive and may exceed the capacity of a single device. In such situations, these devices often rely on cloud computing to pro- vide the processing power needed to run the applications. However, some of these applications are latency-sensitive due to their interactive nature or use in the op- eration of a vehicle. As the demand placed on cloud computing services and its network infrastructure grows, it will become more and more dicult to provide the performance guarantees required by these applications. Edge computing is an ooading technique that has emerged as a solution to satisfy the growing demand of cloud computing services by providing the same services physically closer to where they are needed. This has the dual benet of reducing network congestion and reducing application latency. However, due to their geographically distributed nature, edge computing resources are relatively xii dicult to manage and utilize eciently. This area is still an open research problem, particularly for latency-sensitive applications. In this work, we develop tools to facilitate the use of edge computing resources for latency-sensitive applications in both wireless IoT and connected vehicle sys- tems. We begin by presenting Noctua, a framework that enables a publish-process- subscribe architecture for IoT applications. Through a real-system implementation, we demonstrate and evaluate how Noctua can help IoT developers by enabling more ecient use of network resources and reducing the strain on edge devices by de- livering to them more meaningful data. We illustrate Noctua's capability through application examples including aggregating multiple sensor ows and providing ra- dio signal-strength-based localization as a real-time service. Next, we introduce VESPER, a real-time processing framework and online scheduling algorithm designed to exploit dynamically-available distributed devices that are connected via wireless links. A signicant feature of the VESPER algo- rithm is its ability to navigate the tradeo between accuracy and computational complexity of modern machine learning tools by adapting the workload, while still satisfying latency and throughput requirements. We refer to this capability as poly- morphic computing. VESPER also scales opportunistically to leverage the computa- tional resources of external devices. We evaluate VESPER on an image-processing pipeline and demonstrate that it outperforms ooading schemes based on static workloads. xiii Finally, we present Usher, a framework for structuring and scheduling latency- sensitive applications to enable ecient utilization of computing resources across networked devices. Like VESPER, Usher also exploits the concept of polymorphic computing, but it supports multiple applications of a more general form. Equipped with the Usher framework, we formulate the underlying optimization problem. We show that this problem is NP-hard, but propose two heuristic solutions for it, a sim- ple greedy algorithm and a more sophisticated bundle-based greedy algorithm. We present approximation ratios for these algorithms, and also evaluate them empir- ically over realistic as well as constructed workloads to demonstrate and evaluate their performance over a range of settings. The proposed system is simple and conducive to implementation on real networked distributed systems. xiv Chapter 1 Introduction Advances over the past several years have paved the way for the continued growth of Internet of Things (IoT) devices. These devices are faster and more capable than ever before but their computational capability is still lacking compared to xed computers or servers. Moreover, our computational demands of mobile devices are also increasing as the interest in applications such as augmented and virtual reality grows [1]. Unfortunately, we currently remain fairly constrained to a centralized approach for the processing of computationally intensive tasks. For example, inter- active mobile applications, such as personal assistants using speech recognition, will often rely on the cloud to ensure that responses can be provided within a reason- able amount of time with a high level of accuracy. However, ooading to the cloud for delay-sensitive tasks is not a sustainable solution. As the number of devices continues to grow at an incredible rate, network congestion will increase and make it more dicult to oset the latency costs incurred when communicating with a data center over a wide-area network such as the Internet. As Internet latency is 1 not likely to improve in the near future, new techniques are needed to mitigate this issue [2]. Edge computing has emerged as a solution to this problem by bringing relatively powerful computing resources closer to the IoT devices that need them. When these resources are placed one-hop away from IoT devices, the link quality experienced for ooading to these edge resources is signicantly improved when compared to ooading to a data center, particularly with regards to latency. While the concept is simple, i.e. that placing systems closer together leads to improved communication performance, implementation of this in practice is not trivial. These edge resources are not deployed in a centralized location like a data center, which can be easily managed, but are rather geographically distributed at the edge of a network. How these resources can be eectively managed and eciently utilized is an open research question and the focus of this work. In addition to the wireless IoT environment, similar challenges are also ap- pearing in the vehicular context as well. We are witnessing increasing network connectivity of vehicles to infrastructure. With the growth of smart cities, this enables vehicles to access a wealth of information that was previously inaccessible. Such data can include congestion information provided by induction-loop sensors in roadways or video streams from trac cameras. This data is in addition to the locally-sourced information already provided by the on-board sensors of today's vehicles. As more and more vehicles are equipped with advanced hardware such as 2 cameras, RADAR, LIDAR, and ultrasonic sensors, the modern vehicle will have a wealth of information to digest. Access to large amounts of data raises many new possibilities for vehicular applications. However, there is a fundamental challenge that comes along with it: how does a vehicle process all of this information? The autonomous vehicles currently under development are being equipped with advanced processors and accelerators, such as graphics processing units (GPUs). While it is simpler and more cost-eective to budget sucient computational resources on a vehicle for it to process the data from its own on-board sensors, it is not so clear how data received from external sources should be accommodated to satisfy application demands. Once a vehicle becomes networked, the data it has access to is virtually unbounded. This lead us to explore whether the network capability that has brought about this challenge can also help to solve it through the use of computational ooading techniques. We attempt to build on existing cloud and edge computing techniques to tackle the ooading challenge in a vehicular context. When compared to the traditional and mobile computing environments of desktops, servers and mobile devices, the vehicular context brings about a unique set of challenges: Highly dynamic mobility: By their very nature, vehicles move around constantly at high speeds. This causes rapid changes in network connectivity and link quality. An eective system should not expect a reliable network 3 or long-lived links, and must quickly adapt to leverage external resources as they become available. Real-time performance constraints: There is a growing set of applica- tions to support advanced driver assistance systems (ADAS) and autonomous capabilities in vehicles. Such applications have strict requirements in terms of performance, such as latency and throughput. As more and more data becomes accessible to vehicles, we expect an increase in the number of such applications and consequently their total computational demand. In the following sections, we dene some terminology and illustrate our major contributions towards this eort. 1.1 Terminology 1.1.1 Polymorphic Computing Inspired by approximate computing, we recognize that in some cases a reliable real-time processing system can be more important than an accurate one and we introduce a concept we refer to as polymorphic computing for this domain. Here polymorphic (coined from poly- \many" and morphic - \forms") refers to dierent computational pipelines that could be used for a processing task that oer dierent accuracy-performance tradeos. The fundamental idea here is that at a given point in time, an application's workload (resulting from a particular processing pipeline 4 that may be optimized for accuracy) may not be achievable within a given time frame to meet real-time requirements. This could be due to a number of reasons outside the control of the application. However, rather than simply failing, it would be more useful if the application could adapt its workload (in other words, change the actual processing pipeline itself) in a way that it could be completed within the time constraints, albeit at the cost of some reduction in accuracy. For example, in a neural network context, such a polymorphic computation can be implemented by varying the number of neurons (both in terms of the number of layers in the network and the number of neurons in each layer). Neural networks with more nodes will generally oer a higher accuracy at the cost of a higher computational requirement [3]. Therefore, in circumstances where a limited amount of time is available for processing, a smaller, less-accurate neural network can be used. 1.2 Contributions In this work, we developed frameworks and systems to tackle the following problems regarding ecient ooading for wireless IoT and connected vehicle systems: 1. Publish-Process-Subscribe Messaging 2. Polymorphic Stream Processing Proof of Concept 3. Modeling and Analysis of Polymorphic Stream Processing 5 4. Utility-based Scheduling Algorithms for Polymorphic Applications Our contributions are organized as follows. Chapter 3 describes Noctua, our frame- work for publish-process-subscribe messaging. The design, modeling, and analysis of VESPER is discussed in Chapter 4. We present Usher, our utility-based schedul- ing algorithm for polymorphic applications in Chapter 5. 6 Chapter 2 Background and Related Works Distributed computing is a eld that has a rich history and encompasses many dierent topics including communication, scheduling, and security. This type of computing became possible in the 1980s with the development of distributed sys- tems. During that time there was a departure from monolithic mainframe systems due to the growing availability of inexpensive yet powerful microprocessors. This in combination with the invention of computer networks led to the distributed systems that we know today [4, 5]. Tanenbaum and Van Steen dene a distributed system as \a collection of in- dependent computers that appears to its users as a single coherent system" [5]. While this denition may be appropriately simple, it belies the incredible amount of abstraction and coordination that takes place behind the scenes to make such an experience possible. In this thesis, we focus on specic aspects of ooading computation in certain types of distributed systems and touch on various topics related to communication, scheduling, and security. 7 2.1 Computational Ooading For part of our work, we develop computational ooading techniques to handle the intermittent connectivity of devices, multiple processing pipelines and real- time performance constraints. In general, ooading enables a device to perform computations that exceed its individual capacity by leveraging external resources. This section discusses some prior applications of computational ooading referred to as cyber foraging. Cyber foraging refers to the idea of opportunistically utilizing nearby resources to improve the performance of a locally-running application. Balan et al. highlight two advantages of cyber foraging [6]: (a) data staging to reduce the cache-miss service time in mobile le access and (b) remote execution to reduce compute- intensive applications on a mobile device. These capabilities are dependent on the availability of support infrastructure, which Balan et al. refer to as surrogates. Surrogates are relatively powerful computers dedicated for this purpose and are therefore deployed in environments where they'll have close proximity to mobile devices. The authors of Slipstream present a low-latency parallel processing framework called Sprout for interactive game applications [7]. The Sprout framework aims to automate the execution and parallelization of applications on a cluster and provides an API to deal with a parallel, distributed system. Odessa [8] is a system designed to improve the latency and throughput of an application through ooading and 8 parallelism decisions using the Sprout framework. It accomplishes this by making greedy incremental scheduling decisions across a two-node network, consisting of a mobile device and a server. Cloud computing platforms MAUI [9] and CloneCloud [10] also provide solutions for ooading from a mobile device to the cloud. Hermes [11] presents a task scheduling algorithm for minimizing latency subject to a cost constraint, over an arbitrary network of devices. The latencies and costs of dierent devices are assumed to be known either deterministically or in distribution. The Serendipity distributed mobile computing framework [12] proposes the use of nearby mobile devices' resources to collaboratively process the jobs of a mobile device. The authors propose greedy algorithms for collaborative computing among multiple intermittently connected mobile devices with the goal to either minimize the job completion time or maximize the lifetime of the mobile node [12]. Glimpse [13] is a real-time object recognition and tracking system built to run on mobile devices. It utilizes what the authors refer to as an active cache to mitigate the impact of latency variations between a mobile device and the cloud in order to provide continuous performance. While similar to our work in that it addresses real-time constraints, Glimpse's dependence on the cloud is a limitation we overcome. 9 2.2 Articial Neural Networks Articial neural networks are a type of machine-learning tool designed to mimic the learning behavior of the human brain [14]. They take as input some data, such as an image, and perform some processing to make an inference based on what the network was trained to do. Neural networks have become the state-of-the-art tool for many areas including computer vision and natural language processing [15, 16]. They work well because of the large amounts of information they are able to ingest and learn from. Input Layer Hidden Layers Output Layer Figure 2.1: Articial Neural Network 10 2.2.1 Inference and Training In terms of their structure, neural networks are made up of multiple layers of neurons. An example of a neural network is shown in Figure 2.1. The rst and last layers are referred to as the input and output layers, respectively. The remaining layers in between are the hidden layers. During the processing phase, referred to as inference, each neuron computes a linear combination of its inputs from the previous layer, applies an activation function such as sigmoid or ReLu [17], and passes its result to the subsequent layer. The success of a network depends on its ability to generalize the task it is being trained to perform from the input data it is given. While the quality of training data has a signicant impact on the success of training, certain neural network structures have been shown to perform better than others for certain tasks [18]. The training of neural networks involves a computationally-intensive step known as back-propagation. Back-propagation works by comparing the result of a neural network to the ground truth, in other words the value that is known to be correct. The dierence in these values is sent backwards through the network and used to update the weights of the network in a manner which penalizes the neurons that had the largest impact on the error. After sucient repetitions, the error can usually be minimized, resulting in a network that is useful for making inferences about data similar to what it was trained on. 11 2.2.2 Acceleration and Tradeos In recent years, the availability of powerful GPUs has lead to the growth of a ma- chine learning eld referred to as deep learning [19, 20, 21]. Deep learning involves the use of neural networks with many layers, much larger than were previously possible due to the ability of GPUs to accelerate the training and use of articial neural networks. An example of a class of such deep neural networks is known as convolutional neural networks (CNN). CNNs have become very common for com- puter vision related tasks and various datasets, such as ImageNet [22] and PASCAL [23], have been released to the public to drive development. Some popular CNNs include AlexNet [24], ResNet [25], R-CNN [26, 27, 28], and YOLO [29, 30, 31]. Although high-end GPUs are fast at performing inference for neural networks, they are not available on all devices. For example, it would not be practical to have such a discrete GPU on a mobile device due to space and power constraints. To tackle this issue, there has been work on developing neural network architectures that can achieve reasonable performance on a mobile device [32, 33]. In general, various works have identied a tradeo between the computational requirement and the accuracy of CNNs for computer vision tasks [3, 32]. By adjusting the size of a neural network, either the number of layers or number of neurons within a layer, this tradeo can be achieved by exchanging compute time for accuracy. Smaller networks generally execute faster than larger ones. The polymorphic architectures we explore in this thesis is motivated by this observation. 12 2.3 Knapsack Problems In our work on utility-based scheduling algorithms for polymorphic applications presented in Chapter 5, we make use of a variation of a common binary decision problem known as the knapsack problem or KP [34, 35]. We provide a brief intro- duction to the knapsack problem and some of its well-known variations here. The basic knapsack problem is formulated as follows. There is a knapsack or bag with a certain total weight capacity c. There are a set of n items, each with certain weight and prot values, w i and p i , respectively. The goal of the problem is to ll the bag with a subset of the n items to maximize the total prot while making sure not to exceed the bag's capacity. Let x i represent the binary decision variable, where x i = 1 if item i is chosen and x i = 0 otherwise. The knapsack problem can then be written as follows: maximize n X i=1 p i x i subject to: n X i=1 w i x i c; i = 1;:::;n x i 2f0; 1g; i = 1;:::;n (2.1) The knapsack problem can be applied in a variety of dierent contexts. For example, if the capacity constraint is associated with a resource, such as a pro- cessor or memory, then the knapsack formulation can be applied to job scheduling 13 problems. In the following subsection, we introduce a few well-known variations of the knapsack problem and describe their formulations. 2.3.1 Common Variations First, the multidimensional knapsack problem, or MKP, is a variation of the knap- sack problem that takes into account two or more capacity constraints [36, 37]. To extend the bag example, a knapsack may not just be limited by the total weight it can hold but perhaps by volume as well. So nding the optimal solution to this modied problem is slightly more complicated. In the context of job scheduling, this formulation can be used to account for multiple resources such as processors, memory, disk space, etc. Second, the multiple knapsack problem is an extension of the knapsack problem that allows for more than one knapsack. Instead of being limited to a single knap- sack, there are multiple knapsacks that can be chosen from and used, each with their own, potentially unique, capacity constraint. Finally, the multiple-choice knapsack problem is a variation that divides the set of items into categories. The added restriction for this variation is that at most one item may be chosen from each category [34]. 14 2.4 The Publish-Subscribe Paradigm The publish-subscribe communication paradigm lends itself well to the producer/- consumer abstraction common in many IoT applications. Unlike the more common and intuitive request-response communication scheme, publish-subscribe messaging has the ability to decouple devices in both time and space [38]. Before going into more detail about publish-subscribe messaging, we rst provide an explanation of request-response messaging to better explain the advantages of publish-subscribe for IoT systems. 2.4.1 Request-Response Messaging The request-response communication scheme is what we use when we retrieve a webpage from the Internet. As the client, we send an HTTP message to a web server that is waiting for a request to arrive. The web server is associated with a port at some IP address, either already known or retrieved through a DNS lookup. This address is used to route the request. The server processes the request and sends back the appropriate response, at which point the interaction ends. For additional content, this process is repeated multiple times. If a client wants to check if some content has changed, he or she would need to poll the server by sending more requests. This is a very inecient method. We'll see in the following section how publish-subscribe messaging provides a better method for accomplishing this. 15 2.4.2 Publish-Subscribe Messaging In contrast to request-response messaging, publish-subscribe messaging enables clients to specify interest in certain information. This interest is expressed through a subscription. When data relevant to that subscription is published, all subscribers will automatically receive an update containing the data. There are dierent vari- ations of publish-subscribe [38] but in this work we will focus on the topic-based version, the type used by the MQTT protocol. Topics are names used to refer to certain data. We'll discuss MQTT further in Chapter 3. Whereas a request-response system has clients and servers, a publish-subscribe system typically consists of publishers, subscribers, and a broker. The broker is an analogue to a server in this messaging scheme. Brokers are responsible for accepting and keeping track of subscriptions and relaying new data to the interested parties as it becomes available. Publishers are the source of data in this system. They send updates to the broker on the topics they are associated with. Subscribers, on the other hand, consume this information, specically from the topics they have interest in. 2.5 Macroprogramming Frameworks From the days of wireless sensor networks (WSNs), there has been an eort towards developing eective programming abstractions that has carried over into the IoT 16 space. These eorts have resulted in various solutions ranging from new program- ming languages to macroprogramming frameworks. Here we discuss some of these works that bear similarity to our work on this topic. The authors of T-Res [39] present a programming abstraction that facilitates in- network processing in IoT-based WSNs. In T-Res, the input, output, and process- ing components of Python tasks are decoupled and presented as network resources using the CoAP protocol [40], allowing them to be recongured dynamically. One key dierence between T-Res and our work, is that T-Res places the burden of managing asynchronicity of input data on the developer, who is required to main- tain state between the executions of their tasks. A feature of our work is that developers can explicitly specify the constraints on the data they need and rely on the framework to trigger their macro only when these constraints are satised. This feature also enables users to access arbitrary amounts of historical data rather than just the last value. PyoT [41] is a macroprogramming framework that, like T-Res, makes use of CoAP and Python. However, this framework assumes the existence of metadata on devices to enable searches and avoids the need to address devices directly. While this framework is designed for more interactive use, through a web interface that it provides, our work is focused on data-driven applications and doesn't expect much user involvement beyond its initial setup. 17 Flask [42] presents a language for data-driven sensor network applications. This language can be used to \wire" up data ow graphs, which consist of the operations that need to be performed to compute the desired output. A limitation of Flask is that communication and processing all get statically dened at compile time, leading to a strong coupling between data sources and sinks. It is also not possible to change the processing at runtime as all participating devices would need to be reprogrammed. There is a commercial product called PubNub that provides publish-subscribe messaging infrastructure as a service. One of the features supported by this service is known as PubNub Functions [43] and it appears to implement a form of in- networking processing. It allows users to manipulate their data in various ways as it is owing through the network. These Functions are written in JavaScript. There are also third-party Functions, referred to as BLOCKS, that can be used as well. Since PubNub Functions is a proprietary product, it is dicult to speculate on what is going on behind the scenes, but it demonstrates that there is a real demand for this type of functionality. Node-RED [44] is a web-based data ow tool that enables users to quickly create interactions among supported devices and web-based services through a visual ed- itor. While it is built on similar technology, i.e. Node.js, Node-RED is not focused on publish-subscribe messaging and does not tackle the privacy issues we address in 18 our work. Node-RED's Flows also cannot be dynamically recongured at runtime, a feature we provide in our work described in Chapter 3. With regards to publish-subscribe messaging, there are a variety of software products that provide such capability. One such product is Apache Kafka [45], an open-source stream processing platform. While Kafka provides publish-subscribe semantics, it is more of an enterprise-grade messaging queue than it is a replace- ment for a lightweight protocol such as MQTT. Kafka is suited for operation in data center clouds and not designed for use at the network edge or in otherwise constrained environments, which we target in our work. ZeroMQ [46] is a high-performance messaging library that supports a variety of messaging patterns. While ZeroMQ supports a pattern similar to publish-subscribe, its implementation does not involve a broker and so it losses out on some of the benets of having one. For example, messages from a publisher are ltered at the subscriber since there is no broker to handle this task. This potentially results in unnecessary network trac as all messages are broadcast to all subscribers. 2.6 Privacy and Access Control Security is an important aspect of distributed systems [5]. Given that a distributed system can consist of tens, hundreds, or even thousands of devices, it is necessary to ensure that data is protected as it travels across a network, both from eavesdropping and tampering. While this level of data protection is typically provided by use of 19 a secure channel at the network layer using technologies such as TLS [47] or SSL [48], it does not manage which users have access and how they interact with system processes. Frameworks such as OpenIAM [49] and OAuth [50] make it easy to manage users and access controls in a way that is scalable to large distributed systems. OAuth, for example, implements an authorization protocol that allows authorization to be checked across dierent web services without the need for sharing user login credentials. It accomplishes this by generating access tokens which can be used by dierent services as proof of authorization. User management can be further simplied by the use of role-based access con- trols (RBAC) [51]. RBAC creates a hierarchy where users are assigned roles and the roles are assigned dierent access privileges. This can greatly reduce the complexity and cost of access control administration. 2.6.1 Access Control in Publish-Subscribe Messaging One of the advantages of publish-subscribe messaging is that it simplies commu- nication by decoupling publishers from subscribers. A publisher can transmit data without having to worry about who is receiving it. However, there are some appli- cations scenarios where this data contains sensitive information or is being sold for prot. This creates a con ict for traditional publish-subscribe systems which can't accommodate such situations. In light of this, some prior works have explored 20 how access controls can be incorporated into publish-subscribe brokers [52, 53]. In our work on a publish-process-subscribe framework described in Chapter 3, we also explore how access controls could apply to the results of computations over publish-subscribe streams. 21 Chapter 3 Publish-Process-Subscribe Framework As IoT devices continue to be adopted and their applications grow, there has been an increasingly diverse group of developers engaging with them. These de- velopers have access to a variety of tools which remove the need for them to have extensive training or a background in hardware or software [42, 55]. In addition to the low cost of IoT devices, we believe this level of accessibility to people from a variety of backgrounds has contributed to the widespread adoption of such devices. People are now able to interact with and customize the experiences they have with their environments in ways never thought possible. IoT devices have become much more powerful than the small embedded devices, or motes, typical of wireless sensor networks. Some IoT devices feature multi-core processors, embedded GPUs, large amounts of RAM, etc. While the hardware has improved, the fundamental limitations of the wireless medium over which these This chapter describes work that appears (in part) in the work by Wright et al. [54]. 22 devices communicate has not changed. There is still great value in reducing the communication overhead of an application so that it can operate as eciently as possible. To that end, tools and frameworks that help developers create ecient applications from the start are valuable. A paradigm that has been gaining in prominence to support the rapid devel- opment and deployment of real-time IoT applications is publish-subscribe imple- mented in protocols such as MQTT [56]. The benet of the publish-subscribe approach is that it allows for fast and robust implementation of real-time many-to- many communications. It allows for asynchrony and is forgiving when faced with lossy and dynamic connectivity. However, the basic publish-subscribe paradigm doesn't provide mechanisms to enable ecient in-network computation that could be used to reduce bandwidth utilization and improve privacy. To meet some of these major challenges associated with IoT systems, we advo- cate for an extension of the traditional publish-subscribe approach that we refer to as publish-process-subscribe, which allows for the en-route processing of data. There are several advantages and applications of the publish-process-subscribe paradigm: Sensor Data Analytics, Fusion and Aggregation: By allowing for en-route computation, data owing from one or more publishers can be combined and processed together using data-analytics algorithms such as estimation and prediction and other machine learning algorithms to provide a more mean- ingful stream of rened, analyzed data for a subscriber. 23 Bandwidth, Latency, Energy Improvement: Related to the above point, by processing raw data within the network, the total amount of data that is streamed can be reduced, improving bandwidth utilization. By performing data computations at a more powerful server en-route rather than compute- constrained and energy-constrained end points, the latency and energy ex- penditure associated with data computation could be reduced. Privacy: A raw data stream from one publisher could be processed through an anonymizing lter. Now, access controls could be set up via the broker so that certain authorized subscribers can have access to the raw data, while others are provided access only to the anonymized version. Computation for Automated Control: Computation specied over streaming published data could also be used to generate processed streams intended to control particular actuators. This could be useful when there is limited computation at the client side. Virtual Sensors: To simplify application design, data from physical sensors may rst be transformed into a virtual sensor. This would allow for changing and upgrading physical sensors in a deployment over time while providing the same abstraction to higher layer applications. In this chapter we describe an implementation of the publish-process-subscribe paradigm in the form of Noctua, a publish-subscribe broker that provides for exible 24 computation so that a client can subscribe to a processed version of published raw data from one or more publishing devices. Noctua is written in JavaScript and powered by Node.js. We demonstrate that this system enables the ecient use of network resources for a wide range of IoT applications. We also show how Noctua facilitates the automated implementation of a role-based access control mechanism to provide security and privacy for real-time IoT streams. 3.1 Message Queue Telemetry Transport MQTT, or Message Queue Telemetry Transport, is an application-layer protocol for publish-subscribe messaging [56]. It is simple and lightweight making it a popular choice for IoT applications. MQTT uses TCP as its transport layer protocol but there is a variant, MQTT-SN (previously MQTT-S) [57], designed to run over UDP. Figure 3.1 provides an illustration of how data ows through an MQTT broker. In MQTT, topics are specied as strings and provide a way for applications to refer to data they're interested in. An example of a topic is \car/temp". Slashes have a special meaning in MQTT. They are referred to as topic level separators and are used to specify a hierarchical structure to the data. This is taken into con- sideration when a subscriber uses a wildcard in their subscription. So for example, if a subscriber subscribes to \car/#", they will receive any data published to topics beginning with \car/", such as \car/temp" and \car/speed". 25 Broker Dashboard car/speed car/temp car/speed car/temp Publishers Subscribers car/fuel AC Unit car/temp Subscriptions: car/temp Subscriptions: car/temp car/speed Figure 3.1: Example data ow through an MQTT broker MQTT denes about a dozen message types. A few of particular interest are CONNECT, SUBSCRIBE, and PUBLISH. CONNECT messages are used when clients rst establish a connection to an MQTT broker. If authentication is enforced, clients will be required to provide their login credentials in this message. The SUBSCRIBE message is used by clients to tell the broker which topics they are interested in. Finally, PUBLISH messages are used by clients, in particular data sources, to send updates to the relevant topics. There are three QoS, or Quality of Service, levels supported by MQTT. These QoS levels, namely 0, 1, and 2, provide dierent end-to-end message delivery guar- antees. QoS 0 ensures that a receiver gets a published message at most once; QoS 1 ensures at least once; and QoS 2 guarantees exactly once. These QoS levels require dierent amounts of trac to satisfy their guarantees, with QoS 0 using the least, just a single MQTT message. QoS 1 requires a minimum of two messages while QoS 2 requires a minimum of four messages. 26 MQTT Broker Database (MongoDB) Compute Engine Figure 3.2: Noctua Architecture 3.2 Noctua System Design Noctua is essentially a publish-subscribe broker that has been augmented with computational capabilities. This computational component provides IoT software developers with a framework that makes it easier to develop more ecient appli- cations. In the following subsections, we will provide an overview of the Noctua architecture, discuss its implementation, and show a simple example of how it can be used. 3.2.1 Architecture At the core of Noctua is the Broker, as shown in Figure 3.2. The Broker is respon- sible for relaying messages to their subscribers, so by default it has access to the all of the data that may be relevant to an application. The Compute Engine uses this data to process any macros that are registered with it. We use the term macro to refer to the computation that an application is asking Noctua to perform on its behalf. A macro consists of one or more references to the data it depends on as well 27 as some code representing the calculations to be performed. Macros can be created and even updated anytime during the operation of the system without disruption. An example macro will be shown in the following subsection. Macros are provided with an expressive syntax for accessing current and past values of data. To support this capability, Noctua incorporates a database which is used to store historical values of data that passes through it. 3.2.2 Implementation Noctua is written in JavaScript and powered by Node.js [58], a server-side runtime environment built on Google's V8 JavaScript engine. Node.js is designed with an event-driven architecture that is highly optimized for network applications. This makes it well suited for the task at hand, as operations in Noctua are triggered when data passes through it. The Node.js plugin aedes provides Noctua with the framework for an MQTT broker. We use the HTTP protocol to handle macro operations, such as creation, updates, and deletion. MongoDB [59] is used as the database to store published values for future reference. 3.2.3 Macros 3.2.3.1 Language Rather than create a new language, macros for Noctua are written in JavaScript, a scripting language that is already well known [60, 61, 62]. Research has shown that 28 the software engineering community has adopted JavaScript as one of the primary languages for web programming [63], a domain closely related to our work. The full JavaScript language is supported, so all of the typical methods for ow control are available, e.g. if statements and for loops. We expose a Noctua-specic JavaScript Object to the runtime context of the macro such that it is able to pull in data that has already been published to the broker. We refer to this object as a topic reference. This data can then be treated like any other value in the code. The use of this object also allows Noctua to automatically determine a macro's dependencies, requiring no additional eort from the application developer. We consider a macro to be activated when a value arrives for one of its dependencies. A macro is only triggered when all of its dependencies are satised. Figure 3.3 shows a simple example of a macro that converts temperature data from Fahrenheit to Celsius. This macro takes the most recent value published to the MQTT topic `home/tempF' and performs some arithmetic operations on it to convert from one unit to another. The macro is triggered whenever a new value arrives on the `home/tempF' topic. The computed value will then be published to a topic that corresponds to the name of the macro, which can be seen in Figure 3.4. In this case, the result will be published to the topic \Noctua/home/tempC". By default, the \Noctua" topic level is reserved for use by the broker and is prepended to the name of the macros to determine their associated topic for publishing. 29 (Noctua.topic('home/tempF') - 32) * 5/9 Figure 3.3: A macro for temperature conversion { "name": "home/tempC", "code": "(Noctua.topic('tempF') - 32) * 5/9" } Figure 3.4: JSON object for HTTP POST request to create macro The JSON object shown in Figure 3.4 shows the information needed to register a new macro with Noctua over HTTP. If a name is specied, it must be unique; otherwise a hash value will automatically be generated for the macro and returned. The code eld is the only eld that is required and it must contain valid JavaScript code. 3.2.3.2 Topic Reference Settings Noctua supports an optional settings argument that can be specied in a topic reference. The options are listed in Table 3.1. We will rst explain what the options are and then provide an illustrative example to show why this feature is important. The cutoff option can be be thought of as the amount of time a published value is acceptable for (with respect to a particular macro). So for example, if a cuto value of 5 is specied, only a value received in the last 5 seconds of when the macro is triggered will be returned by Noctua. This option can be applied to 30 Table 3.1: Topic reference settings Option Default Description cuto 0 Time in seconds that values are acceptable for required 1 Minimum number of values to retrieve limit 1 Maximum number of values to retrieve function() { var t1 = Noctua.topic('home/front_temp', { cutoff: 60 }) var t2 = Noctua.topic('home/rear_temp', { cutoff: 60 }) return (t1+t2)/2 }() Figure 3.5: A macro for averaging two temperatures ensure that values used in calculations are either recent, or as we'll see later on, temporally correlated. The required and limit options provide application developers with access to the historical values of a topic. The required option indicates the minimum amount of history needed by a macro for a topic while the limit option species the maximum. As an example, if the required option is set to 5, then an array of the 5 most recent values of a topic will be returned, assuming at least that many are available (otherwise the macro will not be triggered). If the required option is set to 5 and the limit option is set to 10, then an array of anywhere from 5 to 10 values may be returned. A macro may then iterate over those values to perform its calculation. Figure 3.5 shows an example of a macro making use of the cutoff option. With a cuto value of 60, the macro is specifying that only values received within the last minute should be considered. If the last temperature value for one of those 31 topics arrived more than 60 seconds prior to when the macro is activated, then the macro will not be triggered and no value will be published. The use of the cutoff option in this case provides two benets. First, applying a cuto prevents data used and published by our macro from being stale. For ex- ample, if the last temperature value published to a topic is from yesterday (perhaps a sensor has a long periodicity or lost connectivity), then that value may no longer be relevant to the application. Second, since the macro is applying a cuto for all of its topics, the macro can ensure that the values are temporally correlated. Due to the decoupled nature of publish-subscribe systems, the last value published to two dierent topics may have arrived at wildly dierent times. The use of the cuto, in this case, restricts the values used in this macro to have arrived within 60 seconds of each other. 3.2.4 Privacy Noctua extends the purpose of the credentials provided in the MQTT CONNECT message. By default, MQTT only uses the credentials to authorize a connection to the broker. Once a user connects, he or she will have access to every topic and is free to subscribe and/or publish to them. Noctua takes the credentials a step further and enables an administrator to specify permissions on a per-topic basis. This allows for more ne-grained control over the data a user can and can't see. This level of granularity makes Noctua well suited for heterogeneous IoT systems 32 GPS A GPS B Distance(A, B) Application OK OK OK Denied Denied Figure 3.6: Privacy protection where many devices may be sharing the same broker but don't necessarily have the same level of trust. There is another feature that is gained from this high level of access control when it is combined with Noctua's computational capabilities. We illustrate this with an example. Imagine there is some particularly sensitive data being published to the broker, such as a person's GPS location. That person may not want others to know where he or she is, so access to that topic may be severely restricted. However, if there is an application that is not directly interested in that person's location, but rather the distance between that person and someone else, then that use case may be acceptable. That application could be restricted from directly accessing the person's location topic, but may be given permission to access the result it needs through Noctua's macro capability as depicted in Figure 3.6. Essentially, what this approach to privacy allows is indirect access to sensitive information in some aggregated form. This protects sensitive information without severely limiting exibility in application development. The kind of aggregation 33 Figure 3.7: Role-Based Publishing with Noctua or ltering needed to make data anonymous is left to the discretion of the data owner/administrator as it is expected to be application specic. Going beyond simple allow/deny permissions, Noctua can also automate the assignment of the \right stream of data" to each user based on their access role, which we refer to as role-based publishing. We describe this general framework in the following section. 3.3 Role-based Publishing Noctua's ability to process real-time data streams can be used to provide dierently processed streams to dierent subscribing users, as a function of their role, greatly facilitating the use of role-based access control for IoT applications. We refer to this feature of Noctua as role-based publishing. At the outset, we note that role 34 based publishing is an optional functionality in Noctua that can be activated and instantiated for each topic. Figure 3.7 shows how Noctua's role-based publishing works. The role-based authorization service could be implemented on the same system as Noctua or it could be an external authorization/role-based access control server (e.g. a service built using OpenIAM [49]| or OAuth [50]). The service is able to provide Noctua with the role associated with a given user. The data owner can upload one or more macros to Noctua to process the raw data stream, and also provides a role-based publishing specication (RBPS) that Noctua can use to determine which macros/processed streams can be accessed by which user (once the user's role has been determined). The RBPS may specify multiple streams that a role is allowed to access, but must also specify one of these streams as the default. When a user subscribes to an original topic, Noctua makes a call to the role- based authorization service to determine that user's role, then sends data to the user corresponding to the default processed stream that is specied for the role, by eectively subscribing the user to the permitted topic (original or processed). Some roles may even be denied access to any data from the stream. If a role is allowed access to multiple streams, the user may further directly subscribe to other topics corresponding to any of the permitted streams. Figure 3.7 illustrates a possible ow for a given topic and user: 35 1. The owner of data for topic t provides a macro called \t/anonymize" for anonymizing that data stream. This could be implemented, for example by adding noise to the data, removing certain labels, or applying a threshold to generate a coarse-grained version of the data. The owner also uploads to the Noctua server (as a JSON object posted using HTTP) a role-based publishing specication (RBPS) for this topic. This is a table, as shown in the bottom left of the Figure 3.7, that species which data streams each possible role is permitted to access, and for each role also species a default stream. 2. User 2 sends a subscription request for topic t. 3. Noctua queries the authorization service about User 2. The authorization service uses the user-role matrix to determine this user's role. 4. The service informs Noctua that the role for User 2 is \customer" 5. The data owner publishes data to topic t on a streaming basis. 6. The data from the publisher is available as topic t on Noctua and also pro- cessed into a new macro-based topic, called \Noctua/t/anonymized" (using the macro provided by the data owner in Step 1). 7. Since the second row of the RBPS for topic t species that the default stream for customers is anonymized, User 2 will receive data on the topic \Noctu- a/t/anonymized" from Noctua whenever it is available. 36 In this example, we see that User 2's subscription request for topic t is essen- tially automatically translated by Noctua to another stream of anonymized data. User 1 would get access to the raw stream by default due to his or her role as a \developer." However, User 1 can also directly subscribe to and receive data on \Noctua/t/anonymized" because the \'developer" role is permitted access to the anonymized stream as well. If a user with a role that is not authorized to ac- cess either the raw or anonymized streams subscribes, it is denied the subscription entirely (and cannot access \Noctua/t/anonymized" directly either). 3.4 Evaluation We have devised a set of experiments to analyze the capabilities and performance of Noctua as compared to traditional methods of implementation of IoT systems. In our rst experiment, we evaluate Noctua on the campus-wide CCI IoT Testbed currently under development at the University Southern California [64]. Next we apply Noctua to localize a person walking in an indoor environment. Finally, we take a look at the implications of Noctua's privacy features. 3.4.1 Hardware One of the CCI Testbed nodes is shown in Figure 3.8. At the core of each node is a Raspberry Pi 3 computer [65], which handles data collection from all locally attached analog and digital Grove sensors [66]. Each node is equipped with several 37 Figure 3.8: Various sensors on a CCI Testbed node sensors, including temperature, humidity, light, noise, and a variety of gas sensors as shown in the gure. The nodes are connected to USC's campus network through WiFi. 3.4.2 Weighted Moving Average In this experiment, we look at creating a weighted moving average of data from three CCI testbed nodes, specically their temperature readings. A moving average is a simple technique for smoothing out time series data. Let f(t) represent the temperature at some time indext. For each node, we seek to perform the following computation, shown in Equation 3.1, over its past three values. This moving average gives more weight to more recent values. 38 WMA = 3 6 f(t) + 2 6 f(t 1) + 1 6 f(t 2) (3.1) The testbed is congured such that each node is publishing its temperature on an individual MQTT topic, one based on its hostname. The nodes are congured to transmit their temperature value once every second. QoS 0 is used for all messages so that we can determine a lower bound on messages required. Once we've obtained the weighted moving average for each topic, we then average the three resulting values together using equal weights. Figure 3.9 shows the macro used for this experiment. As shown in Figure 3.9, we are using the required option to specify that we need the last three values for each topic we've referenced. It should be noted that whenever the required or limit option specied is greater than 1, the value returned for a topic reference will be in array form. We are therefore able to iterate over the elements as we have done in the macro. For comparison purposes, we've implemented this moving average calculation in three dierent ways, all based on publish-subscribe messaging. We refer to these implementations as: a) local processing, where the subscriber itself performs the calculation; b) application service, where a single standalone service performs the calculation on behalf of any subscribers; and c) Noctua, where the broker itself performs the calculation. The topology for these implementations are shown in Figure 3.12. 39 3.4.2.1 Message Complexity Figure 3.12 shows the message complexity in terms of the number of MQTT PUBLISH messages required between devices until the rst output of the calculation is available, assuming the system starts from scratch (no topic history). In all cases, we need at least three values from each sensor, resulting in at least nine PUBLISH messages before any calculation can take place, regardless of the implementation. For the rst case (Figure 3.12a), since the subscriber performs the calculation it- self, it needs all nine messages forwarded to it resulting in a total of 18 messages, as shown in Figure 3.10. Second, in Figure 3.12b, the application service needs to receive all of the sensor data. The service then publishes the result which gets forwarded to the subscriber. This results in a total of 20 messages. Lastly, we can see the results for the Noctua implementation in Figure 3.12c. The calculation is performed inside the broker itself, so only a single messsage, the result, needs to be transmitted to the subscriber. This results in a total of 10 messages, which is also the theoretical minimum for a broker-based system. 3.4.2.2 Delay To determine the overhead associated with using the Noctua framework, we mea- sured the end-to-end delay of the three implementations. Specically, we measured the amount of time it takes to get a calculated result once the rst MQTT PUBLISH message is sent. For each measurement, all three nodes were triggered to start 40 function(){ var temps = []; temps.push(Noctua.topic('ee499_7/temp', { required: 3 })); temps.push(Noctua.topic('ee499_8/temp', { required: 3 })); temps.push(Noctua.topic('ee499_9/temp', { required: 3 })); var weights = [3, 2, 1]; var denom = weights.reduce((a, b) => a + b, 0); var avg = 0; for (var i = 0; i < temps.length; ++i) { var weighted_sum = 0; for (var j = 0; j < weights.length; ++j) { weighted_sum += temps[i][j] * weights[j]; } avg += weighted_sum/denom; } return avg/temps.length; }() Figure 3.9: Macro for a weighted moving average Figure 3.10: Minimuim MQTT PUBLISH messages required until rst result 41 Figure 3.11: Delay between rst PUBLISH message and rst result sending their values simultaneously so that the results are comparable. Figure 3.11 shows the outcome of this experiment. The results are the averaged over a 100 iterations for each implementation. The maximum dierence between the delays of the implementations is small, < 0:2s. The local implementation performed the best with an average delay of 5:36s, followed closely by Noctua at 5:47s. With a delay of 5:54s, the application service took the longest, which makes sense as there is an additional link traversal involved with that implementation relative to the others, as can be seen in Figure 3.12b. For this moving average example, we can see that there is no signicant cost to using Noctua. 42 Broker Sensor Sensor Sensor 3 3 3 9 Message Complexity: Local Subscriber (a) Local processing Broker Sensor Sensor Sensor 3 3 3 1 Message Complexity: Application Service Subscriber Application Service 10 (9,1) (b) Processing through application service Broker Sensor Sensor Sensor 3 3 3 1 Message Complexity: Noctua Subscriber (c) Processing through Noctua Figure 3.12: Various implementations of a three-sensor system 43 3.4.2.3 Scalability To get a better sense of how Noctua may perform for dierent applications, we investigate its scalability in terms of message complexity. In Figures 3.13 and 3.14, we calculated the number of PUBLISH messages required for a varying number of sensors and a varying number of subscribers, respectively, for the dierent imple- mentations. For Figure 3.13, we imagine that the moving average example was extended to cover an increasing number of sensors. As before, we still require three values from each sensor for the formula shown in Equation 3.1. In this gure we assume there is still a single subscriber. We can see that the message growth rate is linear for all three implementations, but that the Noctua implementation grows much more slowly than the other two, demonstrating Noctua's superiority in terms of scalability. In Figure 3.14, we take a look at the opposite case. Instead of varying the number of sensors, we vary the number of subscribers interested in the output of the moving average. The number of sensors is xed at 50. Here we can see that the local implementation performs much worse than the other two as we would expect. It is simply not practical to send every published message to every subscriber. The application service and Noctua implementations have an almost horizontal curve, although they are growing. Once again Noctua is able to achieve the best performance in terms of PUBLISH messages. 44 Figure 3.13: MQTT PUBLISH messages required for varying number of sensors Figure 3.14: MQTT PUBLISH messages required for varying number of subscribers 45 3.4.3 Application Example: Localization as a Service To demonstrate that Noctua is robust and can support applications involving cal- culations that are much more complicated than a moving average, we created a macro that can process RSSI values and make predictions as to where a person is located. The localization macro is shown in Figure 3.15. The idea is to have a user device publish its RSSI readings from multiple beacons to the broker, and have the Noctua broker use the macro to estimate the location and send back the estimated location stream back to the user. This demonstrates how Noctua can be used to rapidly build and deploy \localization as a service" (and by extension, many other similar applications where data analytics or machine learning algorithms can be used to rene and transform raw data into more meaningful insights in real-time). To show the macro functioning and evaluate its computational cost, we perform experiments with simulated data. In this experiment, we simulate a person walking through an indoor environment with a wireless device. This device measures the received signal strength, or RSSI, from multiple beacons located within the space. The blue line in Figure 3.16 shows the path that this person takes, while the hollow diamonds indicate the location of the beacons. We apply the log-distance path loss radio propagation model [67] to simulate the signal strengths the person would receive as they move about the space. This model is shown in Equation 3.2. P RX andP TX represent the received and transmitted signal powers in dBm, respectively; 46 K represents the path loss in dB at the reference distanced 0 ; andX g is a zero mean Gaussian random variable that represents fading. P RX =P TX K + 10 log 10 d d 0 +X g (3.2) For localization we apply the well-known maximum-likelihood (ML) estimation technique [68]. Our implementation of this technique evaluates the probability of observing the received RSSI values at dierent locations within the space and chooses the location with the highest probability as the prediction. We use the center point of each square in the grid as our search space. Figure 3.17 shows the predictions made by Noctua. As the gure shows, it is generally the case that using more beacons improves the prediction accuracy. Figure 3.18 shows the execution time of the Noctua macro when it uses a varying number of beacons in its calculation. The execution times are averaged over 100 iterations. 3.4.4 Impact of Role-based Publishing We next consider the performance impact of role-based publishing in Noctua, when activated for a topic. When a user makes a subscription to a topic that has a role- based publishing specication (RBPS) associated with it, Noctua incurs additional processing and communication time before data can be sent on that topic. This pertains to the communication and processing needed to contact and hear back from the authorization server (which may be external to Noctua) and the processing 47 function () { //omitted: definitions of beacons, ETA, SIG, K, TX var readings = Noctua.topic('person/rssi') result = new Array(readings.length) for (var i=0; i < readings.length; ++i) { result[i] = new Array(10); for (var j=0; j < 10; ++j) { result[i][j] = new Array(10) } } for (var b=0; b < readings.length; ++b) { rx = readings[b] for (var j=0; j < 10; ++j) { for (var i=0; i < 10; ++i) { y = 5 + 10*j x = 5 + 10*i d = Math.sqrt(Math.pow(beacons[b][0] - x, 2) + Math.pow(beacons[b][1] - y, 2)) fade = TX - rx - K - 10*ETA * Math.log10(d) pdf = (1/Math.sqrt(2 * Math.PI * Math.pow(SIG, 2))) * Math.exp(-Math.pow( fade, 2)/(2 * Math.pow(SIG, 2))) result[b][j][i] = pdf } } } max = 0 max_ind = [] for (var j=0; j < 10; ++j) { for (var i=0; i < 10; ++i) { combined = 1 for (var b=0; b < readings.length; ++b){ combined *= result[b][j][i] } if (combined > max) { max = combined max_ind = [i, j] } } } max_pos = [5 + 10*max_ind[0], 5 + 10*max_ind[1]] return max_pos }() Figure 3.15: Macro for localization using maximum-likelihood estimation 48 Figure 3.16: A person's path through an indoor environment incurred to determine and set up publication from the default data stream for the role that the given user corresponds to. During this additional time before the publication stream is set up, however, it is possible that the publisher sent data items to the broker that Noctua doesn't deliver to the user. We present a brief mathematical analysis of the expected number of lost data items due to the latency associated with role-based publishing. We model the role- based publishing set up latency, the sum of the query latency for user roles plus the local processing incurred to determine and set up the default topic for the user, as being a random variableT RBP that is exponentially distributed with mean (). We also assume that the data for the topic stream (raw or processed) is periodically sent at a deterministic frequency of times per second. Then the number of lost 49 Figure 3.17: Localization using maximum-likelihood (ML) estimation using macro on Noctua packets L =bT RBP c 1. It can be shown that L + 1 is a geometric random variable, with success parameter p = 1e 1 , and hence L has the following expected value: E[L] = 1 1e 1 1 (3.3) This is shown numerically in Figure 3.19. It can be seen that the data loss increases essentially linearly in the product of and , which can be signicant under some circumstances. Such data loss occurs to a signicant extent only if a) a role-based publication specication is provided for a data stream, b) querying for 50 Figure 3.18: Average calculation times for location prediction for macro on Noctua Figure 3.19: Average data loss due to setup latency associated with role-based publishing on Noctua 51 user roles incurs non-trivial latency (perhaps because it involves calls to a cloud- based server), and c) the published data stream has a very high frequency. 3.5 Conclusion This chapter presented Noctua, a framework enabling a new messaging paradigm we refer to as publish-process-subscribe. This paradigm addresses the observation that many IoT applications actuate on processed forms of data rather than just the raw data itself. This leads to a waste of network resources as raw data is shipped across the network unnecessarily. Noctua provides a mechanism, which we refer to as macros, by which application developers can address this issue without being burdened with managing low-level communication details. In summary, the goals of Noctua are to ease application development, while reducing network congestion, improving network lifetime, and protecting data pri- vacy. Noctua accomplishes these goals through the use of macros and exible access controls. Macros are portions of JavaScipt code that are ooaded to the Noctua broker. We have demonstrated that these macros are simultaneously eective at reducing the computational strain on edge devices and improving network conges- tion. And we have discussed how topic-level role-based permissions and role-based privacy-oriented real time data processing allow Noctua to support a diverse set of applications. 52 Chapter 4 Polymorphic Stream Processing Vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication tech- nology promises to improve safety and reduce congestion on the road [70]. As these technologies become a reality, we believe there will be a wealth of information avail- able from various static and mobile sensing sources including video cameras and other sensor feeds (such as induction-loop or ultrasound based trac sensors), that could be used to assist drivers. However, presenting such streams of sensor data directly to a driver in a raw format would be overwhelming and impractical. What is needed is a vehicle perception augmentation system that employs machine-learning based tools to extract useful details from cameras and other sen- sors, for instance giving a driver just the count of vehicles on the road or a notica- tion of a lane blockage beyond visual range. However, a single vehicle may not be able to handle processing of all the incoming sensor data on its own. This is because This chapter describes work that appears (in part) in the work by Wright et al. [69]. 53 machine learning tools such as convolutional neural networks (CNNs), popular for object recognition and scene understanding in video streams, are computationally intensive and known for pushing traditional computing systems to their limits. In order to meet time constraints, such systems often employ specialized hard- ware, such as GPUs, or ooad to the cloud. However, static ooading schemes are not well suited for vehicular applications. Such schemes are susceptible to changes in resource availability, due to variations in wireless link quality as a vehicle drives around, and thus they will often fail to perform as desired, for instance incurring too high of a latency. Moreover, in this context, unlike traditional mobile applica- tions (where typically only two computation points are considered: on-mobile or in-cloud), there are many points where computation could take place, as depicted by the loops in Figure 4.1: on the sensor platform itself, (1) on a processor embedded in the vehicle, (2) additional computational devices brought in by drivers/passen- gers, (3) road-side units (RSU), or (4) in the cloud. These points generally provide a tradeo between latency and computational capability and availability, with the more computationally capable nodes (farther from the source of the sensor data) incurring a higher latency and relatively lower availability. A robust system for vehicle perception augmentation needs to constantly adapt the ooading decisions to eectively utilize all the available processing and communication resources that may be available. 54 1 2 3 4 Increasing Latency Figure 4.1: Potential processing resources available to a car We propose a real-time vehicle sensing and perception augmentation system (VESPER) that incorporates dynamic ooading and polymorphic computing. VES- PER addresses the following critical issues faced while implementing a real-time vehicle perception augmentation system based on sensor data: Workload Adaptability: VESPER changes its processing pipeline based on the available computing resources and their link qualities. Intermittent Connectivity: External computing devices such as the cloud may not always be available to a car, and can come and go frequently. VES- PER is adaptive to the changes in the device connectivity. Scalability: VESPER is capable of handling multiple computing devices (in- cluding on the sensor platform, on car, roadside, and in the cloud) seamlessly. 55 In this chapter we dene the ooading problem we are trying to address in a vehicular context, introduce an application to help motivate this discussion, and identify the metrics by which the system is evaluated. 4.1 Data Deluge Modern vehicles are equipped with a wide variety of sensors which generate a wealth of information. While these vehicles may be provisioned to handle their locally generated data, the introduction of external sensor data will place an unpredictable strain on these vehicles' computational resources as this information needs to be processed in order to infer useful details about the vehicle and its environment. To stay relevant, this data needs to be processed in real-time. We imagine a setting where vehicles are wirelessly connected to external computing resources, such as a road-side unit. These connections are intermittent by nature and we have designed VESPER to be adaptive so that it may leverage these resources when possible. 4.2 Driver Perception Augmentation Application Currently, drivers make most of their driving decisions based on what can been seen directly through their windshields. These decisions may not be optimal and could be potentially improved if the driver is provided with a look-ahead into the future road and trac conditions [71]. As described in [72], if information about 56 the presence of a slow moving truck is available to cars in that lane in advance, they could start lane changing and distributing the trac across other lanes early, providing for a smoother ow of trac. In mountainous areas, if the information about oncoming trac or a land slide is conveyed to the drivers in advance, they can make better driving decisions. Augmenting a driver's perception by providing a look-ahead will improve the safety of passengers and can also help avoid trac jams. With various cameras and other forms of trac monitoring sensors installed across a city's infrastructure, the driver can receive a variety of data streams that can assist with driving. In this work, we consider a specic application where a drone is streaming im- ages of the road ahead to a car to provide the driver with trac information. The raw images are not particularly useful, and in fact can be harmful, if the driver needs to divert his/her attention to interpreting them. The system, therefore, re- quires the implementation of an image processing pipeline for intelligent detection of the vehicles on the road. In order for the system to be helpful, the results of this pipeline need to be computed and delivered to the driver in real-time. This real-time constraint necessitates that the system be capable of adapting to the dy- namic availability of computational resources and wireless link quality. If there are multiple pipelines available for image processing, the system also needs to dynam- ically select a pipeline depending on its computational requirements. The system 57 must be capable of intelligently ooading the pipeline so as to satisfy various con- straints, while providing the highest level of performance possible. We discuss these constraints and performance metrics in detail in the following section. Our image processing pipelines are based on YOLO [29], a real-time object de- tection system that uses a deep convolutional neural network to detect and localize objects in images. The YOLO architecture uses a single deep network and requires only one pass over the input image. YOLO was developed as a competitor for the Pascal VOC Challenge [73] and is capable of detecting multiple objects in an image at once. Here we use two variations of YOLO, namely TinyYOLO and YOLOv2, as our image processing pipelines to detect cars on a highway. As shown in Figure 4.2, TinyYOLO and YOLOv2 exemplify the performance tradeo at the core of polymorphic computing. TinyYOLO is a 16 layer CNN while YOLOv2 is 32 layers deep. On a Titan X GPU, TinyYOLO is capable of running at 207 frames-per-second (fps) but only achieves a mean-average-precision (mAP) of 57.1%. On the other hand, the slower YOLOv2 runs at 67 fps, but it achieves a mAP of 76.8%. 4.3 Performance Metrics When the vehicle receives external sensor data (images in our case), VESPER determines which processing pipeline to use and where in order to extract the most useful information from the image within the time constraints. Based on the 58 (a) Drones-eye view of a highway (b) TinyYOLO (57.1% mAP @ 207 fps*) (c) YOLOv2 (76.8% mAP @ 67 fps*) * on an NVIDIA Titan X GPU [30] Figure 4.2: Illustration of polymorphic computing: two dierent vision processing pipelines (variations of YOLO) for vehicle detection in images that oer dierent tradeos between accuracy and performance. 59 availability and connectivity of the computing devices, it will choose a scheduling assignment. The following metrics are used to evaluate the performance of the controller algorithm: Latency: Latency, or makespan, represents the time it takes an image to make its way through the processing pipeline. It is a function of the chosen pipeline, the resource availability, and the wireless link quality at the time of execution. In order for the system to be useful to a driver, the system needs to satisfy a latency constraint so that the driver has enough time to react to information he or she is provided. The latency constraint would be dictated by how far the drone is expected to travel in front of the car. Throughput: In order to maintain the driver's awareness of trac conditions ahead, the system should deliver updates to the driver at a reasonable rate, or throughput. The throughput constraint, measured in frames processed per second, is based on the visual range of the drone's camera and the number of images needed to maintain continuous coverage. This constraint determines the minimum rate at which the system needs to process images. This metric is a measure of the scheduler's ability to parallelize the workload based on available resources. Accuracy: The accuracy metric represents how well the output of the pro- cessing pipeline ts the ground truth in the real world. The scheduler will attempt to use the most accurate pipeline for the longest period of time while 60 the system is running. However, changing environmental conditions and re- source availability will force the scheduler to adapt in order to maintain a functioning system. A faster, but less accurate, pipeline may be selected in order to satisfy time constraints. These decisions will lead to changes in the expected accuracy of the system for the time in which that decision is in eect. Therefore, we believe that the time-averaged expected accuracy is a useful measure for algorithm performance. In this work we use mAP as our accuracy value. The primary goal of the VESPER algorithm is to maximize the system's accu- racy while ensuring that the throughput and latency constraints are satised. The throughput constraint is a lower bound while the latency constraint is an upper bound. In our envisioned application, these constraints would be dependent on the speed of the car and the distance to the drone as shown in Figure 4.3. These could be tracked by the car and provided as input into VESPER. For example, let the distance of the drone from the car be D, the speed of the car be v and the maximum reaction time of the driver be T r . The time needed to reach the drone's current location is therefore D v . Under these circumstances, the latency constraint should be set to D v T r . If this latency constraint is not satised, the driver will either receive the results after the car has passed the relevant section of the road or at a point where the driver has no time left to react to the situation, rendering the information useless. 61 v v R D H Figure 4.3: Parameters aecting the real-time constraints Let the range of the drone's camera be R. If the post-processing step of the system requires an area of the road to be covered by at leastO number of successive images for an ensemble model to merge the results, then the car should receive at leastO images in the time it takes the drone to traverse that area of road. Since the drone and car should be traveling at the same speed, this results in a requirement of O= R v or Ov R processed images per second. If the throughput does not satisfy this constraint, there might times when the driver has either no perception or an inaccurate perception of the road ahead. 4.4 System Design 4.4.1 Framework The VESPER framework consists of several components, namely the image source, scheduler, dispatcher, token manager, performance monitor, pipeline database, and 62 one or more devices. The framework components are connected as shown in Figure 4.4. Image Source The nature of our target application requires a constant stream of images. These images are provided by a drone over a wireless link. Due to the transient nature of wireless links and other unpredictable circumstances, we have implemented the ability to control the frame rate used by the drone when supplying the images. This provides some exibility for the system to adapt its throughput. We discuss this capability in more detail in Section 4.4.2. Scheduler The scheduler is the key component of our framework and is where the VESPER scheduling algorithm executes. The scheduling algorithm is respon- sible for selecting an image processing pipeline so as to maximize the accuracy of the system while satisfying the makespan and throughput constraints. Due to its critical nature, the scheduler is run on the car to make the system robust against intermittent device connectivity, since external devices such as an RSU may not always be connected. Section 4.4.3 describes the scheduler in more detail. Dispatcher Communication between the scheduler and any devices connected to the system take place through the dispatcher. The dispatcher relies on TCP to ensure that all messages are received. 63 Pipeline Profiles Scheduler Dispatcher Token Manager Devices Performance Monitor Image Source Figure 4.4: Framework Architecture Token Manager To ensure that we do not assign too much work to a device, we have implemented a token manager. Tokens are associated with a particular device and the number created depends on the amount of threads of work that a device can support. A token is required to assign work to a device. Performance Monitor The performance monitor tracks the various metrics used to assess system performance, including makespan ( ^ M) and throughput ( ^ T ). An exponential weighted moving average (EWMA) is used to average some of these values over time. The monitor also tracks per-device performance to allow the scheduler to make more informed decisions. This includes the device's processing rate and round-trip time (RTT). Pipeline Database We collect the execution times of our image-processing pipelines oine for each type of hardware and store this information in a database. The 64 scheduling algorithm uses this information along with live performance measure- ments to predict if a pipeline is feasible or not under the current operating condi- tions. Devices At any given point, there could be multiple computing devices connected to the car, such as nearby RSUs. Each device receives computing jobs from the dispatcher. For each incoming job, the device timestamps the start and the nish time of execution, which are then used by the performance monitor to prole the execution times of these devices and their corresponding link qualities. 4.4.2 Frame Rate Adaptation Queuing theory suggests that the average output rate of our system cannot exceed the average rate at which images are arriving at the input. Therefore, it is necessary to ensure that images are arriving suciently fast. Since the link between the drone and the car introduces varying delay between images, the VESPER algorithm applies frame rate adaptation to make sure that the system delivers the desired throughput performance to satisfy the application constraints. As shown in Algorithm 1, VESPER uses a proportional control system to make frame rate decisions. In Algorithm 1, T 0 represents the system's throughput con- straint. This constraint is determined by application requirements. The controller attempts to approach the throughput constraint by using the ratio T 0 = ^ T . If ^ T is too low, then this ratio will exceed unity and the requested frame rate will increase. 65 Algorithm 1 VESPER Frame Rate Adaptation 1: T 0 throughput constraint 2: ^ T measured system throughput 3: procedure UpdateFrameRate(T 0 , ^ T ) 4: T = 1.01 * T 0 5: newRate = (T / ^ T ) * T 6: newRate = max(T , min(newRate, 1:2T )) 7: RequestFrameRate(newRate) 8: end procedure On the other hand, if ^ T is too high, the ratio will be less than unity and lead to a reduction in the requested frame rate. The further away the reception rate is from the requested frame rate, the more drastic the change. We have modied the proportional controller slightly to make it more robust in practice. Firstly, we boost the throughput constraint by 1% to make the system more likely to satisfy this requirement. We've observed empirically that aiming exactly for the constraint would often cause the system to fall short. Secondly, we set a lower and upper bound on the requested frame rate. The lower bound is simply a logical restriction whereas the upper bound is included to prevent any excessive rates from being requested. 4.4.3 Scheduling Algorithm The scheduling component of VESPER determines which image processing pipeline and which devices to use in an attempt to maximize accuracy while ensuring the real-time constraints are satised. The scheduler runs on the car and makes its 66 Algorithm 2 VESPER Pipeline Selection 1: T 0 ;M 0 throughput and makespan constraints, respectively 2: pipelines ordered list of pipelines (fastest to slowest) 3: devices list of devices and their performance data 4: procedure SelectPipeline(T 0 , M 0 , pipelines, devices) 5: pipeline = 0 6: for p = 0 to len(pipelines) do 7: throughput = 0 8: for all d in devices do 9: est makespan = EstimateMakespan(d, pipelines[p]) 10: if est makespan<M 0 then 11: throughput += 1/est makespan 12: end if 13: end for 14: if throughputT 0 then 15: pipeline = p 16: else 17: break 18: end if 19: end for 20: SetPipeline(pipeline) 21: end procedure 22: procedure EstimateMakespan(device, pipeline) 23: makespan = pipeline.complexity[device] 24: / device.processing rate 25: makespan += device.rtt 26: return makespan 27: end procedure decisions based on the most recent performance data available. As images arrive, they are distributed to scheduled devices using a token-based queuing system. 4.4.3.1 Devices and Tokens Devices are the workhorse of the VESPER framework. When a device establishes connectivity to the car, the device generates tokens based on the number of GPUs available to be used on that device. These tokens are placed in a FIFO queue. 67 A token is needed by the scheduler in order to assign an image to a device for processing. These tokens essentially limit the number of images that can be in- process at a device at any point in time. A token is consumed from the token queue when a job is sent out to a device and is recreated when the device completes the work. If a device completes its work quickly, its token gets added back to the queue very frequently. VESPER rewards ecient work with more work. It is possible for a connected device to not be used at all if it fails to meet the system's makespan constraint. To ensure that performance data for such devices do not get stale, VESPER will periodically probe the device with fake work to get fresh measurements. If the link to a particular device is lost, the system may lose some frames that were assigned to that device, but the frames on other devices are still processed as normal. Loss of a few frames is acceptable as VESPER reacts quickly to ensure that the throughput and makespan constraints are still satised. Once a device is disconnected, the controller removes the token for that device and it is not considered for further scheduling until it reconnects. Note that there will always be at least one device available to the system, namely the car itself. 4.4.3.2 Performance Measurements The VESPER scheduler will periodically review the past performance of all the devices and determine the best pipeline to use for processing subsequent images. It accomplishes this by tracking the processing rates and link times for each device. 68 Jobs are timestamped at the car when they leave and return to the Dispatcher. This makes it possible for the car to calculate a job's entire makespan. In addition, jobs are timestamped at the devices when the devices start and nish working on them, allowing VESPER to determine the execution time for a job on each device, without requiring time synchronization with the car. When combined with the pipeline proles, the execution time can be used to approximate an eective processing rate for each device. By subtracting the execution time from a job's makespan, the latency, or RTT, due to the link can also be determined. 4.4.3.3 Scheduler Logic As shown in Algorithm 2, VESPER uses the processing rate and link estimates to determine if a pipeline is feasible given the throughput and makespan constraints of the system. If a higher-accuracy pipeline is feasible given the available devices then VESPER will switch to that pipeline. If a device leaves or its performance worsens, VESPER may drop back to a faster, albeit less accurate, pipeline to ensure the real-time constraints are still satised. The design of the algorithm is such that it will terminate faster when the system is performing poorly. The scheduler checks the feasibility of pipelines in increasing order of computational complexity. When the algorithm reaches a pipeline that cannot be accommodated, it will terminate early. The algorithm is O(PD), where 69 Roadside Unit Roadside Unit Camera Car Roadside Unit WiFi WiFi Figure 4.5: Network Topology P is the number of pipelines available andD is the number of devices connected at the time of execution. The images sent by the drone are buered by the car until they expire or are scheduled to be processed by a device. The scheduler loop runs at a xed frequency and uses device performance data to make its decisions, as described in Algorithm 2. ^ M and ^ T are measured over xed time intervals and are continuously updated by the performance monitor. The monitoring frequency is three times as fast as the scheduler to ensure that the scheduler uses up-to-date measurements. 4.5 Experimental Setup We devise a set of experiments to evaluate the performance of VESPER under various conditions. As shown in Figure 4.5, we assume a star topology centered around the car. We believe this topology accurately represents a real-world scenario for a driver perception augmentation application. The camera, in our case, is assumed to be mounted on a drone. 70 There are various wireless technologies that can be used to provide connectivity between the devices. We imagine that the car would communicate to the drone and roadside unit(s) using Wi-Fi. Due to the dicultly of getting permission to y a drone around moving vehicles to test our system in a live environment, we instead use a tool called Mahimahi [74] to emulate the links in our experimental scenarios. 4.5.1 Mahimahi Network Emulation Mahimahi provides a set of network emulation tools which made it possible to run realistic experiments in our lab. In particular, Mahimahi's trace-driven link emu- lation tool, LinkShell, enabled us to test our system with realistic link conditions and allowed us to run reproducible experiments. While the links are emulated, we believe the use of traces obtained from real environments supports our condence that the results we obtain in-lab will be indicative of what we would expect to see in the real-world. The trace les read by LinkShell are a list of timestamps in milliseconds that represent an opportunity for a 1500-byte packet to be delivered across a link. We created our own trace les for the car-drone and car-RSU links. For the car-drone link trace le, we set up two Wi-Fi devices on the sidewalk of a busy main street midday with a constant ow of vehicles. For the car-RSU link trace le we placed one Wi-Fi device in a car and the other in a xed location. The car starts o by approaching the RSU from a distance, circles around the RSU for a few laps, 71 and then continues past. For both cases, the Linux tools iperf, tcpdump, and tshark were used to saturate the link, record the trac, and generate the trace le, respectively. A Netgear Gigabit Ethernet Switch served as the backbone for our experiments. 4.5.2 Image Processing Pipelines VESPER utilizes the reference implementations of the YOLO and TinyYOLO ob- jection detection pipelines, which are publicly available. Both networks are loaded into memory at runtime and await images for processing. We proled both pipelines on all of our devices, namely the car and the RSU. 4.5.3 Benchmarks To better assess the performance of VESPER, we implemented a static algorithm within our framework to use as a benchmark. The static algorithm has no capability for frame-rate or pipeline adaptation. This algorithm uses all connected devices regardless of their performance. 4.5.4 Hardware Specications Table 4.1 describes the hardware we used. In all of our experiments, we assign the number of tokens based on the number of GPUs present in each device. For the car and RSU we issued a single token. We did not consider using the drone for neural 72 Figure 4.6: Experimentation Testbed Device CPU Cores RAM GPU Drone (Raspberry Pi 3) ARM Cortex-A53 (1.2 GHz) 4 1 GB - Car (Desktop) Intel Core i7-4770 (3.4 GHz) 4 16 GB NVIDIA 1050 Ti RSU (Jetson TX2) ARM Cortex-A57, Denver 2 (2.0GHz) 4+2 8 GB NVIDIA Pascal Table 4.1: Hardware Specs network computation due to its power limitations, as it is running a Raspberry Pi 3 [65]. Figure 4.6 shows the Jetson TX2 [75] testbed we used for our experiments. Images are captured by a Logitech HD Pro C920 Webcam connected to the drone. The camera captures images at 1080p resolution. At the level of JPEG compression we use for transmission, images are about 20-30 KB in size. 4.6 Results Through a series of experimental scenarios, we evaluate the performance of VES- PER and demonstrate its capabilities. The scenarios are run for 30 minutes. Unless 73 otherwise stated, the makespan and throughput constraints for each experiment are M 0 = 0:8 seconds and T 0 = 8:0 frames/second (fps), respectively. By default, the frame rate adaptation and scheduling algorithms are run every 6 seconds. We refer to this as the control loop time. The control loop time is a tunable parameter that determines how quickly the algorithm takes actions. We've determined experimen- tally that 6 seconds works adequately. Through our experiments, we aim to answer the following questions: How well does VESPER perform and what overhead is incurred by using this framework? How well does VESPER scale to leverage external devices and how responsive is it to changes in computational resource availability and link quality? How well does VESPER adapt when devices are only intermittently con- nected? 4.6.1 Scenario 1: Overhead In the rst scenario, we aim to assess the overhead of VESPER by comparing it to a static algorithm in an environment where the car is the only device available for computation. Figure 4.7 shows the average accuracy, as dened in Section 4.3, versus average throughput. The vertical red line represents the throughput constraint for the system. We know if the system is satisfying the throughput 74 Figure 4.7: Scenario 1: Accuracy/Throughput Performance Figure 4.8: Scenario 1: Makespan Distribution 75 Figure 4.9: Scenario 2: Accuracy/Throughput Performance Figure 4.10: Scenario 2: Makespan Distribution 76 Figure 4.11: Scenario 2: Percentage of work by each device Figure 4.12: Scenario 3: Accuracy/Throughput Performance 77 Figure 4.13: Scenario 3: Makespan Distribution constraint if it is located on or to the right of this line. In this scenario, we can see that both the Static and VESPER algorithms met this constraint. The two horizontal lines in the plot represent the accuracy of the two pipelines available for selection by the scheduler, namely TinyYOLO and YOLOv2. The average accuracy of our system will be located somewhere between those two ac- curacies, based on how long each pipeline is used during operation. The Static algorithm will always be on one of these lines, as it does not switch pipelines. In this case it is set to only use TinyYOLO. We can conclude from this plot that VESPER did not nd it possible to schedule YOLOv2 while satisfying the through- put constraint since it is also on the blue line. This is actually by design, as we wanted to focus on assessing VESPER's overhead for this scenario. From data we've gathered, the car takes about 0:188 seconds on average to process an image using 78 YOLOv2 and therefore can only support about about 5 fps using that pipeline. On the other hand, TinyYOLO takes about 0:098 seconds per image on the car. With a throughput constraint of T 0 = 8:0, VESPER determined that only TinyYOLO could be used. Through the empirical makespan CDF in Figure 4.8 we can see that there is little overhead with running the VESPER algorithm over the static case. This plot shows the latency of image processing, from image reception to result, with the vertical red line representing the makespan constraint. We know there were no late images since the CDF hits 1:0 before reaching this line. Comparing the two CDFs, there is no signicant penalty incurred when running the VESPER algorithm with no external devices present. The CDFs are fairly sharp since only local processing on the car is performed, meaning link quality does not play an important role here. 4.6.2 Scenario 2: Scalability In the second scenario, we introduce the RSUs as external computing resources and observe VESPER's ability to leverage them to improve its accuracy. Four RSUs are active for the entire experiment (meaning they are operational but still subject to varying link quality). In Figure 4.9, we observe that VESPER, while main- taining the throughput constraint, is able to achieve an average accuracy between TinyYOLO and YOLOv2. 79 Whenever a new device connects to the system, the VESPER controller probes the device initially to estimate the link quality and the device execution rate. Dur- ing this phase, the car is the only device executing the jobs. Since the car can only support the TinyYOLO pipeline while meeting the constraints, the accuracy during the initial phase is low. After the probing phase, which may end after just a single frame for a device, VESPER is occasionally able to change the pipeline to YOLOv2 when conditions allow, giving the system better accuracy. The time-averaged ac- curacy of VESPER is, therefore, better than TinyYOLO's accuracy. By leveraging the RSUs, VESPER is able improve the average accuracy of the system. Through the makespan CDF, show in Figure 4.10, we see that while VESPER achieves higher accuracy, this accuracy comes at the cost of makespan due to the heavier processing for YOLOv2. This plot helps to visualize the tradeo that VESPER is making. As long as it continues to satisfy the makespan constraint, VESPER may sacrice makespan in order to improve accuracy. In this scenario, we can also see that the static algorithm suers from not being able to adapt the frame rate of the camera. The varying link conditions introduce too much delay into the system and causes the throughput to suer. Without the ability to compensate for this, the static algorithm is not able to satisfy the throughput constraint. Finally, Figure 4.11 shows the percentage of images processed by each device. The car, being the fastest device available, does the largest percentage of work for 80 all algorithms. Since the RSUs are similar in speed and link quality, they each performed an equal portion of the remaining work. 4.6.3 Scenario 3: Makespan Constraints Scenario 3 demonstrates VESPER's ability to respect the makespan constraint. In cases where the makespan budget allows, VESPER is able to tradeo makespan for better accuracy. In this scenario, VESPER is run thrice with three dierent makespan constraints: 0:6, 0:7 and 0:8 seconds. The car and all four RSUs are active during this experiment. Through Figure 4.12, we can see that for the strictest makespan constraint of 0:6 seconds, VESPER determined that it was not feasible to run the YOLOv2 pipeline, leading to the lowest possible mAP performance; that of TinyYOLO. For the other two cases, VESPER was able to make use of the YOLOv2 pipeline. With the larger makespan budget, VESPER is able to get closer and closer to the performance of YOLOv2, meaning it could spend more time using that pipeline. In Figure 4.13, forM 0 = 0:6 we can see two humps corresponding to the average makespan of the TinyYOLO pipeline, specically 0:098 seconds and 0:275 seconds for the car and RSUs, respectively. As the makespan budget was increased, we observed that VESPER was able to make some use of the YOLOv2 pipeline to improve accuracy while still satisfying the throughput constraint. For the 0:8, an additional hump in the CDF plot can be observed around 0:638 seconds, which 81 Figure 4.14: VESPER performance over time in Scenario 4 corresponds to the average YOLOv2 makespan for the RSUs. The YOLOv2 time for the car is not as noticeable, since this hump gets smoothed out by the TinyYOLO makespans. 4.6.4 Scenario 4: Intermittent Resource For Scenario 4, we further investigate VESPER's ability to leverage an intermit- tent external resource to help improve its accuracy. In this experiment we use a throughput constraint ofT 0 = 5:5 fps but keep the makespan constraint atM 0 = 0:8 seconds. For this scenario, the car begins processing on its own. For a short window 82 of time from 180 seconds to 450 seconds, the RSU is connected to the car and is available for computation. Figure 4.14 shows the time series plot for this scenario. The rst plot in the gure shows the rate at which images are arriving at the car, while the second plot shows the throughput, the rate at which results are produced by the system. We can see in the second plot that VESPER does a decent job of maintaining the throughput constraint. The third plot (red) shows the processing makespan of the system, averaged over 2 second windows. We can see in this plot that there is an increase in the makespan when VESPER is using the YOLOv2 pipeline. The selected pipeline is shown in the fourth plot. We can see that the YOLOv2 pipeline was only used in the time period when the RSU was available. When the controller realized that it could support YOLOv2 and still satisfy the real-time constraints with the present devices, it changed the pipeline. The link to the RSU is, however, not always stable and VESPER drops the pipeline in response to those changes. This scenario demonstrates that VESPER can leverage extra resources and im- prove the accuracy of the system by using the YOLOv2 pipeline whenever possible. When the RSU becomes unavailable, after 450 seconds, VESPER sticks to using TinyYOLO. In the last plot of Figure 4.14, we can see how VESPER used the devices for parallel processing of the images. This plot shows a dot for every image completed by a device. At some points, the quality of the link to the RSU was bad and it received fewer images to process. 83 4.7 System Performance Model In our initial series of VESPER experiments, we emulated the performance of our system using trace data collected from our testbed. Since it would be dicult to perform a more thorough evaluation of our system in this manner, we developed a model to help us capture VESPER's salient properties. In this section we describe the design of this model and present the results we obtained through simulation. 4.7.1 Model Design We construct our model as follows. We assume that there is a car driving down a straight length of road, on which RSUs are evenly distributed at some spacing interval. As the car travels this road at some constant speed, it may enter or exit communication range of the RSUs. We use a disc model for determining connectivity between the car and an RSU, meaning if the RSU is less than a certain distance away from the car then it is capable of communicating with it and vice versa. While the car's motion is continuous, we make observations of the model at discrete points in time to make anaylsis tractable. We create time windows, which we refer to as frames, during which we determine how many RSUs are available and assume some statistics about the latency of carrying out computations on the car and the RSUs. These frame statistics incorporate the delays due to both processing time and link latency. VESPER uses these statistics to make its scheduling decision 84 for the duration of that frame. This captures how VESPER makes its scheduling decisions at periodic intervals based on past performance. To keep the model simple and avoid concern about queue variations, we ignore the drone part of the system and assume that images are always available at the car as needed. 4.7.2 Results Unless otherwise indicated, we use the default settings shown in Table 4.2 for our simulation. The number of frames indicates how many time windows we generated for our experiments. We select a certain number of pipelines that are available for processing and evenly distribute their mean latency in the range from 1/30 to 1/10, inclusively. The accuracy, or quality, of the pipelines are evenly distributed between 0 and 1, with 1 being assigned to the best, or most preferred, pipeline. For each frame, the mean latencies associated with executing a particular pipeline on a particular device are sampled from a uniform distribution around the pipeline's mean latency. For RSUs, this range is the pipeline's mean latency0:02, while for the car the range is0:01. This variation is included to capture the change in link quality that may occur as the car moves around within a frame's time span as well as any resource contention occurring on the device which may introduce some additional delays. The RSUs were placed 40 m apart and we imagine that the car is traveling at 60 mph. The latency and throughput constraints are 0:10 s and 15 fps, respectively. 85 Setting Value Frames 100 Number of Pipelines 3 Pipeline Latency Mean [ 1 30 , 1 10 ] RSU Spacing 40 m Maximum Latency 0:10 s Minimum Throughput 15 fps Latency Variance 10 6 Table 4.2: VESPER Simulation Settings The variance of the total latency for all devices is 10 6 . We assume the latencies follow a normal distribution. For our simulation experiments, we compare VESPER to the performance of static algorithms that either always choose the least accurate pipeline or the most accurate pipeline, which we refer to as Static Low and Static High, respectively. These static algorithms serve as a benchmark for VESPER's performance. 4.7.2.1 Varying the Latency Constraint In this experiment, we sought to observe the change in VESPER performance as the latency constraint is varied. Figure 4.15 shows the makespan CDF and average throughput plots for three sets of values, where the latency constraint is varied from 0:055 to 0:11 seconds. In the makespan CDF plots we can see that VESPER transitions from a CDF that closely resembles the Static Low pipeline to a CDF that closely resembles the Static High pipeline as the latency constraint is increased. This means that VESPER makes more use of the higher accuracy pipelines since the larger latency constraint is providing more time to do so. 86 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Makespan (s) 0.0 0.2 0.4 0.6 0.8 1.0 Probability Makespan CDF (L max = 0.055) VESPER Static Low Static High 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Makespan (s) Makespan CDF (L max = 0.08) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Makespan (s) Makespan CDF (L max = 0.11) 0 10 20 30 40 50 60 Throughput 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Metric Average Throughput (L max = 0.055) VESPER Static Low Static High 0 10 20 30 40 50 60 Throughput Average Throughput (L max = 0.08) 0 10 20 30 40 50 60 Throughput Average Throughput (L max = 0.11) Figure 4.15: VESPER Performance vs. Latency Constraint In the average throughput plots of Figure 4.15, the accuracy of VESPER in- creases as the latency constraint is increased. However, as VESPER makes more use of slower pipelines, the highest feasible throughput decreases. Just as in the makespan CDF plots, VESPER gets closer and closer to the Static High perfor- mance as the latency constraint allows it to do so. VESPER is designed to tradeo latency for accuracy and these results indicate that VESPER is capable of doing so. 87 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) 0.0 0.2 0.4 0.6 0.8 1.0 Probability Makespan CDF (T min = 10) VESPER Static Low Static High 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) Makespan CDF (T min = 20) 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) Makespan CDF (T min = 30) 10 20 30 40 50 60 Throughput 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Metric Average Throughput (T min = 10) VESPER Static Low Static High 10 20 30 40 50 60 Throughput Average Throughput (T min = 20) 10 20 30 40 50 60 Throughput Average Throughput (T min = 30) Figure 4.16: VESPER Performance vs. Throughput Constraint 4.7.2.2 Varying the Throughput Constraint Figure 4.16 shows how VESPER adapts to changes in the throughput constraint. On the left side of the gure, the low throughput constraint allows VESPER to make use of the most accurate pipeline for the majority of the time, as its proximity to the Static High algorithm in the throughput plot indicates. When the throughput constraint increases to 25, the accuracy drops as VESPER can no longer make as much use of the most accurate pipeline due to its high latency. Finally, when the throughput constraint increases to 45, the accuracy again drops as VESPER 88 gradually approaches the performance of the Static Low benchmark. In summary, VESPER is able to maintain the throughput constraint by sacricing accuracy. 4.7.2.3 Varying the Number of Pipelines 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) 0.0 0.2 0.4 0.6 0.8 1.0 Probability Makespan CDF (npipelines = 3) VESPER Static Low Static High 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) Makespan CDF (npipelines = 4) 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) Makespan CDF (npipelines = 5) 0 20 40 60 Throughput 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Metric Average Throughput (npipelines = 3) VESPER Static Low Static High 0 20 40 60 Throughput Average Throughput (npipelines = 4) 0 20 40 60 Throughput Average Throughput (npipelines = 5) Figure 4.17: VESPER Performance vs. Number of Pipelines We theorized that the performance of VESPER would improve with the number of pipelines available to select from, assuming they oer a unique tradeo between accuracy and latency. The result of our experiment to test this is shown in Figure 4.17, where we vary the number of pipelines from three to ve. Note that the blue dashed lines in the throughput plots represent the accuracies of the pipelines. We 89 observe that as the number of pipelines is increased, VESPER approaches closer and closer to the highest possible accuracy. 4.7.2.4 Varying the RSU Spacing 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) 0.0 0.2 0.4 0.6 0.8 1.0 Probability Makespan CDF (spacing = 0.5) VESPER Static Low Static High 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) Makespan CDF (spacing = 1.5) 0.00 0.02 0.04 0.06 0.08 0.10 Makespan (s) Makespan CDF (spacing = 3.0) 25 50 75 100 125 150 Throughput 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Metric Average Throughput (spacing = 0.5) VESPER Static Low Static High 25 50 75 100 125 150 Throughput Average Throughput (spacing = 1.5) 25 50 75 100 125 150 Throughput Average Throughput (spacing = 3.0) Figure 4.18: VESPER Performance vs. RSU Spacing Here we explore how the deployment of RSUs along a road might aect the per- formance of VESPER. Spacing refers to how far apart the RSUs are deployed from each other. Figure 4.18 shows the results for three values of spacing, specically 0:5, 1:5, and 3 times the distance traveled by the car in one second. The results indicate that as the RSUs are spaced further and further apart, VESPER loses its 90 ability to support the most accurate pipeline. This makes sense as there are less devices available for processing at a given time when the RSUs are spaced farther apart. 4.8 Conclusion We have presented VESPER, a real-time polymorphic computing framework for driver perception augmentation. VESPER exploits the computational resources of devices connected wirelessly with the car to perform complex processing tasks. It handles intermittently connected devices and allows for workload adaptation, wherein the processing pipeline can be changed based on the available resources of the devices and their link qualities. We have developed the framework and demonstrated its performance using a computer vision task for identifying vehicles in drone images. Through our experiments we have shown that VESPER maximizes the accuracy of the system while satisfying the real-time constraints of latency and throughput for the application. While the development of VESPER was focused on vehicles, the concepts that we have developed are more broadly applicable. They can be applied in any similar domain where there are dynamically-available resources external to a device. This dynamism could be due to the device's motion, the resources' motion, or a num- ber of other factors. VESPER demonstrates that these resources can be utilized opportunistically by the device to improve its processing of complex tasks. 91 Chapter 5 Utility-based Scheduling Algorithms for Polymorphic Applications With the growing deployment of various connected sensor devices including cameras, LIDAR, RADAR and other data-rich modalities in many environments from buildings to road-side to on board vehicles, many new applications are being developed that rely on the processing of the raw data. While the past decade or more has seen the consolidation of processing in the cloud, more recently there has come to be a growing awareness that processing may also need to be done at the edge, close to where the data is generated, in order to reduce the bandwidth costs associated with data hauling, for lower latency response, as well as for privacy reasons in some cases [77]. This chapter describes work that appears (in part) in the work by Wright et al. [76]. 92 Even as the number of compute points available at the edge may grow, for instance with greater computation availability on board vehicles or road-side units in the case of mobility-oriented applications, trends suggest that the volume of data to be processed might grow even faster [78, 79]. As the ratio of data sources to compute points that are available increases, a growing bottleneck is the decision on where to perform each computation. While many prior works focused on computation ooading between a single mobile device and the cloud [80, 9, 8, 13], scheduling decisions will need to be made increasingly over multiple distributed edge computing as well as cloud based computing resources. The dierent compute points may oer dierent levels of computational capability and may have dierent levels of latency between the source of data and the respective compute point. Scheduling algorithms are therefore needed at the edge to perform the task of allocating compute jobs in real time to the appropriate compute points. There is a small but growing literature on this kind of distributed/mobile edge computing [11, 81, 82]. Much of this literature focuses on optimizing the schedules with respect to a single metric such as latency, or throughput. Further, much of the literature also considers a traditional computational model in which the application to be scheduled in terms of the actual computations to be performed is xed and rigid. This more traditional model of computation is being challenged by developments in machine learning, specically deep learning neural networks. The 93 same data can be processed by deep learning networks with dierent architectures with dierent numbers of neurons and layers. This creates a dierent polymorphic application stack, in that the same application can be implemented in dierent ways. Dierent forms of the same computation may then provide dierent tradeos in terms of how much data they ingest/produce, how much computation they do, and what accuracy or other application-level performance they provide. In this work, we tackle head on two key problems: a) how to schedule modern data processing jobs, that t such a polymorphic computation framework, and b) how to allow the scheduling decision to be exible enough to provide dierent tradeos between a multiplicity of performance metrics. Concretely, we do this by proposing Usher, a general, multi-dimensional, utility-based scheduling framework for polymorphic applications, which has the following key features: Usher allows for multiple dierent jobs to be scheduled simultaneously across a collection of multiple compute nodes. Usher allows for general utility functions that can capture (and be tuned by designers for) combinations of dierent metrics such as latency, throughput, energy usage, economic cost, and application-level performance (such as clas- sication accuracy). Usher accounts for vector resources and corresponding constraints on them at each compute node such as storage, memory, general-purpose processors, accelerators, as well as network bandwidth. 94 Job 0 version m 1 version m 2 Job 1 version m 1 version m 2 Host Node 0 Node 1 Node 2 Figure 5.1: Example of scheduling jobs onto nodes. Implementation m 2 of Job 0 is assigned to Node 0 while implementation m 1 of Job 1 is assigned to Node 2. Usher allows for multiple implementations of the same job, which can be optimized to accommodate dierent tradeos. 5.1 The Scheduling Problem In this section, we (a) describe our system model for representing jobs and (b) formulate the utility-aware job assignment as an optimization problem. 5.1.1 System Model To establish our model, we assume that we have a set of jobs, for which multiple implementations exist. There is also a set of compute devices or nodes. Each node is logically connected to the host across a network, as shown in Figure 5.1, and has some set of resources available for use. The host seeks to allocate jobs to these nodes for execution in the best possible manner according to the jobs' utility function. 95 Let J =fJ 0 ;J 1 ;:::g represent the set of jobs to be scheduled. Each job J i may have one or more implementations designed for dierent hardware and/or with dierent tradeos with regards to its performance. For example, consider that there may be three versions of a job; one written for a general-purpose CPU and two versions that have been GPU-accelerated. All versions of the job fulll the same purpose but they may have dierent performance characteristics. Having multiple versions available for a scheduler to choose from provides ex- ibility with regards to performance and there are various benets. Let us continue with the example from above. On one hand, one of the GPU implementations may execute faster than the other by trading o the accuracy of its result for speed by using a dierent algorithm. On the other hand, the CPU implementation allows the job to be primarily executed on a dierent type of hardware, namely a CPU as op- posed to a GPU. These two types of benets increase the scheduling opportunities and can therefore improve the overall utility that can be realized. The resource requirements of a job, such as the number of processing cores required or the amount of memory, depends on which node the job is executed on 1 and which implementation or instance is used. For each jobJ i , we user(i;n;m;d)2 Z, d = 1;:::;D, to indicate the amount of resource d required by job J i running on noden using implementationm. D represents the number of independent resources considered by the scheduler and can include items like processors or memory. 1 For example, one node may be x86-based while another uses ARM. 96 Each job J i has a utility function u(i;n;m) 0 associated with it. As the arguments indicate, the utility of a job is a function of the implementation used and the node it is scheduled on. This allows user preferences to be taken into consideration at the job level during the scheduling process and the utility can include metrics such as latency, accuracy, resource consumption, execution cost, and reliability. Some of these metrics are properties that are static or initially obtained through oine analysis but these can later be updated based on recent executions of the job in our system. The advantage of dening utilities at the job level is that it allows the user to specify dierent preferences depending on the type of job being scheduled. For example, the priority for one job may be the speed of execution whereas for another it may be reliability. All of these considerations can be accommodated by our scheduler through the user-specied utility functions. 5.1.2 Problem Formulation The goal of our system is to maximize the overall utility of a job schedule. Let X(i;n;m)2f0; 1g be the binary allocation variable, where X(i;n;m) is 1 if jobJ i is scheduled on noden using implementationm and 0 otherwise. We formulate our schedule optimization problem SOP as follows: 97 maximize X i X n X m X(i;n;m)u(i;n;m) (5.1a) subject to: X i X m X(i;n;m)r(i;n;m;d)R(n;d);8n;d (5.1b) X n X m X(i;n;m) 1;8i (5.1c) X(i;n;m)2f0; 1g;8i;n;m (5.1d) where R(n;d) represents the total amount of resource d available at node n. The goal of our objective function, Equation 5.1a, is to maximize the overall utility of the schedule, which is a summation of the individual job utilities. The resource constraint, Equation 5.1b, ensures that jobs assigned to a node do not exceed its resources, while the assignment constraints, Equations 5.1c and 5.1d, ensure that a job is assigned at most once (one instance on one node). In our current formulation, jobs are treated as atomic units and may not be split across nodes. It should be noted that the utility for each job can be arbitrarily dened and we don't make any assumptions about its properties. For example, the utility of a job can be based on the node chosen for execution, such as a usage cost, or a performance metric of the job, such as latency. The utility can also be some combination of the two. 98 5.1.3 Intractability The scheduling problem we've formulated is at least as dicult as the multidi- mensional knapsack problem (MKP)[37]. A special case of MKP is the maximum independent set problem, which is known to be NP-hard [35, 36]. This is easy to see from the ILP formulations of the problems. Theorem 1. SOP is NP-hard. Proof. We seek to show that MKP is polynomial-time reducible to SOP. For the structure of MKP we refer the reader to an explanation by Puchiner et al. [37]. We reduce MKP to a special case of SOP where there is only one node available for scheduling and only one implementation of each job. This special case of SOP, denoted SOP 0 , can be written as: maximize X i X(i)u(i) subject to: X i X(i)r(i;d)R(d);8d X(i)2f0; 1g;8i In MKP terms, X(i) are the decision variables representing the selected items while u(i) can be thought of as the prots. r(i;d) and R(d) are analogous to the weights and capacity constraints, respectively. The goal of MKP is to choose a set of items that maximize prot while limiting the total weight to the capacity 99 constraint. Using this mapping, we can generate the optimal solution for MKP, meaning: MKP p SOP 0 p SOP: 5.2 The Usher Algorithms Given the diculty of our scheduling problem, we propose two greedy algorithms to solve it, which we refer to as the simple greedy (SG) and the bundle-based greedy (BG) algorithms. 5.2.1 Simple Greedy Algorithm The rst algorithm we present is the Simple Greedy (SG) algorithm, designed to generate a schedule for our optimization problem very quickly. As shown in Figure 3, the algorithm begins by enumerating every possible individual allocation of job implementations to nodes. These allocations are sorted by their associated utility value, highest to lowest. The algorithm then iterates over each allocation one-by- one and determines if it can be added to the schedule with the remaining resources. A job can only be scheduled once, so when it is assigned all remaining allocations of that job in the enumerated list are ignored. 100 Algorithm 3 Simple Greedy (SG) Scheduling Algorithm 1: procedure SG(N;J) . schedule jobs J onto nodes N 2: I instances . all instances of all jobs 3: A NI . all possible node assignments 4: Sort(A) by utility, highest to lowest 5: S ; . initialize the schedule 6: while A6=; do . consider each assignment 7: a pop(A) 8: if S[a2F then . if assignment is feasible, 9: S S[a . add it to solution 10: end if 11: end while 12: return S 13: end procedure Let Q be the number of jobs being scheduled. Theorem 2. The SG algorithm has a 1=Q approximation ratio relative to the optimal utility. Proof. Without loss of generality, assume implementation m 1 of jobJ 1 running on noden 1 has the best utility valueu(1;n 1 ;m 1 ) in a given scheduling problem and is not a part of the optimal schedule. By denition we have that: u(1;n 1 ;m 1 )u(i;n;m) 8i;n;m While m 1 is not a part of the optimal schedule, in the best case scenario every other job is a part of it. If there are Q jobs, then the best possible resulting utility not including m 1 is Qu(1;n 1 ;m 1 ), consisting of Q 1 jobs other than J 1 and possibly another implementation of J 1 itself. The worst possible schedule that 101 includes J 1 is J 1 alone, with a maximum utility of u(1;n 1 ;m 1 ). The ratio of these values yields our approximation of 1=Q. Let N be the number of nodes available and M be the maximum number of implementations available for any job. Theorem 3. The SG algorithm has a time complexity of O(NMQ log[NMQ]). Proof. The SG algorithm enumerates all possible allocations of job implementations to nodes in linear time, specicallyO(NMQ). The sorting involved has the highest time complexity, which is O(NMQ log[NMQ]) 5.2.2 Bundle-based Greedy Algorithm We have also developed another greedy heuristic algorithm with a provable perfor- mance guarantee, which we refer to as the Bundle-based Greedy (BG) algorithm. We observed that our scheduling problem is analogous to maximizing a submodular set function subject to transversal matroid constraints. Fisher et al. have developed a greedy heuristic for this type of problem with a provable approximation ratio [83]. In this section, we explain our algorithm and provide a proof of its performance guarantee. We dene a scheduleS as the set representation of the binary allocation variables X(i;n;m). An instance ofJ i is part ofS if the corresponding element ofX(i;n;m) is 1. The union or intersection of a set of instances with S has the intuitive eect of adding or removing instances from the schedule, by setting the corresponding 102 values of X to 1 or 0, respectively. We dene V (S) to represent the value or total utility of schedule S, in other words V (S) = P i P n P m X(i;n;m)u(i;n;m). Lemma 1. V (S) is nondecreasing and submodular. Proof. The job utilities u are non-negative functions by denition and therefore adding an instance to a scheduleS can only increase its value or leave it the same. Since job utilities are independent of each other and the total value function is a summation of them, we can apply the commutative and associative properties of addition to show that the following also holds for arbitrary schedules S and T : V (S) +V (T )V (S[T ) +V (S\T ) Hence, the total utility function V (S) is submodular. Lemma 2. A valid schedule S is bound by one or more transversal matroid con- straints. Proof. Let N = (N 0 ;N 1 ;:::) represent the set of nodes that can be scheduled. We dene a bundle b as a subset of the instances of jobs and B as the set of all possible such bundles. Note that a bundle cannot contain more than one instance of the same job. Using N andB, we can create a bipartite graph, where the edges indicate that it is feasible to schedule a bundle on a particular node with respect to a particular resource (i.e. the node has sucient resources of that type). Since 103 the edges encode our scheduling constraints, we can say that a valid schedule is therefore subject to a matching, or transversal, of this bipartite graph. We can also view this as a transversal matroid [84]. The pseudo code for our algorithm is shown in Figure 4. The variable A rep- resents the set of all possible assignments of bundles to nodes. In other words, it is the Cartesian product of the set of nodes and the power set of all instances. For large problems, we can place a practical limitation on A's size, to restrict the algorithm to execute in polynomial time without aecting the solution. This works if there is a limit on the maximum number of jobs that can be scheduled on a single node that is smaller than the total number of jobs to be scheduled. So for instance, if each job is known to consume at least one resource for some kind of resource (e.g. CPU), then the maximum number of CPUs available on any node can be used to limit the size of A and therefore reduce the complexity of the algorithm from exponential to polynomial. The schedule generated by our algorithm is stored inS. To generate the sched- ule, our algorithm loops through every assignment in A, ordered from highest to lowest impact on utility, and makes a greedy decision regarding its inclusion. If adding assignment a to an existing schedule is feasible, meaning it is a member of F, thena is included and the process continues until there are no more assignments left to consider. 104 Algorithm 4 Bundle-based Greedy (BG) Scheduling Algorithm 1: procedure BG(N;J) . schedule jobs J onto nodes N 2: I instances . all instances of all jobs 3: B P(I) . power set of all instances in J 4: A NB . all possible bundle assignments 5: Sort(A) by utility, highest to lowest 6: S ; . initialize the schedule 7: while A6=; do . consider each assignment 8: a pop(A) 9: if S[a2F then . if assignment is feasible, 10: S S[a . add it to solution 11: end if 12: end while 13: return S 14: end procedure The consideration ofF allows us to incorporate all of our constraints as de- scribed previously. Bundles that contain multiple instances of the same job are ex- cluded. This prevents redundant work from occurring. Likewise, separate bundles that contain instances of the same job cannot both be part of the same schedule. Theorem 4. The BG algorithm achieves a 1=(D + 1) approximation ratio, where D is the number of resources considered. Proof. We have shown that our value function V (S) is non-decreasing and sub- modular and that the solution to our problem is subject to transversal matroid constraints. Each resource in our problem is independent of the others and there- fore requires a separate matroid. Our algorithm applies the Fisher et al. greedy heuristic [83] and therefore has an approximation ratio of 1 (D+1) . Theorem 5. The BG algorithm has a time complexity of O(2 MQ N log[2 MQ N]). 105 Proof. The bundles, or power set of all instances, consists of 2 MQ elements. The BG algorithm considers assigning each possible bundle to each node which takes O(2 MQ N) time to enumerate. These possible assignments are sorted by their utility value leading to an overall complexity of O(2 MQ N log[2 MQ N])). 5.3 Trace-based Evaluation Through a variety of trace-based experiments, we evaluate the performance of our algorithms in this section. We start our evaluation of the SG and BG algorithms by systematically varying some of the problem parameters while keeping the utility function the same. This will allow us to observe how the schedules might change with dierent inputs, giving us better insight into how the algorithms work. Next, we investigate if there are any network congurations that are preferred by the algorithms by varying the correlation of link quality with the resources available on heterogeneous nodes. Afterwards, in a realistic edge computing scenario, we evaluate the ability of the algorithms to adapt their solutions to accommodate the goals of dierent utility functions. Finally, we compare the runtimes of the two algorithms to the brute-force approach. 5.3.1 Schedule Quality on an Object Detection Dataset Object detection is a common job performed by vehicles with camera-based Ad- vanced Driver-Assistance Systems (ADAS) [85]. The goal of object detection is to 106 Implementation CPUs GPUs RAM (MB) TinyYOLOv3 (CPU) 1 0 197 TinyYOLOv3 (GPU) 1 1 361 YOLOv2 1 1 534 YOLOv3 1 1 610 Implementation Accuracy (mAP) Latency (ms) TinyYOLOv3 (CPU) 33.1% 1100 TinyYOLOv3 (GPU) 33.1% 151 YOLOv2 48.1% 165 YOLOv3 60.6% 202 Table 5.1: Performance Data and Resource Requirements of YOLO Variants classify and localize objects within an image. We use an object detection workload to evaluate the quality of the schedules generated by our algorithms. Specically, we created a job dataset based on YOLO [29], a popular objection detection pipeline that utilizes a neural network. While the purpose of each YOLO variant is the same, namely object detection, the underlying neural network or execution is slightly dif- ferent, yielding various tradeos in terms of performance as can be seen in Table 5.1. Table 5.1 shows the data we measured for four YOLO variants running on our server. Our server has an Intel Xeon E5-2620 CPU with 64GB of RAM and an Nvidia GeForce 1080 Ti (12GB) GPU. For our experiments, we create synthetic nodes based on these measurements. Unless otherwise stated, each node has 4 CPUs, 2 GPUs, and 4GB RAM. 107 5.3.1.1 Performance vs. number of GPUs per node For this scenario, we scheduled one to ve YOLO jobs on various sized sets of homogeneous nodes using our algorithms. Homogeneous in this case means that all of the nodes have the same hardware conguration, i.e. amount of CPUs, GPUs, and RAM. Figure 5.2 shows the latency-based utility results for when each node has 4 CPUs and 4 GB of RAM. Each row of the gure represents a dierent total number of nodes available for scheduling while each column represents a dierent number of jobs to be scheduled. The number of GPUs per node is varied on the x-axis of each subplot. The exhaustive search, or brute force, algorithm is also presented as a benchmark. Our latency-based utility function takes the negative of the latency (smaller is better) and shifts it to ensure that our utility is always non-negative as we require. We increase the shift by +1 so that a job that isn't scheduled has the unique utility value of 0. This ensures that scheduling a job is always preferred over not scheduling it. The plots presented in Figure 5.2 demonstrate how the algorithms perform relative to the number of GPUs, a critical resource for realizing the best potential of our object detection workload. GPUs also happen to be the most restrictive resource constraint. We believe this single dominating resource constraint is what enables a simple heuristic such as the SG algorithm to perform so well. We can see that when the number of jobs exceeds the number of GPUs per node, the BG 108 0 2000 4000 1 Nodes 1 Jobs Bundle Simple Exhaustive 3 Jobs 5 Jobs 0 2000 4000 3 Nodes Utility (Latency) 0 2 4 0 2000 4000 5 Nodes 0 2 4 GPUs per Node 0 2 4 Figure 5.2: Utility (latency-based) vs. the number of GPUs per node for various numbers of nodes and jobs. 109 algorithm tends to perform worse than the SG algorithm. If we look at the rst column of Figure 5.2 where only a single job is being scheduled, we can see that for any number of nodes, the utility peaks early, at one GPU/node, and remains the same afterwards as there are no jobs to utilize the additional hardware in this case. However, if we look at the column corresponding to ve jobs, we can see that the BG algorithm consistently gets better as the number of GPUs per node is increased. The BG algorithm has a tendency to consolidate as many jobs as possible on each node it uses as the algorithm is focused on creating bundles. Since having multiple jobs in a bundle is always more valuable than having only a single job on a node, the BG algorithm schedules more CPU implementations as opposed to taking advantage of the GPUs available on other nodes as the SG algorithm does. Another way of viewing this is that the BG algorithm has a tendency to use as few nodes as possible whereas the SG algorithm has a tendency to spread out. While this spreading tendency may seem trivial for centralized scheduling as our system is currently designed, it could prove valuable for a distributed scheduler which has no global knowledge of the nodes being used, as the utilities in our formulation are only considered on a job-by-job basis. To investigate this tendency further, for some of the scenarios we plot the num- ber of nodes used by each algorithm versus the number of GPUs available per node in Figure 5.3. In this gure, we can see that as the number of jobs increases across 110 Figure 5.3: Number of nodes used vs. the number of GPUs available per node. the columns, the SG algorithm has a tendency to spread itself out across multiple nodes if there isn't sucient GPU resources available on a single node. If there are no GPUs or when there are too many GPUs on a single node, then for a suciently small number of jobs the SG algorithm has no incentive to utilize multiple nodes, leading to the \peak" or \mountain" shape we observe in Figure 5.3. For 1 and 3 jobs, the BG algorithm can get away with squeezing all jobs onto a single node. However, at 5 jobs the BG algorithm is forced to include a second node. 5.3.1.2 Performance vs. total number of nodes In Figure 5.4a, we take another look at the behavior of our algorithms with regards to nodes. This gure shows a plot of the utility as a function of the number of nodes available for scheduling. Here we can see that as the number of nodes increases, the SG algorithm leverages them the most after three nodes. At three nodes, the BG algorithm has scheduled four jobs on the rst node, four jobs on the second node, and two jobs on the third node. When a fourth node is added, the BG algorithm 111 (a) (b) Figure 5.4: Change in utility as (a) the number of nodes are varied and (b) the number of jobs are varied. continues to use the third node, assigning job instances that only require a CPU, instead of utilizing the available GPUs on the fourth node. 5.3.1.3 Performance vs. total number of jobs Figure 5.4b shows the utility when 1-10 jobs are scheduled on ve nodes by the SG and BG algorithms. The algorithms perform the same for the one and two job cases, as both algorithms make use of the GPUs of a single node. After this point, however, the algorithms diverge. With regards to utility, the SG algorithm continues to increase linearly because it immediately allocates more nodes as the number of jobs increases so that the available GPUs can be used. In contrast, from 2-4 jobs, the BG algorithm continues to allocate jobs to the rst node using the jobs' CPU implementations. The BG algorithm doesn't begin to use an additional node until no more jobs can t on the previous one. 112 While it is not easily discernible in Figure 5.4b, the utility is actually always increasing for the BG algorithm. The atter portions of the BG algorithm's curve are due to the signicant performance dierence, in terms of latency, between the CPU and GPU implementations of YOLO. Consequently, this makes it easier to observe what is happening. During these ranges, specically during 2-4 jobs and 6-8 jobs, the BG algorithm loses out since it is eectively prioritizing increasing the size of a bundle over using all of the GPU hardware available in the system. At 5 jobs and 9 jobs, the BG algorithm has begun to use the GPUs of a new node. 5.3.2 Heterogeneous Links on a Random Dataset We investigated how heterogeneous links with heterogeneous nodes would impact the performance of our algorithms. Figure 5.5 depicts the setup of our system for this experiment. Logically connected to the host system are ve nodes, N 0 - N 4 , each with varying amounts of three types of resources as shown. Five jobs were randomly generated with resource costs anywhere in the range of 1 10 for each resource type. The utilities are based on latency and the latency for each imple- mentation was randomly generated from 0 100. We ran three experiments where the links were uncorrelated, positively correlated, and negatively correlated with the capabilities of the nodes they are connected to. Specically, for the uncorre- lated case, links A-E had 0 added latency; for the positively correlated case, links A-E had 50, 100, 150, 200, and 250 ms of added latency, respectively; and for the 113 N 0 N 1 Nodes Host N 2 N 3 N 4 Links A B C D E (10, 10, 10) Resources (9, 9, 9) (8, 8, 8) (7, 7, 7) (6, 6, 6) Figure 5.5: Network Topology negatively correlated case, links A-E had 250, 200, 150, 100, and 50 ms of added latency, respectively. The results of the three scenarios are shown in Figure 5.6. For the uncorrelated case, the BG algorithm achieved optimal or close to it as the number of jobs varied. The SG algorithm did not fare so well as its naive behavior causes it to lose out on being able to schedule one of the jobs, which occurred when there were 3 or 4 jobs to be scheduled. In the positively correlated case, all algorithms achieved a lower utility as ex- pected due to the additional latency added to the system. Above 3 jobs, the BG algorithm started to deviate signicantly from the optimal but still outperformed the SG algorithm. There happened to be a high utility job implementation that 114 1 2 3 4 5 0 250 500 750 1000 1250 1500 Utility (Latency) Uncorrelated Bundle Simple Exhaustive 1 2 3 4 5 Jobs Positively Correlated 1 2 3 4 5 Negatively Correlated Figure 5.6: Utility vs. the numbers of jobs from our random dataset. could only run on one node based on the resource requirements but the BG algo- rithm prioritized a bundle of two other jobs instead, even though that bundle could have been assigned to another node. This reveals a limitation of the BG algorithm, that it doesn't make considerations for using the least capable node that can satisfy a bundle in order to keep the most options open for future assignments. Finally, in the negatively correlated case, the SG algorithm was the one to remain at or close to optimal for all dierent numbers of jobs. The BG algorithm made a non-optimal schedule at four jobs but quickly recovers at ve jobs. In the four job case, the BG algorithm chose to pair up two jobs on a node with a relatively high latency link as opposed to splitting up the two jobs across two nodes with lower latency links. In this situation, the tendency of the BG algorithm to pack as many jobs onto each node caused it to incur a high latency penalty due to the bad link. 115 Name Link CPU (threads) GPU RAM (GB) Local - 8 1 8 Edge 1 Wi-Fi 16 2 32 Edge 2 Cellular 16 2 32 Cloud Cellular 32 4 244 Name Xfer Time (ms) CPU/GPU TDP (W) Price ($/hr) Local 0 84/75 0.00 Edge 1 14.5 85/250 1.53 Edge 2 38.3 85/250 1.53 Cloud 52.2 145/300 3.06 Table 5.2: Edge and Cloud Node Details 5.3.3 Multi-Objective Utility in Edge Computing Environment The Usher algorithms allow the utility functions of the jobs to be specied by the user, so that the user may indicate what type of schedule is most preferred. In this section, we investigate the capability of the algorithms to navigate tradeos in the edge/cloud computing environment we envision for vehicular applications. In par- ticular, our utility functions consist of multiple objectives, specically maximizing accuracy while reducing latency, energy, and price. 5.3.3.1 Setup Our setup consists of four nodes as listed in Table 5.2, namely the local (or host) device, two edge servers, and a cloud server hosted by a provider such as Ama- zon's Elastic Compute Cloud (EC2) [86]. We proled the YOLO pipelines on these devices for our simulation. The local device is represented by a workstation con- sisting of an Intel Core i7-4770 CPU, an Nvidia GeForce GTX 1050 Ti GPU, and 116 16 GB of RAM. The edge servers are represented by a server with an Intel Xeon E5-2620 v4 CPU, an Nvidia GeForce GTX 1080 Ti GPU, and 64 GB of RAM. The edge servers have similar hardware but use a dierent link type. Finally, the cloud server's hardware is based on the EC2 p3.2xlarge instance type consisting of an Intel Xeon E5-2686 v4 CPU, an Nvidia Tesla V100 GPU, and 244 GB of RAM. We assume for each job that there is 100 kB of data uploaded to the node for processing, which includes a JPEG image le. Also, 1 kB of application-specic detection data is sent back to the host. We ran a measurement study to collect this information regarding network latency. The transfer times for this data is shown in the "Xfer Time" column of Table 5.2. Since CPUs and GPUs represent the largest consumers of power in a typical computer, the thermal design power (TDP) of the CPUs and GPUs are used to estimate the energy costs of a job. As an example of using the cloud, a CPU- only implementation of a job would have a total TDP of 145 W, whereas a GPU- based implementation would have a total TDP of 445 W. The duration of a job is multiplied by the TDP of any CPUs and GPUs it uses as a rough measure of the energy consumed during its execution. We base the price of the cloud server on EC2's p3.2xlarge instance type. We made the assumption that edge servers would be operated by a third-party and follow a similar pricing model as cloud servers. We estimate the cost of the edge servers to be half that of the cloud server. 117 There are ve jobs in total to be scheduled. We dene the utility function for each job as follows: u(i;n;m) = acc u acc (i;n;m) + lat u lat (i;n;m) + enr u enr (i;n;m) + pri u pri (i;n;m) where acc + lat + enr + pri = 1 (5.2) The subscripts `acc', `lat', `enr' and `pri' stand for accuracy, latency, energy and price, respectively. Energy usage here refers to system-wide consumption. While the goal for accuracy is to maximize, the goals for the other three metrics is to min- imize. Equations 5.3a and 5.3b provide examples for how the utilities for accuracy and latency are calculated, respectively. u acc (i;n;m) = acc(i;n;m)=acc max (i) (5.3a) u lat (i;n;m) = 1 lat(i;n;m)=lat max (i) (5.3b) 5.3.3.2 Results Figure 5.7 shows the tradeos made by the Usher algorithms when the utility function focuses solely on one of the four metrics, namely accuracy, latency, energy, or price, as indicated in the legend. This demonstrates that our algorithms can 118 Figure 5.7: Tradeos between accuracy, latency, energy and price obtained by four utility functions with dierent priorities. 119 generate schedules to accommodate a variety of objectives and are not static in that sense. As we intended, our algorithms have the ability to make tradeos by varying where jobs are executed and which implementations are used. We show the generated schedules in more detail in Figure 5.8. For the rst case, in Figure 5.8a, the utility function is focused on minimizing the price of execution. The cheapest node is the local host ($0) and it can be seen that both algorithms assigned all jobs to it. However, in the second case where the priority is energy consumption, both algorithms chose to ooad as shown in Figure 5.8b. The gure shows that energy reduction is achieved by using the fastest pipeline, i.e. the GPU implementation of TinyYOLOv3, leading to the use of resources for a shorter period of time. While the BG algorithm used TinyYOLOv3 (GPU) for most of the jobs, its tendency to use as much of a single node as possible forced it to schedule TinyYOLOv3 (CPU) for one of the jobs. The third case, shown in Figure 5.8c, focuses on minimizing latency. Again, TinyYOLOv3 (GPU) is the implementation of choice due to its fast execution time. In this case, the SG algorithm makes use of the local node, but since that node only has a single GPU, it chooses to ooad the rest of the jobs to access more GPU resources. This proves to be more valuable than the incurred network delay of ooading. Once again the BG algorithm prefers the cloud due to its abundance of resources. In the fourth and nal case, shown in Figure 5.8d, the priority is accuracy. We can see that both the SG and BG algorithms make heavy use of YOLOv3 due to its high accuracy. 120 From this experiment, it is clear that both the SG and BG algorithms are capa- ble of accommodating various priorities through the user-specied utility function. However. the SG algorithm is better capable of leveraging resources across various nodes while the BG algorithm prefers to consolidate as much as possible. 5.3.4 Runtime Performance Figure 5.9 shows the runtime of each algorithm, including exhaustive search, as the number of jobs increases. As we expected, the SG algorithm has the best performance in terms of time, with BG placing second. The SG algorithm scales very well as the number of jobs increases. The exhaustive search algorithm takes over a minute at only ve jobs so it is not a very practical algorithm to use for the types of applications we are considering. 5.4 Related Works 5.4.1 Mobile Edge Computing In the area of mobile devices, edge computing has emerged as a viable solution to tackle a variety of issues that have developed with regards to the use of cloud services [87]. These issues include latency, network congestion, energy consumption, privacy, and more. For example, the long propagation delay between a mobile device and a cloud data center can adversely impact the performance of latency-sensitive 121 (a) Lowest price ( pri = 1) (b) Lowest energy ( enr = 1) (c) Lowest latency ( lat = 1) (d) Highest accuracy ( acc = 1) Figure 5.8: Schedules resulting from dierent utility functions 122 1 2 3 4 5 Jobs 10 3 10 2 10 1 10 0 10 1 10 2 Run-time (s) Bundle Simple Exhaustive Figure 5.9: Runtime of Usher algorithms with ve nodes. applications [13]. In addition, mobile devices are equipped with a variety of sensors that can generate large amounts of data, raising questions about where and how this data can be processed eciently (in terms of time and energy) and securely. Prior work in this area include ooading schemes such as MAUI [9], Odessa [8], CloneCloud [10] and ThinkAir [88] for applications that seek to reduce energy con- sumption and/or improve latency. However, these schemes are inherently limited to two devices, a mobile device and a cloud-hosted server. In our formulation of Usher, we instead consider that there can be an arbitrary number of edge computing resources to assist with task execution. DARE [89] is an ooading scheme designed for augmented reality applications. While DARE only considers two devices, a mobile device and an edge server, the system incorporates the ability to change the nature of the supported task in order to navigate the tradeo between latency and accuracy in its application. This allows the system to maintain reliable performance despite adversely changing network and server conditions. While this capability is similar to that supported by our 123 Usher algorithms, we take a more generic approach by allowing the user to specify what variations of a task can be used and when, based on a custom user-dened utility function. This per-job utility function enables the user to specify what optimization goal should be prioritized. This could include common metrics such as latency, accuracy, energy consumption, and cost (for network usage, computation time, etc.). 5.4.2 Task Scheduling There are a variety of scheduling algorithms for tasks represented as task graphs or directed acyclic graphs (DAG) [90]. Such algorithms include Hermes [11] and HEFT [81], which focus on minimizing latency. Unlike our Usher algorithms, Hermes and HEFT focus on scheduling the individual components of a single application across a network of devices for execution subject to dependency constraints. In contrast, the Usher algorithms are designed to schedule independent jobs simultaneously to nd the best t. Our algorithms treat each job as atomic units that can't be split and must be executed on a single device. The Usher algorithms can also support an arbitrary number of resource constraints. In addition, any metric can be used as a utility, as opposed to just latency in the case of Hermes and HEFT. In fact, our formulation allows each job to have independently dened utility functions. 124 5.4.3 Knapsack Problems The scheduling problem we tackle has a strong resemblance to multidimensional packing problems [36] in that we are trying to assign or pack jobs onto nodes subject to various constraints, similar to lling a knapsack with items. In particular, our problem sits at the intersection of three variations of the knapsack problem referred to as the multiple-choice, multidimensional, and multiple knapsack problems [34]. In the multiple-choice knapsack problem, there are a set of classes, each from which exactly one item must be taken. The multidimensional variation refers to the requirement that each item must satisfy multiple constraints in order to t and the multiple variation refers to the number of knapsacks available to choose from. If the classes are thought of as dierent jobs and the items are its dierent imple- mentations, then there is almost a direct mapping between the knapsack problem and the problem we address in this work. However, there is a dierence in how the quality of a solution is measured. The knapsack problem deals with items that have a xed value whereas the problem we address in this work allows the jobs to have values that are a function of where it is executed. 5.5 Conclusion We have presented Usher, a practical utility-based scheduling framework for poly- morphic applications and two heuristic-based scheduling algorithms, namely the Simple Greedy (SG) and Bundle Greedy (BG) algorithms. Approximation ratios 125 are derived for both algorithms. During our evaluation, we observed that the SG algorithm performs very well for the types of vehicular applications that we are envisioning. Although we show that the underlying scheduling optimization prob- lem is NP-hard, the SG algorithm performs optimally for our YOLO-based object detection dataset. The SG algorithm also runs in polynomial time, so it scales very well with the number of jobs that need to be scheduled. We believe that our work on Usher is valuable even if the framework itself is not used. The results we obtained help give some insights into how schedules are aected based on the dierent priorities a developer may have for their applications. The results can also help companies decide how to best allocate their time and resources in building out such systems. Although we have only presented two heuristic algorithms, the framework we have developed can accommodate any algorithm that follows the same model as Usher. This includes a more generalized form of the SG and BG algorithms, one that parameterizes the maximum number of jobs in each bundle. For example, the SG algorithm is equivalent to scheduling bundles with a maximum size of one job, while the BG algorithm has no eective limit. Optimizing the size of a bundle could be explored in future work. 126 Chapter 6 Conclusions With the popularity of wireless IoT devices and the growing presence of connected vehicle systems, there are an abundance of applications that will benet or be enabled by the ability to eciently leverage the computational resources available on a network. Such applications include those that allow users to provide input using their voice or gestures, and those that seek to perceive the environment using cameras. Due to the interactive nature of these applications, latency and throughput are signicant constraints that must be satised for the application to be usable. In this thesis, we have presented a variety of frameworks and algorithms that help to satisfy the performance requirements of such demanding applications. We began by presenting Noctua, a macroprogramming framework that imple- ments a modication to the traditional publish-subscribe messaging paradigm that is popular in IoT systems. We refer to this new design as publish-process-subscribe. We presented some real-system implementations to demonstrate how Noctua could benet its users. Noctua allows applications to subscribe to topics representing 127 the data that users are interested in. The data can be the raw information itself or some transformation which will automatically be taken care of by the publish- subscribe broker. This abstraction is eective for the application developer because it removes the burden of explicitly ooading and optimizing computation across available resources. We demonstrated the benets Noctua could provide for a va- riety of scenarios. Noctua is currently implemented as a centralized broker, which can make it a bottleneck as there are limitations on bandwidth and computational resources. This centralization also means that Noctua has a single point of failure. For future work, we believe that a distributed version of Noctua could be developed to make it more scalable and resilient to failure. Algorithms are also needed to eciently disseminate messages throughout the system and to schedule the macroprocessing workloads. Next we introduced VESPER, a real-time processing framework and online scheduling algorithm for image processing applications. In recognition that a reli- able system can sometimes be more valuable than an accurate one, VESPER has the ability to tradeo between accuracy and latency depending on system perfor- mance. VESPER achieves this by adapting the workload in an eort to maintain the latency and throughput requirements, a feature we refer to as polymorphic computing. Polymorphic computing is a technique that lends itself well to CNNs, as the size and depth of a CNN can be adapted to provide a variety of pipelines to 128 choose from with the necessary tradeos in accuracy and speed. VESPER is able to quickly leverage edge resources as they become available and we have demonstrated that it outperforms ooading schemes based on static workloads. A potential future research direction for VESPER is support for multiple ap- plications and multiple users. Our current work has only considered a single user executing a single streaming application. However, users could compete with their own applications if they are running more than one. This raises a question of pri- ority, which we began exploring in our work on Usher. In addition, a real system is likely to be shared among multiple users and VESPER may benet if it considered such dynamics directly. Finally, we presented Usher, our more general utility-based scheduling frame- work for polymorphic applications. While VESPER focuses on scheduling a single application, Usher is designed to schedule multiple applications simultaneously and it prioritizes them based on user-dened utility functions. So for example, one ap- plication may care more about latency while another cares more about cost. Usher accounts for vectors of resources available at each edge device and supports multi- ple implementations of each job. We prove that the schedule optimization problem for the Usher framework is NP-hard and present two heuristic-based algorithms, a simple greedy algorithm and a more complex bundle-based greedy algorithm. De- spite its simplicity, we demonstrate that the simple greedy algorithm performs well for our YOLO-based object detection task dataset. 129 One of the advantages of the Usher framework is that it is not limited to the two heuristic algorithms we have presented. It allows for the development of additional algorithms as long as the same model is followed. On that note, we believe it would be worthwhile to explore a more generalized form of the SG and BG algorithms. It is also possible to that better algorithms could be developed, at least for certain classes of applications. Another future direction for the Usher framework is to consider applications that may have some uncertainty in their resource requirements or utility value. For example, an application's runtime may vary due to some factor external to Usher. In summary, we have developed and demonstrated the benets of polymorphic computing. We believe that by designing computationally intensive applications to support dierent tradeos along relevant metrics such as latency and accuracy, these applications can be reliably supported in wireless IoT and connected vehi- cle systems using an appropriate framework. Given the current trends in cloud and edge computing, we believe this body of work will prove valuable in the de- sign of future applications and can help companies decide how to best invest their resources. 130 Bibliography [1] Wuyang Zhang, Jiachen Chen, Yanyong Zhang, and Dipankar Raychaudhuri. Towards ecient edge cloud augmentation for virtual reality MMOGs. In Proceedings of the Second ACM/IEEE Symposium on Edge Computing, page 8. ACM, 2017. [2] Mahadev Satyanarayanan, Victor Bahl, Ram on Caceres, and Nigel Davies. The case for VM-based cloudlets in mobile computing. IEEE Pervasive Com- puting, 2009. [3] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korat- tikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadar- rama, et al. Speed/accuracy trade-os for modern convolutional object detec- tors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7310{7311, 2017. [4] Gerard O'Regan. A brief history of computing. Springer Science & Business Media, 2008. [5] Andrew S Tanenbaum and Maarten Van Steen. Distributed Systems: Princi- ples and paradigms. Prentice-Hall, 2007. [6] Rajesh Balan, Jason Flinn, Mahadev Satyanarayanan, Shafeeq Sinnamo- hideen, and Hen-I Yang. The case for cyber foraging. In Proceedings of the 10th workshop on ACM SIGOPS European workshop, pages 87{92. ACM, 2002. [7] Padmanabhan S Pillai, Lily B Mummert, Steven W Schlosser, Rahul Suk- thankar, and Casey J Helfrich. Slipstream: Scalable Low-Latency Interactive Perception on Streaming Data. In Proceedings of the 18th International Work- shop on Network and Operating Systems Support for Digital Audio and Video, pages 43{48. ACM, 2009. [8] Moo-Ryong Ra, Anmol Sheth, Lily Mummert, Padmanabhan Pillai, David Wetherall, and Ramesh Govindan. Odessa: enabling interactive perception applications on mobile devices. In Proceedings of the 9th international confer- ence on Mobile systems, applications, and services, pages 43{56. ACM, 2011. 131 [9] Eduardo Cuervo, Aruna Balasubramanian, Dae-ki Cho, Alec Wolman, Stefan Saroiu, Ranveer Chandra, and Paramvir Bahl. MAUI: making smartphones last longer with code ooad. In Proceedings of the 8th international conference on Mobile systems, applications, and services, pages 49{62. ACM, 2010. [10] Byung-Gon Chun, Sunghwan Ihm, Petros Maniatis, Mayur Naik, and Ashwin Patti. CloneCloud: Elastic execution between mobile device and cloud. In Proceedings of the sixth conference on Computer systems, pages 301{314. ACM, 2011. [11] Yi-Hsuan Kao, Bhaskar Krishnamachari, Moo-Ryong Ra, and Fan Bai. Her- mes: Latency optimal task assignment for resource-constrained mobile com- puting. IEEE Transactions on Mobile Computing, 16(11):3056{3069, 2017. [12] Cong Shi, Vasileios Lakafosis, Mostafa H Ammar, and Ellen W Zegura. Serendipity: enabling remote computing among intermittently connected mo- bile devices. In Proceedings of the thirteenth ACM international symposium on Mobile Ad Hoc Networking and Computing, pages 145{154. ACM, 2012. [13] Tiany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. Glimpse: Continuous, real-time object recognition on mo- bile devices. In Proceedings of the 13th ACM Conference on Embedded Net- worked Sensor Systems, pages 155{168. ACM, 2015. [14] Bayya Yegnanarayana. Articial neural networks. PHI Learning Pvt. Ltd., 2009. [15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104{3112, 2014. [17] Vinod Nair and Georey E Hinton. Rectied linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807{814, 2010. [18] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211{252, 2015. [19] Yann LeCun, Yoshua Bengio, and Georey Hinton. Deep learning. nature, 521(7553):436, 2015. 132 [20] D. Steinkraus, I. Buck, and P. Y. Simard. Using GPUs for machine learning algorithms. In Eighth International Conference on Document Analysis and Recognition (ICDAR'05), pages 1115{1120 Vol. 2, Aug 2005. [21] Mart n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef- frey Dean, Matthieu Devin, Sanjay Ghemawat, Georey Irving, Michael Is- ard, et al. Tensor ow: A system for large-scale machine learning. In 12th fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 16), pages 265{283, 2016. [22] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima- genet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248{255. IEEE, 2009. [23] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) challenge. International journal of computer vision, 88(2):303{338, 2010. [24] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. ImageNet Classi- cation with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural In- formation Processing Systems 25, pages 1097{1105. Curran Associates, Inc., 2012. [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [26] Ross Girshick, Je Donahue, Trevor Darrell, and Jitendra Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. [27] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE international confer- ence on computer vision, pages 1440{1448, 2015. [28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91{99, 2015. [29] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unied, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779{788, 2016. 133 [30] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263{7271, 2017. [31] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. [32] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Ef- cient Convolutional Neural Networks for Mobile Vision Applications. CoRR, abs/1704.04861, 2017. [33] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative Intelligence Be- tween the Cloud and Mobile Edge. In Proceedings of the Twenty-Second In- ternational Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, pages 615{629, New York, NY, USA, 2017. ACM. [34] Hans Kellerer, Ulrich Pferschy, and David Pisinger. Knapsack Problems. Springer, Berlin, 2004. [35] Eva Tardos and Jon Kleinberg. Algorithm Design, 2006. [36] Chandra Chekuri and Sanjeev Khanna. On multidimensional packing prob- lems. SIAM journal on computing, 33(4):837{851, 2004. [37] Jakob Puchinger, G unther R Raidl, and Ulrich Pferschy. The multidimen- sional knapsack problem: Structure and algorithms. INFORMS Journal on Computing, 22(2):250{265, 2010. [38] Patrick Th Eugster, Pascal A Felber, Rachid Guerraoui, and Anne-Marie Kermarrec. The many faces of publish/subscribe. ACM computing surveys (CSUR), 35(2):114{131, 2003. [39] Daniele Alessandrelli, Matteo Petraccay, and Paolo Pagano. T-Res: Enabling recongurable in-network processing in IoT-based WSNs. In Distributed Com- puting in Sensor Systems (DCOSS), 2013 IEEE International Conference on, pages 337{344. IEEE, 2013. [40] Zach Shelby, Klaus Hartke, and Carsten Bormann. The constrained applica- tion protocol (CoAP). IETF, 2014. [41] Andrea Azzara, Daniele Alessandrelli, Stefano Bocchino, Matteo Petracca, and Paolo Pagano. PyoT, a macroprogramming framework for the Internet of Things. In Industrial Embedded Systems (SIES), 2014 9th IEEE International Symposium on, pages 96{103. IEEE, 2014. 134 [42] Georey Mainland, Matt Welsh, and Greg Morrisett. Flask: A language for data-driven sensor network programs. Harvard Univ., Cambridge, MA, Tech. Rep. TR-13-06, 2006. [43] PubNub. PubNub Functions for Serverless Compute. https://www.pubnub. com/products/functions/, Last accessed on 2019-07-31. [44] OpenJS Foundation. Node-RED. https://nodered.org/, Last accessed on 2019-07-31. [45] Apache Software Foundation. Apache Kafka. https://kafka.apache.org/, Last accessed on 2019-07-31. [46] iMatix. ZeroMQ. https://zeromq.org/, Last accessed on 2019-07-31. [47] E. Rescorla. The Transport Layer Security (TLS) Protocol Version 1.3. RFC 8446, Internet Engineering Task Force (IETF), August 2018. [48] A. Freier, P. Karlton, and P. Kocher. The Secure Sockets Layer (SSL) Protocol Version 3.0. RFC 6101, Internet Engineering Task Force (IETF), August 2011. [49] OpenIAM. OpenIAM. https://www.openiam.com/, Last accessed on 2019- 07-31. [50] Paul Fremantle, Benjamin Aziz, Jacek Kopeck y, and Philip Scott. Federated identity and access management for the Internet of Things. In Secure Internet of Things (SIoT), 2014 International Workshop on, pages 10{17. IEEE, 2014. [51] David Ferraiolo, D Richard Kuhn, and Ramaswamy Chandramouli. Role-based access control. Artech House, 2003. [52] Andr as Belokosztolszki, David M Eyers, Peter R Pietzuch, Jean Bacon, and Ken Moody. Role-based access control for publish/subscribe middleware ar- chitectures. In Proceedings of the 2nd international workshop on Distributed event-based systems, pages 1{8. ACM, 2003. [53] Jean Bacon, David M Eyers, Jatinder Singh, and Peter R Pietzuch. Access con- trol in publish/subscribe systems. In Proceedings of the second international conference on Distributed event-based systems, pages 23{34. ACM, 2008. [54] Kwame-Lante Wright, Bhaskar Krishnamachari, and Fan Bai. Noctua: A Publish-Process-Subscribe System for IoT. Technical Report ANRG-2019-01, USC ANRG, August 2019. [55] Asad Awan, Suresh Jagannathan, and Ananth Grama. Macroprogramming Heterogeneous Sensor Networks Using COSMOS. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Eu- roSys '07, pages 159{172, New York, NY, USA, 2007. ACM. 135 [56] OASIS. MQTT Version 3.1.1. OASIS Standard, 2014. [57] Urs Hunkeler, Hong Linh Truong, and Andy Stanford-Clark. MQTT-S - A publish/subscribe protocol for Wireless Sensor Networks. In 2008 3rd Inter- national Conference on Communication Systems Software and Middleware and Workshops (COMSWARE'08), pages 791{798. IEEE, 2008. [58] Stefan Tilkov and Steve Vinoski. Node. js: Using javascript to build high- performance network programs. IEEE Internet Computing, 14(6):80{83, 2010. [59] MongoDB. MongoDB. https://www.mongodb.com/, Last accessed on 2019- 07-31. [60] Tommi Mikkonen and Antero Taivalsaari. Using JavaScript as a real program- ming language. Technical report, Sun Microsystems, Inc., Mountain View, CA, USA, 2007. [61] H. M. Kienle. It's About Time to Take JavaScript (More) Seriously. IEEE Software, 27(3):60{62, May 2010. [62] Douglas Crockford. JavaScript: The Good Parts. O'Reilly Media, Inc., 2008. [63] T. F. Bissyand, F. Thung, D. Lo, L. Jiang, and L. Rveillre. Popularity, in- teroperability, and impact of programming languages in 100,000 open source projects. In 2013 IEEE 37th Annual Computer Software and Applications Conference, pages 303{312, July 2013. [64] Center for Cyber-Physical Systems and the Internet of Things. Testbeds. https://cci.usc.edu/index.php/research/testbeds/, Last accessed on 2019-07-31. [65] Raspberry Pi Foundation. Raspberry Pi. https://www.raspberrypi.org/ products/, Last accessed on 2019-07-31. [66] Seeed Studio. Grove. https://www.seeedstudio.com/category/ Grove-c-1003.html, Last accessed on 2019-07-31. [67] Andreas F Molisch. Wireless Communications, volume 34. John Wiley & Sons, 2012. [68] Neal Patwari, Robert J O'Dea, and Yanwei Wang. Relative location in wireless networks. In Vehicular Technology Conference, 2001. VTC 2001 Spring. IEEE VTS 53rd, volume 2, pages 1149{1153. IEEE, 2001. [69] Kwame-Lante Wright, Pranav Sakulkar, Bhaskar Krishnamachari, and Fan Bai. VESPER: A Real-time Processing Framework for Vehicle Perception Augmentation. In Proceedings of the Third Workshop on Integrating Edge 136 Computing, Caching, and Ooading in Next Generation Networks (IECCO). IEEE, 2019. [70] National Highway Trac Safety Administration (NHTSA). U.S. DOT Advances Deployment of Connected Vehicle Technology to Prevent Hun- dreds of Thousands of Crashes. https://one.nhtsa.gov/About-NHTSA/ Press-Releases/nhtsa_v2v_proposed_rule_12132016, Last accessed on 2017-07-31. [71] Hang Qiu, Fawad Ahmad, Ramesh Govindan, Marco Gruteser, Fan Bai, and Gorkem Kar. Augmented Vehicular Reality: Enabling Extended Vision for Future Vehicles. In Proceedings of the 18th International Workshop on Mobile Computing Systems and Applications, pages 67{72. ACM, 2017. [72] Yihang Zhang and Petros A Ioannou. Combined Variable Speed Limit and Lane Change Control for Truck-Dominant Highway Segment. In 2015 IEEE 18th International Conference on Intelligent Transportation Systems (ITSC), pages 1163{1168. IEEE, 2015. [73] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The Pascal Visual Object Classes Chal- lenge: A Retrospective. International Journal of Computer Vision, 111(1):98{ 136, 2015. [74] Ravi Netravali, Anirudh Sivaraman, Keith Winstein, Somak Das, Ameesh Goyal, and Hari Balakrishnan. Mahimahi: A Lightweight Toolkit for Re- producible Web Measurement. ACM SIGCOMM Computer Communication Review, 44(4):129{130, 2015. [75] NVIDIA. NVIDIA Jetson TX2. https://www.nvidia.com/en-us/ autonomous-machines/embedded-systems/jetson-tx2/, Last accessed on 2019-07-31. [76] Kwame-Lante Wright, Bhaskar Krishnamachari, and Fan Bai. Usher: Utility- based Scheduling Algorithms for Polymorphic Applications. Technical Report currently unpublished, pending review, USC ANRG, August 2019. [77] Mahadev Satyanarayanan. The emergence of edge computing. Computer, 50(1):30{39, 2017. [78] Ben Zhang, Nitesh Mor, John Kolb, Douglas S Chan, Ken Lutz, Eric All- man, John Wawrzynek, Edward Lee, and John Kubiatowicz. The cloud is not enough: Saving IoT from the cloud. In 7thfUSENIXg Workshop on Hot Topics in Cloud Computing (HotCloud 15), 2015. 137 [79] Ala Al-Fuqaha, Mohsen Guizani, Mehdi Mohammadi, Mohammed Aledhari, and Moussa Ayyash. Internet of Things: A survey on enabling technolo- gies, protocols, and applications. IEEE communications surveys & tutorials, 17(4):2347{2376, 2015. [80] Cheng Wang and Zhiyuan Li. Parametric analysis for adaptive computation ooading. In Proceedings of the ACM SIGPLAN 2004 Conference on Pro- gramming Language Design and Implementation, PLDI '04, pages 119{130, New York, NY, USA, 2004. ACM. [81] Haluk Topcuoglu, Salim Hariri, and Min-you Wu. Performance-eective and low-complexity task scheduling for heterogeneous computing. IEEE Transac- tions on Parallel and Distributed Systems, 13(3):260{274, 2002. [82] Tolga Soyata, Rajani Muraleedharan, Colin Funai, Minseok Kwon, and Wendi Heinzelman. Cloud-vision: Real-time face recognition using a mobile-cloudlet- cloud acceleration architecture. In 2012 IEEE symposium on computers and communications (ISCC), pages 000059{000066. IEEE, 2012. [83] Marshall L Fisher, George L Nemhauser, and Laurence A Wolsey. An analysis of approximations for maximizing submodular set functions - II. In Polyhedral combinatorics, pages 73{87. Springer, 1978. [84] James G Oxley. Matroid Theory, volume 3. Oxford University Press, USA, 2006. [85] Adnan Shaout, Dominic Colella, and S Awad. Advanced driver assistance systems-past, present and future. In 2011 Seventh International Computer Engineering Conference (ICENCO'2011), pages 72{82. IEEE, 2011. [86] Amazon. Amazon EC2. https://aws.amazon.com/ec2/, Last accessed on 2019-07-31. [87] Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. Edge comput- ing: Vision and challenges. IEEE Internet of Things Journal, 3(5):637{646, 2016. [88] Sokol Kosta, Andrius Aucinas, Pan Hui, Richard Mortier, and Xinwen Zhang. ThinkAir: Dynamic resource allocation and parallel execution in the cloud for mobile code ooading. In 2012 Proceedings IEEE Infocom, pages 945{953. IEEE, 2012. [89] Qiang Liu and Tao Han. DARE: Dynamic Adaptive Mobile Augmented Re- ality with Edge Computing. In 2018 IEEE 26th International Conference on Network Protocols (ICNP), pages 1{11. IEEE, 2018. 138 [90] Oliver Sinnen. Task Scheduling for Parallel Systems, volume 60. John Wiley & Sons, 2007. 139
Abstract (if available)
Abstract
Recent advances in machine learning and artificial intelligence have brought about a variety of new applications in the wireless IoT and vehicular environments, including those that employ speech recognition and image processing. Many of these applications are computationally intensive and may exceed the capacity of a single device. In such situations, these devices often rely on cloud computing to provide the processing power needed to run the applications. However, some of these applications are latency-sensitive due to their interactive nature or use in the operation of a vehicle. As the demand placed on cloud computing services and its network infrastructure grows, it will become more and more difficult to provide the performance guarantees required by these applications. ❧ Edge computing is an offloading technique that has emerged as a solution to satisfy the growing demand of cloud computing services by providing the same services physically closer to where they are needed. This has the dual benefit of reducing network congestion and reducing application latency. However, due to their geographically distributed nature, edge computing resources are relatively difficult to manage and utilize efficiently. This area is still an open research problem, particularly for latency-sensitive applications. ❧ In this work, we develop tools to facilitate the use of edge computing resources for latency-sensitive applications in both wireless IoT and connected vehicle systems. We begin by presenting Noctua, a framework that enables a publish-process-subscribe architecture for IoT applications. Through a real-system implementation, we demonstrate and evaluate how Noctua can help IoT developers by enabling more efficient use of network resources and reducing the strain on edge devices by delivering to them more meaningful data. We illustrate Noctua's capability through application examples including aggregating multiple sensor flows and providing radio signal-strength-based localization as a real-time service. ❧ Next, we introduce VESPER, a real-time processing framework and online scheduling algorithm designed to exploit dynamically-available distributed devices that are connected via wireless links. A significant feature of the VESPER algorithm is its ability to navigate the tradeoff between accuracy and computational complexity of modern machine learning tools by adapting the workload, while still satisfying latency and throughput requirements. We refer to this capability as polymorphic computing. VESPER also scales opportunistically to leverage the computational resources of external devices. We evaluate VESPER on an image-processing pipeline and demonstrate that it outperforms offloading schemes based on static workloads. ❧ Finally, we present Usher, a framework for structuring and scheduling latency-sensitive applications to enable efficient utilization of computing resources across networked devices. Like VESPER, Usher also exploits the concept of polymorphic computing, but it supports multiple applications of a more general form. Equipped with the Usher framework, we formulate the underlying optimization problem. We show that this problem is NP-hard, but propose two heuristic solutions for it, a simple greedy algorithm and a more sophisticated bundle-based greedy algorithm. We present approximation ratios for these algorithms, and also evaluate them empirically over realistic as well as constructed workloads to demonstrate and evaluate their performance over a range of settings. The proposed system is simple and conducive to implementation on real networked distributed systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Dispersed computing in dynamic environments
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Adaptive resource management in distributed systems
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
QoS-aware algorithm design for distributed systems
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
A protocol framework for attacker traceback in wireless multi-hop networks
PDF
Using formal optimization techniques to improve the performance of mobile and data center networks
PDF
Efficient pipelines for vision-based context sensing
PDF
Cooperation in wireless networks with selfish users
PDF
Exploiting diversity with online learning in the Internet of things
PDF
Anycast stability, security and latency in the Domain Name System (DNS) and Content Deliver Networks (CDNs)
PDF
Efficient crowd-based visual learning for edge devices
Asset Metadata
Creator
Wright, Kwame-Lante
(author)
Core Title
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/21/2019
Defense Date
08/27/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
connected vehicles,distributed systems,edge computing,Internet of Things,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
), Bai, Fan (
committee member
), Govindan, Ramesh (
committee member
), Psounis, Konstantinos (
committee member
)
Creator Email
kwame.wright@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-227488
Unique identifier
UC11674120
Identifier
etd-WrightKwam-7870.pdf (filename),usctheses-c89-227488 (legacy record id)
Legacy Identifier
etd-WrightKwam-7870.pdf
Dmrecord
227488
Document Type
Dissertation
Rights
Wright, Kwame-Lante
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
connected vehicles
distributed systems
edge computing
Internet of Things