Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Optimizing task assignment for collaborative computing over heterogeneous network devices
(USC Thesis Other)
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OPTIMIZING TASK ASSIGNMENT FOR COLLABORATIVE COMPUTING OVER HETEROGENEOUS NETWORK DEVICES by Yi-Hsuan Kao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2016 Copyright 2016 Yi-Hsuan Kao Dedication To my beloved family: Yao-Wu Kao, Yen-Ling Hung, Ru-Huei Shiao, Zi-Xin Kao and Mabli Kao ii Acknowledgments I would like to express my sincere gratitude to many signicant individuals for your help and contribution through my PhD study. First, I would like to thank my advisor, Professor Bhaskar Krishnamachari. He gives me extensive freedom to explore every interesting things, including both academic research and life beyond school. He gives me the crayons to draw on a sheet of paper. But he never mind if I actually draw on the walls. I have gone through several dierent stages of my life these years. He gives me full scale of trust and understands my situations on these life events. He not only shows his profession as a teacher, as a researcher, but truly as an advisor. Besides my advisor, I would like to thank Dr. Fan Bai for bringing me industrial insights and giving me valuable advice on the research projects. I would also like to thank the rest of my thesis committee: Professor John Silvester, and Professor Leana Golubchik, for their insightful comments and contribution to my thesis. I would also like to thank Professor Alexandros G. Dimakis for his advice during my Master's degree studies. He spent considerable amount of time with me on iii research and courses. His patience and encouragement made me overcome the time when I was not comfortable speaking English. My sincere thanks also goes to other knowledgeable researchers: ANRG mem- bers, Professor Shaddin Dughmi, Professor Rajgopal Kannan. I thank my ANRG labmates for comprehensive discussions on research and life. We, as a self-motivated group, share memorable time when facing ups and downs, and keep the faith for the best. I want to thank Professor Shaddin Dughmi for being in my qualication committee and oering an enlightening course at spring 2014, which opened my eyes on approximation algorithms. I also want to thank Professor Rajgopal Kan- nan, Dr. Moo-Ryong Ra, and Mr. Kwame Wright for stimulating discussions and collaboration on my research. The work in this dissertation was supported in part by funding from NSF and GM Research. I would also like to express my appreciation to the Annenberg Fellowship Program, and my country, Taiwan, for supporting tuition and oering stipends during my PhD study. Moreover, I thank the Medical Program, and the WIC Program, for oering medical insurance and nutrition benets to my dependents. Last, I want to dedicate this thesis to my family: my parents, Yao-Wu Kao and Yen-Ling Hung, my wife, Ru-Huei Shiao, my brother, Yi-Jiun Kao, and my two kids, Zi-Xin and Mabli. Life is unpredictable. Without your support, it wouldn't have been possible for me to complete my PhD study. I love you. iv Table of Contents Dedication ii Acknowledgments iii List Of Figures viii List Of Tables x Abstract xi Chapter 1: Introduction 1 1.1 General Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Application Task Graph . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Resource Network . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Deterministic Optimization with Single Constraint . . . . . 6 1.2.2 Deterministic Optimization with Multiple Constraints . . . . 7 1.2.3 Stochastic Optimization with Single Constraint . . . . . . . 8 1.2.4 Online Learning in Stationary Environments . . . . . . . . . 9 1.2.5 Online Learning in Non-stationary Environments . . . . . . 10 Chapter 2: Background 12 2.1 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Two Examples: Set Cover and Vertex Cover . . . . . . . . . 13 2.1.2 Hardness of a Problem . . . . . . . . . . . . . . . . . . . . . 15 2.1.2.1 Complexity Analysis . . . . . . . . . . . . . . . . . 16 2.1.2.2 Proof of NP-hardness . . . . . . . . . . . . . . . . 19 2.1.3 LP Relaxation and Rounding Algorithms . . . . . . . . . . . 20 2.1.3.1 Deterministic Rounding Algorithms . . . . . . . . . 20 2.1.3.2 Randomized Rounding Algorithms . . . . . . . . . 22 2.1.4 Other Approximation Algorithms . . . . . . . . . . . . . . . 24 2.2 Dynamic Programming and FPTAS . . . . . . . . . . . . . . . . . . 25 2.2.1 The Knapsack Problem . . . . . . . . . . . . . . . . . . . . . 25 v 2.2.2 FPTAS: Fully Polynomial Time Approximation Scheme . . . 28 2.2.3 Guidelines to Derive FPTAS from DP . . . . . . . . . . . . 32 2.2.3.1 When the optimum cannot be properly bounded . 34 2.3 Online Learning: The Multi-armed Bandit Problems . . . . . . . . 36 2.3.1 Assumptions on a Bandit Process . . . . . . . . . . . . . . . 37 2.3.2 Performance Measurement . . . . . . . . . . . . . . . . . . . 39 2.3.3 An Example: The UCB1 Algorithm . . . . . . . . . . . . . . 40 2.3.4 Adversarial Multi-armed Bandit Problems . . . . . . . . . . 43 Chapter 3: Related Work 46 3.1 Task Assignment Formulations and Algorithms . . . . . . . . . . . 46 3.2 Multi-armed Bandit Problems . . . . . . . . . . . . . . . . . . . . . 50 3.3 Our Approach and Proposed Algorithms . . . . . . . . . . . . . . . 53 3.4 System Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 4: Deterministic Optimization with Single Constraint 57 4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.1 Task Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.2 Cost and Latency . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1.3 Optimization Problem . . . . . . . . . . . . . . . . . . . . . 61 4.2 Proof of NP-hardness of SCTA . . . . . . . . . . . . . . . . . . . . 62 4.3 Hermes: FPTAS Algorithms . . . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Tree-structured Task Graph . . . . . . . . . . . . . . . . . . 63 4.3.2 Serial Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.3 Parallel Chains of Trees . . . . . . . . . . . . . . . . . . . . 72 4.3.4 More General Task Graph . . . . . . . . . . . . . . . . . . . 72 4.4 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.1 Algorithm Performance . . . . . . . . . . . . . . . . . . . . . 75 4.4.2 CPU Time Evaluation . . . . . . . . . . . . . . . . . . . . . 77 4.4.3 Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . 78 4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 5: Deterministic Optimization with Multiple Constraints 83 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.1 Hardness of MCTA . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Sequential Randomized Rounding Algorithm . . . . . . . . . . . . . 90 5.2.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 A Bicriteria Approximation Algorithm for MCTA with Bounded Communication Costs 1 . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 1 This section is a joint work with Dr. Rajgopal Kannan, University of Southern California. vi Chapter 6: Stochastic Optimization with Single Constraint 110 6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2 PTP: Probabilistic Delay Constrained Task Partitioning . . . . . . 114 6.3 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Chapter 7: Online Learning in Stationary Environments 120 7.1 Why Online Learning? . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.2 Models and Formulation . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3 The Algorithm: Hermes with DSEE . . . . . . . . . . . . . . . . . . 123 7.4 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Chapter 8: Online learning in Non-stationary Environments 136 8.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.2 MABSTA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.3 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3.1 Proof of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . 146 8.4 Polynomial Time MABSTA . . . . . . . . . . . . . . . . . . . . . . 148 8.4.1 Tree-structure Task Graph . . . . . . . . . . . . . . . . . . . 150 8.4.2 More general task graphs . . . . . . . . . . . . . . . . . . . . 152 8.4.3 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . 154 8.4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.5 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.5.1 MABSTA's Adaptivity . . . . . . . . . . . . . . . . . . . . . 156 8.5.2 Trace-data Emulation 2 . . . . . . . . . . . . . . . . . . . . . 157 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Chapter 9: Conclusion 163 Reference List 167 2 This section is a joint work with Mr. Kwame Wright, University of Southern California. vii List Of Figures 2.1 A vertex cover problem . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Relationship between quantized domain and original domain . . . . 32 2.3 An example of adversarial MAB . . . . . . . . . . . . . . . . . . . . 42 4.1 An example of application task graph . . . . . . . . . . . . . . . . . 59 4.2 A tree-structured task graph with independent child sub-problems . 64 4.3 Hermes' methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 A task graph of serial trees . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 Hermes converges to the optimum as decreases . . . . . . . . . . . 74 4.6 Hermes over 200 dierent application proles . . . . . . . . . . . . . 74 4.7 Hermes in dynamic environments . . . . . . . . . . . . . . . . . . . 75 4.8 Hermes: CPU time measurement . . . . . . . . . . . . . . . . . . . 76 4.9 Hermes: 36% improvement for the example task graph . . . . . . . 82 4.10 Hermes: 10% improvement for the face recognition application . . . 82 4.11 Hermes: 16% improvement for the pose recognition application . . . 82 5.1 Collaborative computing on face recognition application . . . . . . . 84 5.2 Graphical illustration of MCTA . . . . . . . . . . . . . . . . . . . 84 viii 5.3 SARA: optimal performance in expectation . . . . . . . . . . . . . . 105 5.4 BiAPP: ( + 1; 2 + 2)-approximation ( = 3; = 2) . . . . . . . . 106 6.1 PTP: error probability . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.1 The task graph with 3 edges as maximum matching . . . . . . . . . 125 7.2 The performance of Hermes using DSEE in dynamic environment . 133 7.3 Comparison of Hermes using DSEE with other algorithms . . . . . 133 8.1 The dependency of weights on a tree-structure task graph . . . . . . 150 8.2 The dependency of weights on a serial-tree task graph . . . . . . . . 152 8.3 MABSTA has better adaptivity to the changes than a myopic algorithm157 8.4 Snapshots of measurement result . . . . . . . . . . . . . . . . . . . 158 8.5 MABSTA's performance with upper bounds provided by Corollary 1 158 8.6 MABSTA compared with other algorithms for 5-device network . . 159 8.7 MABSTA compared with other algorithms for 10-device network . . 159 ix List Of Tables 2.1 Common Functions in Complexity Analysis . . . . . . . . . . . . . . 18 3.1 Task Assignment Formulations and Algorithms . . . . . . . . . . . 48 4.1 Notations of SCTA . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 SARA/BiApp: Simulation Prole . . . . . . . . . . . . . . . . . . . 105 6.1 PTP: Simulation Prole . . . . . . . . . . . . . . . . . . . . . . . . 117 8.1 Parameters Used in Trace-data measurement . . . . . . . . . . . . . 157 x Abstract The Internet of Things promises to enable a wide range of new applications involving sensors, embedded devices and mobile devices. Dierent from traditional cloud computing, where the centralized and powerful servers oer high quality computing service, in the era of the Internet of Things, there are abundant computational resources distributed over the network. These devices are not as powerful as servers, but are easier to access with faster setup and short-range communication. However, because of energy, computation, and bandwidth constraints on smart things and other edge devices, it will be imperative to collaboratively run a computational- intensive application that a single device cannot support individually. As many IoT applications, like data processing, can be divided into multiple tasks, we study the problem of assigning such tasks to multiple devices taking into account their abilities and the costs, and latencies associated with both task computation and data communication over the network. A system that leverages collaborative computing over the network faces highly variant run-time environment. For example, the resource released by a device may suddenly decrease due to the change of states on local processes, or the channel xi quality may degrade due to mobility. Hence, such a system has to learn the available resources, be aware of changes and exibly adapt task assignment strategy that eciently makes use of these resources. We take a step by step approach to achieve these goals. First, we assume that the amount of resources are deterministic and known. We formulate a task as- signment problem that aims to minimize the application latency (system response time) subject to a single cost constraint so that we will not overuse the available resource. Second, we consider that each device has its own cost budget and our new multi-constrained formulation clearly attributes the cost to each device separately. Moving a step further, we assume that the amount of resources are stochastic pro- cesses with known distributions, and solve a stochastic optimization with a strong QoS constraint. That is, instead of providing a guarantee on the average latency, our task assignment strategy gives a guarantee that p% of time the latency is less thant, wherep andt are arbitrary numbers. Finally, we assume that the amount of run-time resources are unknown and stochastic, and design online algorithms that try to learn the unknown information within limited amount of time and make competitive task assignment. We aim to develop algorithms that eciently make decisions at run-time. That is, the computational complexity should be as light as possible so that running the algorithm does not incur considerable overhead. For optimizations based on known xii resource prole, we show these problems are NP-hard and propose polynomial- time approximation algorithms with performance guarantee, where the performance loss caused by sub-optimal strategy is bounded. For online learning formulations, we propose light algorithms for both stationary environment and non-stationary environment and show their competitiveness by comparing the performance with the optimal oine policy (solved by assuming the resource prole is known). We perform comprehensive numerical evaluations, including simulations based on trace data measured at application run-time, and validate our analysis on algo- rithm's complexity and performance based on the numerical results. Especially, we compare our algorithms with the existing heuristics and show that in some cases the performance loss given by the heuristic is considerable due to the sub-optimal strat- egy. Hence, we conclude that to eciently leverage the distributed computational resource over the network, it is essential to formulate a sophisticated optimiza- tion problem that well captures the practical scenarios, and provide an algorithm that is light in complexity and gives a good assignment strategy with performance guarantee. xiii Chapter 1 Introduction We are entering an era of the Internet of Things (IoT) that connects billions of devices in the world [1]. From ubiquitous sensors, embedded devices to personal devices and cloud servers, computational resources are distributed over the network. As more and more promising applications leverage collaborative computation and data transmission on multiple devices [2], we are more aware of the requested content but less aware of how and where data communication and computation happen. This is a paradigm of pervasive computing, where computation is made to appear anytime and anywhere and the boundaries between devices begin to blur [3, 4]. Suppose you wear a pair of intelligent glasses that capture several images of the world. With limited processor, memory and storage located on the glasses, applications like face recognition, which requires feature extraction techniques and matching with a large data set, are hardly supported by such a wearable device 1 alone. But if you have a mobile phone, face recognition is possible through col- laborative computation. The glasses rst do some simple processing and send the images to the mobile phone, which can lter out possible faces and match them with the proles stored on it, or can access the cloud server for matching over a larger database through the cellular network. With devices that are capable of running light computation spread everywhere, we are thinking from a system perspective: how can we leverage these devices to jointly support a complicated application that a single device fails to run individ- ually? Ideally, data communication and computation happen unobtrusively, so the system must bring good user experience, considering the resource that are available across the network. The system response time (latency) could be a good metric of user experience. Processor cycles, storage and battery are the computation re- sources on a device. The channel bandwidth is the resource given by the network. We identify several key considerations on how a system can make good use of these resources: Awareness of data shipment: Data shipment uses network resources as well as device battery and incurs transmission delay. A system should be aware of these costs and avoid unnecessary data transmission. Device-dependent task assignment: Some sensors and embedded devices are equipped with microcontrollers that are designed to perform specic tasks 2 [5]. A system should make proper task assignment based on each device's ability. Flexible computing in dynamic environments: The resources on the device (ex. CPU load) and the channel condition vary with time. A system should adapt task assignment to the changes of the environment. For example, a sensor can transmit the raw data when the channel is good but consume local processor resource to process the raw data to reduce its size when the channel is bad. We take a step by step approach to achieve these goals. We partition an ap- plication into executions of multiple tasks and aim to nd a good task assignment to available devices. First, given an application prole with the cost metrics and potential data shipment requirements between tasks, we solve an optimal task as- signment that minimizes application latency subject to a single cost constraint. Then, considering that each device may have its own cost budget, we formulate an optimization problem subject to individual constraints on each device. Taking a step further, in the scenario where the cost metrics and channel qualities vary with time, we model them as stochastic processes with known statistics and solve a stochastic optimization problem. Finally, we propose online learning algorithms to learn the unknown statistics and make competitive task assignment. We envision that a future cyber foraging system [6] is able to take the requests, explore the environment and assign tasks on heterogeneous computing devices, 3 to satisfy the specied QoS requirements. Furthermore, the concept of macro- programming enables application developers to code by describing the high-level and abstracted functionality that is platform independent [7]. The existence of an interpreter plays an important role, which translates the high-level functional de- scription to machine codes for dierent devices. Especially, one crucial component that is closely related to system performance is how to partition an application into tasks and assign them to suitable devices with the awareness of resource availability at run time. Hence, we are motivated to develop these pioneering algorithms that will optimize the performance of collaborative computing. In the following, we introduce the general settings of application model and environment in our formulations, and then illustrate our major contributions to these problems. 1.1 General Settings 1.1.1 Application Task Graph We partition an application into N task executions, which can be described by a directed acyclic graph (DAG). Each node species a computing task and each directed edge species the data dependency between two tasks. For example, the existence of a directed edge (m;n) implies that taskn relies on the execution result 4 of task m. That is, task n cannot start until task m nishes. Hence, the directed edges in the graph also imply the precedence constraints of these task executions. 1.1.2 Resource Network We use the term \resource network" to describe the environment, including the devices' ability and channels' qualities. We are concerned with two metrics, latency and cost. We assume that all devices are connected, that is, the resource network is a mesh network of M connected devices. For each task, a device's ability is described by the task execution latency, subject to a execution cost of performing the task on the device. Two devices describe a channel, with data transmission latency and cost that specify its quality. We assume a mesh network without loss of generality in the way that if two devices are not connected, the corresponding latency and cost are innite. 1.2 Contributions We propose ecient and competitive algorithms for the following task assignment problems. 1. Deterministic Optimization with Single Constraint 2. Deterministic Optimization with Multiple Constraints 3. Stochastic Optimization with Single Constraint 5 4. Online Learning in Stationary Environments 5. Online Learning in Non-stationary Environments An ecient algorithm has reasonable complexity that scales well with the problem size. On the other hand, a competitive algorithm provides a performance guarantee in that if it cannot solve the problem exactly, then it species how close it can approximate the optimum. For the optimization formulations (problem 1,2 and 3), there exist algorithms that can solve them exactly but the complexity grows exponentially with problem size. For learning in unknown dynamic environments (problem 4 and 5), the problems cannot be solved exactly without the full knowledge of the environment. Hence, we aim to develop ecient (polynomial time) algorithms and show their competitiveness by provable performance guarantee. 1.2.1 Deterministic Optimization with Single Constraint Given an application task graph described by a directed acyclic graph, we want to nd a good task assignment strategy that assigns each task to devices. We measure the overall application latency, i.e. the time the last task nishes. The performance of a task assignment strategy depends on devices' performance as well as channel qualities given by the resource network prole. Hence, we formulate an optimization problem that aims to minimize the overall latency subject to a single cost constraint. We show that the problem is NP-hard, that is, there is no polynomial-time algorithm to solve all instances of the problem unless P = NP 6 [8]. Hence, we propose Hermes 1 , which is a fully polynomial time approximation scheme (FPTAS) [9]. For all instances, Hermes always guarantees a solution that gives no more than (1 +) times of the minimum objective, where is a positive number, and the complexity is bounded by a polynomial in 1 and the problem size. Our analysis result clearly illustrates the trade-o between Hermes' accuracy and its complexity. The closer the objective value is to the optimum (smaller ), the more running time Hermes takes. Our formulation generalizes the previous formulation in [10], where the applica- tion is modeled as a chain-structured task graph. Furthermore, instead of relying on a standard integer programming solver, we propose Hermes as the polynomial time algorithm. We verify our result by comprehensive simulations based on trace data of several benchmarks, and compare Hermes with a heuristic proposed in [11] for dynamic environments. To the best of our knowledge, for this class of task as- signment problems, Hermes applies to more sophisticated formulations than prior work and runs in polynomial time with problem size yet still provides near-optimal solutions with a performance guarantee. 1.2.2 DeterministicOptimizationwithMultipleConstraints In the scenario where the resource network consists of multiple battery-operated devices, like sensors and mobile phones, the formulation with a single constraint 1 Because of its focus on minimizing latency, Hermes is named for the Greek messenger of the gods with winged sandals, known for his speed. 7 has deciencies. The assignment strategy given by Hermes may rely mostly on a single device and drain its battery. Hence, we formulate another problem to minimize the overall application latency subject to individual constraints on each device. Our multi-constraint task assignment formulation clearly attributes the cost to each device separately and aims to nd out a task assignment strategy that considers both system performance and cost-balancing. We show that our formu- lation is NP-hard, but propose two new polynomial-time algorithms that provide provable guarantees: SARA and BiApp. SARA rounds the fractional solution of an LP-relaxation to achieve asymptotically optimal performance from a time average perspective. BiApp is a bicriteria ( + 1; 2 + 2) approximation with respect to the latency objective and cost constraints for each data frame processed by the application (; are parameters quantifying bounds on the communication latency and the communication costs respectively). Our simulation results show that these algorithms perform extremely well on a wide range of problem instances. 1.2.3 Stochastic Optimization with Single Constraint We next go a step further to develop an algorithm that makes good task assign- ment under stochastic environments with known statistics. In real environment, the CPU load on a device could increase dramatically due to other co-existing pro- cesses and the wireless channel quality also varies with time. Hence, we solve a stochastic optimization problem subject to a probabilistic QoS constraint. That is, 8 our algorithm provides a task assignment strategy that guarantees the condence on the latency falling within a given value. For example, the QoS guarantees that more than 90% of time the latency is less than 200 ms. Our algorithm runs in polynomial time and its near-optimal performance is shown by simulation results. 1.2.4 Online Learning in Stationary Environments At application run time, the on-device resource and channel qualities may be un- known and time-varying. Hence, we design an algorithm that eciently probes the environment and makes task assignment based on the probing result. However, the probing itself incurs overhead and undertakes the risk of sampling on some devices and channels with poor performance. Hence, we have to take these losses into account when analyzing the performance. We formulate this problem of learning in unknown dynamic environments as a multi-armed bandit problem (MAB) [12]. In a typical MAB formulation, each arm represents an unknown stochastic process. At each trial an agent chooses over a set of arms, gets the payo from the selected arms and tries to learn the statistical information from sensing them. The historical information helps for decisions in the future trials. Given an online learning algorithm, its performance is judged by the regret [13]. That is, the amount of payo the algorithm loses due to sub-optimal strategy compared to a genie who knows the statistics and can use the optimal strategy. The regret is expected to grow with the number of trials but should slow 9 down (grow sub-linearly) as the algorithm learns more information and takes the strategy that is converging to the optimal one. In our setting of dynamic resource network, we model each unknown device and unknown channel as \arms". Our algorithm consists of two interchanging phases, an exploration phase and an exploitation phase. In the exploration phase, we design a deterministic sequence to sample all devices and channels thoroughly. In the exploitation phase, we call Hermes to solve for the optimal assignment based on the sample means that have been collected so far. We show that the performance loss (regret) can be bounded by O(lnT ), which is a sub-linear function of the number of trials, and justify our analysis in the simulation. 1.2.5 Online Learning in Non-stationary Environments We move one more step closer to the reality, where the resource network cannot be captured by stationary processes, due to the unpredictable resources released by a device at run time and intermittent connections in a mobile network. Instead of assuming the payo given by each arm as a stationary stochastic process, we model the payo as a bounded sequence. At each trial, the number of the chosen sequence is exposed and others remain unknown. Since the sequence can change arbitrarily over time without being limited to evolve as a stationary process, our result can be applied to any dynamic environment. 10 We propose MABSTA (Multi-Armed Bandit based Systematic Task Assign- ment) as an online learning algorithm for the non-stationary environment and show that the performance loss is bounded by O( p T ). Furthermore, we compare MAB- STA with other algorithms via simulations based on the trace data measured on an IoT testbed [14], and justify MABSTA's competitiveness and adaptivity to dynamic environments. 11 Chapter 2 Background Our main focus is to design algorithms to solve the proposed optimization formu- lations. The theoretical background includes approximation algorithms and online learning algorithms for multi-armed bandit problems. 2.1 Approximation Algorithms When we solve an optimization problem, we rst want to know how hard it is. There have been several classical problems whose hardness have been justied [15]. A high level approach would be to map the optimization problem to these \benchmarks" so that we can get an idea of its hardness. Some problems are too hard that it is impossible to be solved, or it is too complicated to be solved in a reasonable time (i,e, a computer may take years to solve it). If we are unable to solve the hard problem exactly, we rely on solving it approximately, in which we propose an approximation algorithm [9], and attempt to give a performance guarantee on how 12 close it approximates the optimum. The following steps summarize how we study an optimization problem and propose the corresponding approximation algorithm. 1. Determine the hardness of the problem 2. Propose an approximation algorithm (assuming the problem is not trivial) 3. Analyze the algorithm's complexity 4. Prove the algorithm's performance guarantee In the following, we present related examples and some techniques that have been used in our work. 2.1.1 Two Examples: Set Cover and Vertex Cover We use two well known problems as examples. GivenM subsetsS 1 ; ;S M , where S i f1; ;Ng for all i, the set cover problem (SC) [16] is to nd a minimum number of subsets to cover the universe [N]. That is, SC : min X i2[M] x i s.t. [ i:x i =1 S i = [N]; x i 2f0; 1g;8i2 [M]: 13 Figure 2.1: A vertex cover problem This is an integer programming formulation (IP). The binary variable x i denotes whether thei th subset is selected or not. An input instance ofSC can be described by the tuple, fN;fS 1 ;S 2 ; ;S M gg; where N is the size of the universe, andfS 1 ;S 2 ; ;S M g are the subsets of the universe. For example, if N = 4 and we have subsetsf1; 4g;f2g;f3g andf2; 3g, then we choosef1; 4g andf2; 3g as the minimum set cover. Another well known problem is the vertex cover problem (VC) [17]. We use G(V;E) to denote an undirected graph, whereV =f1; ;Ng is the set of vertices enumerated from 1 to N, andE is the set of edges. For example, if there exists an edge that joints vertex m and n, we use (m;n) to denote this edge. Given an undirected graph G(V;E), the vertex cover problem is to nd the minimum set of 14 vertices such that for each edge, at least one of the two vertices it connects is in the set. That is, VC : min X i2[N] x i s.t. x m +x n 1;8(m;n)2E; x i 2f0; 1g;8i2 [N]: Fig. 2.1 shows an example of nding the minimum vertex cover, where the lled circles are selected vertices. We can see that both of them cover all the edges, with the solution on the right being the minimum vertex cover. These two problems are \hard" problems in computational theory. Hence, sev- eral approximation algorithms have been proposed. We will use these two problems as examples to introduce the approximation algorithm paradigm in this disserta- tion. 2.1.2 Hardness of a Problem The hardness of a problem is measured by the computing resource cost when solving it. If, for a problem, the best algorithm is still very expensive, then we know the problem is hard. Theoretical research generally focuses on two major costs of an algorithm, com- putational cost and storage cost. Computational cost refers to how many CPU 15 cycles an algorithm consumes, and storage cost refers to how much memory an algorithm uses. There could be a trade-o between these two costs. For example, algorithm A may use more CPU cycles than algorithm B, but use less memory. However, most of the time, storage cost is positively related to computational cost in theoretical analysis. Hence, we focus our discussion on computational hardness and refer the readers to more literature on space hardness in [18, 19]. 2.1.2.1 Complexity Analysis When analyzing the computational cost (complexity) of an algorithm, we calculate how many basic tasks are executed in the whole running process. Although the basic tasks may be dierent for dierent algorithms, they are usually cheap operations that take only few CPU cycles. Hence, it is more convenient to count these basic tasks in complexity analysis rather than actual CPU cycles. A problem may have dierent input instances. An input instance is not only characterized by the input parameter values but also its size. For example, an instance ofSC consists of an universe with sizeN andM subsetsS 1 ; ;S M . The numbers N and M specify the problem size, and the content of S 1 ; ;S M are input parameter values. Given two instances with the same size but dierent input values, an algorithm may execute a dierent number of basic tasks. Hence, we usually present the worst case analysis. That is, we present an upper bound on the number of basic tasks necessary for all possible instances with the same size. 16 On the other hand, more importantly, we would like to know how an algorithm's complexity scales with problem size so that we can know it the algorithm is really expensive when we solve a big problem. In sum, an algorithm's complexity is an upper bound on the number of basic tasks and is a function of problem size. We use big O notation to describe how an algorithm's complexity scales with the problem size. A function f(N) is in O(g(N)) if there exists c> 0 and N 0 > 0 such that f(N) cg(N) whenever N >= N 0 . That is, f(N) grows slower then g(N) and is surpassed by g(N) whenever N is large enough. Hence, O(g(N)) is a set of functions that grow slower than g(N). If we say an algorithm's complexity belongs to O(g(N)), then the number of necessary basic tasks grows slower than g(N). We use the binary search algorithm as an example. Given a sorted array, where each element is a positive integer between 1 and N, and an integer n, 1nN, we can perform binary search to check if n is an element of the array as follows. First, look at the medium in the sorted list. Ifn is smaller than the medium, check the medium of the rst half of the list. Otherwise, check the medium of the last half. Repeat until n is found, or an empty array is reached (in case n is not an element of the original array). If we dene the basic task as \comparing two numbers", which can be done in one CPU cycle, then the complexity of binary search belongs to O(log 2 N). Clearly, this algorithm will stop after performing at most log 2 N comparisons. Since the complexity measurement is in the unit of a basic task, 17 Table 2.1: Common Functions in Complexity Analysis functions log 2 N N (linear) N 2 (polynomial) N 3 (polynomial) 2 N (exponential) N = 10 3:32 10 100 1000 1024 N = 100 6:64 100 10000 1000000 1:27 10 30 N = 1000 9:96 1000 10 6 10 9 10 301 which is proportional to running time, we can also say the binary search algorithm runs in O(log 2 N) time, or the binary search is a logarithmic time algorithm. Table 2.1 lists the common functions that describe an algorithm's complexity. We often use logarithmic algorithm, polynomial algorithm, etc, to describe the scalability. We can see the signicant dierence when N is large. Assume that a modern processor can complete 10 6 basic tasks per second. Then for small problem size (N = 10), an exponential algorithm can nish in the blink of an eye just as a logarithmic algorithm does. However, for big problems (N = 1000), an exponential algorithm would run forever (10 290 years) while the other algorithms still nish in seconds. Hence, we denitely want to avoid running an exponential algorithm to solve a big problem. A polynomial algorithm is much better than an exponential one. If possible, a low-degree polynomial algorithm is even better. In computational theory research, we classify the problems by their hardness. There is a class called NP-hard problems, where researchers believe there is no polynomial time algorithm to solve them 1 . Since the running time of an expo- nential algorithm is not tolerable for big problem size, we study polynomial time 1 Formally, there is no polynomial time algorithm to solve an NP-hard problem unless P = NP [17]. 18 approximation algorithms for the class of NP-hard problems. For other classes of problems, we refer readers to more literature in [19]. 2.1.2.2 Proof of NP-hardness The rst step to study approximation algorithms of a problem is to know if the problem is NP-hard. We prove NP-hardness of a problem A by reducing another NP-hard problem B to it. In general, we say problem B is reducible to problem A if an algorithm for solving A can be \eciently" used to solve B. For example, if we have a black box for solving A, then by using this black box a polynomial number of times, and other extra steps with polynomial complexity, we can solve B. This reduction implies A is at least as hard as B. Hence, if B is NP-hard, then A is at least NP-hard. On the other hand, this reduction also implies that if the black box is a polynomial time algorithm, then there exists a polynomial time algorithm that solves B. However, no polynomial time algorithm is known for an NP-hard problem B. Hence, there is no polynomial time algorithm for A. We reduce the vertex cover problem (VC) to the set cover problem (SC), as an example to show that SC is NP-hard. In VC, the universe to be covered is the set of edges,E, and each subset is represented by a vertex, whose elements are the edges that touch it. Hence, we transform a VC instance to a SC instance. The solution of this SC problem is exactly the minimum number of vertices that covers all edges. Clearly, the transformation takes linear time (O(E)) and we can 19 use exactly the same algorithm of SC only once to solve the transformed instance. That is, VC is reducible to SC. Since we know VC is NP-hard, SC is at least NP-hard. 2.1.3 LP Relaxation and Rounding Algorithms In this section, we introduce one approach of approximating a problem, called linear programming (LP) relaxation. For each combinatorial optimization problem, there exists a corresponding integer programming (IP) formulation [9]. An IP instance is NP-hard [20], however, we know that if we relax the integer constraints in the formulation, that is, we allow fractional solutions, then it becomes an LP formulation, which we know there exist polynomial time algorithms to solve [21]. Hence, whenever we formulate an integer programming, we solve the corresponding LP formulation and design a rounding algorithm to round a fractional solution to an integer one if necessary. In the following, we introduce two rounding algorithms using VC and SC as examples. 2.1.3.1 Deterministic Rounding Algorithms A deterministic rounding algorithm is a xed mapping from the fractional numbers to integers. We useVC as an example. In aVC problem, we are choosing the min- imum number of vertices to cover all of edges. The binary variable x i corresponds to selecting vertex i or not. If we allow the binary variable to be any fractional 20 variable between 0 and 1, then we will get an optimal solution of the LP, denoted by x ? i , i = 1; ;N. Let ^ x i , i = 1; ;N, denote the rounded solution. We dene the rounding algorithm as ^ x i = 8 > > > < > > > : 1 if x ? i 1 2 ; 0 otherwise. (2.1) The LP relaxation combined with the rounding algorithm outputs a vertex cover where the number of required vertices is no more than 2 times the minimum [17]. First we verifyf^ x i g N i=1 is indeed a vertex cover. In the LP formulation, we have x ? n +x ? m 1 for each edge (m;n)2E. That is, we have either x m 1 2 or x n 1 2 , which implies either vertex m or vertex n will be selected to cover edge (m;n) as required. Now we look at the performance off^ x i g. LetOPT denote the minimum objective of the IP. We know that the minimum of LP relaxation is always smaller than OPT since there are fewer constraints. That is, P i2[N] x ? i OPT . From the rounding algorithm, we have ^ x i 2x ? i for all i. Hence, X i2[N] ^ x i 2 X i2[N] x ? i 2OPT: (2.2) Solving an LP takes polynomial time, and this rounding algorithm runs in O(N) time. Hence, the above algorithm is a 2-approximation. 21 2.1.3.2 Randomized Rounding Algorithms The LP solution implies how the variables should be set to achieve the optimal objective. For example, if a fractional solution is very close to 1, we know the corresponding binary variable is likely to be 1. Instead of using a xed mapping between fractional numbers and integers, a randomized rounding algorithm rounds the fractional number by following a probability distribution, which usually depends on the optimal LP solution. Since the rounding process is not xed, the performance may dier from time to time. For some trials the rounding result may not be feasible. Hence, we analyze the expected performance of the randomized rounding algorithm, and provide a condence bound on feasibility. We use the set cover problem SC as an example. In an SC problem, we are choosing the minimum number of subsets whose union covers the universe [N]. The binary variablex i corresponds to selecting thei th subset or not. First we solve the LP relaxation and get the optimal LP solutionfx ? i g M i=1 . Then, we design the randomized rounding algorithm as ^ x i = 8 > > > < > > > : 1 with probabilityx ? i ; 0 with probability 1x ? i : (2.3) 22 That is, we include the i th subset with probability x ? i . This algorithm gives the following expected performance, Ef X i2[M] ^ x i g = X i2[M] Ef^ x i g = X i2[M] x ? i OPT: (2.4) Now, using this algorithm as a basic block, we repeat it for 2 logN times. Let C r denote the output (the indices of selected subsets) in the r th round. The - nal output, C = S 2 logN r=1 C r , includes all the subsets that have been selected from rst round to nal round. In [16], this algorithm is proposed and is veried as a 2 logN-approximation. From (2.4),jC r j OPT for all r and hence we have jCj 2 logN OPT as required. Now we analyze the condence on C being a set cover. For any element j in the universe and any r, we have Pfj is not covered byC r g = Y i:j2S i (1x ? i ) Y i:j2S i e x ? i 1 e : (2.5) We use the fact that 1x e x for all x2 [0; 1] and there exists at least one subset that contains j to get the above result. Then, Pfj is not covered byCg = 2 logN Y r=1 Pfj is not covered byC r ge 2 logN = 1 N 2 : (2.6) 23 Therefore, PfC is not a set coverg =Pf9j2 [N] not coveredg 1 N : (2.7) This algorithm involves an LP and 2 logN linear-time rounding algorithms. Clearly, this is a polynomial time algorithm with 2 logN approximation and outputs a set cover with probability at least 1 1 N . 2.1.4 Other Approximation Algorithms There are other commonly seen approximation approaches, like greedy algorithms, local search and dynamic programming (DP). We will brie y introduce greedy algorithms and local search, and will discuss dynamic programming in the next section. We refer the readers to [17] on more about these approximation algorithms. A greedy algorithm makes sequential decision without considering the future and the past. That is, it makes a decision which is the best to current state. For example, in a network routing problem, the goal is to nd the shortest path from a source to a destination. Given the current location, greedy algorithm choose the closet point over the neighbors as the next hop. It does not consider the next hop may be a dead end or further from destination than current location. A local search algorithm starts with a feasible solution and tries to improve it by a \local" move. It stops if there is no local move that can further improve the performance. For the same network routing problem, the local search starts with a 24 path from source to destination. Then it thinks about if changing any intermediate node to another node will make the whole solution a better path. Typically, we have to show that the local search algorithm does stop within a limit amount of time (polynomial time). Moreover, we have to bound the loss due to the algorithm being trapped at a local optimum. 2.2 Dynamic Programming and FPTAS The dynamic programming method (DP) [22] breakdowns a problem by rst solving the necessary sub-problems and combining them in the end. These sub-problems are closely related in that solving one sub-problem may rely on the solutions of other sub-problems. Hence, each sub-problem is solved once, and the result is stored in the memory, which is accessible when necessary. This approach is often used in sequential decision making processes in order to optimally reach the goal (destina- tion, terminal states, etc.). We will use the knapsack problem as an example, and illustrate the dynamic programming algorithm to solve it. Finally, we discuss on an approximation scheme that is closely related to dynamic programming. 2.2.1 The Knapsack Problem The Knapsack Problem (KP) seeks for packing multiple items as valuable as pos- sible into a backpack with limited volume [23]. Formally, given a set of items, 1; ;N, with valuesv 1 ; ;v N and sizesc 1 ; ;c N , we want to select a subset of 25 items where the total value is maximized and the total size is less than the backpack volume C. That is, KP : min X i2[N] x i v i s.t. X i2[N] x i c i C; x i 2f0; 1g: The binary variable x i denotes whether the i th item is selected or not. We assume the rest of numbers are positive integers, and c i C for all i2 [N]. The dynamic programming method can solve the KP exactly. We rst dene the sub-problems. Let C[j;v] denotes the minimum size induced by packing a subset of items fromf1; ;jg to achieve value v. Specically, KP sub : min X i2[j] y i c i s.t. X i2[j] y i v i =v; y i 2f0; 1g: 26 We claim that if we solve all the sub-problems forj = 1; ;N andv = 1; ;Nv max , where v max = max i2[N] v i , then we can solve the KP. Clearly, the optimal pack- ing has value at most Nv max . Suppose we have already solve C[N;v] for all v2f1; ;Nv max g. We can nd the optimal solution by maxv s.t. C[N;v]C: (2.8) Let C[N;V ? ] be the optimal solution given by (2.8). We prove that C[N;V ? ] is optimal by contradiction. Assume C[N;V ? ] is not optimal, then there exists a packing that has value V 0 > V ? and has total size C 0 C. Since C[N;V ? ] is optimal given by (2.8), we haveC[N;V 0 ]>C, which is a contradiction as required. The following equation is the core idea of dynamic programming. We are solving the sub-problem,C[j;v], packing a subset of items inf1; ;jg that uses the least volume and achieves value v. C[j;v] relies on the packing strategies of C[j 1;v] and C[j 1;vv j ]. That is, C[j;v] = minfC[j 1;v];C[j 1;vv j ] +c j g: (2.9) For item j, we have two choices. If we don't select item j, then we have to pack a subset off1; ;j 1g to achieve total value v. On the other hand, if we select itemj, then we have to nd out the least size packing off1; ;j1g that achieves value vv j . Hence, we rst solve C[j 1;v] for all v2f1; ;Vg and then we 27 can solveC[j;v]. The dynamic programming procedure starts solving from the rst item, given C[1;v] = 8 > > > < > > > : c 1 if v =v 1 ; 1 otherwise. (2.10) In total, we have to solve NNv max sub-problems. Since each sub-problem is as simple as choosing the lesser of two numbers, we dene it as a basic task. Hence, the dynamic programming runs in O(N 2 v max ) time. The complexity not only depends on the problem size N but also depends on v max , which is related to the input instance. This is not a polynomial time algorithm. We call it a pseudo-polynomial time algorithm. Not only when the values are big may the algorithm results in long running time, but if the values are positive real numbers, the number of sub- problems is unbounded. Although the pseudo-polynomial time is good enough for some problem instances, we are interested in a polynomial time algorithm for all instances by sacricing the accuracy to some extent. That is, we seek for a polynomial time approximation algorithm with performance guarantee. 2.2.2 FPTAS:FullyPolynomialTimeApproximationScheme A Fully Polynomial Time Approximation Scheme (FPTAS) [9] clearly illustrates the trade-o between solution accuracy and algorithm complexity. For an FPTAS, we can improve its solution accuracy with the increase on complexity. Hence, 28 an FPTAS can theoretically approach the optimum arbitrarily close, however, the increasing complexity may become considerable. Specically, for a maximization problem, an FPTAS guarantees (1) approxi- mation ratio. That is, it always outputs a solution that achieves at least (1) times of the maximum objective. Here is a real number between 0 and 1. Its complexity is bounded by a polynomial of problem size, 1 and the number of bits to describe the problem instance,jIj. We have seen that the problem size is closely related to the complexity of an algorithm. Moreover, by the denition of FPTAS above, we can approach the optimum closer by setting smaller , however, its complexity increases with 1 . The last factor isjIj. Since a computer runs in binary operations, the number of basic operations increases with the number of bits to describe an input instance. For example, if we know all the numbers in an instance is bounded byT , then, in total, these numbers can be described by O(N log 2 T ) =O(jIj) bits. Hence, an FPTAS runs in poly(N; 1 ; log 2 T ) time 2 . In the following, we illustrate an FPTAS for KP [24]. From the dynamic pro- gramming approach, the number of sub-problems scales with O(N 2 v max ). If we xed the problem size, the FPTAS comes from the quantization on the values of each item such that the number of sub-problems scales slower than O(v max ). By denition, we have to nd a suitable quantization step size such that the number 2 One can see thatjIj also depends on problem size. For the knapsack problem, there areO(N) input numbers. If each of them can be described by log 2 v max bits, thenjIj = O(N log 2 v max ). Hence, the formal denition on the complexity is that FPTAS runs in poly( 1 ;jIj) time [9]. We separate problem size fromjIj for cleaner presentation 29 of sub-problems is bounded byO(log 2 v max ). We will see that the complexity of the following FPTAS turns out to be independent of v max , which is even better than the minimum requirement. Let V ? be the maximum value given by the optimal packing of KP. There is a simple bound on V ? , that is, v max V ? Nv max . Clearly, the maximum value is at least v max achieved by only packing the most valuable item, and is less than Nv max . We design the quantization step size and the quantizer as = v max N ; (2.11) q (x) =k; if kx< (k + 1): (2.12) Now we consider the quantized dynamic programming as follows. We solve C[j;k] by C[j;k] = minfC[j 1;k];C[j 1;kq(v j )] +c j g: (2.13) In the quantized dynamic programming, we no longer have to solve N 2 v max sub- problems. Each quantized value falls in the dynamic range 0q(v j )q(v max )d N e: (2.14) Hence, the number of sub-problems is bounded by O(NNq(v max )) = O(N 3 1 ). The quantized dynamic programming has performance loss. Let ^ V 0 be the total 30 quantized value achieved by the packing strategy solved by the quantized DP, which implies its original valueV 0 is at least ^ V 0 . LetI ? be the set of items selected by the optimal strategy. Assume the packing strategy solved by the quantized DP is dier- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a dierent strategy, we have P i2I ? q(v i ) ^ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ^ V 0 X i2I ? q(v i )V ? NV ? v max (1)V ? : (2.15) We use the fact that in quantized DP, each quantized value loses at most compared to its original value. Hence, comparing V ? and P i2I ? q(v i ), the total loss is at most N = v max . Since v max V ? , we achieve the nal result. That is, the objective value given by quantized DP is at least (1)V ? . As we have shown, the FPTAS runs in O(N 3 1 ) time and guarantees (1) approximation ratio. Note that in (2.11), could be less than one for small and v max , in which we don't need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just ne. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 31 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 quantizedvalueachievedbythepackingstrategysolvedbythequantizedDP,which implies its original value V 0 is at least ˆ V 0 . Let I ? be the set of items selected by the optimalstrategy. AssumethepackingstrategysolvedbythequantizedDPisdi↵ er- ent from I ? (If they are the same, then the quantized DP achieves the optimum). Since the quantized DP outputs a di↵ erent strategy, we have P i2 I ? q(v i ) ˆ V 0 . Otherwise, the quantized DP would rather pick I ? . Hence, we have V 0 ˆ V 0 X i2 I ? q(v i ) V ? N V ? ✏v max (1 ✏)V ? . (2.15) kv V ? ˆ V ? ˆ V ? V 0 ˆ V 0 ˆ V 0 V ? l (1 ✏)V ? (2.16) WeusethefactthatinquantizedDP,eachquantizedvaluelosesatmost compared to its original value. Hence, comparing V ? and P i2 I ? q(v i ), the total loss is at most N = ✏v max . Since v max V ? , we achieve the final result. That is, the objective value given by quantized DP is at least (1 ✏)V ? . As we have shown, the FPTAS runs in O(N 31 ✏ )timeandguarantees(1 ✏) approximation ratio. Note that in (2.11), could be less than one for small ✏ and v max , in which we don’t need quantized DP because it increases the number of sub-problems. Hence, for instances with small v max , the original DP works just fine. However, if v max is big enough such that the running time of original DP is considerable, we need the quantized DP, which has shorter running time and provable approximation guarantee. 30 Figure 2.2: Relationship between quantized domain and original domain 2.2.3 Guidelines to Derive FPTAS from DP Since dynamic programming is often used in solving a large set of optimization problems but results in pseudo-polynomial time complexity, we want to know, given an original DP, if there exists an FPTAS for it. In this section, we present some general guidelines for deriving an FPTAS from the original DP. We refer the readers to [25] for sucient conditions that a DP formulation possesses an FPTAS. We start from Fig. 2.2 to examine the values on the quantized domain and the original domain, and see how the performance bound is derived for the knapsack problem in the last section. First, in the quantized domain, we leverage the fact that the quantized DP outputs a strategy whose quantized value ^ V 0 is higher than the quantized objective ^ V ? achieved by the optimal packing strategy. Mapping these numbers back to the original domain, we have ^ V 0 ^ V ? . Our nal goal is to get the result thatV 0 (1)V ? , whereV 0 is the value achieved by the quantized DP's strategy and V ? is the optimum. On one hand, since the quantizer always underestimates an input value, we have a lower bound onV 0 , i.e.,V 0 ^ V 0 . On the other hand, the quantization loss of V ? is no more than l, where l is the number 32 of quantization that were performed when solving the quantized DP. Since ^ V ? is the sum of quantized values of the packed items, we have lN. Hence, we have V 0 V ? lV ? N =V ? N V low N : (2.16) We can see that if V low is a lower bound of V ? , we arrive at the nal result V 0 (1)V ? . The quantization step size plays an important role on the performance bound and the complexity of a quantized DP. The above example gives us some ideas on how to design the step size using reverse engineering. We list the guidelines as follows. Find V low and V up such that V low V ? V up , where V up =poly(N)V low . Identify l and set as V low l . To summarize, rst we need an upper bound and lower bound of V ? . Given a problem instance, we try to sandwich V ? by the input numbers of the instance. For a KP with items' values v 1 ; ;v N , we have v max V ? Nv max , where v max = max i2[N] v i . The upper bound V up species the dynamic range, that is, we only have to solve the sub-problems with values fall in [0;V up ]. The lower bound V low should appear in the numerator of. We notice that Vup dominates the number of sub-problems of the quantized DP. Hence, if we can nd V up and V low such that V up = poly(N)V low (a factor bounded by a polynomial of problem size), then 33 Algorithm 1 Binary Search FPTAS 1: procedure FPTAS(N;;V up ) 2: for r 1; log 2 V up do 3: V r Vup 2 r1 +c; r Vup l2 r 4: V 0 DP (q;V r ; r ) . solve space [0;V r ] using step size r 5: if V 0 Vup 2 r then . lower bound found, return 6: return 7: end if 8: end for 9: end procedure the number of sub-problems would be independent of input values v i . Finally, by designing as mentioned and following the derivation achieve the desired result. 2.2.3.1 When the optimum cannot be properly bounded To sandwichV ? by the bounds as required is sometimes non-trivial. Given a prob- lem instance, to nd an upper bound of V ? is easy, but then we need a non-trivial lower bound satises V low = poly(N)V up . In the cases where we fail to nd a suitable lower bound, we perform binary search that iteratively searches the lower bound ofV ? in the active space and adapts the quantization step size to guarantee its solution is accurate enough. Algorithm 1 summarizes the process of searchingV low . It calls the quantized DP for solving the spaces [0;V up ], [0; Vup 2 ]; [0; Vup 4 ]; , untilV low has been found. We see in line 5, the termination criterion implies V ? Vup 2 r . Assume that this algorithm 34 stops at the r th round, using r = Vup l2 r . From (2.16), with the newly found lower bound on V ? , we have V 0 V ? l r V ? V up 2 r (1)V ? : (2.17) Now we bound the running time of Algorithm 1. For ther th round, Algorithm 1 calls the quantized DP for solving the space [0;V r ], where V r = Vup 2 r1 +c, with step size r . We add a small constant c to avoid missing the optimal packing strategy. That is, in roundr1, ifV 0 < Vup 2 r ,V ? could possibly sit in the range of [ Vup 2 r ; Vup (1)2 r ]. Hence, by adding a constant c enables the algorithm to check both V 0 and V ? in the r th round. Clearly, for r th round, the running time of quantized DP can be bounded by O(poly(N) V r r ) =O(poly(N) l ): (2.18) That is, the running time is independent of r. Moreover, the poly(N) is related to the original DP and l can usually be bounded by another polynomial of N (as we can seel =N forKP). Since Algorithm 1 either stops when the criterion is met or stops when the smallest precision is reached, i.e., log 2 V up rounds, the total running time can be bounded by O(poly(N) l log 2 V up ): (2.19) The term log 2 V up is exactly the number of bits to represent V up , which can be bounded byjIj. Hence, Algorithm 1 is an FPTAS. In Chapter 4, we design Hermes 35 to solve the task assignment problem for minimum latency by following this binary search approach. 2.3 Online Learning: The Multi-armed Bandit Problems The multi-armed bandit (MAB) problem has been introduced by Robbins [26], which is a sequential decision problem where at each time an agent chooses over a set of \arms", gets the payo from the selected arms and tries to learn the statistical information from sensing them, which is helpful for future decisions. The payo of each arm is usually modeled as an unknown stochastic process, which is also called a bandit process. These formulations have been widely used in the scenarios when one tries to learn the unknown environment eciently and aims to nd out the optimal policy to maximize the benet. An interesting example would be a gambler facing a rack of slot machines. He has limited amount of chips and wants to use a strategic approach, not gambling with luck, to earn as much payo as possible. What are the machines should he pull? First, without knowing anything from these machines, he denitely needs some trials to learn whether a machine produces good payo or not. Suppose he tries machine A once and the result is bad. Should he try it one more time or try other machines? It depends on how valuable the information is and how many 36 chips he has. If machine A is memoryless and highly variant, then only one bad result means little. He might want to giveA another chance to spit out high payo (or he shouldn't spend time on this machine because it is too risky). If machine A has memory, then maybe he would assume that it is in bad state and try other machines. He can wait for sometime and go back toA with the belief that the state has changed, or wait for other gamblers playing it for few more times to change its state. If he has a lot of chips, then it may be worth to spend several chips on each machine, get the sample averages and nd out the best one. If he has limited chips, then he might want to focus on two machines, get samples and satisfy with the payo given by the better one. These are not good strategies if the gambler know nothing about the machines. So let's wish him the best of luck! Mathematically, we simplify the tricky situation, make some assumptions on the machines and focus on designing a strategy and analyzing how good it is. To be consistent with our gambler example, we will continue using a machine as the \arm" whose payo is characterized by a bandit process. 2.3.1 Assumptions on a Bandit Process In the gambler example, the decision varies from whether a machine has memory or not. In the simplest case, we assume that the payo is modeled as an i.i.d. process. For each machine, the payo is independent between trials and is independent of 37 the payo given by other machines. Hence, the best strategy would be keep playing the machine with highest mean, so that we would get maximum total payo in the long run. However, we can only get limited samples from each machine, so we are facing a dilemma. If we spend only few trials on each machine and play the machine with highest sample average, then there could be a good machine which has been underestimated, or the best machine we select might be overestimated. Hence, we might not be playing the best machine. On the other hand, if we try to get lots of samples from each machine, we could get the best machine with higher condence, however, we might waste too many chips on sampling and hence we don't play the best machine often enough. This is a well-known trade-o between exploration and exploitation mentioned in the MAB literature [13]. A Markov process is often used to model a machine with memory. That is, its current state depends on the last state. For example, at each time slot, a machine can be in good state with higher payo or bad state with less payo. The transition probabilities are unknown and hence the goal is to learn these probabilities and adapt the optimal policy given an observation on a current state. Hence, if machine A is in bad state, based on the transition probabilities we learn, we decide to play it again (observe the next state) or play other machines. A bandit process characterized by states can be restful or restless. For a restful bandit, its state remain the same if there is no observation. Hence, we have to wait for other gamblers to play machineA and change its state. On the other hand, for a restless 38 bandit, its state changes with time even if there is no observation. Hence, we only know the machine's state from the last observation. Given machine A was in the bad state 10 minutes ago, we have to make decision with the belief that machine A is in good state now. The longer the time has passed from the last observation, the less condence we have on the current state. To learn multiple unknown and restless Markov processes is highly non-trivial. It is the hardest model that have been studied so far. Even when the bandits are i.i.d., the problem is still very challenging and there have been some well-known algorithms that deal with the bandits in this space [27]. In the following, we will brie y introduce some algorithms and discuss their performance. 2.3.2 Performance Measurement We measure the performance of an MAB algorithm by how much we lose due to not knowing the statistics of the bandit processes, called \regret" [13]. For example, suppose the bandit processes are i.i.d. over time. If a genie knows which one has the highest mean, then he would simply play the machine all the time and hence would have the highest total payo in the long run. Hence, the algorithm's performance is measure by the accumulate loss when not playing the best machine. For Markov processes, if the genie knows the transition probabilities, then he would know the best next action when he observes a machine's state. Hence, we compare 39 the performance with this optimal policy. Since the environment is dynamic, the regret is always measured in expectation. As an MAB problem is a sequential decision problem, we want to know how the regret accumulates over time. If the regret grows with O(T ) (linear), then it implies we do not learn anything, therefore, the regret grows by the same rate as the beginning when we know nothing about the environment. Hence, for an algorithm that has learning ability, its regret must grow sub-linearly. For example, we would expect functions likeO( p T ), or even better,O(logT ). The sub-linear regret implies that the longer the time has passed, the slower the regret accumulates. Moreover, asymptotically, the regret will not increase in the end, which implies the strategy given by the algorithm converges to the optimal strategy. 2.3.3 An Example: The UCB1 Algorithm We assume the simplest case when these bandits processes are unknown but bounded and i.i.d. over time, and introduce the well-known UCB1 algorithm [28] in this sec- tion. Under the i.i.d. assumption, we want to know the machine with highest average payo. Hence, we aim to get these sample averages that are believed to be accurate enough and exploit the best machine so that we can catch up with the performance given by a genie who always plays the machine with highest mean. However, we don't have enough samples for each machine such that the sample av- erage is arbitrarily close to its actual mean, because every performance loss counts 40 when we are not playing the best one. Therefore, a reasonable approach is to spend amount of time to get samples from all machines thoroughly, and spend the rest of (1) amount of time playing the machine with highest sample average (0<< 1). This is called -greedy strategy [27]. As the number of samples is increasing, we have more condence on which machine has high payo on average. Is it worth to spend amount of time on sampling every machine as we did in the very beginning of time? -greedy strategy has a performance loss that grows linearly with time, due to the fact that it spends a xed amount of time () on sampling. Hence, it is more reasonable to sample the arms more frequently at the beginning, and exploit the best arm when we have more and more condence in the end. That is, an -decreasing strategy reduces the sampling frequency and focuses on exploitation as time goes on [27], in order to catch up with the performance given by the genie. Auer et al. [28] proposes UCB1, which is an -decreasing strategy and is shown to be competitive with sub- linear regret. The UCB1 starts with playing each machine once, and then plays the machine j ? that has the maximum index dened as follows. j ? = arg maxf x j + s 2 lnT T j g; (2.20) where x j is the average payo obtained from the past, T j is the number of times that machine j has been played and T is the total number of plays so far. These bandit processes are bounded between 0 and 1 without loss of generality. The 41 a1 a2 machine A: a3 a4 a5 a6 b1 b2 machine B: b3 b4 b5 b6 c1 c2 machine C: c3 c4 c5 c6 … … … T Figure 2.3: An example of adversarial MAB index contains two terms. First term implies that the machine with higher sample mean is better. Second term implies that the machine that has been played less has higher index. These two terms balance between exploration and exploitation. At the beginning, lnT grows fast so that T j has to grow with it. This is the time when every machines are played frequently. AsT increases, lnT slows down, hence, if a machine has high sample mean, T j can keep growing. However, if a machine has low sample mean, T j has to slow down. To summarize, as time goes on, the machine with higher sample mean will be played much more frequently. The USB1 has performance guarantee stated as bounded regret. That is, its regret can be bounded by O(lnT ). Since lnT grows slow when T is large, which implies the regret grows slow as well, UCB1 must be playing the best machine most of the time in the end. Hence, UCB1 is a index policy, which is light and concerges to the optimal policy in the end. More detailed analysis and UCB1's variations can be found in [28]. 42 2.3.4 Adversarial Multi-armed Bandit Problems In the physical world, things change over time not just by following a stationary process. Hence, the restless and Markov assumptions, which appear in the most sophisticated MAB formulation, are still very strong. For example, we will never know how it works behind the slot machine as it is very condential. The machine's payo might change the distribution over time, or when some events happen. Hence, we have to relax the assumptions in some scenarios. Auer et al. [29] present the adversarial MAB problem, where the payo of each machine is model by a bounded sequence. The sequence is unknown, and does not have to evolve by following a stationary stochastic process. On the other aspect, given an algorithm, an adversary could hurt is with arbitrary sequences, as long as they are bounded. Hence, the algorithm must adapt to all the possible cases and give performance guarantee on the regret. Fig. 2.3 shows an example of a adversarial MAB problem, where each machine's payo are represented by a sequence. At each time slot, an algorithm selects a machine, gets the payo and the rest payos from other machines remain unknown. For example, the circled numbers are the exposed payos by selecting machine B;C;C;A;B;A in order. The algorithm aims to learn these sequences and remain competitive with the best sequence. That is, a genie who knows all the sequences beforehand, and plays the best machine all the time. 43 Auer et al. later present the Exp3 algorithm and show its competitiveness in [30]. In Exp3, every machine has a weightw j (t), which is updated when getting a new sample, and Exp3 chooses over these machines with the probability distribution that depends on their weights. Specically, at time slot t, choose machine j out of M machines with the probability p j (t) = (1 ) w j (t) P k2[M] w j (t) + M ; (2.21) where is a constant between 0 and 1. Similar to an-greedy policy, Exp3 spends amount of time on exploring all machine uniformly, and exploits the machines with higher scores more frequently for the rest of the time. When getting a new sample from machine j, w j (t) is updated with a exponential function whose exponent depends on the sampled value. The weights on other machines remain the same. Auer et al. show that the regret of Exp3 is bounded by O( p T ) and propose some variations of Exp3 in [30]. First, compared to UCB1 with regret bounded by O(lnT ), the regret of Exp3 grows faster but still ats out whenT is very large. This implies there must be some extreme sequences that hurt Exp3 very bad, however, Exp3 still shows that it can learn these sequences and reduce the single-slot regret as time goes on. Second, one would ask if an -decreasing could do better than Exp3. Since the sequence might change drastically in a short time, not following a stationary stochastic process, Exp3 has to keep spending a xed amount of time on exploring each machine. Hence, for machines whose payos evolve by following 44 stationary processes, an -decreasing policy could do better than Exp3. But in general, Exp3 has a better worse-case performance guarantee. In Chapter 8, we present MABSTA that extends Exp3 to learn the devices' performance and channel qualities over the resource network, and derive its performance guarantee with new techniques. 45 Chapter 3 Related Work We compare deterministic task assignment formulations on optimizing computa- tional ooading or collaborative computing over connected devices. For online learning to unknown environments, we compare our algorithms with the existing multi-armed bandit algorithms. Finally, we introduce several systems prototypes of collaborative computing. 3.1 TaskAssignmentFormulationsandAlgorithms Computational ooading, sending intensive tasks to more resourceful servers, is a solution to augment computing on resource-constrained devices [31]. This approach is similar to collaborative computing over multiple devices except that computa- tional ooading focuses on a scenario where a resource-constrained local device seeks help from remote servers or other devices. The formulation of task assign- ment problems in this context generally applies to scenarios where we have multiple 46 connected devices and want to nd out how to leverage these resources to achieve the objective, either to best benet a local device from computational ooading aspect, or to make most ecient usage of available devices from a network perspec- tive. Hence, we compare these related works, focusing on optimization formulations and algorithms. 47 Table 3.1: Task Assignment Formulations and Algorithms Algorithm Task Graphs Objectives # of Constraints # of Devices Complexity Performance MAUI [10] serial energy cost single two exponential optimal min k-cut [32] DAG channel usage none multiple exponential optimal Hermes 1 subset of DAG 2 latency single multiple polynomial near-optimal SARA 3 serial latency multiple multiple polynomial expected optimal BiApp bicriteria app DTP 4 tree energy cost one, deterministic two polynomial near optimal PTP one, probabilistic Unknown & Dynamic Environments DSEE 5 subset of DAG latency single multiple polynomial O(lnT ) regret MABSTA 6 subset of DAG cost applicable multiple polynomial O( p T ) regret 1 Hermes is based on our work in [33]. 2 We dene a subset of DAG, called parallel chains of trees, in [33]. 3 SARA and BiApp are based on our work in [34] 4 DTP and PTP are based on our work in [35] 5 DSEE is based on our work in [36]. 6 MABSTA is based on our work in [37]. 48 Table 3.1 summarizes the existing works and our algorithms. Of all optimization formulations, integer linear programming (ILP) is the most common formulation due to its exibility and intuitive interpretation of the optimization problem [10, 38, 39]. In the well-known MAUI work, Cuervo et al. [10] propose an ILP formulation with a single latency constraint. However, the ILP problems are generally NP- hard, that is, there is no polynomial-time algorithm to solve all instances of ILP unless P = NP [20]. In addition to ILP, graph partitioning is another approach [32]. The minimum cut on weighted edges species the minimum communication cost and cuts the nodes into two disjoint sets, one is the set of tasks that are to be executed at the remote server and the other are ones that remain at the local device. However, it is not applicable to constrained optimization. Furthermore, for task assignment over multiple devices, solving the generalized problem, minimum k-cut, is NP-hard [40]. We identify the fact that existing formulations are not applicable to some real scenarios. First, an application may have some tasks that can run in parallel. In general, the task graph can be described by a directed acyclic graph (DAG) [41]. Leveraging collaborative computing, a system can assign multiple parallel tasks to dierent devices that can execute them separately, which increases the eciency compared to sequential processing. Second, the resource-constrained devices in the network may have their limitation. Instead of greedily boosting the system perfor- mance, we have to consider a constrained optimization formulation. MAUI [10] has 49 a constrained formulation but only applies to serial task graphs. On the other hand, the graph partitioning method [32] works for general task graphs, however, it does not consider any constraint. Hence, it is necessary to relax MAUI's assumption on the task graph and consider the constraints on the resource network. Moreover, we aim to design ecient approximation algorithms to solve our more sophisticated formulation. Existing works have been focusing on the optimization formulations assuming that the run-time environment is static and known. However, in the real scenarios, the on-device resource and network channel qualities may vary with time. More- over, in some cases, they may vary in unpredictable ways, like a device leaving a network or intermittent connections between mobile devices. Hence, a stochastic optimization formulation is necessary. More importantly, we seek for an algorithm that learns the unknown run-time environment and adapts the task assignment strategy to the changes. 3.2 Multi-armed Bandit Problems The multi-armed bandit (MAB) problem is a sequential decision problem where at each time an agent chooses over a set of \arms", gets the payo from the se- lected arms and tries to learn the statistical information from sensing them. These formulations have been considered recently in the context of opportunistic spec- trum access for cognitive radio wireless networks, but those formulations are quite 50 dierent from ours in that they focus only on channel allocation and not on also allocating computational tasks to servers [42, 43]. Given an online algorithm to an MAB problem, its performance is measured by a regret function, which species how much the agent loses due to the unknown information at the beginning [13]. For example, we can compare the performance to a genie who knows the statistics of payo functions and selects the arms based on the best policy. If we always make blind decision without considering sampling results, the accumulated loss will grows linearly with time. Hence, we seek for sub-linear regret, like O(logT ) or O( p T ) to show that the algorithm has learning ability, where the accumulated performance loss slows down with time. The exploration-exploitation trade-o arises when making optimized decisions. That is, to balance between staying with the option that gave the highest pay- os in the past and exploring other options that might give the higher payos in the future [44]. Hence, an MAB algorithm has to consider a balanced strategy on the two phases. For example, in [45], the DSEE (deterministic sequencing of ex- ploration and exploitation) algorithm clearly separates the exploration phase and exploitation phase. When exploring, it samples every arms thoroughly by following a deterministic order. When exploiting, it plays the arm that has highest sample mean. Stochastic MAB problems model the payo of each arm as a stationary random process and aim to learn the unknown information behind it. If the distribution 51 is unknown but is assumed to be i.i.d. over time, Auer et al. [28] propose UCB algorithms to learn the unknown distribution with bounded regret. However, the assumption on i.i.d. processes does not always apply to the real environment. On the other hand, Ortner et al. [46] assume the distribution is known to be a Markov process and propose an algorithm to learn the unknown state transition probabilities. However, the large state space of Markov process causes the problem to be intractable. Adversarial MAB problems, however, do not make any assumptions on the payos. Instead, an agent learns from the sequence given by an adversary who has complete control over the payos [29]. In addition to the well-behaved stochastic processes, an algorithm of adversarial MAB problems gives a solution that generally applies to all bounded payo sequences and provides the the worst-case performance guarantee. To summarize, we have to identify the property of the dynamic environment and choose the formulation that best matches with it. Adversarial MAB applies to the most general environment, however, it has weaker performance guarantee since the analysis has to consider the most extreme cases. On the other hand, for some application scenarios that t the stationary process model, stochastic MAB may be applicable and give stronger performance guarantee. 52 3.3 Our Approach and Proposed Algorithms We take a step by step approach towards our goal, starting from deterministic optimization formulations with known proles to online learning in unknown envi- ronments. We consider more sophisticated formulations compared to the existing works. We show that these problems are NP-hard, and propose polynomial-time approximation algorithms to solve them with performance guarantee on the approx- imation ratios. Moving forward to online learning problems, we adapt multi-armed bandit formulations and propose algorithms that run in polynomial-time and give strategies whose the performance loss is well-bounded. First, we relax the serial task graph assumption and propose Hermes that aims to minimize the application latency with feasible cost budget. We show that Hermes is a fully polynomial time approximation scheme (FPTAS) [9] that gives (1 +) approximation ratio with its complexity bounded by a polynomial of O( 1 ). Hence, comparing to the existing works, Hermes applies to more general formulations, runs in polynomial time, and gives near-optimal solution. Second, we formulate a task assignment problem with multiple individual con- straints on each device. We identify the fact that Hermes, which assigns tasks over multiple devices by only considering a overall cost constraint, may mostly rely on one device so that the heavy workload drains its battery. Hence, in the multi-constrained task assignment problem, we clearly attribute the cost to each device separately and aim to nd out a task assignment strategy that considers 53 both system performance and cost-balancing. We propose two polynomial-time al- gorithms with provable performance guarantees: SARA and BiApp. SARA rounds the fractional solution of an LP-relaxation to achieve asymptotically optimal per- formance from a time average perspective. BiApp is a bicriteria ( + 1; 2 + 2) approximation with respect to the latency objective and cost constraints for each data frame processed by the application (; are parameters quantifying bounds on the communication latency and the communication costs respectively). In addition to deterministic formulations, we consider task assignment in stochas- tic environments. Since the average performance is not strong enough in some highly variant environments, we formulate a stochastic optimization problem with probabilistic QoS constraint, PfLatencytmsg>p. That is, we have the con- dence that more thanp% of time the latency is less thant ms under the stochastic environment, where p and t are arbitrary numbers. We propose PTP, which runs in polynomial time and gives near-optimal solution. Last, we consider the resource that are distributed over the network, including on-device resource and network bandwidth, may vary with time, and hence we have to learn the information and adapt to changes at run time. We formulate the online learning scenarios as multi-armed bandit problems, where devices and channels are modeled as arms and give the payos with unknown statistics as their performance metrics. Dierent from the existing MAB algorithms that can freely probe the desired arm at each time slot, our task assignment not only makes decisions on 54 selecting devices but also aects on the channel usage. Hence, we have to develop algorithms that jointly consider probing the devices and the channels between them. For stationary bandit processes, we adapt the sampling method, DSEE, in [45] and design a new sequence to sample both devices and channels. Our performance analysis shows that the new sampling method can achieve O(lnT ) regret, which is in the same order as the guarantee provided by DSEE. For non-stationary bandit processes, we adapt the adversarial MAB formulation that does not make any assumptions on the stochastic processes. Since the proposed Exp3 algorithm [30] is not applicable to our task assignment scenario, we propose a new algorithm, MABSTA, which jointly learns the performance on unknown devices and channels. Our performance analysis also shows that MABSTA achieves the same order of regret, O( p T ), as Exp3 does. 3.4 System Prototypes We classify the system prototypes by the number of devices (servers) that par- ticipate in collaborative computing. One extreme is the typical computational ooading approach, where there exists one to one connection between a local de- vice and a cloud server. MAUI [10] and CloneCloud [38] are systems that leverage the resources in the cloud. Odessa [11] identies the bottleneck stage at run time, suggests ooading strategy and leverages data parallelism to mitigate the load on a 55 mobile phone. Another extreme is to exploit the computational resources on multi- ple connected devices. Shi et al. [47] investigate the nearby mobile helpers reached by intermittent connections. CWC [48] uses idle mobile devices connected in the network, like mobile devices held by company employees, to run certain tasks as an alternative to enterprise servers. Between these two extremes, MapCloud [49] is a hybrid system that makes run-time decision on using \local" cloud with less com- putational resources but faster connections or using \public" cloud that is distant away with more powerful servers but longer communication delay. Cloudlets [50] is a 3-hierarchy system that introduces a middle tier between mobile devices and the cloud, which forwards the intensive tasks or data to the cloud if necessary or takes the job if possible, considering QoS constraints. COSMOS [39] nds out the customized and economic cluster in its size and its setup time considering the task complexity. These system prototypes have demonstrated the promising applications of col- laborative computing over multiple devices. However, an optimization formulation is essential so that a system can make intelligent decisions based on the run-time environment. As the numerical results presented in Chapter 4, the optimized strat- egy signicantly outperforms the heuristic strategy in scenarios where staying at the local optimal incurs considerable performance loss. Hence, we are positive on the integration of these pioneer algorithms with the system to make the best value of collaborative computing. 56 Chapter 4 Deterministic Optimization with Single Constraint In this study, assuming the application prole and the resource proles are known and deterministic, we formulate a task assignment problem to minimize the ap- plication latency, subject to a cost constraint. We show that this formulation is NP-hard and propose Hermes, a Fully Polynomial Time Approximation Scheme (FPTAS) that provides a solution with latency no more than (1 +) times of the minimum while incurring complexity that is bounded by a polynomial in problem size and 1 . We evaluate the performance by using a real data set collected from several benchmarks, and show that Hermes improves the latency by 16% compared to a previously published heuristic and increases CPU computing time by only 0:4% of overall latency. This chapter is based on our works in [33, 36]. 57 Table 4.1: Notations of SCTA Notation Description m i workload of task i d mn the amount of data exchange between task m and n G(V;E) task graph with set of nodesV and set of edgesE) C(i) set of children of node i l the depth of task graph (the longest path) d in the maximum indegree of task graph quantization step size [N] setf1; 2; ;Ng x2 [M] N assignment strategy of tasks 1N T (j) i latency of executing task i on device j T (jk) mn latency of transmitting data between task m and n from device j to k C (j) i cost of executing task i on device j C (jk) mn cost of transmitting data between task m and n from device j to k D(i;x) accumulated latency when task i nishes, given strategy x 4.1 Problem Formulation We call our formulation as a Single-Constrained Task Assignment problem (SCTA), as a comparison with the Multi-Constrained Task Assignment problem (MCTA) that is presented in the next chapter. Table 4.1 summarizes the notations used in SCTA. We introduce each component in the formulation as follows. 4.1.1 Task Graph An application prole can be described by a directed graphG(V;E) as shown in Fig. 4.1, where nodes stand for tasks and directed edges stand for data dependencies. A task precedence constraint is described by a directed edge (m;n), which implies 58 Hermes: Latency Optimal Task Assignment for Resource-constrained Mobile Computing Yi-Hsuan Kao and Bhaskar Krishnamachari Dept. of Electrical Engineering University of Southern California Los Angeles, CA, USA Email: {yihsuank,bkrishna}@usc.edu Moo-Ryong Ra AT&T Research Lab Bedminster, NJ, USA Email: mra@research.att.com Fan Bai General Motors Global R&D Warren, MI, USA Email: fan.bai@gm.com Abstract—With mobile devices increasingly able to connect to cloud servers from anywhere, resource-constrained devices can potentially perform offloading of computational tasks to either improve resource usage or improve performance. It is of interest to find optimal assignments of tasks to local and remote devices that can take into account the application-specific profile, availability of computational resources, and link connectivity, and find a balance between energy consumption costs of mobile devices and latency for delay-sensitive applications. Given an application described by a task dependency graph, we formulate an optimization problem to minimize the latency while meeting prescribedresourceutilizationconstraints.Differentfrommostof existing works that either rely on an integer linear programming formulation, which is NP-hard and is not applicable to general task dependency graph for latency metrics, or on intuitively de- rived heuristics that offer no theoretical performance guarantees, we propose Hermes, a novel fully polynomial time approxima- tion scheme (FPTAS) algorithm to solve this problem. Hermes provides a solution with latency no more than (1 + ✏) times of the minimum while incurring complexity that is polynomial in problem size and 1 ✏ . We evaluate the performance by using real data set collected from several benchmarks, and show that Hermes improves the latency by 16% (36% for larger scale application) compared to a previously published heuristic and increases CPU computing time by only 0.4% of overall latency. I. INTRODUCTION As more embedded devices are connected, lots of resource on the network, in the form of cloud computing, become ac- cessible. These devices, either suffering from stringent battery usage, like mobile devices, or limited processing power, like sensors, are not capable to run computation-intensive tasks locally.Takingadvantageoftheremoteresource,moresophis- ticated applications, requiring heavy loads of data processing and computation [1], [2], can be realized in timely fashion and acceptable performance. Thus, computation offloading— sending computation-intensive tasks to more resourceful sev- ers, is becoming a potential approach to save resources on local devices and to shorten the processing time [3], [4], [5]. However, implementing offloading invokes extra communi- cation cost due to the application and profiling data that must be exchanged with remote servers. The additional communi- cation affects both energy consumption and latency [6]. In general, an application can be modeled by a task graph, as This work was supported in part by NSF via award number CNS-1217260. split start final 10.5 3 1.2 2 10 3.3 5 1 10 3 5.5 5.5 5 10 3 5 15 9.7 8.5 8.5 3 1.2 1.2 5 5 5 15.5 10 10 5 5 5 8 Fig. 1: A task graph of an application, where nodes specify tasks with their workloads and edges imply data dependency labeled with amount of data exchange. an example shown in Fig. 1. A task is represented by a node whoseweightspecifiesitsworkload.Eachedgeshowsthedata dependencybetweentwotasks,andislabelledwiththeamount of data being communicated between them. An offloading strategy selects a subset of tasks to be offloaded, considering the balance between how much the offloading saves and how much extra cost is induced. On the other hand, in addition to targeting a single remote server, which involves only binary decision on each task, another spectrum of offloading schemes make use of other idle and connected devices in the network [7], [8], where the decision is made over multiple devices based on their availabilities and multiple wireless channels. In sum, a rigorous optimization formulation of the problem and the scalability of corresponding algorithm are the key issues that need to be addressed. In general, we are concerned in this domain with a task as- signmentproblemovermultipledevices,subjecttoconstraints. Furthermore, task dependency must be taken into account in formulations involving latency as a metric. The authors of Odessa [11] present a heuristic approach to task partitioning for improving latency and throughput metrics, involving iter- ative improvement of bottlenecks in task execution and data Figure 4.1: An example of application task graph that task n relies on the result of task m. That is, taskn cannot start until it gets the result of task m. The weight on each node species the workload of the task, while the weight on each edge shows the amount of data communication between two tasks. In addition to the application prole, there are some parameters related to the graph measure in our complexity analysis. We use N to denote the number of tasks and M to denote the number of available devices in the network. For each task graph, there is an initial task (task 1) that starts the application and a nal task (task N) that terminates it. A path from initial task to nal task can be described by a sequence of nodes, where every pair of consecutive nodes are connected by a directed edge. We use l to denote the maximum number of nodes in a path, i.e., the length of the longest path. Finally, d in denotes the maximum 59 indegree in the task graph. Using Fig. 4.1 as an example, we have l = 7 and d in = 2. 4.1.2 Cost and Latency Let C (j) i be the execution cost of task i on device j and C (jk) mn be the transmission cost of data between taskm andn though the channel from devicej tok. Similarly, the latency consists of execution latency T (j) i and the transmission latency T (jk) mn . Given a task assignment strategy x2f1Mg N , where the i th component, x i , species the device that task i is assigned to, the total cost can be described as follows. Cost = X i2[N] C (x i ) i + X (m;n)2E C (xmxn) mn (4.1) As described in the equation, the total cost is additive over nodes (tasks) and edges of the graph. For a tree-structure task graph, the accumulated latency up to task i depends on its preceding tasks. Let D(i;x) be the accumulated latency when task i nishes given the assignment strategy x, which can be recursively dened as D(i;x) = max m2C(i) n D(m;x) +T (xmx i ) mi o +T (x i ) i : (4.2) We useC(i) to denote the set of children of node i. For example, in Fig. 4.2, the children of task 6 are task 4 and task 5. For each branch leading by node m, the 60 latency is accumulating as the latency up to taskm plus the latency caused by data transmission between m and i. D(i;x) is determined by the slowest branch. 4.1.3 Optimization Problem Consider an application, described by a task graph, and a resource network, de- scribed byfC (j) i ;C (jk) mn ;T (j) i ;T (jk) mn g, our goal is to nd a task assignment strategy x that minimizes the total latency and satises the cost constraint, that is, SCTA : min x2[M] N D(N;x) s.t. CostB: TheCost andD(N;x) are dened in (4.1) and (4.2), respectively. The constant B species the cost constraint, for example, energy consumption of battery-operated devices. In Section 4.3, we propose an approximation algorithm based on dynamic programming to solve this problem and show that its running time is bounded by a polynomial of 1 with approximation ratio (1 +). 61 4.2 Proof of NP-hardness of SCTA We reduce the 0-1 knapsack problem to a special case of SCTA, where a binary partition is made on a serial task graph without considering data transmission. Since the 0-1 knapsack problem is NP-hard [15], SCTA is at least NP-hard. Assume that C (0) i = 0 for all i, the special case of SCTA can be written as SCTA 0 : min x i 2f0;1g N X i=1 (1x i )T (0) i +x i T (1) i s.t. N X i=1 x i C (1) i B: GivenN items with their valuesfv 1 ; ;v N g and weightsfw 1 ; ;w N g, one wants to decide which items to be packed to maximize the overall value and satises the total weight constraint, that is, KP : max x i 2f0;1g N X i=1 x i v i s.t. N X i=1 x i w i B: 62 Now KP can be reduced to SCTA 0 by the following encoding T (0) i = 0;8i T (1) i =v i ; C (1) i =w i : By giving these inputs to SCTA 0 , we can solve KP exactly, hence, KP p SCTA 0 p SCTA: (4.3) 4.3 Hermes: FPTAS Algorithms We rst propose the approximation scheme to solveSCTA for a tree-structure task graph and prove that this simplest version of the Hermes algorithm is an FPTAS. Then we solve for more general task graphs by calling the proposed algorithm for trees a polynomial number of times. 4.3.1 Tree-structured Task Graph We propose a dynamic programming method to solve the problem on tree-structured task graphs. For example, in Fig. 4.2, the minimum latency when task 6 nishes depends on when and where task 4 and 5 nish. Hence, prior to solving the min- imum latency of task 6, we want to solve both task 4 and 5 rst. We exploit the 63 between two tasks. In addition to the application profile, there are some parameters related to the graph measure in our complexity analysis. We use N to denote the number of tasks and M to denote the number of devices. For each task graph, there is an initial task (task 1) that starts the application and a final task (task N) that terminates it. A path from initial task to final task can be described by a sequence of nodes, where every pair of consecutive nodes are connected by a directed edge. We use l to denote the maximum number of nodes in a path, i.e., the length of the longest path. Finally, d in denotes the maximum indegree in the task graph. B. Cost and Latency We use the general cost and latency functions in our derivation. Let C (j) ex (i) be the execution cost of task i on device j and C (jk) tx (d) be the transmission cost of d units of data from device j to device k. Similarly, the latency consists of execution latency T (j) ex (i) and the transmission latency T (jk) tx (d). Given a task assignment strategy x2 {1···M} N , where the i th component, x i , specifies the device that task i is assigned to, the total cost can be described as follows. Cost = N X i=1 C (x i ) ex (i)+ X (m,n)2 E C (x m x n ) tx (d mn ) (1) As described in the equation, the total cost is additive over nodes (tasks) and edges of the graph. On the other hand, the accumulated latency up to task i depends on its preceding tasks. Let D (i) be the latency when task i finishes, which can be recursively defined as D (i) = max m2 C(i) n T (x m x i ) tx (d mi )+D (m) o +T (x i ) ex (i). (2) We use C(i) to denote the set of children of node i. For example, in Fig. 2, the children of task 6 are task 4 and task 5. For each child node m, the latency is accumulating as the latency up to task m plus the latency caused by transmission data d mi . Hence, D (i) is determined by the slowest branch. C. Optimization Problem Consider an application, described by a task graph, and a resource network, described by the processing powers and link connectivity between available devices, our goal is to find a task assignment strategyx that minimizes the total latency and satisfies the cost constraint, that is, P:min x2 [M] N D (N) s.t. Cost B. The Cost and D (N) are defined in Eq. (1) and Eq. (2), respectively. The constant B specifies the cost constraint, for example, energy consumption of mobile devices. In the following section, we propose an approximation algorithm based on dynamic programming to solve this problem and show that it runs in polynomial time in 1 ✏ with approximation ratio (1+ ✏). start finish 1 2 3 4 5 6 Fig. 2: A tree-structured task graph, in which the two sub- problems can be independently solved. cost latency y= t x= B Fig. 3: The algorithm solves each sub-problem for the min- imum cost within latency constraint t (the area under the horizontal line y = t). The filled circles are the optimums of each sub-problems. Finally, it looks for the one that has the minimum latency of all filled circles in the left plane x B. III. HERMES:FPTASALGORITHMS In the appendix, we prove that our task assignment problem P is NP-hard for any task graph. In this section, we first propose the approximation scheme to solve problem P for a tree-structure task graph and prove that this simplest version of the Hermes algorithm is an FPTAS. Then we solve for moregeneraltaskgraphsbycallingtheproposedalgorithmfor trees a polynomial number of times. Finally, we show that the Hermes algorithm also applies to the dynamic environment. A. Tree-structured Task Graph We propose a dynamic programming method to solve the problem with tree-structured task graph. For example, in Fig. 2, the minimum latency when the task 6 finishes depends on when and where task 4 and 5 finish. Hence, prior to solving the minimum latency of task 6, we want to solve both task 4 and 5 first. We exploit the fact that the sub-trees rooted by task 4 and task 5 are independent. That is, the assignment strategy on task 1, 2 and 4 does not affect the strategy on task 3 and 5. Hence, we can solve the sub-problems respectively and combine them when considering task 6. We define the sub-problem as follows. Let C[i,j,t] denote the minimum cost when finishing task i on device j within latencyt. We will show that by solving all of the sub-problems for i 2 {1,···,N}, j 2 {1,···,M} and t 2 [0,T] with sufficiently large T, the optimal strategy can be obtained by Figure 4.2: A tree-structured task graph with independent child sub-problems between two tasks. In addition to the application profile, there are some parameters related to the graph measure in our complexity analysis. We use N to denote the number of tasks and M to denote the number of devices. For each task graph, there is an initial task (task 1) that starts the application and a final task (task N) that terminates it. A path from initial task to final task can be described by a sequence of nodes, where every pair of consecutive nodes are connected by a directed edge. We use l to denote the maximum number of nodes in a path, i.e., the length of the longest path. Finally, d in denotes the maximum indegree in the task graph. B. Cost and Latency We use the general cost and latency functions in our derivation. Let C (j) ex (i) be the execution cost of task i on device j and C (jk) tx (d) be the transmission cost of d units of data from device j to device k. Similarly, the latency consists of execution latency T (j) ex (i) and the transmission latency T (jk) tx (d). Given a task assignment strategy x2 {1···M} N , where the i th component, x i , specifies the device that task i is assigned to, the total cost can be described as follows. Cost = N X i=1 C (x i ) ex (i)+ X (m,n)2 E C (x m x n ) tx (d mn ) (1) As described in the equation, the total cost is additive over nodes (tasks) and edges of the graph. On the other hand, the accumulated latency up to task i depends on its preceding tasks. Let D (i) be the latency when task i finishes, which can be recursively defined as D (i) = max m2 C(i) n T (x m x i ) tx (d mi )+D (m) o +T (x i ) ex (i). (2) We use C(i) to denote the set of children of node i. For example, in Fig. 2, the children of task 6 are task 4 and task 5. For each child node m, the latency is accumulating as the latency up to task m plus the latency caused by transmission data d mi . Hence, D (i) is determined by the slowest branch. C. Optimization Problem Consider an application, described by a task graph, and a resource network, described by the processing powers and link connectivity between available devices, our goal is to find a task assignment strategyx that minimizes the total latency and satisfies the cost constraint, that is, P:min x2 [M] N D (N) s.t. Cost B. The Cost and D (N) are defined in Eq. (1) and Eq. (2), respectively. The constant B specifies the cost constraint, for example, energy consumption of mobile devices. In the following section, we propose an approximation algorithm based on dynamic programming to solve this problem and show that it runs in polynomial time in 1 ✏ with approximation ratio (1+ ✏). start finish 1 2 3 4 5 6 Fig. 2: A tree-structured task graph, in which the two sub- problems can be independently solved. cost latency y= t x= B Fig. 3: The algorithm solves each sub-problem for the min- imum cost within latency constraint t (the area under the horizontal line y = t). The filled circles are the optimums of each sub-problems. Finally, it looks for the one that has the minimum latency of all filled circles in the left plane x B. III. HERMES:FPTASALGORITHMS In the appendix, we prove that our task assignment problem P is NP-hard for any task graph. In this section, we first propose the approximation scheme to solve problem P for a tree-structure task graph and prove that this simplest version of the Hermes algorithm is an FPTAS. Then we solve for moregeneraltaskgraphsbycallingtheproposedalgorithmfor trees a polynomial number of times. Finally, we show that the Hermes algorithm also applies to the dynamic environment. A. Tree-structured Task Graph We propose a dynamic programming method to solve the problem with tree-structured task graph. For example, in Fig. 2, the minimum latency when the task 6 finishes depends on when and where task 4 and 5 finish. Hence, prior to solving the minimum latency of task 6, we want to solve both task 4 and 5 first. We exploit the fact that the sub-trees rooted by task 4 and task 5 are independent. That is, the assignment strategy on task 1, 2 and 4 does not affect the strategy on task 3 and 5. Hence, we can solve the sub-problems respectively and combine them when considering task 6. We define the sub-problem as follows. Let C[i,j,t] denote the minimum cost when finishing task i on device j within latencyt. We will show that by solving all of the sub-problems for i 2 {1,···,N}, j 2 {1,···,M} and t 2 [0,T] with sufficiently large T, the optimal strategy can be obtained by Figure 4.3: Hermes' methodology Algorithm 2 Find maximum latency given the problem instance 1: procedure FIND (N) 2: q BFS (G;N) . run BFS from node N and store visited nodes in order 3: for i q.end; q.start do . start from the last element in q 4: if i is a leaf then . L[i;j]: max latency nishing task i on device j 5: L[i;j] T (j) i 8j2 [M] 6: else 7: for j 1;M do 8: L[i;j] T (j) i + max m2C(i) max k2[M] fL[m;k] +T (kj) mi g 9: end for 10: end if 11: end for 12: max j2[M] L[N;j] 13: end procedure fact that the sub-trees rooted by task 4 and task 5 are independent. That is, the assignment strategy on task 1, 2 and 4 does not aect the strategy on task 3 and 5. Hence, we can solve the sub-problems independently and combine them when considering task 6. We dene the sub-problem as follows. Let C[i;j;t] denote the minimum cost when nishing task i on device j within latency t. We will show that by solving 64 Algorithm 3 Hermes FPTAS for tree-structured task graph 1: procedure FPTAS tree (N;) . solve sub-problems for task N 2: FIND (N) . nd the dynamic range [0; ] that covers all assignment strategies 3: q BFS (G;N) . run BFS from node N and store visited nodes in order 4: for r 1; log 2 do 5: r 2 r1 ; r l2 r 6: ~ x DP (q; r ; r ) . solve sub-problems in [0; r ] using step size r 7: if L(~ x) (1 +) 2 r then . L(~ x)D(N; ~ x) 8: return 9: end if 10: end for 11: end procedure 12: 13: procedure DP (q;T up ;) 14: K d Tup e 15: for i q.end; q.start do . start from the last element in q 16: if i is a leaf then . initialize C values of leaves 17: C[i;j;k] ( C (j) i 8kq (T (j) i ) 1 otherwise 18: else 19: for j 1;M, k 1;K do 20: Calculate C[i;j;k] from (4.7) 21: end for 22: end if 23: end for 24: k min min j2[M] k s.t. C[N;j;k]B 25: end procedure all of the sub-problems for i2 [N], j2 [M] and t2 [0; ] with suciently large , the optimal strategy can be obtained by combining the solutions of these sub- problems. Fig. 4.3 shows our methodology. Each circle marks the performance given by an assignment strategy, with x-component as cost and y-component as latency. Our goal is to nd out the red circle, that is, the strategy that results in minimum latency and satises the cost constraint. Under each horizontal line 65 y = t, we rst identify the circle with minimum x-component, which species the least-cost strategy among all of strategies that result in latency at most t. These solutions are denoted by the lled circles. In the end, we look at the one in the left plane (xB) whose latency is the minimum. Instead of solving innite number of sub-problems for allt2 [0; ], we discretize the time domain by using the quantization function q (x) =k; if (k 1)<xk: (4.4) It suces to solve all the sub-problems for k 2 f1; ;Kg, where K = d e. We will analyze how the performance is aected due to the loss of precision by doing quantization and the trade-o with algorithm complexity after we present our algorithm. Suppose we are solving the sub-problem C[i;j;k], given that all of sub-problems of the preceding tasks have been solved, the recursive relation can be described as follows. C[i;j;k] =C (j) i + min xm:m2C(i) f X m2C(i) C[m;x m ;kk m ] +C (xmj) mi g; (4.5) k m =q T (j) i +T (xmj) mi : (4.6) That is, to nd out the minimum cost within latency k at task i, we trace back to its child tasks and nd out the minimum cost over all possible strategies, with the latency that excludes the execution delay of taski and data transmission delay. 66 As the cost function is additive over tasks and the decisions on each child task is independent with each other, we can further lower down the solution space from M z to zM, where z is the number of child tasks of task i. That is, by making decisions on each child task independently, we have C[i;j;k] =C (j) i + X m2C(i) min xm2[M] fC[m;x m ;kk m ] +C (xmj) mi g: (4.7) After solving all the sub-problems C[i;j;k], we solve for the optimal strategy by performing the combining step as follows. min j2[M] k s.t. C[N;j;k]B: (4.8) LetjIj be the number of bits that are required to represent an instance of our problem. As an FPTAS runs in the time bounded by a polynomial of problem size,jIj and 1 [9], we have to bound K by choosing that is larger enough to cover the dynamic range, and choosing the quantization step size to achieve the required approximation ratio. To nd , we solve an unconstrained problem for maximum latency given the input instance. We also propose a polynomial- time dynamic programming to solve this problem exactly, which is summarized in Algorithm 2. To realize how the solution provided by Hermes approximates the minimum latency, we take iterative approach and reduce the dynamic range and step size for each iteration until the solution is close enough to the minimum. 67 We summarize Hermes for tree-structure task graph in Algorithm 3. For r th iteration, we solve for half of the dynamic range with half of the step size compared to last iteration. The procedure DP solves for the minimum quantized latency based on the dynamic programming described in (4.7). Let ~ x be the output strategy suggested by the procedure andL(~ x) be the total latency. Algorithm 3 stops when L(~ x) (1 +) 2 r , or after running log 2 iterations, which implies the smallest precision has been reached. Theorem 1. Algorithm 3 runs in O(d in NM 2l log 2 ) time and admits a (1 +) approximation ratio. Proof. From Algorithm 3, each DP procedure solves NMK sub-problems, where K =d r r e = O( l ). Let d in denote the maximum indegree of the task graph. For solving each sub-problem in (4.7), there are at mostd in minimization problems over M devices. Hence, the overall complexity of a DP procedure can be bounded by O(NMKd in M) =O(d in NM 2 l ): (4.9) Algorithm 3 involves at most log 2 iterations, hence, it runs inO(d in NM 2l log 2 ) time. Since both l and d in of a tree can be bounded by N, and log 2 T is bounded by the number of bits to represent the instance, Algorithm 3 runs in polynomial time of problem size,jIj and 1 . 68 Now we prove the performance guarantee provided by Algorithm 3. For a given strategyx, let ^ L(x) denote the quantized latency andL(x) denote the original one. That is, L(x) =D(N;x). Assume that Algorithm 3 stops at the r th iteration and outputs the assignment strategy ~ x. As ~ x is the strategy with minimum quantized latency solved by Algorithm 3, we have ^ L(~ x) ^ L(x ? ), wherex ? denotes the optimal strategy. For a task graph with depth l, only at most l quantization procedures have been taken. By the quantization dened in (4.4), it always over-estimates by at most r . Hence, we have L(~ x) r ^ L(~ x) r ^ L(x ? )L(x ? ) +l r (4.10) Since Algorithm 3 stops at the r th iteration, we have (1 +) 2 r L(~ x)L(x ? ) +l r =L(x ? ) + 2 r : (4.11) That is, 2 r L(x ? ): (4.12) From (4.10), we achieve the approximation ratio as required. L(~ x)L(x ? ) +l r =L(x ? ) + 2 r (1 +)L(x ): (4.13) 69 Algorithm 4 Hermes FPTAS for serial trees 1: procedure FPTAS path (N) . min. cost when task N nishes at devices 1; ;M within latencies 1; ;K 2: for root i l ;l2f1; ;ng do . solve the conditional sub-problem for every tree 3: for j 1;M do 4: CallFPTAS tree (i l ) conditioning onj with modication described in (4.14) 5: end for 6: end for 7: for l 2;n do 8: Perform combining step in (4.15) to solve C[i l ;j l ;k l ] 9: end for 10: end procedure Algorithm 2 Hermes FPTAS for serial trees 1: procedure FPTAS path (N) . min. cost when task N finishes at devices 1,···,M within latencies 1,···,K 2: for root i l ,l2 {1,···,n} do . solve the conditional sub-problem for every tree 3: for j 1,M do 4: Call FPTAS tree (i l ) conditioning on j with modification described in Eq. (7) 5: for l 2,n do 6: Perform combining step in Eq. (8) to solve C[i l ,j l ,k l ] 7: end procedure chain tree tree i1 i2 i3 Fig. 4: A task graph of serial trees with depthl, only at mostl quantization procedures are taken. By the quantization defined in Eq. (3), it always over estimates by at most . Hence, we have L(˜ x) ˆ L(˜ x) ˆ L(x ⇤ ) L(x ⇤ )+l (5) Let T min = c mmax rmax , that is, the latency when the most intensive task is executed at the fastest device. As the most intensive task must be assigned to a device, the optimal latency, L(x ⇤ ), is at least T min . From Eq. (5), we have L(˜ x) L(x ⇤ )+l = L(x ⇤ )+ ✏Tmax (1+ ✏ rmax rmin )L(x ⇤ ). (6) For realistic resource network, the ratio of the fastest CPU rate and the slowest CPU rate is bounded by a constant c 0 . Let ✏ 0 = 1 c 0 ✏, then the overall complexity is still bounded by O(d in NM 2l 2 ✏ ) and Algorithm 1 admits an (1 + ✏) approxi- mation ratio. Hence, Algorithm 1 is an FPTAS. As chain is a special case of a tree, the Hermes FPTAS Algorithm 1 also applies to the task assignment problem of serial tasks. Instead of using the ILP solver to solve the formulation for serial tasks proposed previously in [9], we have therefore provided an FPTAS to solve it. B. Serial Trees Most applications start from a unique initial task, then split to multiple parallel tasks and finally, all the tasks are merged into one final task. Hence, the task graph is neither a chain nor a tree. In this section, we show that by calling Algorithm 1 in polynomial number of times, Hermes can solve the task graph that consists of serial of trees. The task graph in Fig. 4 can be decomposed into 3 trees connecting serially, where the first tree (chain) terminates in task i 1 , the second tree terminates in task i 2 . In order to find C[i 3 ,j 3 ,k 3 ], we independently solve for every tree, with the condition on where the root task of the former tree ends. For example, we can solve C[i 2 ,j 2 ,k 2 |j 1 ], which is the strategy thatminimizesthecostinwhichtaski 2 endsatj 2 withindelay k 2 and given task i 1 ends at j 1 . Algorithm 1 can solve this sub-problem with the following modification for the leaves. C[i,j,k|j1]= ( C (j) ex (i)+C (j 1 j) tx (di 1 i) 8 k q (T (j) ex (i)+T (j 1 j) tx (di 1 i)), 1 otherwise (7) To solve C[i 2 ,j 2 ,k 2 ], the minimum cost up to task i 2 , we perform the combining step as C[i 2 ,j 2 ,k 2 ]= min j2 [M] min kx+ky=k2 C[i 1 ,j,k x ]+C[i 2 ,j 2 ,k y |j]. (8) Similarly, combining C[i 2 ,j 2 ,k x ] and C[i 3 ,j 3 ,k y |j 2 ] gives C[i 3 ,j 3 ,k 3 ]. Algorithm 2 summarizes the steps in solving the assignment strategy for serial trees. To solve each tree involves M calls on different conditions. Further, the number of trees n can be bounded by N. The latency of each tree is within (1+ ✏) optimal, which leads to the (1+ ✏) approximation of total latency. Hence, Algorithm 2 is also an FPTAS. C. Parallel Chains of Trees We take a step further to extend Hermes for more com- plicated task graphs that can be viewed as parallel chains of trees, as shown in Fig. 1. Our approach is to solve each chains by calling FPTAS path with the condition on the task where they split. For example, in Fig. 1 there are two chains that can be solved independently by conditioning on the split node. The combining procedure consists of two steps. First, solve C[N,j,k|j split ] by Eq. (4) conditioned on the split node. Then C[N,j,k] can be solved similarly by combining two serial blocks in Eq. (8). By calling FPTAS path at most d in times, this proposed algorithm is also an FPTAS. D. Stochastic Optimization The dynamic resource network, where server availabilities and link qualities are changing, makes the optimal assignment strategy vary with time. For Hermes, which solves the opti- mal strategy based on the profiling data, it is reasonable to formulate a stochastic optimization problem of minimizing the expected latency subject to expected cost constraint. If both latency and cost metrics are additive over tasks, we can directly apply the deterministic analysis to the stochastic one by assuming that the profiling data is the 1 st order expectations. However, it is not clear if we could apply our deterministic analysis for parallel computing as the latency metric is nonlinear. For example, for two random variables X and Y, E{max(X,Y)} = max(E{X},E{Y }) is in general not true. In the following, we exploit the fact that the latency Figure 4.4: A task graph of serial trees As chain is a special case of a tree, Algorithm 3 also applies to the task as- signment problem of serial tasks. Instead of using the ILP solver to solve the formulation for serial tasks proposed previously in [10], we have therefore provided an FPTAS to solve it. 4.3.2 Serial Trees In [11], several of applications are modeled as task graphs that start from a unique initial task, then split to multiple parallel tasks and nally, all the tasks are merged into one nal task. Hence, the task graph is neither a chain nor a tree. In this section, we show that by calling Algorithm 3 in polynomial number of times, Hermes can solve the task graph that consists of serial trees. 70 The task graph in Fig. 4.4 can be decomposed into 3 trees connecting serially, where the rst tree (chain) terminates in taski 1 , the second tree terminates in task i 2 . In order to nd C[i 3 ;j 3 ;k 3 ], we independently solve for every tree, with the condition on where the root task of the former tree ends. For example, we can solve C[i 2 ;j 2 ;k 2 jj 1 ], which is the strategy that minimizes the cost in which task i 2 ends at j 2 within delay k 2 and given task i 1 ends at j 1 . Algorithm 3 can solve this sub-problem with the following modication for the leaves. C[i;j;kjj 1 ] = 8 > > > < > > > : C (j) i +C (j 1 j) i 1 i 8kq (T (j) i +T (j 1 j) i 1 i ); 1 otherwise (4.14) To solve C[i 2 ;j 2 ;k 2 ], the minimum cost up to task i 2 , we perform the combining step as C[i 2 ;j 2 ;k 2 ] = min j2[M] min kx+ky =k 2 C[i 1 ;j;k x ] +C[i 2 ;j 2 ;k y jj]: (4.15) Similarly, combining C[i 2 ;j 2 ;k x ] and C[i 3 ;j 3 ;k y jj 2 ] gives C[i 3 ;j 3 ;k 3 ]. Algorithm 4 summarizes the steps in solving the assignment strategy for serial trees. To solve each tree involves M calls on dierent conditions. Further, the number of trees n can be bounded by N. The latency of each tree is within (1 +) optimal, which leads to the (1 +) approximation of total latency. Hence, Algorithm 4 is also an FPTAS. 71 4.3.3 Parallel Chains of Trees We take a step further to extend Hermes for more complicated task graphs that can be viewed as parallel chains of trees, as shown in Fig. 4.1. Our approach is to solve each chains by calling FPTAS path with the condition on the task where they split. For example, in Fig. 4.1 there are two chains that can be solved independently by conditioning on the split node. The combining procedure consists of two steps. First, solve C[N;j;kjj split ] by (4.7) conditioned on the split node. Then C[N;j;k] can be solved similarly by combining two serial blocks in (4.15). By callingFPTAS path at mostd in times, this proposed algorithm is also an FPTAS. 4.3.4 More General Task Graph The Hermes algorithm in fact can be applied to even more general graphs, albeit with weaker guarantees. In this section, we outline a general approach based on identifying the \split nodes" | nodes in the task graph with more than one outgoing edge. From the three categories of task graph we have considered so far, each split node is only involved in the local decision of two trees. That is, in the combining stage shown in (4.15), there is only one variable on the node that connects two serial trees. Hence, the decision of this device can be made locally. Our general approach is to decompose the task graph into chains of trees and call the polynomial time procedure FPTAS path to solve each of them. If a split node connects two trees from dierent chains, then we cannot resolve this condition variable and have to 72 keep it until we make the decision on the node where all of involved chains merge. We use the task graph in Fig. 4.1 to show an example: as the node (marked with split) splits over two chains, we have to keep it until we make decisions on the nal task, where two chains merge. On the other hand, there are some nodes that split locally, which can be resolved in the FPTAS path procedure. A node that splits across two dierent chains requires O(M) calls of the FPTAS path . Hence, the overall complexity of Hermes in such graphs would be O(M S ), where S is the number of \global" split nodes. If the task graph contains cycles, similar argument can be made as we classify them into local cycles and global cycles. A cycle is local if all of its nodes are contained in the same chain of trees and is global otherwise. For a local cycle, we solve the block that contains it and make conditions on the node with the edge that enters it and the node with the edge that leaves it. However, if the cycle is global, more conditions have to be made on the global split node and hence the complexity is not bounded by a polynomial. The structure of a task graph depends on the granularity of partition. If an application is partitioned into methods, many recursive loops are involved. If an application is partitioned into tasks, which is a block of code that consists of mul- tiple methods, the structure is simpler. As we show in the following, Hermes can tractably handle practical applications whose graph structures are similar to bench- marks in [11]. 73 4.4 Numerical Evaluation 0.2 0.3 0.4 0.5 1 2 3 4 5 6 1 1.5 2 2.5 3 ε ratio bound Hermes optimal Figure 4.5: Hermes converges to the optimum as decreases 0.2 0.3 0.4 0.5 1 3 0.8 1 1.2 1.4 1.6 1.8 2 ε ratio bound optimal Figure 4.6: Hermes over 200 dierent application proles 74 20 25 30 35 40 avg latency (ms) 1 2 3 4 5 6 0 10 20 30 40 50 60 ε avg cost Hermes optimal Figure 4.7: Hermes in dynamic environments First, we verify that Hermes provides near-optimal solution with tractable com- plexity and performance guarantee. Then, we use the real data set of several benchmark proles to evaluate the performance of Hermes and compare it with the heuristic Odessa approach proposed in [11]. 4.4.1 Algorithm Performance From our analysis result in Section 4.3, the Hermes algorithm runs inO(d in NM 2l log 2 T ) time with approximation ratio (1 +). In the following, we provide the numerical results to show the trade-o between the complexity and the accuracy. Given the 75 10 12 14 16 18 20 10 1 10 2 10 3 10 4 10 5 10 6 10 7 number of tasks CPU time (ms) Brute Force Hermes Figure 4.8: Hermes: CPU time measurement task graph shown in Fig. 4.1 andM = 3, the performance of Hermes versus dier- ent values of is shown in Fig. 4.5. When = 0:4, the performance converges to the minimum latency. Fig. 4.5 also shows the bound of worst case performance in dashed line. The actual performance is much better than the (1 +) bound. We examine the performance of Hermes on dierent problem instances. Fig. 4.6 shows the performance of Hermes on 200 dierent application proles. Each prole is selected independently and uniformly from the application pool with dierent task workloads and data communications. The result shows that for every instance we have considered, the performance is much better than the (1 +) bound and converges to the optimum as decreases. 76 If the means of these stochastic processes are known, Hermes can solve for the best strategy based on these means. Fig. 4.7 shows that how the strategies suggested by Hermes perform under the dynamic environment. The average per- formance is taken over 10000 samples. From Fig. 4.7, the solution converges to the optimal one as epsilon decreases, which minimizes the expected latency and satises the expected cost constraint. 4.4.2 CPU Time Evaluation Fig. 4.8 shows the CPU time for Hermes to solve for the optimal strategy as the problem size scales. We use Apple Macbook Pro equipped with 2.4GHz dual-core Intel Core i5 processor and 3MB cache as our testbed and use java management package for CPU time measurement. For each problem size, we measure Hermes' CPU time over 100 dierent problem instances and show the average with vertical bar as standard deviation. As the number of tasks (N) increases in a serial task graph, the CPU time needed for the Brute-Force algorithm grows exponentially, while Hermes scales well and still provides the near-optimal solution ( = 0:01). From our complexity analysis, for serial task graph l = N, d in = 1 and we x M = 3, the CPU time of Hermes can be bounded by O(N 2 ). 77 4.4.3 Benchmark Evaluation In [11], Ra et al. present several benchmarks of perception applications for mobile devices and propose a heuristic approach, called Odessa, to improve both makespan and throughput with the help of a cloud connected server. They call each edge and node in the task graph as stages and record the timestamps on each of them. To improve the performance, for each data frame, Odessa rst identies the bottleneck, evaluates each strategy with simple metrics and nally select the potentially best one to mitigate the load on the bottleneck. However, this greedy heuristic does not oer any theoretical performance guarantee, as shown in Fig. 4.9 Hermes can improve the performance by 36% for task graph in Fig. 4.1. Hence, we further choose two of benchmarks, face recognition and pose recognition, to compare the performance between Hermes and Odessa. Taking the timestamps of every stage and the corresponding statistics measured in real executions provided in [11], we emulate the executions of these benchmarks and evaluate the performance. In dynamic resource scenarios, as Hermes' complexity is not as light as the greedy heuristic (86.87 ms in average) and its near-optimal strategy needs not be updated from frame to frame under similar resource conditions, we propose the following online update policy: similar to Odessa, we record the timestamps for online proling. Whenever the latency dierence of current frame and last frame goes beyond the threshold, we run Hermes based on current proling to update the strategy. By doing so, Hermes always gives the near-optimal strategy for current 78 resource scenario and enhances the performance at the cost of reasonable CPU time overhead due to resolving the strategy. As Hermes provides better performance in latency but larger CPU time over- head when updating, we dene two metrics for comparison. Let Latency(t) be the normalized latency advantage of Hermes over Odessa up to frame numbert. On the other hand, let CPU(t) be the normalized CPU advantage of Odessa over Hermes up to frame number t. That is, Latency(t) = 1 t t X i=1 L O (i)L H (i) ; (4.16) CPU(t) = 1 t C(t) X i=1 CPU H (i) t X i=1 CPU O (i) ; (4.17) whereL O (i) andCPU O (i) are latency and update time of framei given by Odessa, and the notations for Hermes are similar except that we use C(t) to denote the number of times that Hermes updates the strategy up to frame t. To model the dynamic resource network, the latency of each stage is selected independently and uniformly from a distribution with its mean and standard de- viation provided by the statistics of the data set measured in real applications. In addition to small scale variation, the link coherence time is 20 data frames. That is, for some period, the link quality degrades signicantly due to possible fading situations. Fig. 4.10 shows the performance of Hermes and Odessa for the face recognition application. Hermes improves the average latency of each data 79 frame by 10% compared to Odessa and increases CPU computing time by only 0:3% of overall latency. That is, the latency advantage provided by Hermes well- compensates its CPU time overhead. Fig. 4.11 shows that Hermes improves the average latency of each data frame by 16% for pose recognition application and increases CPU computing time by 0:4% of overall latency. When the link quality is degrading, Hermes updates the strategy to reduce the data communication, while Odessa's sub-optimal strategy results in signicant extra latency. Considering CPU processing speed is increasing under Moore's law but network condition does not change that fast, Hermes provides a promising approach to trade-in more CPU for less network consumption cost. 4.4.4 Discussion We have formulated a task assignment problem and provided a FPTAS algorithm, Hermes, to solve for the optimal strategy that makes the balance between latency improvement and energy consumption of battery-operated devices. Compared with previous formulations and algorithms, to the best of our knowledge, Hermes is the rst polynomial time algorithm to address the latency-resource trade-o problem with provable performance guarantee. Moreover, Hermes is applicable to more sophisticated formulations on the latency metrics considering more general task dependency constraints as well as multi-device scenarios. The CPU time measure- ment shows that Hermes scales well with problem size. We have further emulated 80 the application execution by using the real data set measured in several mobile benchmarks, and shown that our proposed online update policy, integrating with Hermes, is adaptive to dynamic network change. Furthermore, the strategy sug- gested by Hermes performs much better than greedy heuristic so that the CPU overhead of Hermes is well compensated. 81 0 20 40 60 80 100 120 140 160 180 200 20 25 30 35 40 45 frame number latency (ms) Odessa Hermes avg: 36.0095 avg: 26.4896 Figure 4.9: Hermes: 36% improvement for the example task graph 0 20 40 60 80 100 120 140 160 180 200 300 400 500 600 700 800 900 1000 1100 latency (ms) 0 20 40 60 80 100 120 140 160 180 200 10 0 10 1 10 2 10 3 frame number time (ms) Odessa Hermes Odessa extra latency Hermes extra CPU overhead avg: 621 avg: 682 Figure 4.10: Hermes: 10% improvement for the face recognition application 0 50 100 150 200 0 2000 4000 6000 8000 10000 12000 latency (ms) 0 50 100 150 200 10 1 10 2 10 3 frame number time (ms) Odessa Hermes Odessa extra latency Hermes extra CPU overhead avg: 6261 avg: 5414 Figure 4.11: Hermes: 16% improvement for the pose recognition application 82 Chapter 5 Deterministic Optimization with Multiple Constraints In this study, we follow the assumption of known and deterministic proles and formulate an optimization problem to nd out the best task assignment that min- imizes the overall latency, but is subject to individual constraints on each device. This multi-constrained formulation clearly attributes the cost to each device sep- arately. Hence, we can avoid assignments that mostly rely on a single device and hence drain its battery. We show that our formulation is NP-hard, propose two polynomial-time approximation algorithms with provable performance guarantees with respect to the optimum, and verify our analysis through numerical simulation. This chapter is based on our work in [34]. 83 classify classify data frame tiler detect detect detect feature merger graph splitter recog merge output stage 1 stage 2 stage 3 stage 4 stage 5 stage 6 A block diagram resource network B C task assignment Figure 5.1: Collaborative computing on face recognition application task device task i task i+1 device j device k Figure 5.2: Graphical illustration of MCTA 84 5.1 Problem Formulation Fig. 5.1 takes the face recognition application from [11] as an example 1 , which consists of several major stages (tasks) like face detection and classiers. The system processes each coming image and outputs the result as a set of names, which signicantly reduce the data size compared to the raw data. Formally, suppose a data processing application consists of N stages (tasks), where each data frame goes though N stages in order to be processed. There are M available devices in the network. Our goal is to nd the optimal task assignment over these devices. That is, for each task i, nd a device j to execute it such that the overall latency when nishing N tasks is minimized. Furthermore, each device has an individual cost constraint. We can view this problem as nding the optimal path from the 1 st stage to the N th stage of the trellis diagram as shown in Fig. 5.2. For cleaner presentation, each edge on the trellis diagram is denoted by a 3- tuple (i;j;k). Our decision variable is a binary variable x ijk 2f0; 1g, where being 1 denotes the assignment of task i on device j and task i + 1 on device k. If edge (i;j;k) is selected, a latency is induced and is denoted by T ijk , which is the sum of execution latency of task i on device j and potential data transmission latency from device j to k if j6=k. Furthermore, let C ij denote the cost of executing task i on device j. Let C e ijk be the data emission cost of transmitting the intermediate result of taski from devicej tok. Similarly,C r ijk be the data receiving cost induced 1 We neglect some control signals exchanged between stages, as they are relatively small com- pared to the data frame. 85 on device k. Moreover, we use the notation [N] to denote the setf1; 2; ;Ng. The multi-constrained task assignment (MCTA) problem can be formulated as a linear integer programming. MCTA : min X i2[N] X j;k2[M] x ijk T ijk s.t. X i2[N] X k2[M] x ijk (C ij +C e ijk ) +x ikj C r ikj B j ;8j2 [M] (5.1) X j2[M] x ijk = X l2[M] x i+1;kl ;8i2 [N 1];k2 [M] (5.2) X j;k2[M] x ijk = 1;8i2 [N] (5.3) x ijk 2f0; 1g (5.4) Since the application ends at the N th task, we have C e Njk = C r Njk = 0. Eq. (5.1) is the cost constraint for each device. Constraint (5.2) implements the rule that if we assign task i + 1 on device k, then for the next task we have to pick one edge starting from k. Constraint (5.3) implies that we have to pick exactly one edge for each stage in the trellis diagram. 5.1.1 Hardness of MCTA We rst reduce an NP-hard problem, called the general assignment problem (GAP) [51], to MCTA. That is, MCTA is at least as hard as GAP. Therefore, MCTA 86 is also NP-hard. In GAP, there are N items and M bins. The reward of packing item i to bin j is p ij . The goal is to pack each item into exactly one bin such that the total reward is maximized, subject to the cost constraints on each bin. The integer programming formulation of GAP is as follows. GAP : max X i2[N] X j2[M] x ij p ij s.t. X i2[N] x ij w ij B j ;8j2 [M] (5.5) X j2[M] x ij = 1;8i2 [N] (5.6) x ij 2f0; 1g (5.7) Given an instance of GAP, we will map the instance to the input of MCTA and transfer the corresponding solution back to the solution of GAP. An instance of MCTA can be described by as fT ijk ;C ij ;C e ijk ;C r ijk ;B 0 j ;N 0 ;M 0 g 87 Solving theGAP problem with instancefp ij ;w ij ;B j ;N;Mg is equivalent to solving MCTA by the following mapping. N 0 =N M 0 =M B 0 j =B j ;8j2 [M] T ijk =P max p ij ;8i2 [N]; j;k2 [M] (5.8) C ij =w ij ;8i2 [N]; j2 [M] (5.9) C e ijk =C r ijk = 0 (5.10) By dening P max = max i;j p ij , Eq. (5.8) transfer the maximization of objective in GAP to the objective in MCTA. Eq. (5.9) maps the cost of packing item i into bin j to the execution cost of task i on device j in MCTA. Given the solution of MCTA,fx ijk g, the solution of GAP is to pack item i to bin j if x ijk = 1. Since fx ijk g satises the constraint in (5.3), for each i, there exists one and only one tuple (j;k) of which x ijk = 1. That is, for each item i, we can nd exactly one bin j to pack the item suggested by the solution of MCTA (k is irrelevant). By the mappings of input parameters and solution, we show that GAP can be reduced to MCTA. Hence, MCTA is NP-hard. We further verify the upper bound of hardness of MCTA. We notice from Fig. 5.2, solving MCTA is equivalent to nding an optimal and feasible path on 88 the trellis diagram. Hence, we correlate MCTA to the multi-constrained path selection problem (MCP) [52], where one wants to nd an optimal and feasible path on a directed acyclic graph (DAG) given a starting node and a destination. Literally, MCTA falls in the set of formulations of MCP. We omit the details of the reduction from MCTA to MCP and refer the reader to more literature on MCP problems [53]. To summarize, we can bound the hardness of MCTA by GAP p MCTA p MCP: (5.11) The relationA p B states that problemA is polynomial-time reducible to problem B. That is, A can be solved by calling the solver of B a polynomial number of times. Since these problems are NP-hard, researchers aim to nd polynomial-time algorithms to approximately solve them and give the performance guarantee of the sub-optimal solution. There have been several approximation results for GAP and MCP. Fleischer et al. [51] propose an LP rounding algorithm that approximates GAP by (1 1 e ). On the MCP problem, if the number of constraints is a constant, then Xue et al. [54] propose a fully polynomial time approximation scheme (FPTAS [9]), which approximates MCP by (1 +) and the complexity of the algorithm is bounded by a polynomial of 1 and the problem size. However, in MCTA, the number of constraints grows with the network size M. Hence, directly applying the FPTAS algorithm toMCTA results in exponential complexity. The authors in [55] propose 89 Algorithm 5 Sequential Randomized Rounding Algorithm 1: procedure SARA(x ? ijk ) 2: Choose v 1 ;v 2 w.p. x ? 1jk 3: for i 2; ;N 1 do 4: Given v i , choose v i+1 =k w.p. x ? iv i k P l x ? iv i l 5: end for 6: Return v 1 ; ;v N 7: end procedure an algorithm that admits M approximate ratio, which grows with the number of constraints. Having justied the hardness of MCTA, we propose two polynomial- time algorithms with provable performance guarantees. 5.2 SequentialRandomizedRoundingAlgorithm In this section, we solve the LP-relaxation of MCTA and design a randomized rounding algorithm based on the LP solution. For each data frame, our algorithm sequentially assigns each task to a device in order. Unlike most LP rounding algo- rithms that independently round the fractional solution to integers, our algorithm rounds the assignment on task i depending on the rounding result of task i 1. Hence, we call it Sequential rAndomized Rounding Algorithm (SARA). If we relax (5.4) to allow each variable falling in the interval [0; 1], then we have an LP-relaxation of MCTA. We rst solve the LP-relaxation and design a simple rounding algorithm to round the fractional values to integers, either 0 or 1. Let fx ? ijk g denote the optimal solution of the LP-relaxation. We propose SARA shown in Algorithm 5. SARA takesfx ? ijk g as input and outputs a sequence of vertices 90 V 1 ; ;V N , which implies the selected path on the trellis diagram in Fig. 5.2. The corresponding task assignment would be assigning task i to device V i for each i. Since SARA is a randomized algorithm, we use capital letters to denote the output as random variables. On the other hand, we use small letters v 1 ; ;v N to denote constant values. Furthermore, we use E i = (i;V i ;V i+1 ) denote the random edge selected in the i th stage of Algorithm 5. SARA makes task assignment as follows. First, select an edge on the trellis diagram that represents for the assignment on task 1 and task 2 with distribution implied by x ? 1jk . Starting from task 3, select the device based on the conditional probability given the assignment of the previous task, PfV i+1 =vjV i =ug = x ? iuv P l2[M] x ? iul : (5.12) Theorem 2. The expected performance of Algorithm 5 is X i2[N] X j;k2[M] x ? ijk T ijk ; (5.13) which is the minimum objective of the LP-relaxation. Furthermore, the expected cost on each device j is X i2[N] X k2[M] x ? ijk (C ij +C e ijk ) +x ? ikj C r ikj : (5.14) 91 We rst prove the following lemma. Lemma 1. For all i2f2; ;Ng, v2f1; ;Mg, implementing Algorithm 5 results in PfV i =vg = X u2[M] x ? i1;uv (5.15) Proof. We prove this lemma by induction. For i = 2, we have PfV 2 =vg = X u PfV 1 =u;V 2 =vg = X u x ? 1uv : (5.16) Assume the case i =n is true. When i =n + 1, PfV n+1 =vg = X u PfV n =u;V n+1 =vg (5.17) = X u PfV n =ugPfV n+1 =vjV n =ug (5.18) = X u X w x ? n1;wu x ? nuv P s x ? nus (5.19) = X u x ? nuv : (5.20) Eq. (5.19) uses the fact that the optimal solutionfx ? ijk g satises the constraint in (5.2). That is, P w x ? n1;wu = P s x ? nus . Hence, we get the result as required. 92 5.2.1 Proof of Theorem 2 Let T i be the latency induced by selecting edge (i;V i ;V i+1 ), which is a random variable depending on V i and V i+1 . The expected objective value given by Alg. 5 can be written as Ef X i2[N] T i g = X i2[N] EfT i g: (5.21) For i = 1, we haveEfT 1 g = P jk x ? ijk T ijk implied by Alg. 5. For i = 2; ;N 1, we have EfT i g = X j PfV i =jgE V i+1 fT i jV i =jg (5.22) = X j X u x ? i1;uj X k T ijk x ? ijk P l x ? ijl (5.23) = X j;k x ? ijk T ijk : (5.24) Eq. (5.23) again uses the fact thatfx ? ijk g satises the constraint in (5.2). Summing up the stage-wise expected values, we achieve the expected performance given in Theorem 2. 93 The expected cost on each device can be derived in the similar way. Let D ij be the cost induced on device j by selecting edge (i;V i ;V i+1 ). That is, D ij = 8 > > > > > > > > > > > > > < > > > > > > > > > > > > > : C ij +C e ijk if V i =j;V i+1 =k (k6=j) C r ikj if V i =k;V i+1 =j (k6=j) C ij +C e ijj +C r ijj if V i =j;V i+1 =j 0 otherwise. (5.25) The expected cost on device j can be written as Ef X i2[N] D ij g = X i2[N] EfD ij g: (5.26) Fori = 1,EfD 1j g = P k x ? 1jk (C 1j +C e 1jk ) +x ? 1kj C r 1kj . Fori = 2; ;N 1, we have EfD ij g = X u;v2[M] PfV i =ugPfV i+1 =vjV i =ugEfD ij jV i =u;V i+1 =vg (5.27) = X u;v X w x ? i1;wu x ? iuv P l x ? iul EfD ij jV i =u;V i+1 =vg (5.28) = X v6=j x ? ijv (C ij +C e ijv ) + X u6=j x ? iuj C r iuj +x ? ijj (C ij +C e ijj +C r ijj ) (5.29) = X k x ? ijk (C ij +C e ijv ) +x ? ikj C r ikj : (5.30) Theorem 2 implies SARA achieves the optimum of the LP and induces feasible cost on each device in expectation. That is, if we run SARA for each data frame, in 94 the long run, the average latency per frame will converge to the optimum of the LP implied by the strong law of large numbers. Since the optimal solution of MCTA is feasible to its LP-relaxation, SARA achieves the optimal performance in average. The LP-relaxation ofMCTA hasNM 2 variables and Algorithm 5 runs inO(N) time. Hence, SARA, including the LP, runs in polynomial time. On the other hand, we can naively formulate an ILP, where for each assignment strategy i, the binary variable y i denotes if the corresponding assignment is selected or not. Similarly, we solve the LP-relaxation and design the rounding algorithm to be selecting as- signment i with probability y star i . This naive algorithm also achieves the optimal performance in expectation. However, there are M N assignment strategies, which results in solving an LP whose number of variables grows exponentially with the problem size and sampling over exponentially many assignment strategies (O(M N )). Hence, we propose SARA that runs eciently and achieves the same performance guarantee. 5.3 ABicriteriaApproximationAlgorithmforMCTA with Bounded Communication Costs 2 In this section, we assume that device communication costs are bounded, speci- cally we assume8(i;j;k) :C e ijk ;C r ijk 2 [C;C]. We make no assumptions about any overall bounds on task latencies or execution costs and letC ij ;T ijk 2R + . Note that 2 This section is a joint work with Dr. Rajgopal Kannan, University of Southern California. 95 T ijk are a combination of two dierent metrics: task computing latency and data transmission latency and therefore we can consider separating these two compo- nents. More specically, deneF ikl as the latency involved in forwarding the results of task i received by device k to a device l. It is reasonable to assume that packet forwarding can be handled directly by the network interface at a device with low overheads and therefore we assume that the packet forwarding latency F ikl ^ T ijk for all tasks and devices, where ^ T ijk = min i;j;k T ijk and > 1 is a small constant. It is reasonable to assume is small since ^ T ijk involves both task computing and data transmission latencies whileF involves only data forwarding latency. We then dene and solve a modied LP-relaxation of the original MCTA integer program by separating the functionalities of task execution and data forwarding and convert the resultant fractional assignment of tasks to devices into an integral assignment using a technique similar to one used in nding minimum cost makespan scheduling [56]. We show that the resultant assignment is a ( + 1; 2 + 2)-approximation to the optimal integral solution, where the total latency of our assignment is within a ( + 1)-factor of the optimal latency while all device energy costs are within a (2 + 2)-factor of their original budgets. 96 Consider the following modied version of the originalMCTA integer program: MCTA2 : min X i2[N] X j;k2[M] x ijk T ijk s.t. X i2[N] X k2[M] x ijk (C ij +C e ijk +C r i1;j ) + x ikj (C r ikj +C e ij ) ( + 1)B j ; 8j2 [M] (5.31) X j;k2[M] x ijk = 1;8i2 [N] (5.32) x ijk 2f0; 1g 8i2 [N];8(j;k)2 [M] (5.33) where C r i1;j and C e i;j in (5.31) represent the maximum reception and emission costs of tasksi1 andi to devicej, respectively. The integer program ofMCTA2 above selects the best transmit-receive pair of devices (j;k) for each task i2 [N] that minimizes the total latency over all N tasks while satisfying ( + 1) upscaled device budget constraints. Once the results of a task are received at a device, they are forwarded to the device executing the next task. LetM 0 =f(i;j 0 i ;k 0 i )g denote the optimal solution to MCTA2. Under this notation, taski is executed at device j 0 i and the emitted results are received by device k 0 i , which then forwards it toj 0 i+1 , the computing device for task i + 1 (only necessary, if k 0 i 6= j 0 i+1 ). Similarly, let M =f(i;j i ;k i )g denote the optimal solution to the original MCTA problem, 97 where taski is executed at devicej i and results transmitted to k i . UnderM , we have k i =j i+1 ,8i2 [N 1]. Then we have, Lemma 2. The optimal solutionM to the original MCTA problem is a feasible solution for the MCTA2 problem. Proof. Consider task assignments (i 1;j i1 ;j i ) and (i;j i ;k i ) fromM . Using (5.31), the cost to device j i in MCTA2 of this assignment is: (C ij i +C e ij i k i +C r i1;j i ) + (C r i1;j i1 j i +C e ij i ) (5.34) ( + 1) C ij i +C e ij i k i +C r i1;j i1 j i (5.35) where (5.35) follows from the bounds on communication costs. The last term of (5.35) represents the energy cost to device j i for implementing task assignments (i 1;j i1 ;j i ) and (i;j i ;k i ) in the optimal MCTA solutionM . When summing this up over all tasks inM , this sum is less than B j sinceM is optimal and the result follows. Lemma 3. The total latency of the optimal solutionM 0 is at most + 1 times the optimal latency ofM . 98 Proof. The optimal solution derived fromM 0 requires forwarding for all tasks i such that k 0 i 6= j 0 i+1 with associated cost F ik 0 i j 0 i+1 . Thus the latency of this solution is at most X i2[N] T ij 0 i k 0 i +F ik 0 i j 0 i+1 X i2[N] T ij 0 i k 0 i + ^ T ijk (5.36) X i2[N] ( + 1)T ij 0 i k 0 i ( + 1) X i2[N] T ij i k i ; (5.37) where the rst part of (5.37) is because ^ T ijk is the minimum latency for any task i and the second part follows from lemma 2. Now consider the LP-relaxation of MCTA2 with the additional constraints x ijk = 0 if (C ij +C e ijk >B j orC r ijk >B k ) (5.38) x ijk 0 8i2 [N];8(j;k)2 [M] (5.39) Constraint (5.38) ensures that in the LP-relaxation, there is no fractional assign- ment of taski to devicej with emission to devicek if either the combined execution and data emission costs exceed the budget of device j or the data reception costs exceed the budget of device k. Letf~ x ijk g and ~ T denote the optimal solution and the optimal objective func- tion value obtained from the LP-relaxation dened above. Note that ~ T T 0 = 99 P i2[N] T ij 0 i k 0 i . We convert this fractional optimal assignment to an integral multi- constrained task assignment as follows: For devicej2 [M], letn j =d P i2[N] P k2[M] ~ x ijk e. represent the net integral weight of tasks assigned for execution at device j. We dene a bipartite graph G = (U;V;E) as follows: For each task i, add a nodev i to V and for each device j, add vertex set U j =fu 1 j ;u 2 j ;:::;u n j j g to U. Next for each device j, dene a logical vertex set V j =fv ijk g consisting of one logical vertex for each task-device index ijk such that ~ x ijk > 0. This logical set represents all tasks i2 [N] with a positive fractional assignment ~ x ijk > 0 to device j, with results emitted to device k2 [M]. Each logical vertex v ijk is mapped to an actual task vertex v i 2 V and has the following attributes: a communication weight dened as C ij +C e ijk and an assignment weight dened as ~ x ijk . Sort the logical set V j in non-increasing order of communication weights. Let L j = [v l 1 ;v l 2 ;:::] be this sorted list of vertices. Note that each actual vertexv i 2V may appear several times in this list. For notational convenience henceforth, let b t denote the assignment weight of a typical vertex v lt 2 L j , t = 1; 2;:::. Divide L j into n j groups of consecutive vertices as follows: The rst group G 1 j consists of vertices (v l 1 ;v l 2 ;:::v lr ) where P r1 t=1 b t < 1 and we can divide b r = b 0 r + ^ b r , where ^ b r 0, such that P r1 t=1 b t + b 0 r = 1. If ^ b r > 0, then the second group G 2 j consists of (v lr ;v l r+1 ;:::;v l r+k ). Again, for the last vertex in the group, divide b r+k =b 0 r+k + ^ b r+k , where ^ b r+k 0, such that ^ b r +b 0 r+k + P k1 t=1 b r+t = 1. However if ^ b r = 0, then the second group G 2 j does not include v lr and instead consists of 100 (v l r+1 ;:::;v l r+k ). Again, for the last vertex in the group, divide b r+k =b 0 r+k + ^ b r+k , where ^ b r+k 0, such that b 0 r+k + P k1 t=1 b r+t = 1. This process is repeated for each of the n j groups. In general, if a vertex v lt appears in two consecutive groupsq andq+1 as the last and rst vertex respectively, then its assignment weight is split as b 0 t and ^ b t (as dened above) among groups q and q + 1. WLOG, we use the notation b 0 to denote the (possibly split) fractional contribution of the last vertex to its group and ^ b to denote the (possibly split) fractional contribution of the rst vertex to its group. Note that with the possible exception of the last group n j , the sum of the (split) assignment weights of all vertices in a group add up to 1. Once the groups have been formed, draw weighted edges in E between the vertices in L j and U j as follows: From each vertex v lt in group G q j , we draw an edge to a single vertex u q j 2U j , 1qn j with edge weight T ijk , where ijk is the label of this logical vertex. (Again, note that logical vertexv lt corresponds to some actual vertex v i 2V from which we draw the edge). The degree of each vertex in U j is at least 1. Each task i appears exactly once as vertex v i in V but is mapped to several logical vertices in L j ; for each group G q j to which it is mapped we have an edge (v i ;u q j )2E. Consider a minimum weighted matchingM on bipartite graph G such that every vertex v i 2 V (task i) is matched to exactly one vertex in U (some device j). Further every vertex u q j 2U j is matched to at most one vertex (task) in group 101 G q j , 1 q n j . We convert this matching to an assignment of tasks to devices M =f(i;j;k)g, if logical vertex v ijk is included in the matching. The objective function for this matching can be expressed as min P v i 2Vjv ijk !v i T ijk z ijk , where v ijk ! v i is true if logical vertex v ijk 2 V j is mapped to v i 2 V and z ijk 2f0; 1g represents the (integer) indicator variable for matching vertexv i to the corresponding vertex in U j with edge weight T ijk . Hence, the minimum weighted matching can be further interpreted by tuning the fractional weightsf~ x ijk g given by the LP-relaxation of MCTA2 without being limited by the budget constraints on each device. Therefore, its resulting minimum value is less than ~ T and hence is less than T 0 , which is the optimum of MCTA2. For any two consecutive task assignments (i;j;k)2M, (i + 1;l;m)2M, task i are executed at j, the result is emitted to k and forwarded to l. We now show that this violates device budget constraints by at most a 2 + 2 factor. Lemma4. For each devicej2 [M], the energy cost of the solution obtained through matchingM is (2 + 2)B j . 102 Proof. For each group G q j , 1qn j , let v fq;jb and v lq;jd denote the rst and last (logical) vertices with fq;lq2 [N] and b;d2 [M]. The energy cost to j for task execution and emission in matchingM is therefore bounded by ~ B M j n j X q=1 C fq;j +C e fq;jb +C r fq1;j ( + 1) n j X q=1 C fq;j +C e fq;jb (5.40) ( + 1)(B j + n j 1 X q=1 C lq;j +C e lq;jk ) (5.41) ( + 1)(B j + n j 1 X q=1 X v ijk 2G q j nfv fq;jb ;v lq;jd g (C ij +C e ijk )b i + (C fq;j +C e fq;jb ) ^ b fq + (C lq;j +C e lq;jd )b 0 lq ) (5.42) ( + 1)(B j + X v ijk 2V j (C ijk +C e ijk )~ x ijk ) (5.43) Eqs. (5.40) and (5.41) are due to the fact that at most one task vertex from a group G q j gets matched; vertices are sorted in non-increasing order of weights; C f1;j +C e f1;jb B j , and C r fq1;j C e fq;jb from our assumption on bounded com- munication costs. Eq. (5.42) follows by putting the fractional weights (b t ;b 0 t ; ^ b t ) on the sorted list of vertices induces more cost than putting all integral weights on the vertices with least cost in each group. Eq. (5.43) holds since b t corresponds to ~ x ijk and so as the sum of b 0 t and ^ b t in the splitting case. 103 Device j also incurs cost in receiving and forwarding the results of task i in some task assignment (ikj), for some i2 [N], and some k2 [M]. Its energy costs from matchingM can be simply bounded as X k2[M] n k X q=1 C r j +C e j ( + 1) X v ikj 2V (C r ikj +C e j )~ x ikj (5.44) Adding (5.43) and (5.44) and using the device energy constraint ofMCTA2 (5.31) gives the result. Putting lemma 3 and lemma 4 together, we get the main result, Theorem 3.M is a ( +1; 2 +2)-approximation to the optimal multi-constrained task assignment MCTA problem. If we add dummy nodes and edges with innite weights to the bipartite graph to make it balance and complete, then the minimum weighted matching is equivalent to nding minimum weight perfect matching in the complete weighted bipartite graph, which can be solved in polynomial time [57]. Hence, our proposed bicriteria approximation (BiApp), including solving an LP-relaxation with NM 2 variables and a minimum weighted bipartite matching withO(N) nodes, runs in polynomial time and provides performance guarantees on both objective and constraints. 104 Table 5.1: SARA/BiApp: Simulation Prole Parameter Interval N=M 10=3 (xed) 3 (xed) T ijk Uniform [0; 18] C ij Uniform [0; 12] C e ijk ;C r ijk Uniform [4; 8] ( = 2) B j Uniform [50; 100] 0 50 100 150 200 45 46 47 48 Objective − Latency Latency 0 50 100 150 200 35 40 45 Device 1 − Cost Cost 0 50 100 150 200 10 20 30 Device 2 − Cost Cost 0 50 100 150 200 10 20 30 Device 3 − Cost frame number Cost SARA Fractional Opt SARA Expected Cost SARA Expected Cost SARA Expected Cost Figure 5.3: SARA: optimal performance in expectation 5.4 Numerical Evaluation We simulate the performance of SARA and BiApp on comprehensive problem in- stances. Table 5.1 summarizes our simulation prole. The application task graph 105 0 10 20 30 40 50 0.8 1 1.2 1.4 ratio Objective − Latency 0 10 20 30 40 50 0 1 2 3 4 ratio Device 1 − Cost 0 10 20 30 40 50 0 1 2 3 4 ratio Device 2 − Cost 0 10 20 30 40 50 0 1 2 3 4 instance ratio Device 3 − Cost SARA BiApp ratio (1) Figure 5.4: BiAPP: ( + 1; 2 + 2)-approximation ( = 3; = 2) consists of 10 stages and the resource network contains 3 devices. For each prob- lem instance, the single-stage costs and latencies are uniformly and independently drawn from the prescribed intervals. So are the individual cost constraints, B j . 106 Fig. 5.3 shows the performance of SARA. For each data frame, SARA imple- ments the randomized rounding algorithm based on the solution of LP-relaxation fx ? ijk g, and gives the task assignment for processing this data frame. Hence, we present the simulation result as running averages. LetLatency(t) denote the overall latency of frame t. The running average at t =T can be calculated as avg(T ) = 1 T T X t=1 Latency(t): (5.45) In Fig. 5.3, the red lines are the performance of the optimal fraction solution. That is, we set the variables inMCTA asfx ? ijk g and get the overall latency and costs in- duced on each devices. We can see that the average performance of SARA converges to these numbers. Furthermore, since the minimum objective of an LP-relaxation is always smaller than the minimum objective of its original IP, SARA achieves the optimal performance and induces feasible cost on each device asymptotically on average. Fig. 5.4 shows the performance of BiApp over 50 randomly selected problem instances. Given an instance, we present the performance of BiApp as the ratio of BiApp's induced latency to the optimum. Similarly, we present the ratio of BiApp's induced cost to the budget B j on each device. We can see that BiApp achieves its ( + 1; 2 + 2)-approximation guarantee. That is, the overall latency is no more than ( + 1) times of the optimal, and the induced cost on each device is no more than (2 + 2) times of the budget. In our simulation setting, = 3; = 2. Hence, 107 BiApp is a (4; 6)-approximation algorithm. We can see that the performance is much better than our theoretically worst-case bound. Note that for some instances, BiApp's performance is even better than the optimum. This is because BiApp solves the relaxed LP, where there is no constraint on the consistency of the device that receives the result from task i 1 and the device that executes task i (constraint (5.2) in MCTA). Hence, there is possibly a more economic strategy which rst transmits the result of task i 1 to another device and then forwards the data to the device that executes task i. We examine the same instances on SARA. However, unlike BiApp, which uses the xed assignment for a given instance, SARA varies the assignment for dierent data frames. Hence, we run SARA and average the performance over 200 data frames, with the vertical bar as standard deviation. Fig. 5.4 shows that SARA achieves the optimal performance in average and incurs tolerable variance for the instances we have considered. 5.5 Conclusion Given an application that is partitioned into multiple tasks, we have formulated MCTA that aims to minimize the execution latency and satisfy all the budget con- straints on each device, considering the on-device resource and potential data com- munication overhead over the network. Compared to SCTA that is presented in Chapter 4,MCTA considers both system performance and cost-balancing. Hence, 108 it avoids assigning most of the tasks on a single device, which could possibly drain the battery. We have proved that our formulation is NP-hard and proposed two al- gorithms with provable guarantees with respect to the optimum, SARA and BiApp. SARA is an LP rounding algorithm that achieves the optimal performance in ex- pectation. BiApp is a bicriteria approximation that has provable performance guar- antees on both objective and constraints. Simulation results have shown SARA's optimality and justied BiApp's performance guarantees. Especially, for compre- hensive problem instances we have considered, BiApp performs much better than theoretical worse-case analysis. 109 Chapter 6 Stochastic Optimization with Single Constraint In this study, we assume that the task execution latencies and channel transmission latencies are i.i.d. stochastic processes with known distributions. We formulate a stochastic optimization problem that partitions the tasks into two sets, one is the set of tasks that are to be executed at the remote server and the other are ones that remain at the local device. Our task partition provides a probabilistic QoS guarantee. That is, our task assignment strategy guarantees that at least p% of time the application latency is no more thant, wherep andt are arbitrary numbers. This chapter is based on our work in [35]. 6.1 Problem Formulation Consider a tree-structure task graph G(V;E) with N nodes as tasks and edges specifying data dependency. For taski, we use a binary variablex i such that either task i is sent to the remote server (x i = 1) or remains at the local device (x i = 0). 110 We use the same set of notations as in Chapter 4. However, since we only have two devices, a local device and a remote server, we use l and r instead of enumeration. For example, let T l i denote the latency of executing task i on the local device, and T r i denote the latency of executing taski on the remote device. We model the data transmission latency between two tasks (if necessary) as T lr mn =T rl mn =d mn T c ; (6.1) where T c is the latency of transmitting a unit amount of data between the local device and the remote server. WLOG, for illustrative proposes, we assume that the channel is symmetric. If taskm and taskn are executed at the same place, we have T ll mn = T rr mn = 0 for all (m;n) injEj, as there is no data transmission happening between them. We focus on the energy cost on the local device. That is, we assume C r i = 0 for all i in [N]. Furthermore, we model the energy cost as C l i =p c T l i ; (6.2) C lr mn =C rl mn =p RF T lr mn ; (6.3) wherep c is the computing power of the local device andp RF is the power consump- tion of RF components. 111 Let D (i) denote the accumulated latency when nishing task i, which depends on its child tasks and can be described by the recursive relation, D (i) = max m2C(i) D (m) +T xmx i mi +T x i i ; (6.4) whereC(i) describe the set of child tasks of i. In the dynamic environment, we consider T l i , T r i and T c as i.i.d. stochastic processes with known distribution. We aim to nd out the task partition that minimizes the expected cost and is subject to the probabilistic QoS constraint. That is, PTP : minEf X i2[N] C x i i + X (m;n)2jEj C xmxn mn g s.t. PfD (N) t max g>p obj (6.5) x N = 0 (6.6) x i 2f0; 1g8i2 [N]: (6.7) We assume the root (last) task always remains at the local device. Eq. (6.5) speci- es the QoS guarantee that at leastp obj % of time the latency is no more thant max . 112 Algorithm 6 Probabilistic delay constrained Task Partitioning (PTP) 1: procedure PTP(N;p obj ;t max ) . min. cost from N s.t. P D (N) t max >p obj 2: q BFS (G;N) . run BFS from node N and store visited nodes in order in q 3: for n q.end; q.start do . start from the last element in q 4: if n is a leaf then . initialize OPT values of leaves 5: OPT l [n;p k ] ( p c T l n if p k =q(F T l n (t max )) 1 otherwise 6: OPT r [n;p k ] ( 0 if p k =q(F T r n (t max )) 1 otherwise 7: else 8: for all combinations (OPT [1;p k1 ]; ;OPT [d;p k d ]) do 9: link to OPT l [n;p ] if p =q( Z tmax 0 Y m2C(n) F D (m) (tx m d mn T c )f T l n (t max t)dt): (6.8) 10: link to OPT r [n;p ] if p =q( Z tmax 0 Y m2C(n) F D (m) (t (1x m )d mn T c )f T r n (t max t)dt): (6.9) 11: end for 12: for k 1;K do 13: CalculateOPT l [n;p k ],OPT r [n;p k ] by choosing the minimum from their links 14: end for 15: end if 16: end for 17: Trace back the optimal decision from min k2f1;;Kg OPT l [N;p k ] 18: end procedure Instead of arguing the average latency, this probabilistic constraint is stronger es- pecially in highly variant environment. In the following, we propose an ecient algorithm that gives near-optimal solution to this problem. 113 6.2 PTP: Probabilistic Delay Constrained Task Partitioning We adapt the dynamic programming approach and use quantization to bound the number of sub-problems. Similar to Hermes in Chapter 4, we solve the sub-problems from the leaves to the root. We dene OPT l [i;p k ] as the minimum cost when nishing task i at local under the constraint q(P D (i) t max ) =p k ; (6.10) where p k is the k th quantization step between [p obj ; 1] and the quantizer is dened as q(x) =p k ;8x2 (p k ;p k+1 ];8k2f1; ;Kg: (6.11) Since the latency is accumulating as we solve the sub-problems from leaves to root, it is sucient to just deal with the interval [p obj ; 1]. However, for an edge (m;n), instead of simply excluding the delays induced after node m as the case in deterministic analysis, the links between OPT [m;p m ] and OPT [n;p n ] ( could be l or r) are not obvious for arbitrary p m and p n . To nd out OPT l [n;p k ], for illustrative purposes, we assume that node n has two children, m 1 andm 2 , and we 114 derive the case when both of them are executed at the local device. In this example, we have to identify all possible cases OPT l [m 1 ;p k 1 ];OPT l [m 2 ;p k 2 ] satisfying q(P D (m 1 ) t max ) =p k 1 ; (6.12) q(P D (m 2 ) t max ) =p k 2 ; (6.13) q(P D (n) t max ) =p k : (6.14) Let F represent the cumulative distribution function (CDF) and f represent the probability density function (PDF). Since D (n) depends on D (m 1 ) and D (m 2 ) as shown in (6.4), givenF D (m 1 ) andF D (m 2 ) calculated by the optimal solutions of sub- problems, we can nd F D (n)(x) as follows. P n D (n) x o =P n max n D (m 1 ) ;D (m 2 ) o +T l n x o (6.15) = Z x 0 f T l n (t)P n maxfD (m 1 ) ;D (m 2 ) gxt o dt (6.16) = Z x 0 F D (m 1 ) (xt)F D (m 2 ) (xt)f T l n (t) dt (6.17) As D (m 1 ) and D (m 2 ) are independent, the CDF of their maximum is the product of individual CDFs. Hence, Eq. (6.17) holds for multiple children. By calculating the integral in (6.17), we can nd (p k 1 ;p k 2 ) such that the latenciesD (m 1 ) andD (m 2 ) satisfy (6.14). 115 To consider the variation of channel state T c , which is the transmission latency per unit data. As the total latency is additive over tasks for each branch, the derivation of CDF also contains convolutions. For illustrative purposes, we present our equations for the case when T c is constant, in which Eq. (6.17) will involve some shifts of CDFs if constant data transmission delay is induced. Suppose node n has d children, we have to consider all possible assignments on its children and all possible p k 's. We write the set of all possible combinations as (OPT [1;p k 1 ];OPT [2;p k 2 ]; ;OPT [d;p k d ]): (6.18) Each can be independently chosen fromfl;rg andk m can be independently chosen fromf1; ;Kg for all m inf1; ;dg. Hence, creating links for a node n, at the worst case, involves (2K) d integrals shown in (6.17). We summarize our algorithm, PTP, in Algorithm 6. PTP runs in O NK d in time, whered in is the maximum in-degree of the task graph. We further investigate the value of K to guarantee that PTP runs in polynomial time. Let the smallest precision of probability measurement be. That is, the system is no longer sensitive to any condence dierence less than . For example, the partition that gives you 90% of condence may not make any dierence on the performance compared to the partition that gives you 90:1% of condence. Let be the quantization step size, i.e. K = 1p obj . Since the quantizer dened in (6.11) always underestimates the probability, the solution error happens when the quantization error is large enough 116 Table 6.1: PTP: Simulation Prole Notation Value T l i Exp; 1 =Uniform [10; 100] ms T r i Exp; 1 =Uniform [1; 10] ms T c 100 ms d mn Uniform [0:1; 10] MB p c 0:3 W p RF 0:7 W so that PTP judges the optimal solution as a non-feasible one. Letl be the longest path from a leaf node to the root. Since the quantization error accumulates over tasks on each path, the maximum quantization error is at most l. As the proba- bility constraint in our optimization problem is a strict inequality, the probability of the optimal solution is at least away fromp obj if we neglect the fractions below . Hence, we choose the quantization step size to be l to guarantee that given any instance, the optimal solution is always considered as a feasible solution by PTP. In other words, K can be bounded by O l . For the worst case when the task graph is a chain (l =N), PTP runs in O N d in +1 ( 1 ) d in time. In next section, numerical results show that PTP does not need such a small as our theoretical analysis, but provides the optimal solution most of the time. 6.3 Numerical Evaluation We verify the accuracy PTP for comprehensive problem instances. Our simulation is done as follows. For every problem instance, we x the task graph as a perfect 117 10 15 20 25 30 35 40 45 50 2 3 4 5 6 7 8 9 10 x 10 −3 K probability d = 3, wrong solution d = 3, missing solution Figure 6.1: PTP: error probability binary tree while the simulation prole is varying. Table 6.1 summarizes how we choose the simulation prole. To model the dynamic resource network, T l i and T r i are independent exponential distribution with means drawn uniformly from the given interval. Similarly, the amount of data exchange, d mn is drawn uniformly from the interval. The other parameters remain constant as shown in Table 6.1. Given the prole we choose, we solve the optimal partition by PTP and compare the solution with the one given by the brute force algorithm, which simply checks over all partitions and chooses the optimal and feasible one. Fig. 6.1 shows the performance of PTP that solves the stochastic optimization problem for a task graph with depth 3. In general, we classify the solution errors into three types: rst, PTP may provide a feasible solution which is not the optimal one; second, PTP may not nd any feasible solution but there exists at least one; 118 third, PTP may provide a non-feasible solution. Since the quantizer in (6.11) always underestimates the probability that the total latency is less than t max , PTP will never give a non-feasible solution but may miss some feasible solutions for small K (when the quantization error is signicant). As shown in Fig. 6.1, the solid line represents the overall error probability, which contains the event of giving a sub- optimal solution and the event of not nding any feasible solution but there exists at least one. The dash line represents the probability of the latter. Since K is related to the interval size [p obj ; 1], wherep obj is chosen to be close to 1 for stronger QoS guarantee, from our simulation result, K = 10 (with corresponding step size 0:01 and p obj = 0:9) provides a good performance with error probability 0:009. 6.4 Discussion We have formulated a task partition problem, Probabilistic delay constrained Task Partitioning, and have provided an algorithm, PTP, to solve the scenarios when the application task graph is a tree. Our partition provides strong performance guarantee that at least p% of time the application latency is less than t in the dynamic environment, wherep andt are arbitrary numbers. Furthermore, instead of relying on an integer programming formulation that may not be solve in polynomial time for all problem instances, we have shown that PTP runs in polynomial time with the problem size and provides the optimal solution most of the time. 119 Chapter 7 Online Learning in Stationary Environments In this study, we consider the scenario when the devices' computation ability and channel qualities are unknown and dynamic. We formulate the task assignment as a multi-armed bandit problem, where the performance on each device and channel are modeled as stationary processes that evolve i.i.d. with time. We propose an online algorithm that learns the unknown environment and makes competitive task assignment over the network. This chapter is based on our work in [36]. 7.1 Why Online Learning? Given an application that consists of multiple tasks, we want to assign them on mul- tiple devices, considering the resource availability so that the system performance can be improved. These resources that are accessible by wireless connections form a resource network, which is subject to frequent topology changes and has the following features: 120 Dynamic device behavior: The quantity of the released resource varies with devices, and may also depend on the local processes that are running. Moreover, some of devices may carry microporcessors that are specialized in performing a subset of tasks. Hence, the performance of each device varies highly over time and dierent tasks and is hard to model as a known and stationary stochastic process. Heterogeneous network with intermittent connections: Devices' mobility makes the connections intermittent, which change drastically in quality within a short time period. Furthermore, dierent devices may use dierent protocols to communicate with each other. Hence, the performance of the links between devices is also highly dynamic and variable and hard to model as a stationary process. From what we discuss above, since the resource network is subject to drastic changes over time and is hard to be modeled by stationary stochastic processes, we need an algorithm that applies to all possible scenarios, learns the environment at run time, and adapts to changes. Existing works focus on solving optimization problems given known deterministic prole or known stochastic distributions [58, 10]. These problems are hard to solve. More importantly, algorithms that lack learning ability could be harmed badly by statistical changes or mismatch between the prole (oine training) and the run-time environment. Hence, we use an online learning approach, which takes into account the performance during the learning phase, and aim to learn the environment quickly and adapt to changes. We start from stronger assumptions on the environment. That is, the resource network is 121 characterized by stationary processes that evolve i.i.d. over time. In Chapter 8, we relax the assumptions so that the result applies to more general scenarios. 7.2 Models and Formulation An application prole can be described by a task graph, where node i with weight m i species the execution complexity of task i, and edge (m;n) with weight d mn species the amount of data exchange between taskm and taskn. An example with weighted task graph is shown in Fig. 4.1. Although these task complexities and data communication are xed, the task execution latency and data transmission latency are non-deterministic due to the fact that on-device resource (like CPU cycles) and channel bandwidth vary with time. The dynamic environment (resource network) is described by devices' compu- tation performance and channels' bandwidth. We use the same set of notations in Chapter 4. Let T (j) be the latency of executing a unit task on device j, and T (jk) be the latency of transmitting a unit amount of data from device j to device k. Hence, the task execution latency and data transmission latency are T (j) i =m i T (j) ; (7.1) T (jk) mn =d mn T (jk) : (7.2) 122 We assume that T (j) and T (jk) are i.i.d. processes with unknown means (j) and (jk) , respectively. For some real applications, like the ones considered in [11], a stream of video frames comes as input to be processed frame by frame. For example, a video- processing application takes a continuous stream of image frames as input, where each image comes and goes though all processing tasks. Since the performance on each device and channel are unknown, an online algorithm aims to learn the devices and channels by making dierent task assignments for each data frame (ex- ploration), and make use of the ones with better performance (exploitation). How to balance between exploration and exploitation signicantly aects an algorithm's competitiveness [44]. 7.3 The Algorithm: Hermes with DSEE We adapt the sampling method, deterministic sequencing of exploration and ex- ploitation (DSEE) [45], to learn the unknown environment and derive the per- formance bound. The DSEE algorithm consists of two phases, exploration and exploitation. During the exploration phase, DSEE follows a xed order to probe (sample) the unknown distributions thoroughly. Then, in the exploitation phase, DSEE exploits the best strategy based on the probing result. In [45], learning the unknown environment is modeled as a multi-arm banded (MAB) problem, where at each time an agent chooses over a set of \arms", gets 123 the payo from the selected arm and tries to learn the statistical information from sensing it, which will be considered in future decision. The goal is to gure out the best arm from exploration and exploit it later on. However, the exploration costs some price due to the mismatch between the payos given by the explored arm and the best one. Hence, we have to eciently explore the environment and compare the performance with the optimal strategy (always choose the best arm). The authors in [45] prove the performance gap compared to the optimal strategy is bounded by a logarithmic function of number of trials as long as each arm is sampled logarithmically often. That is, if we get enough samples from each arm (O(lnV )) compared to total trialsV , we can make good enough decision such that the performance loss ats out with time, which implies we can learn and exploit the best arm without losing noticeable payo in the end. We adapt DSEE to sample all devices and channels thoroughly at the explo- ration phase, calculate the sample means, and applies Hermes to solve the optimal assignment based on sample means. During the exploration phase, we design a xed assignment strategy to get samples from devices and channels. For example, if task n follows after the execution of task m, by assigning task m to device j and assigning task n to device k, we could get one sample of T (j) , T (k) and T (jk) . Since sampling all the M 2 channels implies that all devices have been sampled M times, we focus on sampling all channels using as less executions of the application as possible. That is, we would like to know, for each frame (an execution of the 124 0 T T/2 T/4 1st round 2nd round B A A C B C selected edge Figure 7.1: The task graph with 3 edges as maximum matching application), what is the maximum number of dierent channels we can sample from. This number depends on the structure of the task graph, which, in fact, is lower-bounded by the matching number of the graph. A matching on a graph is a set of edges, where no two of which share a node [59]. The matching number of a graph is then the maximum number of edges that does not share a node. Taking an edge from the set, which connects two tasks in the task graph, we can assign these two tasks arbitrarily to get a sample of data transmission over our desired channel. Fig. 7.1 illustrate how we design the task assignment to sample as many channels as possible in one execution. First, we treat every directed edges as non-directed ones and nd out the graph has matching number equal to 3. That is, we can sample at least 3 channels (AB;CA;BC) in one execution. There are some tasks that are left blank. We can assign them to other devices to get more samples. 125 Algorithm 7 Hermes with DSEE 1: procedure Hermes DSEE (w) 2: r d dmaxM 2 jEj e 3: A(0) ; .A(v) denes the set of exploration epoches up to v 4: for v 1; ;V do 5: ifjA(v 1)j<dw lnve then . exploration phase 6: for t 1; ;r do . each epoch contains r frames 7: Sample the channels with strategy ^ x 8: end for 9: Calculate the sample means, (j) (v) and (jk) (v), for all j;k2 [M] 10: A(v) A(v 1) +fvg 11: else . exploitation phase 12: Solve the best strategy ~ x(v) with input T (j) i =m i (j) (v); T (jk) mn =d mn (jk) (v): 13: for t 1; ;r do 14: Exploit the assignment strategy ~ x(v) 15: end for 16: end if 17: end for 18: end procedure In every exploration epoch, we want to get at least one sample from every channel. Hence, we want to know how many frames (executions) are needed in one epoch. We derive a bound for general case. For a DAG, its matching number is shown to be lower-bounded by jEj dmax , where d max is the maximum degree of a node [60]. For example, the matching number of the graph in Fig. 7.1 is lower bounded by 10 5 = 2. Hence, to sample each channel at least once, we require at most r =d dmaxM 2 jEj e frames. Algorithm 7 summarizes how we adapt Hermes to dynamic environment. We separate the time (frame) horizon into epoches, where each of them contains r 126 frames. LetA(v 1)f1; ;v 1g be the set of exploration epoches prior to v. At epoch v, if the number of exploration epoches is below the threshold (jA(v 1)j<dw lnve), then epoch v is an exploration epoch. Algorithm 7 uses a xed assignment strategy ^ x to get samples. After r frames have been processed, Algorithm 7 gets at least one new sample from each channel and device, and updates the sample means. At an exploitation epoch, Algorithm 7 calls Hermes to solve for the best assignment strategy ~ x(v) based on current sample means, and uses this assignment strategy for the frames in this epoch. In the following, we derive the performance guarantee of Algorithm 7. First, we present a lemma from [45], which species the probability bound on the deviation of sample mean. Lemma 5. LetfX(t)g 1 t=1 be i.i.d. random variables drawn from a light-tailed distribution, that is, there exists u 0 > 0 such that E[exp(uX)] <1 for all u2 [u 0 ;u 0 ]. Let X s = P s t=1 X(t) s and = E[X(1)]. We have, given > 0, for all 2 [0;u 0 ], a2 (0; 1 2 ], Pfj X s jg 2 exp(a 2 s): (7.3) Lemma 5 implies the more samples we get, the much less chance the sample mean deviates from the actual mean. From (4.2), the overall latency is the sum of single-stage latencies (T (j) i and T (jk) mn ) across the slowest branch. Hence, we would 127 like to use Lemma 5 to get a bound on the deviation of total latency. Let be the maximum latency solved by Algorithm 2 with the following input instance T (j) i =m i ;8i2 [N];j2 [M]; T (jk) mn =d mn ;8(m;n)2E;j;k2 [M]: Hence, if all the single-stage sample means deviate no more than compared to their actual means, then the overall latency deviates no more than . In order to prove the performance guarantee of Algorithm 7, we identify an event and verify the bound on its probability in the following lemma. Lemma 6. Assume that T (j) , T (jk) are independent random variables drawn from unknown light-tailed distributions with means (j) and (jk) , for all j;k2 [M]. Let a, be the numbers that satisfy Lemma 5. For each assignment strategy x, let (x;v) be the total latency accumulated over the sample means that are calculated at epoch v, and (x) be the actual expected total latency. We have, for each v, Pf9x2 [M] N jj (x;v)(x)j>g X n2[M 2 +M] M 2 +M n (1)(2) n e na 2 jA(v1)j : (7.4) Proof. We want to bound the probability that there exists a strategy whose total deviation (accumulated over sample means) is greater than . We work on its 128 complement event that the total deviation of each strategy is less than . That is, Pf9x2 [M] N jj (x;v)(x)j>g = 1Pfj (x;v)(x)j8x2 [M] N g (7.5) We further identify the fact that if every single-stage deviation is less than , then the total deviation is less than for all strategy x2 [M N ]. Hence, 1Pfj (x;v)(x)j8x2 [M] N g (7.6) 1Pf( \ j2[M] j (j) (j) j)\ ( \ j;k2[M] j (jk) (jk) j)g (7.7) = 1 Y j2[M] Pfj (j) (j) jg Y j;k2[M] Pfj (jk) (jk) jg (7.8) 1 h 1 2e a 2 jA(v1)j i M 2 +M (7.9) X n2[M 2 +M] M 2 +M n (1)(2) n e na 2 jA(v1)j (7.10) Leveraging the fact that all of random variables are independent and Lemma 5, where at epochv, we get at leastjA(v 1)j samples for each unknown distribution, we arrive at (7.9). Finally, we use the binomial expansion to achieve the bound in (7.10). 129 In the following, we compare the performance of Algorithm 7 with the optimal strategy, which is obtained by solving Problem SCTA with the input instance T (j) i =m i (j) ;8i2 [N];j2 [M]; T (jk) mn =d mn (jk) ;8(m;n)2E;j;k2 [M]: Theorem 4. Let = c 2 , where c is the smallest precision so that for any two assignment strategies x and y, we havej(x)(y)j > c whenever (x)6= (y). Let R V be the expected performance gap accumulated up to epoch V , which can be bounded by R V r(w lnV + 1) +r X n2[M 2 +M] M 2 +M n (1)(2) n (1 + 1 na 2 w 1 ): (7.11) Proof. The expected performance gap consists of two parts, the expected loss due to the use of xed strategy during exploration (R fix V ) and the expected loss due to the mismatch of strategies during exploitation (R mis V ). During the exploration phase, the expected loss of each frame can be bounded by , which can be obtained by Algorithm 2 presented in Chapter 4, withm i (j) andd mn (jk) as input instance. Since the number of exploration epochesjA(v)j will never exceed (w lnV + 1), we have R fix V r(w lnV + 1): (7.12) 130 On the other hand, R mis V is accumulated during the exploitation phase whenever the best strategy given by sample means is not the same as the optimal strategy, where the loss can also be bounded by . That is, R mis V Ef X v62A(v) rI(~ x(v)6=x ? )g =r X v62A(v) Pf~ x(v)6=x ? g (7.13) r X v62A(v) Pf9x2 [M] N jj (x;v)(x)j>g (7.14) r X v62A(v) X n2[M 2 +M] M 2 +M n (1)(2) n e na 2 jA(v1)j (7.15) r X n2[M 2 +M] M 2 +M n (1)(2) n 1 X v=1 v na 2 w (7.16) r X n2[M 2 +M] M 2 +M n (1)(2) n (1 + 1 na 2 w 1 ) (7.17) In (7.14), we want to bound the probability when the best strategy based on sample means is not the optimal strategy. We identify an event, where there exists a strategy x whose deviation is greater than . If this event doesn't happen, in worst case, the dierence between any two strategies deviates at most 2 = c. Hence, (x ? ;v) is still the minimum, which implies Algorithm 7 still outputs the optimal strategy. We further use Lemma 6 in (7.15) and acquire (7.16) by the fact that epochv is in exploration phase impliesjA(v 1)j>=w lnv. Finally, selecting w to be larger enough such that a 2 w> 1 guarantees the result in (7.17). 131 Theorem 4 shows that the performance gap consists of two parts, one of which grows logarithmically with V and another one remains the same as V is increas- ing. Hence, the increase of performance gap will be negligible for large V , which implies Algorithm 7, although starting from not knowing the environment at the beginning, will learn the optimal strategy as time goes on. Furthermore, Theo- rem 4 provides the upper bound on the performance loss based on the worst-case analysis, in which w is a parameter left for users in Algorithm 7. A smaller w leads to less amount of probing (exploration) and hence reduces the accumulated loss during exploration, however, may increase the chance of missing the optimal strategy during exploitation. In next section, we will compare Algorithm 7 with other algorithms by simulation. 7.4 Numerical Evaluation To measure the performance of Algorithm 7 under dynamic environment, we sim- ulate an application that processes a stream of data frames. The resource network consists of 3 devices with unit process time T (j) on device j. The devices form a mesh network with unit data transmission time T (jk) over the channel between devicej andk. We modelT (j) andT (jk) as stochastic processes that are uniformly- distributed with given means and evolve i.i.d. over time. Hence, for each frame, we draw the samples from corresponding uniform distributions, and get the single-stage latencies by (7.1) and (7.2). 132 0 100 200 300 400 500 600 700 200 300 400 500 600 latency (ms) 0 100 200 300 400 500 600 700 100 200 300 400 500 cost 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 frame number gap frame latency running avg optimal frame cost running avg optimal bound gap to optimal Figure 7.2: The performance of Hermes using DSEE in dynamic environment 0 100 200 300 400 500 600 700 360 380 400 420 440 460 480 frame number latency (ms) Hermes updated frame by frame Hermes with random exploration Hermes with DSEE optimal Figure 7.3: Comparison of Hermes using DSEE with other algorithms 133 We adapt Algorithm 7 to probe the devices and channels and exploit the strategy that is the best based on the sample means. Fig. 7.2 shows the performance of Hermes using DSEE as the sampling method. We see that the average latency per frame converges to the minimum, which implies Algorithm 7 learns the optimal strategy and exploits it most of the time. On the other hand, Algorithm 7 uses the strategy that costs less but performs worse than the optimal one during the exploration phase. Hence, the average cost per frame is slightly lower than the cost induced by the optimal strategy. Finally, we measure the performance gap, which is the extra latency caused by sub-optimal strategy accumulated over frames. The gap attens out in the end, which implies the increase on extra latency becomes negligible. We compare Algorithm 7 with two other algorithms in Fig. 7.3. First, we propose a randomized sampling method as a baseline. During exploration phase, Algorithm 7 designs a xed strategy to sample the devices and channels thor- oughly. However, the baseline randomly selects an assignment strategy and gather the samples. The biased sample means result in signicant performance loss during exploitation phase. We propose another algorithm that resolves the best strat- egy every frame. That is, at the end of each frame, it updates the sample means and runs Hermes to solve for the best strategy for the next frame. We can see that by updating the strategy every frame, the performance is slightly better than 134 Algorithm 7. However, Algorithm 7 only runs Hermes at the beginning of each ex- ploitation phase, which only increases tolerable amount of CPU load but provides competitive performance. 7.5 Discussion We have formulate a multi-armed bandit problem to make task assignment un- der unknown and dynamic environment, where the performance on each device and channel are modeled as stationary processes that evolve i.i.d. with time. We have proposed an algorithm that combines Hermes presented in Chapter 4 with the DSEE sampling method. Our algorithm uses a xed sampling sequence to sample all the devices and channels thoroughly during exploration phase, and runs Hermes to make optimal task assignment based on the sampling results during exploita- tion phase. Simulation results have validated our performance analysis and shown that our proposed algorithm makes competitive task assignment compared to the optimal one. 135 Chapter 8 Online learning in Non-stationary Environments In this study, we follow the online learning scenario considered in Chpater 7 but make no stochastic assumption on the bandit processes. Instead, we adapt the adversarial multi-armed bandit formulation where each arm's payo is given by an arbitrary but bounded sequence. Hence, our proposed algorithm applies to any dynamic environment, including non-stationary ones. This chapter is based on our work in [37]. 8.1 Problem Formulation Suppose a data processing application consists ofN tasks, where their dependencies are described by a directed acyclic graph (DAG)G = (V;E). There is an incoming data stream to be processed (T data frames in total), where for each data framet, it is required to go through all the tasks and leave afterwords. There are M available devices. The assignment strategy of data frame t is denoted by a vector x t = 136 x t 1 ; ;x t N , wherex t i denotes the device that executes task i. Given an assignment strategy, stage-wised costs apply to each node (task) for computation and each edge for communication. The cost can correspond to the resource consumption for a device to complete a task, for example, energy consumption. In the following formulation we follow the tradition in MAB literature and focus on maximizing a positive reward instead of minimizing the total cost, but of course these are mathematically equivalent, e.g., by setting reward = maxCostcost. That is, instead of minimizing the total latency, we have an equivalent formulation that maximize the total time saved compared to the worst case. When processing data frame t, letR (j) i (t) be the reward of executing task i on device j. LetR (jk) mn (t) be the reward of transmitting the data of edge (m;n) from device j to k. The reward sequences are unknown but are bounded between 0 and 1. Our goal is to nd out the assignment strategy for each data frame based on the previously observed samples, and compare the performance with a genie that uses the best assignment strategy for all data frames. That is, R max total = max x2F T X t=1 0 @ N X i=1 R (x i ) i (t) + X (m;n)2E R (xmxn) mn (t) 1 A ; (8.1) whereF represents the set of feasible solutions. The genie who knows all the reward sequences can nd out the best assignment strategy, however, not knowing these sequences in advance, our proposed online algorithm aims to learn this best strategy and remain competitive in overall performance. 137 Algorithm 8 MABSTA 1: procedureMABSTA( ;) 2: w y (1) 18y2F 3: for t 1; 2; ;T do 4: W t P y2F w y (t) 5: Draw x t from distribution p y (t) = (1 ) w y (t) W t + jFj (8.2) 6: Get rewardsfR (x t i ) i (t)g N i=1 ,fR (x t m x t n ) mn (t)g (m;n)2E . 7: C i ex fz2Fjz i =x t i g;8i 8: C mn tx fz2Fjz m =x t m ;z n =x t n g;8(m;n) 9: for8j2 [M],8i2 [N] do ^ R (j) i (t) = 8 < : R (j) i (t) P z2C i ex pz(t) if x t i =j; 0 otherwise. (8.3) 10: end for 11: for8j;k2 [M],8(m;n)2E do ^ R (jk) mn (t) = 8 > < > : R (jk) mn (t) P z2C mn tx p z (t) if x t m =j;x t n =k; 0 otherwise. (8.4) 12: end for 13: Update for all y ^ R y (t) = N X i=1 ^ R (y i ) i (t) + X (m;n)2E ^ R (ymyn) mn (t); (8.5) w y (t + 1) =w y (t) exp ^ R y (t) : (8.6) 14: end for 15: end procedure 8.2 MABSTA Algorithm We propose MABSTA (Multi-Armed Bandit based Systematic Task Assignment), summarized in Algorithm 8, which learns the environment and makes task assign- ment at run time. For each data frame t, MABSTA randomly selects a feasible as- signment (arm x2F) from a probability distribution that depends on the weights 138 of arms (w y (t)). Then it updates the weights based on the reward samples. From (8.2), MABSTA randomly switches between two phases: exploitation (with proba- bility 1 ) and exploration (with probability ). At exploitation phase, MABSTA selects an arm based on its weight. Hence, the one with higher reward samples will be chosen more likely. At exploration phase, MABSTA uniformly selects an arm without considering its performance. The fact that MABSTA keeps probing every arms makes it adaptive to the changes of the environment, compared to the case where static strategy plays the previously best arm all the time without knowing that other arms might have performed better currently. The commonly used performance measure for an MAB algorithm is its regret. In our case it is dened as the dierence in accumulated rewards ( ^ R total ) compared to a genie that knows all the rewards and selects a single best strategy for all data frames (R max total in (8.1)). Auer et al. [30] propose Exp3 for adversarial MAB. However, if we apply Exp3 to our online task assignment problem, since we have an exponential number of arms (M N ), the regret bound will grow exponentially. The following theorem shows that MABSTA guarantees a regret bound that is polynomial with problem size and O( p T ). 139 Theorem 5. Assume all the reward sequences are bounded between 0 and 1. Let ^ R total be the total reward achieved by Algorithm 8. For any 2 (0; 1), let = M(N+jEjM) , we have R max total Ef ^ R total g (e 1) R max total + M(N +jEjM) lnM N : (8.7) In above, N is the number of nodes (tasks) andjEj is the number of edges in the task graph. We leave the proof of Theorem 5 in Section 8.3. By applying the appropriate value of and using the upper boundR max total (N +jEj)T , we have the following Corollary. Corollary 1. Let = minf1; q M(N+jEjM) lnM N (e1)(N+jEj)T g, then R max total Ef ^ R total g 2:63 p (N +jEj)(N +jEjM)MNT lnM: (8.8) We look at the worst case, wherejEj =O(N 2 ). The regret can be bounded by O(N 2:5 MT 0:5 ). Since the bound is a concave function of T , we dene the learning time T 0 as the time when its slope falls below a constant c. That is, T 0 = 1:73 c 2 (N +jEj)(N +jEjM)MN lnM: (8.9) 140 This learning time is signicantly improved compared with applying Exp3 to our problem, where T 0 =O(M N ). As we will show in the numerical results, MABSTA performs signicantly better than Exp3 in the trace-data emulation. 8.3 Proof of Theorem 5 We rst prove the following lemmas. We will use more condensed notations like ^ R (y i ) i for ^ R (y i ) i (t) and ^ R (ymyn) mn for ^ R (ymyn) mn (t) in the prove where the result holds for each t. 8.3.1 Proof of lemmas Lemma 7. For all t = 1; ;T , we have X y2F p y (t) ^ R y (t) = N X i=1 R (x t i ) i (t) + X (m;n)2E R (x t m x t n ) mn (t): (8.10) Proof. X y2F p y (t) ^ R y (t) = X y2F p y 0 @ N X i=1 ^ R (y i ) i + X (m;n)2E ^ R (ymyn) mn 1 A = X i X y p y ^ R (y i ) i + X (m;n) X y p y ^ R (ymyn) mn ; (8.11) 141 where X y p y ^ R (y i ) i = X y2C i ex p y R (x t i ) i P z2C i ex p z =R (x t i ) i ; (8.12) and similarly, X y p y ^ R (ymyn) mn =R (x t m x t n ) mn : (8.13) Applying the result to (8.11) completes the proof. Lemma 8. For all y2F, we have Ef ^ R y (t)g = N X i=1 R (y i ) i (t) + X (m;n)2E R (ymyn) mn (t): (8.14) Proof. Ef ^ R y (t)g = N X i=1 Ef ^ R (y i ) i g + X (m;n)2E Ef ^ R (ymyn) mn g; (8.15) where Ef ^ R (y i ) i g =Pfx t i =y i g R (y i ) i P z2C i ex p z =R (y i ) i ; (8.16) and similarly, Ef ^ R (ymyn) mn g =R (ymyn) mn : (8.17) Lemma 9. IfF =fx2 [M] N g, then for M 3 andjEj 3, X y2F p y (t) ^ R y (t) 2 jEj M N2 X y2F ^ R y (t): (8.18) 142 Proof. We rst expand the left-hand-side of the inequality as X y2F p y (t) ^ R y (t) 2 = X y2F p y 0 @ N X i=1 ^ R (y i ) i + X (m;n)2E ^ R (ymyn) mn 1 A 2 (8.19) = X y2F p y 0 @ X i;j ^ R (y i ) i ^ R (y j ) j + X (m;n);(u;v) ^ R (ymyn) mn ^ R (yuyv ) uv + 2 X i X (m;n) ^ R (y i ) i ^ R (ymyn) mn 1 A (8.20) In the following, we derive the upper bound for each term in (8.20) for all i2 [N], (m;n)2E. X y p y ^ R (y i ) i ^ R (y j ) j = X y2C i ex \C j ex p y R (x t i ) i R (x t j ) j P z2C i ex p z P z2C j ex p z (8.21) R (x t j ) j R (x t i ) i P z2C i ex p z =R (x t j ) j ^ R (x t i ) i 1 M N1 X y ^ R (y i ) i (8.22) The rst inequality in (8.22) follows byC i ex \C j ex is a subset ofC j ex and the last inequality follows by ^ R (y i ) i = ^ R (x t i ) i for all y inC i ex . Hence, X i;j X y p y ^ R (y i ) i ^ R (y j ) j 1 M N2 X y X i ^ R (y i ) i : (8.23) Similarly, X (m;n);(u;v) X y p y ^ R (ymyn) mn ^ R (yuyv ) uv jEj M N2 X y X (m;n) ^ R (ymyn) mn : (8.24) 143 For the last term in (8.20), following the similar argument gives X y p y ^ R (y i ) i ^ R (ymyn) mn = X y2C i ex \C mn tx p y R (x t i ) i R (x t m x t n ) mn P z2C i ex p z P z2C mn tx p z (8.25) R (x t m x t n ) mn R (x t i ) i P z2C i ex p z =R (x t m x t n ) mn ^ R (x t i ) i 1 M N1 X y ^ R (y i ) i : (8.26) Hence, X i X (m;n) X y p y ^ R (y i ) i ^ R (ymyn) mn jEj M N1 X y X i ^ R (y i ) i : (8.27) Applying (8.23), (8.24) and (8.27) to (8.20) gives X y2F p y (t) ^ R y (t) 2 X y2F [ X i ( 1 M N2 + 2jEj M N1 ) ^ R (y i ) i + X (m;n) jEj M N2 ^ R (ymyn) mn ] (8.28) jEj M N2 X y2F ^ R y (t): (8.29) The last inequality follows by the fact that 1 M N2 + 2jEj M N1 jEj M N2 for M 3 and jEj 3. For M = 2, we have X y2F p y (t) ^ R y (t) 2 M + 2jEj M N1 X y2F ^ R y (t): (8.30) Since we are interested in the regime where (8.29) holds, we will use this result in our proof of Theorem 5. Lemma 10. Let = M(N+jEjM) , ifF =fx2 [M] N g, then for all y2F, all t = 1; ;T , we have ^ R y (t) 1. 144 Proof. SincejC i ex jM N1 andjC mn tx jM N2 for all i2 [N] and (m;n)2E, each term in ^ R y (t) can be upper bounded as ^ R (y i ) i R (y i ) i P z2C i ex p z 1 M N1 M N = M ; (8.31) ^ R (y i1 y i ) i R (y i1 y i ) i P z2C i tx p z 1 M N2 M N = M 2 : (8.32) Hence, we have ^ R y (t) = N X i=1 ^ R (y i ) i + X (m;n)2E ^ R (ymyn) mn N M +jEj M 2 = M (N +jEjM): (8.33) Let = M(N+jEjM) , we achieve the result. 145 8.3.2 Proof of Theorem 5 Proof. Let W t = P y2F w y (t). We denote the sequence of decisions drawn at each frame as x = [x 1 ; ;x T ], where x t 2F denotes the arm drawn at step t. Then for all data frame t, W t+1 W t = X y2F w y (t) W t exp ^ R (y) (t) (8.34) = X y2F p y (t) jFj 1 exp ^ R (y) (t) (8.35) X y2F p y (t) jFj 1 1 + ^ R (y) (t) + (e 2) 2 ^ R (y) (t) 2 (8.36) 1 + 1 0 @ N X i=1 R (x t i ) i (t) + X (m;n)2E R (x t m x t n ) mn (t) 1 A + (e 2) 2 1 jEj M N2 X y2F ^ R y (t): (8.37) Eq. (8.36) follows by the fact thate x 1+x+(e2)x 2 forx 1. Applying Lemma 7 and Lemma 9 we arrive at (8.37). Using 1 +xe x and taking logarithms at both sides, ln W t+1 W t 1 0 @ N X i=1 R (x t i ) i (t) + X (m;n)2E R (x t m x t n ) mn (t) 1 A + (e 2) 2 1 jEj M N2 X y2F ^ R y (t): (8.38) Taking summation from t = 1 to T gives ln W T +1 W 1 1 ^ R total + (e 2) 2 1 jEj M N2 T X t=1 X y2F ^ R y (t): (8.39) 146 On the other hand, ln W T +1 W 1 ln w z (T + 1) W 1 = T X t=1 ^ R z (t) lnM N ;8z2F: (8.40) Combining (8.39) and (8.40) gives ^ R total (1 ) T X t=1 ^ R z (t) (e 2) jEj M N2 T X t=1 X y2F ^ R y (t) lnM N : (8.41) Eq. (8.41) holds for all z2F. Choose x ? to be the assignment strategy that maximizes the objective in (8.1). Now we take expectations on both sides based on x 1 ; ;x T and use Lemma 8. That is, T X t=1 Ef ^ R x ?(t)g = T X t=1 [ N X i=1 R (x ? i ) i (i) + X (m;n)2E R (x ? m x ? n ) mn (t)] =R max total ; (8.42) and T X t=1 X y2F Ef ^ R y (t)g = T X t=1 X y2F 0 @ N X i=1 R (y i ) i (t) + X (m;n)2E R (ymyn) mn (t) 1 A M N R max total : (8.43) Applying the result to (8.41) gives Ef ^ R total g (1 )R max total jEjM 2 (e 2)R max total lnM N : (8.44) 147 Let = M(N+jEjM) , we arrive at R max total Ef ^ R total g (e 1) R max total + M(N +jEjM) lnM N : (8.45) 8.4 Polynomial Time MABSTA In Algorithm 8, since there are exponentially many arms, implementation may result in exponential storage and complexity. However, in the following, we propose an equivalent but ecient implementation. We show that when the task graph belongs to a subset of DAG that appear in practical applications (namely, parallel chains of trees), Algorithm 8 can run in polynomial time with polynomial storage. We observe that in (8.5), ^ R y (t) relies on the estimates of each node and each edge. Hence, we rewrite (8.6) as w y (t + 1) = exp t X =1 ^ R y (t) ! (8.46) = exp 0 @ N X i=1 ~ R (y i ) i (t) + X (m;n)2E ~ R (ymyn) mn (t) 1 A ; (8.47) where ~ R (y i ) i (t) = t X =1 ^ R (y i ) i ; ~ R (ymyn) mn (t) = t X =1 ^ R (ymyn) mn : (8.48) 148 Algorithm 9 Calculate w (j) N for tree-structured task graph 1: procedure (N;M;G) 2: q BFS (G;N) . run BFS from N and store visited nodes in order 3: for i q.end; q.start do . start from the last element 4: if i is a leaf then . initialize ! values of leaves 5: ! (j) i e (j) i 6: else 7: ! (j) i e (j) i Y m2N i X ym2[M] e (ymj) mi ! (ym) m 8: end if 9: end for 10: end procedure To calculatew y (t), it suces to store ~ R (j) i (t) and ~ R (j;k) mn (t) for alli2 [N], (m;n)2E and j;k2 [M], which costs (NM +jEjM 2 ) storage. Eq. (8.3) and (8.4) require the knowledge of marginal probabilities Pfx t i = jg andPfx t m =j;x t n =kg. Next, we propose a polynomial time algorithm to calculate them. From (8.2), the marginal probability can be written as Pfx t i =jg = (1 ) 1 W t X y:y i =j w y (t) + M : (8.49) Hence, without calculating W t , we have Pfx t i =jg M :Pfx t i =kg M = X y:y i =j w y (t) : X y:y i =k w y (t): (8.50) 149 ! (j 2 |j 1 ) i 2 |i 1 ! (j 1 ) i 1 4 5 6 1 2 3 ! (k) 4 ! (l) 5 ! (j) 6 e (j) 6 e (kj) 46 e (lj) 56 Figure 8.1: The dependency of weights on a tree-structure task graph 8.4.1 Tree-structure Task Graph Now we focus on how to calculate the sum of weights in (8.50) eciently. We start from tree-structure task graphs and solve more general graphs by calling the proposed algorithm for trees a polynomial number of times. We drop time index t in our derivation whenever the result holds for all time steps t 2 f1; ;Tg. For example, ~ R (j) i ~ R (j) i (t). We assume that the task graph is a tree with N nodes where the N th node is the root (nal task). Let e (j) i = exp( ~ R (j) i ) and e (jk) mn = exp( ~ R (jk) mn ). Hence, the sum of exponents in (8.47) can be written as the product of e (j) i and e (jk) mn . That is, X y w y (t) = X y N Y i=1 e (y i ) i Y (m;n)2E e (ymyn) mn : (8.51) For a node v, we useD v to denote the set of its descendants. Let the setE v denote the edges connecting its descendants. Formally, E v =f(m;n)2Ejm2D v ;n2D v [fvgg: (8.52) 150 The set ofjD v j-dimensional vectors,fy m g m2Dv , denotes all the possible assignments on its descendants. Finally, we dene the sub-problem, ! (j) i , which calculates the sum of weights of all possible assignment on task i's descendants, given task i is assigned to device j. That is, ! (j) i =e (j) i X fymg m2D i Y m2D i e (ym) m Y (m;n)2E i e (ymyn) mn : (8.53) Figure 8.1 shows an example of a tree-structure task graph. Task 4 and 5 are the children of task 6, where we have D 6 =f1; 2; 3; 4; 5g, andE 6 =f(1; 4); (2; 4); (3; 5); (4; 6); (5; 6)g: From (8.53), if we have ! (k) 4 and ! (l) 5 for all k and l, ! (j) 6 can be solved by ! (j) 6 =e (j) 6 X k;l e (kj) 46 ! (k) 4 e (lj) 56 ! (l) 5 : (8.54) In general, the relation of weights between task i and its children m2N i is given by the following equation. ! (j) i =e (j) i X fymg m2N i Y m2N i e (ymj) mi ! (ym) m =e (j) i Y m2N i X ym2[M] e (ymj) mi ! (ym) m : (8.55) 151 ! (j2|j1) i2|i1 ! (j1) i1 4 5 6 1 2 3 ! (k) 4 ! (l) 5 ! (j) 6 e (j) 6 e (kj) 46 e (lj) 56 Figure 8.2: The dependency of weights on a serial-tree task graph Algorithm 9 summarizes our approach to calculate the sum of weights of a tree- structure task graph. We rst run breath rst search (BFS) from the root node. Then we start solving the sub-problems from the last visited node such that when solving task i, it is guaranteed that all of its child tasks have been solved. Let d in denote the maximum in-degree ofG (i.e., the maximum number of in-coming edges of a node). Running BFS takes polynomial time. For each sub-problem, there are at most d in products of summations over M terms. In total, Algorithm 9 solves NM sub-problems. Hence, Algorithm 9 runs in (d in NM 2 ) time. 8.4.2 More general task graphs All of the nodes in a tree-structure task graph have only one out-going edge. For task graphs where there exists a node that has multiple out-going edges, we de- compose the task graph into multiple trees and solve them separately and combine the solutions in the end. In the following, we use an example of a task graph that consists of serial trees to illustrate our approach. Figure 8.2 shows a task graph that has two trees rooted by task i 1 and i 2 , respectively. Let the sub-problem, ! (j 2 jj 1 ) i 2 ji 1 , denote the sum of weights given that i 2 152 is assigned toj 2 andi 1 is assigned toj 1 . To nd! (j 2 jj 1 ) i 2 ji 1 , we follow Algorithm 9 but consider the assignment on task i 1 when solving the sub-problems on each leaf m. That is, ! jmjj 1 (mji 1 ) =e (j 1 jm) i 1 m e (jm) m : (8.56) The sub-problem,! (j 2 ) i 2 , now becomes the sum of weights of all possible assignment on task i 2 's descendants, including task 1's descendants, and is given by ! (j 2 ) i 2 = X j 1 2[M] w (j 2 jj 1 ) i 2 ji 1 w (j 1 ) i 1 : (8.57) For a task graph that consists of serial trees rooted by i 1 ; ;i n in order, we can solve ! (jr ) ir , given previously solved ! (jrjj r1 ) irji r1 and ! (j r1 ) i r1 . From (8.57), to solve ! (j 2 ) i 2 , we have to solve w (j 2 jj 1 ) i 2 ji 1 for j 1 2f1; ;Mg. Hence, it takes O(d in n 1 M 2 ) + O(Md in n 2 M 2 ) time, where n 1 (resp. n 2 ) is the number of nodes in tree i 1 (resp. i 2 ). Hence, to solve a serial-tree task graph, it takes O(d in NM 3 ) time. Our approach can be generalized to more complicated DAGs, like the one that contains parallel chains of trees (parallel connection of Figure 8.2), in which we solve each chain independently and combine them from their common root N. Most of the real applications can be described by these families of DAGs where we have proposed polynomial time MABSTA to solve them. For example, in [11], the three benchmarks fall in the category of parallel chains of trees. In Wireless Sensor Networks, an application typically has a tree-structured work ow [61]. 153 8.4.3 Marginal Probability From (8.50), we can calculate the marginal probability Pfx t i = jg if we can solve the sum of weights over all possible assignments given taski is assigned to devicej. If taski is the root (nodeN), then Algorithm 9 solves! (j) i = P y:y i =j w y (t) exactly. If taski is not the root, we can still run Algorithm 9 to solve [! (j 0 ) p ] y i =j , which xes the assignment of task i to device j when solving from i's parent p. That is, [! (j 0 ) p ] y i =j =e (j 0 ) p e (jj 0 ) ip ! (j) i Y m2Npnfig X ym e (ymj 0 ) mp ! (ym) m : (8.58) Hence, in the end, we can solve [! (j 0 ) N ] y i =j from the root and X y:y i =j w y (t) = X j 0 2[M] [! (j 0 ) r ] y i =j : (8.59) Similarly, the Pfx t m = j;x t n = kg can be achieved by solving the conditional sub- problems on both tasks m and n. 8.4.4 Sampling As we can calculate the marginal probabilities eciently, we propose an ecient sampling policy summarized in Algorithm 10. Algorithm 10 rst selects a random number s between 0 and 1. If s is less than , it refers to the exploration phase, 154 Algorithm 10 Ecient Sampling Algorithm 1: procedureSampling( ) 2: s rand() . get a random number between 0 and 1 3: if s< then 4: pick an x2 [M] N uniformly 5: else 6: for i 1; ;N do 7: [! (j) i ] x t 1 ;;x t i1 (N;M;G) x t 1 ;;x t i1 8: Pfx t i =jjx t 1 ; ;x t i1 g/ [! (j) i ] x t 1 ;;x t i1 9: end for 10: end if 11: end procedure where MABSTA simply selects an arm uniformly. Otherwise, MABSTA selects an arm based on the probability distribution p y (t), which can be written as p y (t) =Pfx t 1 =y 1 gPfx t 2 =y 2 jx t 1 =y 1 g (8.60) Pfx t N =y N jx t 1 =y 1 ; ;x t N1 =y N1 g: (8.61) Hence, MABSTA assigns each task in order based on the conditional probability given the assignment on previous tasks. For each taski, the conditional probability can be calculate eciently by running Algorithm 9 with xed assignment on task 1; ;i 1. 155 8.5 Numerical Evaluation In this section, we rst examine how MABSTA adapts to dynamic environment. Then, we perform trace-data emulation to verify MABSTA's performance guarantee and compare it with other algorithms. 8.5.1 MABSTA's Adaptivity Here we examine MABSTA's adaptivity to dynamic environment and compare it to the optimal strategy that relies on the existing prole. We use a two-device setup, where the task execution costs of the two devices are characterized by two dierent Markov processes. We neglect the channel communication cost so that the optimal strategy is the myopic strategy. That is, assigning the tasks to the device with the highest belief that it is in \good" state [62]. We run our experiment with an application that consists of 10 tasks and processes the in-coming data frames one by one. The environment changes at the 100 th frame, where the transition matrices of two Markov processes swap with each other. From Figure 8.3, there exists an optimal assignment (dashed line) so that the performance remains as good as it was before the 100 th frame. However, myopic strategy, with the wrong information of the transition matrices, fails to adapt to the changes. From (8.2), MABSTA not only relies on the result of previous samples but also keeps exploring uniformly (with probability M N for each arm). Hence, when the performance of 156 0 50 100 150 200 250 300 350 400 450 500 0 2 4 6 8 10 frame number cost myopic MABSTA offline opt Figure 8.3: MABSTA has better adaptivity to the changes than a myopic algorithm Table 8.1: Parameters Used in Trace-data measurement Device ID # of iterations Device ID # of iterations 18 U(14031; 32989) 28 U(10839; 58526) 21 U(37259; 54186) 31 U(10868; 28770) 22 U(23669; 65500) 36 U(41467; 64191) 24 U(61773; 65500) 38 U(12386; 27992) 26 U(19475; 44902) 41 U(15447; 32423) one device degrades at 100 th frame, the randomness enables MABSTA to explore another device and learn the changes. 8.5.2 Trace-data Emulation 1 To obtain trace data representative of a realistic environment, we run simulations on a large-scale wireless sensor network / IoT testbed. We create a network using 10 IEEE 802.15.4-based wireless embedded devices, and conduct a set of experi- ments to measure two performance characteristics utilized by MABSTA, namely channel conditions and computational resource availability. To assess the channel conditions, the time it takes to transfer 500 bytes of data between every pair of 1 This section is a joint work with Mr. Kwame Wright, University of Southern California. 157 0 200 400 600 800 1000 0 2000 4000 6000 device 18 latency (ms) 0 200 400 600 800 1000 0 2000 4000 6000 device 28 latency (ms) 0 200 400 600 800 1000 0 1 2 x 10 4 channel 21 −> 28 frame number latency (ms) avg = 1881, std = 472 avg = 2760, std = 1122 avg = 1798, std = 2093 Figure 8.4: Snapshots of measurement result 0 1 2 3 4 5 x 10 5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 5 T regret bound (10,5) MABSTA (10,5) bound (10,3) MABSTA (10,3) bound (5,5) MABSTA (5,5) bound (5,3) MABSTA (5,3) Figure 8.5: MABSTA's performance with upper bounds provided by Corollary 1 158 0 1 2 3 4 5 x 10 5 0 0.5 1 1.5 2 2.5 3 x 10 5 Application, N = 5, M = 5 regret 0 1 2 3 4 5 x 10 5 0.75 0.8 0.85 0.9 0.95 1 frame number ratio to opt Randomized Exp3 MABSTA, fixed γ MABSTA, varing γ MABSTA, varying γ MABSTA, fixed γ Figure 8.6: MABSTA compared with other algorithms for 5-device network 0 1 2 3 4 5 x 10 5 0 0.5 1 1.5 2 2.5 3 x 10 5 Application, N = 5, M = 10 regret 0 1 2 3 4 5 x 10 5 0.75 0.8 0.85 0.9 0.95 1 frame number ratio to opt Randomized Exp3 MABSTA, fixed γ MABSTA, varing γ MABSTA, varying γ MABSTA, fixed γ Figure 8.7: MABSTA compared with other algorithms for 10-device network 159 motes is measured. To assess the resource availability of each device, we measure the amount of time it takes to run a simulated task for a uniformly distributed number of iterations. The parameters of the distribution are shown in Table 8.1. Since latency is positively correlated with device's energy consumption and the ra- dio transmission power is kept constant in these experiments, it can also be used as an index for energy cost. We use these samples as the reward sequences in the following emulation. We present our evaluation as the regret compared to the oine optimal solution in (8.1). For real applications the regret can be extra energy consumption over all nodes, or extra processing latency over all data frames. Figure 8.5 validates MABSTA's performance guarantee for dierent problem sizes. From the cases we have considered, MABSTA's regret scales with O(N 1:5 M). We further compare MABSTA with two other algorithms as shown in Figure 8.6 and Figure 8.7. Exp3 is proposed for adversarial MAB in [30]. Randomized baseline simply selects an arm uniformly for each data frame. Applying Exp3 to our task assignment problem results in the learning time grows exponentially with O(M N ). Hence, Exp3 is not competitive in our scheme, in which the regret grows nearly linear withT as randomized baseline does. In addition to original MABSTA, 160 we propose a more aggressive scheme by tuning provided in MABSTA. That is, for each frame t, setting t = min ( 1; s M(N +jEjM) lnM N (e 1)(N +jEj)t ) : (8.62) From (8.2), the larger the , the more chance that MABSTA will do exploration. Hence, by exploring more aggressively at the beginning and exploiting the best arm as decreases witht, MABSTA with varying learns the environment even faster and remains competitive with the oine optimal solution, where the ratio reaches 0:9 at early stage. That is, after rst 5000 frames, MABSTA already achieves the performance at least 90% of the optimal one. In sum, these empirical trace- based evaluations show that MABSTA scales well and outperforms the state of the art in adversarial online learning algorithms (EXP3). Moreover, it typically does signicantly better in practice than the theoretical performance guarantee. 8.6 Discussion With increasing number of devices capable of computing and communicating, the concept of collaborative computing enables complex applications which a single device cannot support individually. However, the intermittent and heterogeneous connections and diverse device behavior make the performance highly-variant with time. In this study, we have proposed a new online learning formulation that does 161 not make any stationary stochastic assumption on channels and devices. We have presented MABSTA and proved that it can be implemented eciently and provides performance guarantee for all dynamic environments. The trace-data emulation has shown that MABSTA is competitive to the optimal oine strategy and is adaptive to changes of the environment. A more general category than multi-armed bandit problems, called reinforce- ment learning, is to learn how to map the situations (environments) to actions [63]. An interesting and essential research problem is to speed up the learning process, in which not only can we learn from the actions we take but also learn from the actions not taken [64, 65, 66]. 162 Chapter 9 Conclusion As more and more intelligent devices are being connected in the era of the Internet of Things (IoT), there is an abundant amount of computing resources distributed over the network. Hence, it is promising to perform collaborative computing over multiple devices to jointly support a complex application that a single device can- not support individually. On one hand, multiple devices can run tasks in parallel to speed up the process. On the other hand, spreading the tasks over multiple de- vices from a cost-balancing perspective extends the battery lifetime on each device. However, collaborative computing over the network has communication overhead. The extra data transmission incurs both energy consumption and latency. Because of energy, computation, and bandwidth constraints on smart things and other edge devices, we have to eciently leverage the resource to optimize system performance, considering devices' availabilities, and the costs as well as latencies associated with computation and communication over the network. 163 In this thesis, we have proposed a step by step approach to optimize task assign- ment over multiple devices in realistic scenarios so as to make most ecient usage of available resource, while satisfying QoS constraints like energy consumption and application latency. We have partitioned an application into multiple tasks with their dependencies described by a directed acyclic graph (DAG), and studied the best task assignment strategy in dierent environments. We started from assuming that the amount and the cost of resources, like CPU cycles and channel bandwidth, are known and deterministic. We have solved an optimal task assignment that mini- mizes the application latency subject to a single cost constraint. Then, considering each device may have its own cost budget, we have formulated an optimization problem subject to individual constraints on each device. Taking a step further, in the scenario where the amount and the cost of resource vary with time, we have modeled them as stochastic processes with known statistics and solved a stochastic optimization problem. Finally, considering the resource states may be unknown and highly variant at the application run time, we have proposed online learning algorithms to learn the unknown statistics and make competitive task assignment that adapts to the changes in dynamic environments. Specically, we have proposed the following formulations. Deterministic Optimization with Single Constraint Deterministic Optimization with Multiple Constraints Stochastic Optimization with Single Constraint 164 Online Learning in Stationary Environments Online Learning in Non-stationary Environments We have focused on designing computation-light algorithms to make decisions on task assignment so that they do not incur considerable CPU overhead. For optimization formulations, we have shown that these problems are NP-hard, and proposed polynomial-time approximation algorithms with provable performance guarantee (approximation ratio). For online learning formulations, we have pro- posed polynomial-time algorithms that makes competitive task assignment com- pared with the optimal strategy. We have performed comprehensive simulations, including trace-data simulations, to validate our analysis on algorithms' perfor- mance and complexity. We envision that a future cyber foraging system is able to take the requests, explore the environment and assign tasks on heterogeneous computing devices, and satises the specied QoS requirements. Furthermore, the concept of macro- programming enables the application developers to code by describing the high-level and abstracted functionality that is independent of platforms. The existence of an interpreter plays an important role, which translates the high-level functional de- scription to machine codes for dierent devices. Especially, one crucial component that is closely related to system performance is how to partition an application into tasks and assign them to suitable devices with the awareness of resource availability at run time. As we have seen several potential system prototypes that demonstrate 165 the benet of collaborative computing, we are positive that these innovative al- gorithms can be incorporated into real systems to optimize the performance of collaborative computing. 166 Reference List [1] O. Vermesan and P. Friess, Internet of things: converging technologies for smart environments and integrated ecosystems. River Publishers, 2013. [2] \50 sensor applications for a smarter world."http://www.libelium.com/top_ 50_iot_sensor_applications_ranking/. Accessed: 2015-12-08. [3] M. Satyanarayanan, \Pervasive computing: Vision and challenges," Personal Communications, IEEE, vol. 8, no. 4, pp. 10{17, 2001. [4] L. Atzori, A. Iera, and G. Morabito, \The internet of things: A survey," Computer networks, vol. 54, no. 15, pp. 2787{2805, 2010. [5] M. A. M. Vieira, C. N. Coelho Jr, D. da Silva, and J. M. da Mata, \Survey on wireless sensor network devices," in Emerging Technologies and Factory Automation, 2003. Proceedings. ETFA'03. IEEE Conference, vol. 1, pp. 537{ 544, IEEE, 2003. [6] R. Balan, J. Flinn, M. Satyanarayanan, S. Sinnamohideen, and H.-I. Yang, \The case for cyber foraging," in Proceedings of the 10th workshop on ACM SIGOPS European workshop, pp. 87{92, ACM, 2002. [7] L. Mottola and G. P. Picco, \Programming wireless sensor networks: Fun- damental concepts and state of the art," ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 19, 2011. [8] C. H. Papadimitriou and K. Steiglitz, Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998. [9] G. Ausiello, Complexity and approximation: Combinatorial optimization prob- lems and their approximability properties. Springer, 1999. [10] E. Cuervo, A. Balasubramanian, D.-k. Cho, A. Wolman, S. Saroiu, R. Chandra, and P. Bahl, \Maui: making smartphones last longer with code ooad," in ACM MobiSys, pp. 49{62, ACM, 2010. [11] M.-R. Ra, A. Sheth, L. Mummert, P. Pillai, D. Wetherall, and R. Govindan, \Odessa: enabling interactive perception applications on mobile devices," in ACM MobiSys, pp. 43{56, ACM, 2011. 167 [12] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed bandit allocation indices. John Wiley & Sons, 2011. [13] S. Bubeck and N. Cesa-Bianchi, \Regret analysis of stochastic and nonstochas- tic multi-armed bandit problems," arXiv preprint arXiv:1204.5721, 2012. [14] \Tutornet: A low power wireless iot testbed." http://anrg.usc.edu/www/ tutornet/. Accessed: 2015-12-08. [15] R. M. Karp, Reducibility among combinatorial problems. Springer, 1972. [16] L. Lov asz, \On the ratio of optimal integral and fractional covers," Discrete mathematics, vol. 13, no. 4, pp. 383{390, 1975. [17] D. P. Williamson and D. B. Shmoys, The design of approximation algorithms. Cambridge University Press, 2011. [18] J. Canny, \Some algebraic and geometric computations in pspace," in Pro- ceedings of the twentieth annual ACM symposium on Theory of computing, pp. 460{467, ACM, 1988. [19] S. Arora and B. Barak, Computational complexity: a modern approach. Cam- bridge University Press, 2009. [20] G. L. Nemhauser and L. A. Wolsey, Integer and combinatorial optimization, vol. 18. Wiley New York, 1988. [21] G. Nemhauser and L. Wolsey, \Polynomial-time algorithms for linear program- ming," Integer and Combinatorial Optimization, pp. 146{181. [22] E. V. Denardo, Dynamic programming: models and applications. Courier Cor- poration, 2012. [23] K. Dudzi nski and S. Walukiewicz, \Exact methods for the knapsack problem and its generalizations," European Journal of Operational Research, vol. 28, no. 1, pp. 3{21, 1987. [24] O. H. Ibarra and C. E. Kim, \Fast approximation algorithms for the knapsack and sum of subset problems," Journal of the ACM (JACM), vol. 22, no. 4, pp. 463{468, 1975. [25] G. J. Woeginger, \When does a dynamic programming formulation guaran- tee the existence of a fully polynomial time approximation scheme (fptas)?," INFORMS Journal on Computing, vol. 12, no. 1, pp. 57{74, 2000. [26] H. Robbins, \Some aspects of the sequential design of experiments," in Herbert Robbins Selected Papers, pp. 169{177, Springer, 1985. 168 [27] J. Vermorel and M. Mohri, \Multi-armed bandit algorithms and empirical evaluation," in Machine Learning: ECML 2005, pp. 437{448, Springer, 2005. [28] P. Auer, N. Cesa-Bianchi, and P. Fischer, \Finite-time analysis of the multi- armed bandit problem," Machine learning, vol. 47, no. 2-3, pp. 235{256, 2002. [29] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, \Gambling in a rigged casino: The adversarial multi-armed bandit problem," in Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pp. 322{ 331, IEEE, 1995. [30] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, \The nonstochastic multiarmed bandit problem," SIAM Journal on Computing, vol. 32, no. 1, pp. 48{77, 2002. [31] K. Kumar, J. Liu, Y.-H. Lu, and B. Bhargava, \A survey of computation ooading for mobile systems," Mobile Networks and Applications, vol. 18, no. 1, pp. 129{140, 2013. [32] C. Wang and Z. Li, \Parametric analysis for adaptive computation ooading," ACM SIGPLAN, vol. 39, no. 6, pp. 119{130, 2004. [33] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, \Hermes: Latency optimal task assignment for resource-constrained mobile computing," in IEEE INFOCOM, pp. 1894{1902, IEEE, 2015. [34] Y.-H. Kao, R. Kannan, and B. Krishnamachari, \Flexible in-network data processing on iot devices," submitted to IEEE SECON 2016. [35] Y.-H. Kao and B. Krishnamachari, \Optimizing mobile computational ooad- ing with delay constraints," in IEEE GLOBECOM, IEEE, 2014. [36] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, \Hermes: Latency op- timal task assignment for resource-constrained mobile computing," submitted to IEEE Transactions on Mobile Computing. [37] Y.-H. Kao, B. Krishnamachari, F. Bai, and K. Wright, \Online learning for wireless distributed computing," in preparation. [38] B.-G. Chun, S. Ihm, P. Maniatis, M. Naik, and A. Patti, \Clonecloud: elas- tic execution between mobile device and cloud," in ACM Computer systems, pp. 301{314, ACM, 2011. [39] C. Shi, K. Habak, P. Pandurangan, M. Ammar, M. Naik, and E. Zegura, \Cosmos: computation ooading as a service for mobile devices," in ACM MobiHoc, pp. 287{296, ACM, 2014. 169 [40] O. Goldschmidt and D. S. Hochbaum, \A polynomial algorithm for the k- cut problem for xed k," Mathematics of operations research, vol. 19, no. 1, pp. 24{37, 1994. [41] A. Gerasoulis and T. Yang, \On the granularity and clustering of directed acyclic task graphs," Parallel and Distributed Systems, IEEE Transactions on, vol. 4, no. 6, pp. 686{701, 1993. [42] W. Dai, Y. Gai, and B. Krishnamachari, \Online learning for multi-channel opportunistic access over unknown markovian channels," in IEEE SECON, pp. 64{71, IEEE, 2014. [43] K. Liu and Q. Zhao, \Indexability of restless bandit problems and optimal- ity of whittle index for dynamic multichannel access," IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547{5567, 2010. [44] J. D. Cohen, S. M. McClure, and J. Y. Angela, \Should i stay or should i go? how the human brain manages the trade-o between exploitation and explo- ration," Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 362, no. 1481, pp. 933{942, 2007. [45] S. Vakili, K. Liu, and Q. Zhao, \Deterministic sequencing of exploration and exploitation for multi-armed bandit problems," Selected Topics in Signal Pro- cessing, IEEE Journal of, vol. 7, no. 5, pp. 759{767, 2013. [46] R. Ortner, D. Ryabko, P. Auer, and R. Munos, \Regret bounds for restless markov bandits," in Algorithmic Learning Theory, pp. 214{228, Springer, 2012. [47] C. Shi, V. Lakafosis, M. H. Ammar, and E. W. Zegura, \Serendipity: enabling remote computing among intermittently connected mobile devices," in ACM MobiHoc, pp. 145{154, ACM, 2012. [48] M. Y. Arslan, I. Singh, S. Singh, H. V. Madhyastha, K. Sundaresan, and S. V. Krishnamurthy, \Cwc: A distributed computing infrastructure using smartphones," Mobile Computing, IEEE Transactions on, 2014. [49] M. R. Rahimi, N. Venkatasubramanian, S. Mehrotra, and A. V. Vasilakos, \Mapcloud: mobile applications on an elastic and scalable 2-tier cloud archi- tecture," in IEEE/ACM UCC, pp. 83{90, IEEE, 2012. [50] M. Satyanarayanan, \Cloudlets: at the leading edge of cloud-mobile conver- gence," in ACM SIGSOFT, pp. 1{2, ACM, 2013. [51] L. Fleischer, M. X. Goemans, V. S. Mirrokni, and M. Sviridenko, \Tight ap- proximation algorithms for maximum general assignment problems," in ACM- SIAM symposium on Discrete algorithm, pp. 611{620, Society for Industrial and Applied Mathematics, 2006. 170 [52] T. Korkmaz and M. Krunz, \Multi-constrained optimal path selection," in IEEE INFOCOM, vol. 2, pp. 834{843, IEEE, 2001. [53] F. Kuipers, P. Van Mieghem, T. Korkmaz, and M. Krunz, \An overview of constraint-based path selection algorithms for qos routing," IEEE Communi- cations Magazine, 40 (12), 2002. [54] G. Xue, W. Zhang, J. Tang, and K. Thulasiraman, \Polynomial time approx- imation algorithms for multi-constrained qos routing," IEEE/ACM Transac- tions on Networking (ToN), vol. 16, no. 3, pp. 656{669, 2008. [55] G. Xue, A. Sen, W. Zhang, J. Tang, and K. Thulasiraman, \Finding a path subject to many additive qos constraints," IEEE/ACM Transactions on Net- working (TON), vol. 15, no. 1, pp. 201{211, 2007. [56] V. V. Vazirani, Approximation Algorithms. New York, NY, USA: Springer- Verlag New York, Inc., 2001. [57] J. B. Orlin, \A faster strongly polynomial minimum cost ow algorithm," Operations research, vol. 41, no. 2, pp. 338{350, 1993. [58] X. Chen, S. Hasan, T. Bose, and J. H. Reed, \Cross-layer resource allocation for wireless distributed computing networks," in RWS, IEEE, pp. 605{608, IEEE, 2010. [59] H. N. Gabow, \An ecient implementation of edmonds' algorithm for max- imum matching on graphs," Journal of the ACM (JACM), vol. 23, no. 2, pp. 221{234, 1976. [60] Y. Han, \Tight bound for matching," Journal of combinatorial optimization, vol. 23, no. 3, pp. 322{330, 2012. [61] H. Viswanathan, E. K. Lee, and D. Pompili, \Enabling real-time in-situ pro- cessing of ubiquitous mobile-application work ows," in IEEE MASS, pp. 324{ 332, IEEE, 2013. [62] Y. M. Dirickx and L. P. Jennergren, \On the optimality of myopic policies in sequential decision problems," Management Science, vol. 21, no. 5, pp. 550{ 556, 1975. [63] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998. [64] K. Tumer and N. Khani, \Learning from actions not taken in multiagent sys- tems," Advances in Complex Systems, vol. 12, no. 04n05, pp. 455{473, 2009. 171 [65] N. Khani and K. Tumer, \Learning from actions not taken: a multiagent learning algorithm," in Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 1277{1278, Inter- national Foundation for Autonomous Agents and Multiagent Systems, 2009. [66] N. Khani and K. Tumer, \Fast multiagent learning: Cashing in on team knowl- edge," Intel. Engr. Systems Though Articial Neural Nets, vol. 18, pp. 3{11, 2008. 172
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Online learning algorithms for network optimization with unknown variables
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Data-driven optimization for indoor localization
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Empirical methods in control and optimization
PDF
Exploiting diversity with online learning in the Internet of things
PDF
Dispersed computing in dynamic environments
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Optimal resource allocation and cross-layer control in cognitive and cooperative wireless networks
PDF
Using formal optimization techniques to improve the performance of mobile and data center networks
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Backpressure delay enhancement for encounter-based mobile networks while sustaining throughput optimality
PDF
Active state tracking in heterogeneous sensor networks
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Some bandit problems
PDF
Resource scheduling in geo-distributed computing
Asset Metadata
Creator
Kao, Yi-Hsuan
(author)
Core Title
Optimizing task assignment for collaborative computing over heterogeneous network devices
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Defense Date
01/12/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
approximation algorithms,combinatorial optimization,computational offloading,Internet of Things,multi‐armed bandits,OAI-PMH Harvest,online learning algorithms
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
), Bai, Fan (
committee member
), Golubchik, Leana (
committee member
), Silvester, John (
committee member
)
Creator Email
griffey373@gmail.com,yihsuank@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-217987
Unique identifier
UC11277990
Identifier
etd-KaoYiHsuan-4176.pdf (filename),usctheses-c40-217987 (legacy record id)
Legacy Identifier
etd-KaoYiHsuan-4176.pdf
Dmrecord
217987
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kao, Yi-Hsuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
approximation algorithms
combinatorial optimization
computational offloading
Internet of Things
multi‐armed bandits
online learning algorithms