Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Optimizing distributed storage in cloud environments
(USC Thesis Other)
Optimizing distributed storage in cloud environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OPTIMIZING DISTRIBUTED STORAGE IN CLOUD ENVIRONMENTS Copyright 2013 by Maheswaran Sathiamoorthy A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2013 Maheswaran Sathiamoorthy Dedication To the giants whose shoulders I stand on. ll Acknowledgments The quote 'There is no such thing as a self-made man' springs to my mind. This work would not have been possible without the help of a number of people. The opportunities presented to me and the help extended to me by so many people has been invaluable, and I would like to take this opportunity to thank them. First, I thank my advisors Dr. Bhaskar Krishnamachari and Dr. Alexandros Dimakis who have been more than mentors to me. I have learned a lot from them, and they have been influential well beyond my professional life. I am forever indebted to them. This dissertation would not have been possible without their vision, guidance, and help. I would like to thank all the faculty members at USC with whom I have had the opportunity to take courses. I especially thank Dr. Michael Neely for the Queueing theory class I took. His clarity of presentation and carefully designed problem sets are unparalleled. I extend my thanks to my dissertation and qualifying examination committee members that include my advisors and Dr. Michael Neely, Dr. Shahram Ghandeharizadeh, Dr. Minlan Yu, Dr. Rahul Jain and Dr. Fan Bai. I would like to thank my labmates Sundeep Pattern, Ying Chen, Scott Moeller, Pai Han Huang, Yi Gai, Yi Wang, Suvil Deora, Nachikethas, Joon Ahn, Parisa, Keyvan and Ill Amitabha Ghosh in no particular order. Life outside the lab and my stay at Los Angeles was kept fun by many friends, especially Prithviraj, Pramod, Kartik, Harsh and Srividya. My time here at USC went smoothly, thanks to the Annenberg Fellowship and the resourceful people who took care of the administrative burdens: Margery Berti, Tracy Charles, Diane Demetras, Shane Goodoff, and Danielle Hamra. And finally I would like to thank my two elder sisters and my parents for their constant support throughout this journey. IV Table of Contents Dedication Acknowledgments List of Figures Abstract Chapter 1: Introduction 1.1 Content Access from Vehicles 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 Bottleneck: Content Access using Cellular Networks Proposal: Content Access From Vehicular Cloud .. First Challenge: High Latencies due to Node Mobility Solution: Optimizing Latency using Distributed Storage Codes Second Challenge: Helper Node Allocation .......... . Solution: Optimizing Helper Node Allocation Using Dissemination Utility as Metric . . 1.2 Data Center Networks ....... . 1.2.1 Bottleneck: Storage .... . 1.2.2 Proposal: Use Erasure Codes 1.2.3 First Challenge: Repair Problem ii iii ix xii 1 2 3 3 7 7 7 8 8 9 10 11 1.2.4 1.2.5 1.2.6 Solution: Optimizing Repair Traffic Using Locally Repairable Codes 11 Second Challenge: Placement Problem . . . . . . . . . . . 11 Solution: Optimizing Placement Using MTTDL as Metric 1.3 Contributions . . . . 1.4 Thesis Organization . . . . . . . . . . . Chapter 2: Background and Related Work 2.1 Background ........ . 2.1.1 Erasure Coding .. . 2.1.2 Vehicular Networks. 2.1.3 Data Centers . . . . 2.1.4 Brief Introduction to MTTDL 2.2 Related Work ..... . 2.2.1 Vehicular Cloud ....... . 12 13 16 18 18 18 20 21 22 24 24 v 2.2.1.1 Dissemination Schemes .......... . 2.2.1.2 Coding for Vehicular Clouds ....... . 2.2.1.3 Content Dissemination using Helper Nodes 2.2.2 Data Centers ................... . 2.2.2.1 Erasure Coding for Data Center Cloud 2.2.2.2 Data Placement Schemes for Storage I Vehicular Cloud Chapter 3: Optimizing Content Access Latency Using Distributed Storage 24 25 27 29 30 32 36 Codes 37 3.1 Introduction . . . . . . . . 38 3.2 Model and Problem Setup 39 3.3 Theoretical Analysis . . . 41 3.3.1 Uncoded File Storage 42 3.3.2 Coded File Storage . . 44 3.3.3 The Benefits of Coding: Summary 52 3.4 Trace Based Experiments . 3.4.1 Dataset Description ... 3.4.2 Performance Metrics . . . 3.4.3 Experiment Methodology 3.4.4 Realistic Radio Link Model 3.4.5 3.4.6 Choice of the coding parameter k . Discussion of the Results . . . . . 3.4.6.1 Effect of file size ..... 3.4.6.2 Effect of the number of files and the capacity 3.4.7 Effect of File Distribution .... 3.4.8 Absolute File Download Latency 3.5 Chapter Summary ........... . Chapter 4: Optimizing Helper Node Allocation Using Dissemination Util- 53 54 56 57 60 60 63 64 64 65 67 67 ity as Metric 69 4.1 Introduction . . . . . . 70 4.2 Problem Formulation . 4.2.1 Notation .... 4.2.2 Contact and Dissemination Model 4.3 Understanding Dissemination of a single file 4.3.1 Modeling the Markov Chain ..... 4.3.1.1 States of the Markov Chain . 4.3.1.2 Transition Probabilities ... 73 74 75 76 77 77 78 4.3.1.3 Transition matrix A . . . . . 78 4.3.1.4 Occupancy of the Markov Chain 79 4.3.2 A Few Definitions . . . . . . . . . . . . . 80 4.3.3 Ml: Understanding the Expected Number of Satisfied Demands 80 4.3.4 M2: Understanding the expected completion Time . . . . . . . . 82 VI 4.4 Understanding Dissemination of multiple files ........ . 4.4.1 Ml: Compute Expected Number of Satisfied Demands 4.4.2 M2: Compute Completion Time 4.4.3 Understanding M1 and M2 4.5 Social Allocation .... . 4.6 Market Allocation ... . 4.6.1 Game Formulation 4.6.1.1 Players . 4.6.1.2 Actions . 4.6.1.3 Allocation Policy . 4.6.1.4 Payoffs ..... . 4.6.2 Existence of Nash Equilibria 4.6.3 Three Player Game Example 4.7 Chapter Summary ......... . 4.8 Appendix: Existence of Nash Equilibrium II Data Center Cloud 84 86 87 89 90 92 94 94 94 95 96 96 97 101 103 107 Chapter 5: Optimizing Repair Traffic Using Locally Repairable Codes 108 5.1 Motivation . . . . . . . . . . 109 5.1.1 Importance of Repair 110 5.2 Locally Repairable Codes . 112 5.3 Xorbas: System Description 5.3.1 HDFS-Xorbas .... 5.3.1.1 Encoding . 5.3.1.2 Decoding & Repair 5.4 Evaluation. . . ..... 5.4.1 Evaluation Metrics .... . 5.4.2 Amazon EC2 ....... . 5.4.2.1 HDFS Bytes Read 5.4.2.2 Network Traffic 5.4.2.3 Repair Time ... 5.4.2.4 Repair under Workload 5.4.3 Facebook's cluster 5.5 Chapter Summary Chapter 6: Optimizing Placement by maximizing Mean Time To Loss (MTTDL) 6.1 Motivation ........ . 6.2 Models and Assumptions 6.2.1 Data Center Model . 6.2.2 Storage Model 6.2.3 Failure Model ... . 6.2.4 Repair Model ... . 6.2.5 Block Storage Simulator 116 117 118 118 119 119 121 123 123 124 127 129 130 Data 131 132 134 135 136 137 138 139 Vll 6 .3 Load Balancing . . . . . . 6.3.1 Motivation .... 6.3.2 Proposed Solution 6.4 Evaluation ........ . 6.4.1 Effect of Node Failure Rate 6.4.2 Effect of Rack Failures ... 6.4.3 Effect of the load exponent ((3) 6.4.4 Effect of number of blocks stored 6.5 Chapter Summary ........... . III Epilogue Chapter 7: Conclusions and Future Work 7.1 Summary ........ . 7.1.1 Vehicular Cloud . 7.1.2 Data center Cloud 7.2 Future Directions . 7.3 Vehicular Cloud .. 7.4 Data Center Cloud References 142 144 144 144 145 148 149 150 152 154 155 155 155 156 157 157 158 162 Vlll List of Figures 1.1 The explosive growth of video consumption on mobile devices [6] . . . . . 4 1.2 Number of failed nodes over a single month period in a 3000 node produc- tion cluster of Facebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 The Markov chain used for RAID4/RAID5 reliability analysis. Here N is the number of disks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 A vehicular network with N nodes and redundancy a represented in the balls and bins framework. Each node is represented as a square, with the shaded squares containing copies of the file the sink is interested in. 43 3.2 The balls and bins model for the coded case for integer (3 2 1 3.3 The balls and bins set up when (3 S 1 3.4 Maps of the routes traced by a few randomly selected nodes in the Beijing and the Chicago datasets. We limited the number of nodes so as to not clutter the image. Colors are chosen randomly for each node by the tool 44 45 we used to plot the routes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Density of moving taxis vs time 56 3.6 Evaluating the performance of distributed storage codes in the default setting consisting of 2,500 files each of size 1GB stored in nodes each having 100GB storage for both the Beijing and Chicago datasets. There are 1,000 nodes in total in the Beijing dataset and 1,608 nodes in the Chicago dataset. 58 3.7 Plots showing how various parameters affect the full-recovery probability. In each of the cases, one parameter is varied while keeping the others constant. Typical values used are a storage capacity of 100GB, 2,500 files and file size 1GB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 IX 3.8 Plots showing the impact of different parameters on the average file down- load percentage. The parameters are same as in Fig 3. 7. . . . . . . . . . . 62 3.9 Effect of File Distribution ................ . 66 4.1 The expected number of demands satisfied in percentage as a function of the number of encounters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 We consider N = 50 nodes that could have multiple files, and show various statistics for file 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Finding the globally optimal (social welfare maximizing) helper node allo- cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 Three player example (with one agent and two content providers). Here co= 1,c2 =50, nd(1) = 4, nd(2) = 8 . . . . . . . . . . . . . . . . . . . . . . 93 4.5 The bids for different values of c 0 . 4.6 Payoff of the Central Agent for various co for M1 with deadlines 250 and 400 and M2. The Central Agent will fix the c 0 to maximize its payoff, and 99 it depends on the scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.7 A set of cases for model M1 (deadline T = 250 encounters) when quasi- concavity holds for both P 1 and P 2 . . . . . . . . . . . . . . . . . . . . . . 103 4.8 A set of cases for model M2 when quasiconcavity holds for both P1 and P2. 105 4.9 Demonstrating a case where quasiconcavity does not hold. Here c 0 = 1 =dw=10.......................... ..... 106 5.1 Locally repairable code implemented in HDFS-Xorbas. The four parity blocks P1, P2, P3, P4 are constructed with a standard RS code and the local parities provide efficient repair in the case of single block failures. The main theoretical challenge is to choose the coefficients ci to maximize the fault tolerance of the code. . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2 The metrics measured during the 200 file experiment. Network-in is similar to Network-out and so it is not displayed here. During the course of the experiment, we simulated eight failure events and the x-axis gives details of the number of DataNodes terminated during each failure event and the number of blocks lost are displayed in parentheses. . . . . . . . . . . . . 120 5.3 Measurements in time from the two EC2 clusters during the sequence of failing events. 125 X 5.4 Measurement points of failure events versus the total number of blocks lost in the corresponding events. Measurements are from all three experiments. 126 5.5 Completion times of 10 WordCount jobs: encountering no block missing, and ~ 20% of blocks missing on the two clusters. Dotted lines depict average job completion times ........... . 6.1 The datacenter topology used for the simulations 6.2 The datacenter topology used for the simulations 6.3 100 nodes start with 10 blocks each, followed by a number of failure-repair cycles. The nodes are sorted by the descending order of the loads at the end of the failure-repair cycles. More than 500 failures follow a trend similar 128 135 140 to 500 failures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.4 Effect of node failure rate for various placement schemes. 146 6.5 Effect of rack failure rate for various placement schemes .. 147 6.6 The improvement in MTTDL in percentage as the load exponent parameter is varied. In this case, we used a node failure rate of 1E-4. . . . . . . . . . 149 6. 7 Effect of number of blocks stored in the system for various placement schemes. Here a (8, 5) code is used. The performance of {3, 2, 1, 1, 1} is similar to that of {3, 1, 1, 1, 1, 1 }. The node failure rate is 5E- 5 and the rack failure rate is 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 XI Abstract Cloud storage, in the context of this research, is defined to be the abstraction of storage spanning multiple machines into a single storage pool that end-users can access without knowing the internal details of where or how the storage is maintained. Tradi tionally, cloud storage is used to refer to the storage pool in data centers. In our work, in addition to data center based cloud storage, we also consider a vehicular network based cloud storage - storage obtained by pooling together the storage on vehicles, typically connected by a vehicular network. In this thesis, we optimize the distributed storage in these two cloud environments. Specifically, we identify two challenges each in the two cloud environments and propose solutions to these challenges. In Chapter 3, we consider the first important challenge in the vehicular cloud, namely the high latencies of on-demand content access. We investigate the benefits of using erasure codes in reducing the content access latencies through both analysis and realistic trace-based simulations. We show that a key parameter affecting the file download latency is the ratio of file size to download bandwidth. When this ratio is small so that a file can be communicated in a single encounter, we find that coding techniques offer very little benefit over simple file replication. However, we analytically show that for large ratios, Xll for a memoryless contact model, distributed erasure coding yields a latency benefit of N /a over uncoded replication, where N is the number of vehicles and a the redundancy factor. Effectively, in this regime, coding yields the same performance as replicating all the files at all other vehicles, but using much less storage. We also evaluate the benefits of coded storage using large real vehicle traces of taxis in Beijing and buses in Chicago. These simulations, which include a realistic radio link quality model for a IEEE 802.1lp dedicated short range communication (DSRC) radio, validate the observations from the analysis, demonstrating that coded storage dramatically speeds up the download of large files in vehicular networks. In Chapter 4, we consider the second challenge, namely the problem of helper node allocation. In order to relay a file from a node that has the file to another that wants the file, it may be necessary to enlist the help of other relaying nodes. When there are multiple types of files, an existing pool of helper nodes cannot help the dissemination of all the files due to storage and bandwidth constraints. In the chapter, we formulate and address mathematically this fundamental problem of resource allocation in the form of helper nodes in disseminating multiple contents. We consider a stochastic homogeneous contact process for the nodes in the vehicular network, or more generally an intermittently connected mobile network. We consider and solve two variations of the problem- one in which the goal is to maximize the expected number of demands satisfied and another in which the goal is to minimize the time taken to disseminate the files. Besides the global optimization perspective, we also examine the problem from a game theoretic perspective in which a central agent auctions the storage to competing content providers, and show how self-interested decisions impact the social welfare. Xlll In the second half of the thesis, the data center cloud is considered. In Chapter 5 we investigate how to optimize the repair traffic in data centers, while keeping the storage overhead as low as possible. Node failures are frequent in data centers and when repairing failed nodes, network traffic is used (which is called repair traffic). Replication has the lowest possible repair traffic; however, replication has large storage overhead. The storage overhead can be reduced by using Reed Solomon codes, but they generate significantly more repair traffic than replication. We implement in Hadoop HDFS a new class of erasure codes called Locally Repairable Codes (LRCs) that can reduce the repair traffic by approximately 2x as compared to Reed Solomon Codes, while only requiring 14% more storage (for our particular implementation). The last challenge we consider is the problem of placement of blocks in a data center. When placing replicas or erasure encoded blocks in a data center, the common approach is to place them on separate racks. While this can help reduce the probability of permanent data loss, it creates cross-rack traffic when repairing failed nodes. This can slow down repair, thereby affecting reliability. In Chapter 6, we identify a tradeoff between fault tolerance and repair speed when placing data in a data center and capture this tradeoff into a single metric called Mean Time To Data Loss (MTTDL). We use this metric to determine how to store blocks to maximize reliability. Even though vastly different, the two cloud environments offer some similarities in the challenges or the approaches we can take to handle these challenges. For example, we advocate the use of erasure codes in both these storage environments for storing certain types of data. XIV A different perspective to understand the work presented in this thesis is to consider the solutions to the four challenges presented here as the answers to the questions that ask how to store data and where to store data in each of these two distributed environments. While the chapters 3 and 4 cater to the how and where questions for the vehicular clouds, the chapters 5 and 6 do the same for the data center cloud. XV Chapter 1 Introduction In 1967, the computer architect Gene Amdahl devised a method to measure the maxi mum possible speedup of a program when parallelized over multiple processors [13]. Now known as Amdahl's law [2], it can be used to understand the maximum possible overall speedup of a system when parts of it are improved. More importantly, a consequence of the law is that the speedup is limited by the improvement of the slowest part. Patterson et al. [76] in 1988 used Amdahl's law to point out that the rapid processor and memory speed improvements were squandered by the relatively slower disk input/output (I/0) improvements, and therefore proposed RAID (Redundant Array of Inexpensive Disks) to alleviate the effect of disk I/0 bottleneck. Motivated by this methodology, we use the following strategy for two scenarios: we identify the bottleneck(s), formulate a proposal to alleviate the bottleneck, identify a few challenges part of the proposal, and formulate solutions to solve or mitigate these specific challenges. We consider content access from vehicles, and data center networks in this thesis as the two scenarios. In the context of RAID, the strategy is as follows. The bottleneck is disk I/0 and the proposal is to use an array of inexpensive disks. A 1 challenge arising out of this proposal is lower reliability because of disk failures. Therefore, a solution to improve the reliability is to use some form of redundancy (replication in RAID-1, XOR parity in RAID-5 or more generally, erasure codes in RAID-6). We argue that content access from vehicles does not scale when using cellular networks. Therefore, we propose the use of vehicular networks as a backbone for content storage and define a new distributed storage architecture called vehicular cloud. We identify two challenges in such a system: the latencies of content access can be very high and the allocation of resources for helping dissemination can be non-trivial. We formulate these challenges and outline the solutions in Chapters 3 and 4 respectively. In the case of data centers, where the primary goal is to store as much data as possible, we identify that the bottleneck is in the storage to be used. When reliability has to be maintained in the face of failures, the total storage is generally three or more times than the actual storage required. To circumvent this problem we propose the use of erasure codes. Now, two challenges arising out of the use of erasure codes are that the failure repair cycles create a lot of network traffic, and the placement of erasure encoded blocks of data within a data center affects reliability. Again, we formulate these challenges and outline the solutions in Chapters 5 and 6 respectively. 1.1 Content Access from Vehicles As smartphones and touch-based tablets become increasingly popular, similar devices are finding their way to cars and other vehicles. Dubbed as in-vehicle infotainment sys tems [4], they allow drivers and passengers to navigate, and access music, videos, movies 2 and other relevant content from within the vehicle. While some of these may already be stored in the vehicle, others may need to be downloaded or refreshed periodically as in the following examples: (1) taxis may want to download location-based ads that are meaningful only around that location, (2) new episodes can be downloaded only after they are aired and may become stale and therefore replaceable after a certain period, (3) traffic updates, GPS map updates, software updates etc. need to be downloaded as and when available. One way to download content is to use the cellular network, but we will next argue that this will be expensive and difficult to scale. 1.1.1 Bottleneck: Content Access using Cellular Networks Mobile data traffic has been growing and is expected to grow exponentially. According to Cisco Visual Networking Index [6], the global mobile data traffic grew 2.3x in 2011, continuing the trend of doubling every year for the fourth year in a row. This trend is expected to continue, at least for the next few years. See Figure 1.1. Owing to this exponential growth in demand, not only is it likely that the cellular data bandwidth will remain limited, but also expensive [33, 77]. Therefore, it is critical to find alternatives to cellular networks for content access from vehicles. 1.1.2 Proposal: Content Access From Vehicular Cloud A new class of networks have emerged in the past few years- networks where vehicles are connected to one another and to roadside infrastructure units. These are called vehicular networks, or sometimes just connected vehicles. One of the important use cases of vehicular networks is in fostering better safety. Some examples of applications 3 Exabytes per Month 12 - 2011 2012 2013 Figures in legend refer to traffic share in 2016. Source: Cisco VNI Mobile, 2012 78% CAGR 2011-2016 • Mobile VoiP (0.3%) • Mobile Gaming (1.1%) • Mobile File Sharing (3.3%) • Mobile M2M (4.7%} • Mobile Web/Data (20.0%) • Mobile Video (70.5%} 2014 2015 2016 Figure 1.1: The explosive growth of video consumption on mobile devices [6] geared towards safety are: avoiding rear-end collisions, extended braking [14, 52], and detecting and disseminating information about potholes, bumps and other anomalous road conditions [31]. Road-side infrastructure units (also called Access Points or APs) could cache content so that vehicles passing by can download content from them instead of using cellular networks. However, there can be a few disadvantages with road-side units which prevent them from becoming a good alternat ive to cellular networks. First, they may be hard to deploy in high densit ies, owing to the cost and other overhead associated with maintaining these (such as electricity, security etc.) . Second, since these are not mobile, a vehicle quickly passing by an AP might not have sufficient time to download its desired data. 4 In contrast, the WAVE/WAVE BSS modes of IEEE 802.1lp allow for rapid vehicle to-vehicle file transfers [52] over potentially longer contact durations (e.g., if the vehicles are traveling in the same direction). Therefore, in this work, we investigate the possibility of exploiting inter-vehicular communication to enable P2P file sharing. Peer-to-peer file sharing, by its very nature, does not guarantee real-time content access. But we first note that the types of content requested from vehicles can be of two types - real-time content such as traffic updates, and delay-tolerant content such as multimedia. A majority of real-time content will usually be small in size and can be served easily through the cellular networks. In contrast, the delay-tolerant content will not be a good fit to be downloaded through cellular networks due to their large sizes. In fact, since they are delay-tolerant, they are better suited to be served through peer-to-peer file sharing. We propose that a limited repository of content be stored on the vehicles themselves. The repository is large enough that each vehicle cannot store all the contents locally, but can only store a fraction of the repository. If a user in a vehicle requests a content not in the vehicle, it may be downloaded from another passing by vehicle if it has the content. Henceforth, we will refer to requests as being made by vehicles rather than the users in the vehicles, for simplicity. Proposed Architecture for Vehicular Cloud Hitherto most of the literature on vehicular networks and ICMNs (also referred to as delay-tolerant networks (DTNs)) has focused primarily on a flat decentralized archi tecture. We argue that in the real-world, whether deployed on personal mobile devices 5 or in vehicles, the omnipresent availability of cellular infrastructure makes possible an alternative, hybrid two-tier architecture. Analogous to the OpenFlow architecture [68], such a hybrid network would have separate control and data planes. The control plane would provide low-overhead bidirectional control messaging between nodes and a server via the cellular infrastructure that allows for centralized resource allocation. The data plane would be where heavier amounts of data are disseminated during encounters. The model we use is as follows. We assume that content is stored only on those vehicles that have subscribed to the content access application. Next, we assume that there is a central agent that can keep track of the storage status of vehicles for purposes of this application. The storage across all these nodes is what we call the vehicular There can be multiple types of content. For each content, the nodes can be divided into three categories - seed nodes that contain the content, demanding nodes that want to download the content and helper nodes that may cache the content and help demanding nodes get the content. We consider a Central Agent (CA) who can keep keep track of the storage on each node through the control plane. It could also send various commands to the helper nodes to inform them which content they need to act as relay for. 1 the term cloud is used in the following sense of storage in our work: the storage is abstracted out for the end-user (in our case, a user in a vehicle). It is up to the cloud storage vendor (in our case the central agent) to maintain the details of where and how the files are stored and other issues such as scalability, reliability etc. 6 1.1.3 First Challenge: High Latencies due to Node Mobility We define latency as the expected time it takes for a node requesting a content to download the content completely. Due to node mobility, the demanding node may en counter the seed nodes only occasionally, and/or the demanding node may be able to download only portions of the file each time it meets a seed node due to short contact durations. This could result in very high latencies, especially for large files. 1.1.4 Solution: Optimizing Latency using Distributed Storage Codes In an attempt to reduce the latencies, we propose the use of distributed storage codes. The content to be stored is first encoded using a suitable erasure code and then distributed across the seed nodes. We show that doing this can reduce the latencies significantly. 1.1.5 Second Challenge: Helper Node Allocation Recall that bandwidth as well as storage are limited resources in vehicular networks. Let us consider a simple example to motivate this challenge. Let there be two contents. There are two seed nodes and one demanding node for the first content, one seed node and two demanding nodes for the second content. Let there beN remaining helper nodes. Under storage and bandwidth limitations, how best should the agent allocate the helper nodes for the optimal dissemination of both the contents? 7 1.1.6 Solution: Optimizing Helper Node Allocation Using Dissemination Utility as Metric To approach this problem, we capture the outcome of the dissemination by assigning utilities. Since the overall goal is to satisfy the demanding nodes, the utility must either capture how many demanding nodes get satisfied after certain time, or capture how long it took for all the demanding nodes to be satisfied. Therefore, we can have two models to capture the outcome: given a deadline, we can determine the average number of demanding nodes satisfied by the deadline, or we can determine the average number of encounters it takes for the dissemination to complete (i.e. all demanding nodes are satisfied). With the help of suitable models for the encounters and dissemination, we can compute these utilities for a given helper node allocation. Therefore, we can determine the helper node allocation that maximizes the utilities. The exact utility to be used depends on the application. If the content expires after some duration, one may use the former utility. Else if the content needs to be delivered to as many nodes as possible, then the latter utility will be suitable. 1.2 Data Center Networks Data centers consist of anywhere between hundreds to tens of thousands of servers interconnected by a number of switches, all under one roof. Each server has compute and storage capabilities. The storage across all these servers are what we call as the data center cloud. The net storage capacity can run in the orders of hundreds of petabytes. For instance, one of Facebook's production clusters (Facebook is an online social network 8 !J) (]) 11~------------------------------------------------------ 88+-------------~------------------------------------- ~ 6~------------~n-~~------------------------------------- z "'0 (]) Mon Tue W ed Thu Fn S at Sun Mon Tu e W ed Th..J F n Dec 2 6 Jon2 J<!n9 Stt S un Mon Tu e Wed Jan23 Figure 1.2: Number of failed nodes over a single month period in a 3000 node production cluster of Facebook. with more than one billion users) has at least 80 petabytes (PB) of total storage capacity across more than 3000 servers [85]. Due to the large number of nodes associated with these clusters, failure is the norm rather than the exception. Figure 1.2 shows a trace of node failures in one of the produc- tion clusters of Facebook. It can be seen that it is quite typical to have 20 or more node failures per day, even when most repairs are delayed to avoid transient failures. This calls for reliable storage. 1.2.1 Bottleneck: Storage Replication is the widely used mechanism to provide high reliability. For example, in Google File System [36] and Hadoop HDFS [18] the default replication is three. When a block of data is lost due to a component failure, another copy of its replica is made, 9 thereby restoring the lost block. Replication has a number of benefits such as: ease of implementation, ease of repair, good reliability and efficient reads. 3x replication entails an overhead of 200%, which reflects on the cost of the cluster. When the amount of data to be stored runs in the order of tens or hundreds of petabytes, 3x replication starts to become a major expense for data center operators. And as the amount of managed data grows faster than the data center infrastructure, data center operators are hard-pressed to use alternatives to replication. 1.2.2 Proposal: Use Erasure Codes Due to this bottleneck in storage, Facebook and many other data center operators are transitioning to erasure coding techniques (typically, classical Reed-Solomon codes) to introduce redundancy while saving storage [23, 34, 56], especially for data that is more archival in nature. Using the parameters of Facebook clusters, the data blocks of each large file are grouped in sets of 10 and for each such set, 4 parity blocks are created. The resulting 14 blocks are considered as part of a stripe. This system (called RS (10, 4)) can toler ate any 4 block failures and has a storage overhead of only 40% as opposed to 200% of replication. RS codes are therefore significantly more robust and storage efficient com pared to replication. In fact, this storage overhead is the minimal possible, for this level of reliability [29]. Codes that achieve this optimal storage-reliability tradeoff are called Maximum Distance Separable (MDS) [107] and Reed-Solomon codes [79] form the most widely used MDS family. 10 1.2.3 First Challenge: Repair Problem Classical erasure codes are suboptimal for distributed environments because of the so-called Repair problem: When a single node fails, typically one block is lost from each stripe that is stored in that node. RS codes are usually repaired with the sim pie method that requires transferring 10 blocks and recreating the original 10 data blocks even if a single block is lost [83], hence creating a lOx overhead in repair bandwidth and disk I/0. 1.2.4 Solution: Optimizing Repair Traffic Using Locally Repairable Codes Recently, information theoretic results established that it is possible to repair erasure codes with much less network bandwidth compared to this naive method [27]. There has been significant amount of very recent work on designing such efficiently repairable codes, see section 2.2.2 for an overview of this literature. In this work, we implement a new family of erasure codes called Locally Repairable Codes {LRCs), that are efficiently repairable both in terms of network bandwidth and disk I/0. Our goal is to implement more efficient coding schemes that could allow a large fraction of data to be coded without facing this repair bottleneck. This could save petabytes of storage overheads and significantly reduce cluster costs. 1.2.5 Second Challenge: Placement Problem In addition to storage and computation, another major component of data centers is the network. Data centers can be designed based on a number of network topologies such as the single-rooted tree topology, Fat-tree topology [10], Clos network topology [42] etc. 11 All these topologies have a tree-like structure, which means that cross-rack traffic share the core switches. And since the core switches are limited in number, multiple flows could share the same set of core switches. This causes the cross-rack bandwidth to be much lower than the intra-rack bandwidth. For each stripe, existing systems place one block per fault domain (generally a rack): for exam pie, there are 20 fault domains and 10 upgrade domains in Windows Azure Storage, and no two blocks of a stripe (referred to as coding group) are placed in the same domain [48]. In GFS [36], replicas are distributed one per rack, and for erasure encoded chunks, a rack-aware placement policy makes sure that no two chunks of the same stripe are placed in the same rack [34]. Note that placing one block per rack (or a fault domain) will create cross-rack repair traffic. This could in effect slow down the rate of repair, and could possibly reduce the reliability of data, defeating the very purpose of placing blocks one per rack. The other approach would be to place all the blocks on the same rack to speed up repair. But if the rack fails, all the blocks of the stripe will become unavailable. It appears that both these storage schemes have shortcomings and therefore we are interested in determining a methodology to systematically determine good storage schemes. 1.2.6 Solution: Optimizing Placement Using MTTDL as Metric Mean Time To Data Loss (MTTDL) is widely used to capture reliability of data storage schemes. As the name suggests, it represents how long it takes for permanent data loss to occur after repeated failure and repair cycles. The two opposing factors, namely the fault tolerance and repair speed get accounted for when computing the MTTDL, and 12 therefore MTTDL is a suitable metric for us to determine whether a storage scheme is good or not. The approach we follow is to determine various possible placement schemes for a given set of parameters, evaluate the MTTDL for each scheme and determine the best scheme (which maximizes the MTTDL). 1.3 Contributions There are several metrics that can be used to analyze and optimize distributed storage for cloud environments. In this work, we consider latency and a generic metric of utility for vehicular cloud, total storage used/repair traffic and MTTDL for data center clouds. • In Chapter 3, we propose the use of erasure codes for the vehicular cloud and prove analytically as well using trace based simulations that the use of erasure codes can help reduce the latency of file downloads from the vehicular cloud. Our analytical contribution is a novel probabilistic analysis of the latency for repli cated and encoded distributed storage for vehicular networks. We analyze the expected delay for a vehicle trying to collect pieces to reconstruct a desired file by meeting other vehicles according to a memoryless process. We show how both replicated and encoded storage correspond to different balls and bins processes and using stochastic dominance and coupling arguments on these processes, we bound the expected download time. We identify three regions of interest depending on how the communication bandwidth per vehicle interaction d, compares to the file size M and the vehicle storage capacity per file C / m (here C is the storage per node and m is the number of files, and the setup is described in detail below). Our most 13 surprising result is in the bandwidth limited regime: when d < Clm. For this case we show that distributed erasure coding yields a latency of MId, which is equivalent to replicating all the files in all the vehicles. While coding uses much less storage, the equivalent set up of uncoded replication performs N I ex times worse, where N is the number of vehicles and ex is redundancy factor. Beyond our analytical model, we present a comprehensive performance analysis using a real vehicle trace consisting of 1,000 taxis in Beijing and 1,608 buses in Chicago, combined with a realistic 802.1lp DSRC Packet Delivery Ratio (PDR) model. These simulations validate the key insights from the analysis, demonstrating that coded storage substantially improves the timeliness of file downloads particu larly in the bandwidth limited regime for large files. For instance, when using the Beijing dataset, we show that for downloading 1GB files, by the time 80% of the nodes are able to completely download the file under coded storage, only 4.4% of the nodes succeed if uncoded replication is used. • In Chapter 4, we formulate and address mathematically the fundamental problem of resource allocation in the form of helper nodes in disseminating multiple content in a hybrid intermittently connected mobile network under a general stochastic homogeneous contact process. We derive mathematical expressions for computing the expected time to satisfy all demands and the expected number of satisfied demands by a given deadline for a given helper node storage allocation, under a homogenous stochastic encounter model with general inter-encounter time distribution. Using this, we show how to 14 compute the node allocation that maximizes the social welfare under both met rics. We show some interesting trends: for instance, helper nodes have diminishing returns and are less effective at large deadlines, and that increase in demand is actually beneficial in reducing the expected delay in dissemination. We formulate the problem of helper node allocation also from a game theoretic perspective and show that when the central agent tries to maximize its profit under a proportional allocation policy, the resulting system generally has a price of anarchy greater than 1. We also find that, somewhat counter-intuitively, a content provider with lower demand may need to pay more to the agent. • In Chapter 5, we motivate how the regular MDS erasure codes are ill-suited for data center storage applications, and therefore implement a new class of erasure codes called Locally Repairable Codes in Hadoop HDFS. These codes are efficiently repairable both in terms of network bandwidth and disk I/0. We also design and implement HDFS-Xorbas, a module that replaces Reed-Solomon codes with LRCs in HDFS-RAID. We evaluate HDFS-Xorbas using experiments on Amazon EC2 and a cluster in Facebook. Our experiments show that Xorbas enables approximately a 2x reduction in disk I/0 and repair network traffic compared to the Reed-Solomon code currently used in production. The disadvantage of the new code is that it requires 14% more storage compared to RS, an overhead that is information theoretically optimal for the obtained locality. Less network and disk I/0 implies that Xorbas repairs failures faster than HD FS-RAID. 15 • In Chapter 6, we identify the opposing tradeoffs in a data center when determining how to place blocks across a data center - the tradeoff between fault tolerance and repair speed. We introduce a family of storage schemes that guide how blocks are placed in a data center, called placement schemes, and systematically describe how to determine the best placement scheme among many possible schemes. The two opposing tradeoffs are captured into a widely used metric called Mean Time To Data Loss, which can be used to gauge the effectiveness of any given placement scheme (higher MTTDL values are better). Therefore, we formulate the problem of identifying good placement schemes as an optimization problem, with the objective that we want to maximize being MTTDL. We have also developed a Monte Carlo Data Center simulator that simulates failures and repairs that are typical in a data center, and it is used to determine the MTTDL of various placement schemes. Using the results of this simulator, we show how the best placement scheme varies with the different parameters associated with a data center. We also show cases where non-intuitive placement schemes do best, and we illustrate the reason behind why certain schemes do better than others, which may be helpful for data center operators to decide which placement to choose. 1.4 Thesis Organization We first begin with some background on erasure coding on vehicular and data centers, and compare and contrast our work to other related works in Chapter 2. The rest of the thesis is divided into three parts. 16 Part I primarily concerns with optimizing distributed storage in vehicular clouds. In particular, in Chapter 3, we show how using distributed storage codes can help us tackle the challenge of high latencies in vehicular clouds. In Chapter 4, we give details on how to allocate helper nodes optimally in a vehicular cloud by using content dissemination utility as a metric. In Part II, we optimize distributed storage in data center clouds. In Chapter 5, we give details about our implementation of Locally Repairable Codes that help optimize the repair traffic in data centers. Next, in Chapter 6, we use MTTDL as a metric to optimize data placement in data centers. Finally we conclude in Part III with a chapter on conclusions and future work (Chap ter 7). 17 Chapter 2 Background and Related Work As described in Chapter 1, the focus of this thesis is on optimizing various challenges involved in distributed storage for cloud environments. The two cloud environments considered here are the vehicular cloud and the data center cloud. This chapter is divided in two sections: a section on background (Section 2.1) and another on the related work (Section 2.2). 2.1 Background This section gives a brief introduction to erasure coding, which will be a recurring topic throughout the thesis, followed by introductions to vehicular networks, data centers, and an important metric used to measure the reliability of data- Mean Time To Data Loss (MTTDL). 2.1.1 Erasure Coding Erasure coding is widely used in traditional distributed storage environments such as RAID (Redundant Array of Inexpensive Disks). But until recently, erasure codes have 18 not found widespread adoption in large scale distributed storage environments, such as the data center and the vehicular network clouds. In this section, we give a brief overview of erasure codes. For more details refer to the book by Richardson et al. [82]. Erasure codes are generally identified by two parameters n and k, with n > k. To encode a file of size M bits using ( n, k) code, it is first split into k blocks of equal size (M/k bits) and then n blocks are generated each with size M/k bits. The code is called MDS (Maximum Distance Separable) if any k blocks are sufficient to reconstruct the file. So the M bits get expanded into nM /k bits. We call these n blocks of data as a stripe. These n blocks can be stored on one or more storage nodes. Reed-Solomon Code [79] is an example of a MDS code. In Chapter 3 and Chapter 6, we will consider the use of MDS erasure codes where we will use the MDS property of the coding scheme, and not consider the actual coding scheme. In Chapter 5, we will use a non-MDS erasure code called Locally Repairable Code. Erasure codes can also be classified into two categories: systematic and non-systematic. When a systematic code is used, k blocks out of the n blocks are same as the data blocks used. Therefore, these codes can be seen as containing k data blocks and the remaining n- k parity blocks. Non-systematic codes, however, do not guarantee that the data blocks will be part of the n blocks. Systematic codes are invariably used for storage because they give faster access to data blocks. Many implementations typically use the classical Reed-Solomon (RS) codes. Suppose each block is stored m a different storage device and one storage device fails. We call the block to be lost, or that it has been erased. In such a scenario, it is possible to determine exactly which block was lost. Note that this is different from 19 the communication context where one or more blocks might get corrupted and it is not immediately clear which are the affected blocks. These are called errors in the context of communication (and therefore error control coding) as opposed to erasures in the context of storage (and therefore erasure coding). Note that a classical coding scheme such as Reed Solomon coding can be used for both applications, except that the number of erasures that can be tolerated is more than the number of errors that can be tolerated. Earlier we said that M bits of a file get divided into k parts. However, this is generally not the case in real systems. The block size is first fixed. Let us consider an example with a block size of 1MB. If a (14, 10) code is used, a lOMB file will be split into 10 blocks (each of size 1MB) and 14 blocks will be generated. If suppose a 20MB file is to be encoded, then each of the lOMBs of the file will be encoded as above. If a file's size is not a multiple of lOMB, then it can be padded with zero bits to make the size a multiple so that the above procedure can be applied (in which case the actual file size will have to be tracked using metadata or other means). 2.1.2 Vehicular Networks The recent development of the IEEE 802.1lp WAVE (Wireless Access in Vehicular Environment) protocol [52] and the allocation of Dedicated Short Range Communications (DSRC) spectrum specifically for vehicular use are two big steps towards the practicality of vehicular networks. In the United States, the FCC has allocated 75MHz of spectrum in the 5.9GHz band exclusively for vehicular networks and in Europe, the ETSI (European Telecommunications Standards Institute) has allocated a 20MHz range in the same band. These bands enable vehicle-to-vehicle communication as well as vehicle-to-infrastructure 20 (and vice versa) communication, and capabilities like these open up a number of possi bilities. Most applications focus on safety, such as avoiding rear-end collisions; extended braking [14, 52]; and detecting and disseminating information about potholes, bumps and other anomalous road conditions [31]. Recently, applications that concern entertain ment and file sharing are also receiving attention and involve different challenges (e.g., AdTorrent [70], CarTorrent [59], FleaNet [60], C2P2 [54]). 2.1.3 Data Centers A Data center is essentially a large scale storage system. It can house anywhere from hundreds to tens of thousands of servers and host petabytes of data. All these servers are connected by a network topology. A detailed description of data centers is out of scope of this thesis. We mainly consider the data center as a storage system backed by a network. There are many factors that can be used to quantify the performance of a storage system. Among them, two that are relevant here are availability and reliability. The availability is the probability that a desired piece of information is available at any point of time and is computed by taking the percentage of the time the information has been available. Since the availability of high-performance systems are generally above 99%, it is expressed in terms of number of nines which correspond to the number of nines following the decimal point. The reliability is often computed by a metric called mean time to data loss (MTTDL), which measures how long in expectation does it take for data to be lost. It may be expressed in hours or even years, and is generally useful for comparison purposes. Thus here the relative order is important rather than the absolute value. Thus saying six more zeros means a million times more reliability. 21 NA (N-1)A Figure 2.1: The Markov chain used for RAID4/RAID5 reliability analysis. Here N is the number of disks. 2.1.4 Brief Introduction to MTTDL Reliability can be measured using several metrics. The probability of permanent data loss (when a fraction of nodes fail) is one, and it is sometimes just called reliability (see Greenan's thesis [40] for a precise definition). But this metric ignores that most failures do not cause permanent data loss, and that the corresponding repairs could take a long time leaving the system vulnerable to permanent data loss if a subsequent failure occurs. Mean time to data loss, or MTTDL is a widely used metric to capture the reliability of storage systems [20, 40, 76, 101, 108]. Note that it is sometimes called MTTF or availability. Storage systems typically have some kind of redundancy and offer repair mechanisms to recover lost data from failures. MTTDL captures the expected duration it takes for a non-zero amount of data to be lost such that it is not recoverable following the standard repair model of the system. The idea is that the longer it takes for a system to reach this stage of permanent data failure, the more reliable it is. 22 Among similar storage systems or schemes, if one offers higher values ofMTTDL, then it is considered more reliable. Note that these values are not meaningful when considered in isolation, but are useful for purposes of comparison. There are two main factors that affect the MTTDL. 1. Rate of failures: The lower the rate of failure, the higher the MTTDL. If multiple blocks of a stripe are all placed in the same node, the rate of losing blocks becomes very high. 2. Rate of Repair: The faster the repair, the higher the MTTDL. If the lost data can be recovered fast, then it reduces the window of vulnerability, thereby providing higher MTTDL. MTTDL is canonically calculated using a Continuous Time Markov Chain (CTMC) (see the book by Norris [72] for a detailed introduction to CTMCs). The states of the Markov chain represent the number of failures with one or more states representing per manent data loss. The MTTDL is then the hitting time to reach the permanent data loss state starting from the initial state. Fig 2.1 shows the Markov model used to evaluate MTTDL of RAID4 and RAID5 systems. Here each state represents the number of failed disks. Since RAID4/RAID5 cannot tolerate two disk failures, state 2 represents the state of permanent data loss. Refer to the work by Xin et al. for examples that show the calculation of MTTDL. 23 2.2 Related Work This section is mainly divided into two parts. In the first part, we com pare and contrast our work in vehicular cloud with relevant existing literature in vehicular networks and vehicular clouds. In the second part, we present existing literature on data center clouds related to our work on data centers. 2.2.1 Vehicular Cloud When it comes to files stored on vehicles, we are mainly focused on dissemination of files. We divide this section into three categories. First we talk about how various dissemination schemes can be divided into two main types (push-based and pull-based), next we primarily talk about the use of coding for dissemination, followed by describing existing literature on content distribution using helper nodes. 2.2.1.1 Dissemination Schemes We identify two basic file/data sharing/dissemination schemes- one is the well stud ied push-based mechanism (e.g. [97, 99]) and the other is a pull-based retrieval scheme. In push-based file transfer, a node with the file will push it into the network and other nodes can download and/or relay the file if interested. In pull-based retrieval, a node that needs a file will try to download it from other nodes that have it. We note that push-based mechanisms are well-suited for traffic updates and other small files, whereas pull-based mechanisms are well-suited for large files. This is because, push-based schemes work by replicating the same file over multiple relays, which can be quite inefficient for large files (e.g., epidemic routing [99], spray and wait [97]). 24 We consider the pull-based dissemination for large files in Chapter 3, and the push based dissemination in Chapter 4. 2.2.1.2 Coding for Vehicular Clouds Vehicular networks can be considered as examples of Intermittently Connected Mobile Networks (ICMNs) or Delay /Disruption Tolerant Networks (DTNs) because they involve intermittent encounters. There is a rich set of literature on routing and content dissemi nation for ICMNs, but most of them are push-based heuristics [51, 59, 61, 70, 97, 99, 103]. As noted earlier, push-based mechanisms work well for small-sized data transfers such as traffic updates, pothole monitoring and other content that might be of interest to all users, but would fail to perform well for large file transfers of interest to only a few users (e.g., movies, long videos). This can be seen easily because traditional push-based schemes work by replicating the same file over multiple relays (e.g., epidemic routing [99], spray and wait [97], file swarming [59, 70]), which can be quite inefficient for large files. Other methods have been proposed such as the use of erasure coding [51, 103] and network coding [61], and even though they help reduce the delay and enhance the reliability, they fundamentally work by pushing more data than the file size into the network. In Chapter 3, we primarily focus on large size file sharing and thus employ a pull based scheme. In contrast to prior work such as by Kapadia et al. [54] where the authors assume that the content in such a system is stored using uncoded replication, we advocate and analyze the performance when the content is stored using erasure codes. Erasure coding and uncoded replication have previously been compared in other con texts. For instance, Weatherspoon et al. [105] compare coding and uncoded replication 25 for distributed storage in a wired system, and argue that coding is a clear winner as it provides mean times to failure that are magnitudes higher than that provided by replica tion. Another work by Rodrigues et al. [84] analyzes and compares these two approaches from the perspective of their ability to provide content availability in a P2P distributed hash table. They indicate that, given its complexity, erasure coding is useful only when the servers are extremely unreliable. Our work on vehicular networks has a different focus, not on ensuring availability in the face of server failures, but rather on reducing latency in the face of sparse and short-duration vehicular encounters. In this setting, we argue that coding is indispensable, particularly for large files. Closest in spirit to our work are two previous studies with an overlapping set of authors, who have examined the use of erasure codes in DTNs for reliably routing in formation between a particular source-destination pair [51, 103]. These studies provide a comparative analysis showing that the use of erasure coding can provide significant robustness to en-route path/node failures (the focus Jain et al. [51]), as well as reduced latency (the focus Wang et al. [103]), for push-based networks. In contrast, our emphasis in this work is on evaluating the latency and reliability of erasure coding for a pull-based network, specifically suitable for large files. Also related to our work are papers that advocate the use of network coding for content dissemination or distribution. The use of network coding in the form of mixing of packets in intermediate nodes for content distribution was first proposed in the context of a content delivery system called Avalanche [37,38]. Several researchers have extended this idea to adopt network coding to handle content distribution in vehicular networks, e.g., CodeTorrent [61], VANETCODE [8], CodeOn [63], and VCD [90]. Again, an essential 26 distinction is that these works focus primarily on pushing small files and messages to other nodes, whereas our focus is on pull-based retrieval for large files. Our theoretical analysis relies on balls and bins processes (see e.g., the book by Mitzen macher et al. [69]) and stochastic dominance arguments [43, 100] that are used to obtain bounds on the expected delay for coded storage. If a Maximum Distance Separable (MDS) erasure code [82] is used, any k out of the n encoded chunks suffice to reconstruct the original file. One specific family of codes that are almost-MDS and are suitable for our application are digital fountain codes. Initially proposed by Byers et al. [21] and later developed by Luby [67] and Shokrollahi [92], digital fountain codes are binary near-MDS (almost all sets of k(1 +e) chunks suffice to reconstruct the file with high probability, but we neglect e for simplicity) and have very fast and simple encoding and decoding algorithms. Using ideas related to this work, fountain code designs were introduced for sensor network problems by Dimakis et al. [28] and Kamra et al. [53]. 2.2.1.3 Content Dissemination using Helper Nodes As previously described, vehicular networks fall into the category of Intermittently Connected Mobile Networks (ICMNs) or Delay Tolerant Networks (DTNs). Our goal in Chapter 4 is to optimally allocate the helper nodes to relay various contents. The utility we use depends on successfully disseminating the content to the nodes that demand the contents. In our work, we take a mathematical approach to understand the dissemination. Since there have been a number of works that have focused on a mathematical treatment of content dissemination in DTN, we briefly survey some of these here. 27 One of the first works to analyze message dissemination for DTNs is by Groenevelt et al. [44]. It analyzes the expected delay in propagating a single message in a DTN using Markov models. We follow a similar approach of using Markov models, but our work is a general version of their's - since we consider dissemination of multiple files simulta neously (multi-content) and each possibly to multiple nodes (multicast). Furthermore, we also analyze the expected number of demands satisfied by a deadline - a metric that might be useful for certain types of content. The work by Groenevelt et al. [44]1ead to the work by Zhang et al. [109], where the authors characterize the message delay approximately using ODEs. While we could have used the ODE approximations rather than the Markov model, we chose the latter because of the more general nature of the Markov model. Spyropoulos et al. [95] are primarily concerned with analyzing the expected node inter-encounter durations for various mobility models, and for a more realistic mobility model that they derive. In the work by Krifa et al. [58], the authors derive optimal policies for buffer management because the amount of buffer and bandwidth is limited. While the authors make use of a similar contact model as well as similar metrics, their goal however is to decide when to drop items from the buffer. The problem of disseminating news and other dynamic content to a mobile phone based ICMN is considered by Ionnidis et al. [49]. It is shown how to determine an optimal allocation of the bandwidth of the service provider to maximize the social welfare of the network, such that the content at the users as as fresh as possible. Altaian et al. [12] have a similar flavor, but the goal is design efficient ways for distributing dynamic content when the participating nodes could be either cooperative or non-cooperative. 28 Reich et al. [80] and Ionnidis et al. [50] consider optimal distributed cache allocations strategies in ICMNs, along the lines of the well-known square root replication scheme [26]. While the work proposed by Reich et al. [80] is optimal at equilibrium, the convergence is not guaranteed; Ionnidis et al. [50] overcome this issue. Even though these two works and our work share the similar objective (among others) of reducing access delay to content, their problem formulation and the setting is fundamentally different from ours. Recently, DTNs have been started to be analyzed from a game theoretic setting. The primary argument here is that the helper (relay) nodes may be inherently selfish and so may not be willing to relay content for other nodes. Some works consider fully selfish nodes [91, 106], whereas others consider socially selfish nodes [64] - where nodes may be willing to carry content of other nodes depending on the social ties (for example, nodes may prefer to help friends rather than strangers). In our work, we do not consider fully autonomous nodes, but assume that a central agent can control the nodes. We consider instead self-interested content providers and a revenue-maximizing central agent. 2.2.2 Data Centers The second half of the thesis deals with storage in data centers. Chapters 5 and 6 deal with the use of novel erasure codes for data centers, and with the placement problem in data centers, respectively. Correspondingly, in Section 2.2.2.1, we present existing literature on erasure coding for the data centers, and in Section 2.2.2.2, we present existing literature relevant to placement in data centers. 29 2.2.2.1 Erasure Coding for Data Center Cloud As mentioned earlier, erasure coding is widely used in storage systems such as RAID [76] systems. Next, we will consider works that use erasure coding for large scale storage sys tems, and then give details on new erasure code designs that are specifically useful in such storage systems. Erasure coding for storage systems: Weatherspoon et al. [105] characterize the availability and reliability of erasure coded systems analytically, and show that erasure coding helps achieve orders of magnitude better reliability. One of the key drawbacks with the analysis is that it assumes that the MTTDL of the system is equivalent to MTTDL of a stripe divided by the number of stripes. This, however, does not apply, especially in the context of Data centers. Locally Repairable Codes: Optimizing code designs for efficient repair is a topic that has recently attracted significant attention due to its relevance to distributed systems. There is a substantial volume of work and we only try to give a high-level overview here. The interested reader can refer to the survey by Dimakis et al. [29] and references therein. The first important distinction in the literature is between functional and exact re pair. Functional repair means that when a block is lost, a different block is created that maintains the ( n, k) fault tolerance of the code. The main problem with functional repair is that when a systematic block is lost, it will be replaced with a parity block. While global fault tolerance to n- k erasures remains, reading a single block would now require access to k blocks. While this could be useful for archival systems with rare reads, it is 30 not practical for our workloads. Therefore, we are interested only in codes with exact repair so that we can maintain the code systematic. Dimakis et al. [27] showed that it is possible to repair codes with network traffic smaller than the naive scheme that reads and transfers k blocks. The first regenerating codes [27] provided only functional repair and the existence of exact regenerating codes matching the information theoretic bounds remained open. A substantial volume of work (e.g. [29, 78, 98] and references therein) subsequently showed that exact repair is possible, matching the information theoretic bound of Dimakis et al. [27]. The code constructions are separated into exact codes for low rates k/n S 1/2 and high rates k/n > 1/2. For rates below 1/2 (i.e. storage overheads above 2) beautiful combinatorial constructions of exact regenerating codes were recently discovered [78, 89]. Since replication has a storage overhead of three, for our applications storage over heads around 1.4 - 1.8 are of most interest, which ruled out the use of low rate exact regenerating codes. For high-rate exact repair, our understanding is currently incomplete. The problem of existence of such codes remained open until two groups independently [22] used In terference Alignment, an asymptotic technique developed for wireless information theory, to show the existence of exact regenerating codes at rates above 1/2. Unfortunately this construction is only of theoretical interest since it requires exponential field size and per forms well only in the asymptotic regime. Explicit high-rate regenerating codes are a topic of active research but no practical construction is currently known to us. A sec ond related issue is that many of these codes reduce the repair network traffic but at a 31 cost of higher disk I/0. It is not currently known if this high disk I/0 is a fundamental requirement or if practical codes with both small disk I/0 and repair traffic exist. Another family of codes optimized for repair has focused on relaxing the MDS re quirement to improve on repair disk I/0 and network bandwidth (e.g. [39, 4 7, 57]). The metric used in these constructions is locality, the number of blocks that need to be read to reconstruct a lost block. The codes we use are optimal in terms of locality and match the bound shown by Yekhanin et al. [39]. We note that optimal locality does not necessarily mean optimal disk I/0 or optimal network repair traffic and the fundamental connections of these quantities remain open. 2.2.2.2 Data Placement Schemes for Storage We classify the existing literature into a number of categories as below. Placement of Blocks in Data Centers : As mentioned in Section 1.2.5, most of the existing cloud-storage systems place one block per fault-domain [34, 36, 48], with some minor variations. The default block placement policy of Hadoop is to place two replicas in one rack and third replica in a second rack. But the Hadoop used at Facebook is modified to allow a more intelligent placement scheme [19]. In this work, the authors consider logical groups of racks and nodes and replicas are placed randomly only within chosen groups. Their particular implementation consists of a group size of 2 racks and 5 machines, which helps reduce the probability of data loss by a hundred times. The main idea here is that the probability of data loss decreases with decreasing group size. Copysets [25] generalizes this idea, and they try to incorporate scatter width to help improve the repair times. This is because with decreasing group sizes, even though the 32 probability of data loss decreases, the repair time increases, since the repair now gets distributed to fewer nodes. Our work is complementary to these works and also considers erasure encoded storage. We consider the entire data center to be one group and then determine the placement scheme. If instead multiple groups were to be considered, the placement will still be an issue within each group. Ford et al. [34] consider availability and MTTF (Mean Time To Failure, same as MTTDL when considered for data) at Google's clusters and describe some failure statistics using a year's worth of measurements. Unlike other works, they do consider erasure coded storage, but not the tradeoff between fault tolerance and repair bandwidth, and therefore they place one block per fault domain. Other placement issues in Data Centers : An issue very relevant is the placement of jobs across the data center. Bodik et al. [17] consider this issue and identify the tradeoff between job placement and fault tolerance. They consider an initial placement of jobs and consider moving the services around to minimize bandwidth while maximizing the failure tolerance. In contrast, we consider placement of blocks, and moreover, we combine the fault tolerance and the repair rate (which depends on the bandwidth) into a single metric called MTTDL. In Volley [7], the focus is on placing application data across globally distributed dat- acenters. They use an iterative optimization algorithm based on data access patterns, client locations, etc. In addition, Volley's focus is on application-level placement not a block level placement. 33 Placement issues in other storage systems : Placement has been considered ex tensively in other storage systems, especially those related to RAID. RAID-1 has a very sim pie placement scheme: mirroring. In RAID-5 over N disks, N - 1 data blocks are placed in one disk each, followed by a parity block on the disk N. This layout is then ro tated (so that the parity block is now on disk N -1, the data blocks on the rest). RAID-6 uses RS coding and follows a similar rotated layout. The goal of these placements is to reduce disk bottlenecks, rather than on improving the MTTDL (if say parities are always placed in the same disk, then that disk will become a bottleneck). Douceur et al. [30] propose a distributed file replication mechanism using hill-climbing replica placement strategies to improve the availability. In particular, they focus on maximizing the worst-case availability of files by repeatedly swapping files among different machines. However, their work concentrates on improving the availability of replicated data, while we focus on improving the reliability of erasure coded data. Lian et al. [65] consider the tradeoff between the rebuild speed and the fault toler- ance between randomized placement and sequential placement and offer an analytical framework to study the MTTDL of the system under both the placements. They do not consider correlated failures due to the complicated nature of modeling them, which is an import ant factor in our work. Miscellaneous : Wang et al. [102] implement MRPerf to explore, model, and simulate the various MapReduce design space such as node, rack, and network configurations to best predict the expected MapReduce performance. This work demonstrates the im portance of application deployments in various network/node topology choices and their performance differentiations. 34 Leong et al. [62] talk about distributed storage allocations for maximum reliability and optimal delay but in the context of a model where nodes fail with some probability. A realistic model for failures in data centers is to consider failure rates associated with the nodes, and to consider recovery rates. 35 Part I Vehicular Cloud 36 Chapter 3 Optimizing Content Access Latency Using Distributed Storage Codes We 1 investigate the benefits of distributed storage using erasure codes for file sharing in vehicular networks through both analysis and realistic trace-based simulations. We show that the key parameter affecting the on-demand file download latency is the ratio of file size to download bandwidth. When this ratio is small so that a file can be com- municated in a single encounter, we find that coding techniques offer very little benefit over simple file replication. However, we analytically show that for large ratios, for a memory less contact model, distributed erasure coding yields a latency benefit of N /ex over uncoded replication, where N is the number of vehicles and ex the redundancy fac tor. Effectively, in this regime, coding yields the same performance as replicating all the files at all other vehicles, but using much less storage. We also evaluate the benefits of coded storage using large real vehicle traces of taxis in Beijing and buses in Chicago. These simulations, which include a realistic radio link quality model for a IEEE 802.1lp dedicated short range communication (DSRC) radio, validate the observations from the 1 Some of the content in this paper has also been presented in [86, 87] 37 analysis, demonstrating that coded storage dramatically speeds up the download of large files in vehicular networks. 3.1 Introduction Here, we consider a P2P file sharing application where the files are stored in the nodes as a distributed repository and interested users retrieve these files on-demand. The amount of data downloaded by a node is no more than the file size (excluding control data), and thus such a pull based scheme is not only efficient but will also scale well with the number of nodes, file size etc. But in order to improve the latency and reliability of content access, intuitively the files should be stored with some redundancy. Thus, in order to reduce the latency of file access, we shift the burden from the expensive bandwidth to the relatively inexpensive storage, thereby enabling additional applications to run. Previously, Kapadia et al. [54] have suggested a similar scheme, where the content is stored using sim pie uncoded replication. The novel contribution of our work is that we recommend the use of erasure codes, especially for large files, and we quantify the performance improvement of coded storage compared to uncoded replication. We consider out of scope of this work the orthogonal problem of the process by which the file repository is initially created and maintained. For this purpose, other previously proposed schemes such as coded dissemination [90] or direct infrastructure download could be used. When the inter-vehicle communication data rates are high, or when the communicated files are sufficiently small, we find that simply storing multiple copies of each file has almost identical performance to an optimized erasure coded representation. 38 However, we show that in other cases, when the file sizes are large compared to the download bandwidth, a distributed coded representation offers very substantial benefits and decreases the average download time by orders of magnitude. 3.2 Model and Problem Setup In this section, we present a simplified model of a basic file sharing system and present a set of assumptions governing the model, making it amenable to analysis, but more im portantly, giving us crucial insights into the system. Some of the simplifying assumptions (e.g., regarding mobility) will be relaxed later when we consider numerical simulations over realistic vehicular traces. We assume there are N identical participating seed vehicles (or nodes), each with a storage capacity of C bits allocated for the file sharing application. The total number of different files stored in the system is denoted by m; for simplicity, we assume that all the files have the same size of M bits (assume C 2 M) and are equally likely to be requested. It is desired to distribute these m files to as many nodes as possible. It is assumed that the total available storage exceeds the total size of all files: i.e., NC 2 mM. Denote a = ::£t and note that we can store each file a 2 1 times throughout the system and saturate the available capacity in the system. Typically, we will have a < N, which means that each file will not be stored in all the nodes. We refer to a as the system redundancy, since it is the number of times each bit is stored in the system. In this chapter, we consider and analyze the expected delay in downloading files when uncoded replication scheme and the coded storage scheme are 39 used. For the uncoded replication scheme (also called uncoded storage), we simply store each file a times in the nodes ensuring that a node doesn't store the same file multiple times (maximal spreading). On the other hand, for the coded storage scheme, an (n, k) MDS code is used and each file is split into k chunks and encoded into n chunks of the same size. We set nlk =a, equal to the total system redundancy. This is because, as the effective size of each file after coding is nM lk, in order to saturate the system capacity, we need NC = m(nMik), yielding a= nlk. We focus on the latency experienced by a given sink vehicle that is trying to download one of the m files. For the analysis, we assume an i.i.d. encounter model in which the sink is an external node that encounters any of the N nodes uniformly at random at each encounter. We impose a key communication constraint: whenever the sink meets any other vehicle, it can download at most d bits of data. We refer to d as the bandwidth constraint. Note that this parameter implicitly incorporates both the duration of the contact as well as the link rate. In our numerical simulations, we relax these simplify ing assumptions, as we use encounters based on real vehicular traces and the download bandwidth is not a constant but rather a random variable that depends on the contact duration and the link quality model used. Given all other parameters, we would like to determine the optimal values of n and k for coding. In order to do so, we note that each chunk has size Mlk and so we want to choose k such that the chunk is downloadable within the bandwidth constraint (d). Thus we want k 2 MId, but since higher k equates to higher coding complexity, we use k = r M I dl. The chunk size is therefore either M or d, whichever is lower (in practice we will have d < M for large files). Note that k = 1 in fact corresponds to not using any 40 Variable N m c M d (n, k) (3 D Brief Description Number of nodes Number of files Storage capacity of each node (bytes) File size (bytes) Bandwidth limitation (bytes) Coding parameters Redundancy factor Number of chunks from a file in the same node Random variable denoting the delay in downloading a file Table 3.1: List of common variables used. coding at all. Now, once k is fixed, choose n = cxk. Since there are N nodes, each node will contain (3 = n/N chunks of a file. (3 > 1 implies there is at least one node which contains two different chunks for the same file. We define the delay or latency D as the number of encounters needed before being able to fully reconstruct a file, and in the next section we quantify the expected latency IE[D] for both the storage schemes. This can be multiplied by the expected inter-encounter time to give the latency in units of time. We have listed the commonly used variables along with brief descriptions in Table 3.1. 3.3 Theoretical Analysis Given the number of nodes N, the storage per node C, the file size M, and the number of files, we can compute the redundancy of each file ex = ::£t. The files are stored according to coded or uncoded storage as described above, and the goal here is to analyze the expected delay in downloading a file from such a system. Since all files are equally popular, it is sufficient to consider any one file. 41 We make extensive use of balls and bins processes [69] in our analysis. The basic idea is to represent nodes as bins, and the throwing of a ball randomly into any of bins with equal probability models the sink meeting each node uniformly at random. The configuration of balls in bins that corresponds to a complete file download is defined differently in each case, as discussed below, but the common goal is to determine the expected time to reach this configuration, which corresponds to the expected delay. If instead of using uniform contact probabilities, even if we assume that the encounter probabilities of the sink with the nodes are non-uniform, the problem is equivalent to a balls and bins process with non-uniform bin selection probabilities and this problem is extremely hard with no known solutions. 3.3.1 Uncoded File Storage We first analyze the latency of accessing a file in a vehicular network utilizing un coded replication. Specifically, we show that the latency is inversely proportional to the redundancy in the system and the bandwidth constraint. In this scheme, all the files are stored 'as such' in various nodes. We assume the system redundancy a to be an integer (recall that a is the number of times each file is stored in the system). Since the capacity C > file size M, each file can be stored completely in a node. When the sink meets a node, it can download a maximum of d bits or M bits (entire file) whichever is lower. So depending on the values of d and M, we can have two cases. If d 2 M, then there is no bandwidth constraint at all. So, IF[a node is good] = a/N, where a good node is one which contains the required file. Thus the number of nodes to be seen before encountering a good node is a geometric random 42 variable with mean IE[D] = ~. But if d < M, only a fraction diM of the file will be downloaded every time the sink meets a node. So IE[D] = ~ ( A:J). 1 "-D_D __ o __ o_o_D___.I o o o o D 1 2 N Figure 3.1: A vehicular network with N nodes and redundancy o: represented in the balls and bins framework. Each node is represented as a square, with the shaded squares containing copies of the file the sink is interested in. Alternatively, in the balls and bins framework of Fig 3.1, this corresponds to throwing balls into N bins where each ball can land into any one bin with equal probability. If there is no bandwidth restriction, we are interested in counting the average number of balls to be thrown before a ball lands into one of the shaded bins (which is N I o: as above). But when there is bandwidth restriction, we want to determine the number of balls in expectation that must be thrown until the first o: bins contain M Id balls total. In this case, once a ball lands in any of the shaded bins, we repeat the experiment again. Note that a ball can fall into the same bin multiple times, which is equivalent to meeting the same node multiple times; but at each time the sink can download a different portion of the file. So IE [D] = ~ ( ~) which is the same as that obtained above. Thus we have, E[D] ~ { Nlo: ifd :::O:MorMid~1, else if d < M or MId > 1. Combining, N IE[D] =- max(1,Mi d). 0: (3.1) 43 Hence, the expected delay is increased by a factor of M / d when there is a bandwidth constraint. 3.3.2 Coded File Storage In this section, we analyze the expected delay in reconstructing a file under a coded storage scheme. As explained before, when using a ( n, k) coding, each file is split into k chunks and then coded into n chunks and distributed to the nodes. In order to reconstruct the file, the sink has to download any k out of the n chunks. Whereas each file is stored a times in uncoded replication, it is expanded a = n/k times when using coding. Thus, analyzing the delays of both cases for the same a makes a fair comparison. Balls and Bins model f3 2 1 1 0 0 0 2 3 N-1 N Figure 3.2: The balls and bins model for the coded case for integer (3 2 1 Recall that (3 is the number of chunks per file that a node gets to store ((3 = n/N). As before, each node can be represented as a bin and thus we have N bins. Balls thrown into the bins are equivalent to the sink meeting a node at each time step. Whenever a sink meets a node, it can only download a single chunk since the chunk size is equal to the 44 bandwidth constraint, and so the sink can meet the same node f3 times before running out of new data. Thus we can set the capacity of each bin to be f3 balls (assume f3 to be an integer). See Fig 3. 2. In order to relax the condition on the integrality of f3 we note that we can set the capacity of a few bins to r f3l and the rest to l f3 J) making the bins non-identical and thus difficult to analyze. Also note that the final expression obtained does not require f3 to be an integer. We will note below what happens when f3 < 1. Now, in order to get a file, we need to download any k different chunks out of the n chunks. In the balls and bins process, we are interested in finding out the number of balls to be thrown in expectation, so that there are k total balls in all of the bins. We make a note that since the bins have limited capacity, they could overflow and hence the required expectation is not always k. Let us analyze the delay by considering three different cases based on whether f3 :::; 1, 1 < f3 < k or f3 2:': k. Case 1: f3 :::; 1 1 .... D_D __ o __ o_o _D___.I o o o o D 1 2 n N Figure 3.3: The balls and bins set up when f3 :::; 1 In order to understand the capacity f3 being less than 1, we note that this in fact corresponds ton :::; N, since f3 = n/N. Thus only n out of theN nodes store the chunks. Without loss of generality, consider the first n nodes to contain the chunks. In the balls and bins process, we are interested in counting the number of balls in expectation to be thrown until there are k balls in any of the first n bins (see Fig 3.3). The expected 45 number of balls to be thrown before the first ball lands into any of then bins is N In; the second ball takes N I ( n - 1) in expectation and so on. Thus, N N N IE[D] = - + --1 + ... + k 1 n n- n- + = N(Hn- Hn-k) "'Nlog [nl(n- k)], since Hn "'log n. Using a = nlk, we get, IE[D] "'Nlog [al(a- 1)]. (3.2) Even though this equation does not depend on parameters like M, C etc., there is an implicit dependence, since for example, a depends on N, C, M, m and we need to have the chunk size M I k to be equal to the bandwidth constraint d. Case II: 1 < (3 < k To recall, (3 is the capacity per bin, or the number of times the same node can be seen before the sink runs out of useful chunks. Let us assume that (3 is an integer. We are interested in finding out the expected number of throws to get k balls into the system. Deriving an exact expression for IE[D] seems hard and so we upper bound the expected delay. Also note that IE[D] 2 k. Let us define the state of the system S at any time as the arrangement of the balls in the bins and lSI to be the number of balls in the system. For example when lSI = 2, valid states include S = {2, 0, 0, ... , 0}, {1, 1, 0, ... , 0} etc., where the j'h element in the set 46 corresponds to the number of balls in the j'h bin. For a general i, there are an exponential number of states S such that lSI = i. Let Ti-+i+l be the number of balls required to add one more ball to the system, given that there are already i balls in the system. The expected delay is then k-1 IE[D] = LIE[THi+ll· (3.3) i=O We first note that the distribution of Ti-+i+l can be determined if the current state is given, otherwise it is extremely difficult. For example, given that S = {0, 0, ... , 0} (i.e. i = 0 and there are no balls in the system), To-+1 is 1 with probability 1 (or geometric with failure probability 0); and given that the state isS= {p, 0, 0, ... , 0} (i.e. the first bin is full with p balls), then Tf3-+f3+1 is geometric with failure probability 1/N. Thus once we know the state, we can determine the distribution of Ti-+i+l· But what can we say about Ti-+i+l without conditioning on the state? As an example, suppose there are i = p balls in the system, then there is a finite probability q with which one of the bins may be full, in which case, the distribution is geometric with failure probability 1/N and with the remaining probability (1 - q), the distribution is geometric with failure probability 0. Thus we can express IF[Tf3-+(f3+1) = z] = qll"[Geom(1/N) = z] + (1- q)IF[Geom(O) = z], where Geom(x) is a geometric random variable with failure probability x. We make a note that Tf3-+(f3+1) is a probabilistic mixture of two geometric random variables with mixing probabilities q and 1- q. For a general i, Ti-+i+l is a probabilistic mixture of at most N geometric random variables. Even though it is difficult to determine the mixing 47 probabilities for every i, we can effectively eliminate them, for which we use the concept of Stochastic dominance (see [43, 100] for more details): Definition 1 (Defn: Stochastic Dominance). Consider two random variables X andY, possibly defined on different probability spaces. When X is stochastically smaller than Y, then for every z E JR, the probability inequality IF(X S z) 2 IF(Y S z) must hold or in terms of the cumulative distribution function, Fx(z) 2 Fy(z). This is denoted as X:::< Y, i.e. X is stochastically dominated by Y. Another concept we need below is that of Coupling. Definition 2 (Defn: Coupling). For a given set of random variables X 1 , X 2 , ... , Xn, a coupling is defined as a new set of random variables (i\, X2, ... , Xn) over the same probability space such that the marginal distribution of xi is same as that of xi for i = 1, 2, ... , n. Thus for all measurable subsets E of JR, IF(X E E) = IF(X E E). Remark If X :::< Y, then JE[X] S JE[Y]. This is noted by seeing that JE[X] = Lz(l Fx(z)) S I:z(1- Fy(z)) = JE[Y]. We list three useful lemmas below. Lemma 3.3.1. A random variable X is stochastically dominated by another random variable Y if and only if there exists a coupling (X, Y) of X and Y such that IF( X S Y) = L Refer to Lemma 2.11 in Hofstad et al. [100] for proof. Lemma 3.3.2. Let X ~ Geom(p) andY ~ Geom(q), where p and q are the failure probabilities. If p S q, then X :::< Y. 48 Proof. At each step (starting from 0), a real number is selected at random from [0, 1] and X andY are defined to denote the step at which the number chosen belongs outside [O,p] or [0, q] respectively. We note that if X succeeds at step x, so that the number chosen at that step is for the first time higher than p, then Y could not have succeeded before or at x i.e. IF(Y 2 x I X= x) = 1. IF( X S Y) =I:~ IF(Y 2 x I X= x)IF(X = x) = L::~(1)IF(X = x) = 1. Thus IF(X S Y) = 1 giving X :::< Y. Hence a geometric random variable is always dominated by another geometric random variable with higher failure probability. D Lemma 3.3.3. Let us have l random variables X 1 , X 2 , ... , Xz, with Xj :::< X 1 for all j = 2,3, ... ,1. If X is a probability mixture of X 1 ,X 2 , ... ,Xz, such that px(z) = L::;~l OCjPXj (z) with constants OCj 2 0 (j = 1, 2, ... , l) and L::;~l OCj = 1, then X :::< X1. have Fx(z) 2 Fx, (z) for all z, which implies X:::< X 1 . D In other words, this lemma states that a probabilistic mixture of geometric random variables is stochastically dominated by the constituent geometric random variable with the biggest failure probability. Back to the case when i = (3, smce the biggest failure probability is 1/N, the corresponding geometric random variable stochastically dominates other geometric ran dom variables (Lemma 3.3.2), and so from Lemma 3.3.3, we can see that Tf3-+f3+1 :::< Geom(1/N), conveniently removing the dependence on q. Thus we note that when Ti-+i+l is a probabilistic mixture of geometric random variables with biggest failure probability p, then Ti-+i+l :::< Geom(p). We are now all set to get the upper bound on the latency. 49 Theorem 3.3.4. The expected delay due to coded storage in the case when 1 < (3 < k is upper bounded by n(HN- HrN( 1 _ 1 /a)l), where HN is the Nth harmonic number. Proof. Consider a random variable D 1 i = :>::7~t T 1 i-+i+l where T 1 i-+i+l is a geometric random variable with failure probability p;. By suitably choosing p;, we will first prove that Ti-+i+l is stochastically dominated by T 1 i-+i+l for each i. When the first (3 balls are thrown, none of the bins could have overflowed and so Ti-+i+l is geometric with failure probability 0 for i = 0, 1, ... , (3 - 2, (3 - 1. We choose p; = 0 in these cases. Once there are (3 balls in the system, as explained above, Ti-+i+l is a probabilistic mixture of geometric random variables with the biggest failure probability 1IN. From the insight above, we set p~ = 1IN, so that Ti-+i+l :::< Geom(1IN). Also, when (3 S i S 2(3 - 1, its not possible to have two or more bins full, and so in all these cases, the biggest failure probability is 1IN; thus we set p; = 1IN for (3 S i S 2(3- 1. Further, its not difficult to see that we should set p; 13 = 2IN. Using similar arguments, we set p; = j IN for j(3 S i < (j+ 1)(3, for j = 0, 1, ... , (kl (3- 1) (assuming k I (3 = x to be an integer, otherwise use x = l k I (3 J above). For the last case when i = k- 1, since x or more bins cannot be full, set p~_ 1 = (x- 1)IN. We note that x=- 13 k ="N=f'i n " For each i, since we have chosen p; to be at least as big as the biggest failure probability in the probabilistic mixture of geometric random variables that constitute Ti-+i+l, we can note that Ti-+i+l :::< Geom(p;) using Lemmas 3.3.2 and 3.3.3. This implies IE[THi+l] S 50 JE[Geom(p;)] and thus JE[D] S JE[D 1 ] by summing over all i. Noting that JE[Geom(p)] = 1 1-p' k-1 JEW]= LJE[T[-+i+ll i=O x-1 (H1)/')-1 = L L JE[T[-+i+ll j=O i=jf3 x-1 (H1)/')-1 = L L JE[Geom(j/N)] j=O i=jf3 x-1 = L 1-j/N J=O Since k S JE[D] S JE[D 1 ], we get D Note that when using this expression, it does not matter whether x or (3 is an integer or not. Also, for high values of a, n(HN- HrN( 1 - 1 ;a)l) "'(3Nlog [a/(a- 1)] since (3 = nN. The above analysis is not restricted to keeping (3 < k. We can note that when (3 2 k, a = n/k 2 N and thus for large values of a, the average delay expression can be approximated as k S JE[D] S k[l + 1/2a + o(l/a 2 )], thus confirming with the result below. 51 Case III: (3 2 k Since the capacity is sufficiently large, no bin can get full ink throws and so IE[D] = k. To summarize, IE[D] f "'Nlog (c,e>l) l S (3N log (c,C> 1 ) Combining, we obtain our final bound, if (3 s 1 else if (3 > 1 IE[D] S max(1,(3)Nlog (-"'-) "'N max (1, M ~), a-1 a d N where the approximation holds for high values of a. (3.4) A note on the choice of k: in order to derive all the above expressions, we have chosen k = M /d. But what would happen if the k is chosen any higher? First, consider the case when n is a multiple of N. By choosing k twice its actual value, for instance, each node will have only half the chunk size as before, but twice the number of chunks, so the amount of data per node is the same, thus the average delay will be the same. Now if n < N, by increasing k, we also increase n, i.e. more nodes contain desired data but less of it. Overall, we did not observe any improvement in the average delay by increasing k. 3.3.3 The Benefits of Coding: Summary By comparing eqn (3.1) and eqn (3.4), it is clear that the expected delay with coding is at least as good or better than uncoded replication. The interesting cases of d are when M/ d = 1 and A;} 'fv = 1; the former giving d = M and the latter giving d = AJv"' = C /m. Thus we have the following three regimes: 52 • {High Bandwidth regime) d 2 M: The expressions for the latencies in both the coded and uncoded storage schemes become almost equal with JE[Duncoded] = N /a"' JE[Dcoded], and so coding performs same as uncoded replication (note that since k = 1, coding is equivalent to uncoded file replication). • {Intermediate Bandwidth regime) C/m S d S M: From the expressions, we have lE[Duncoded] = lj;~ and lE[Dcoded] S lj;. Thus the improvement of using coding is M/d. Each node cannot store chunks from all the files due to C/m S d and so the sink has to wait to meet good nodes, which is the only factor contributing to the delay. • (Bandwidth limited regime) When d S C/m, we obtain JE[Dcoded] S ~ (and since lE[Dcoded]2 k =~'we have that lE[Dcoded] =~).Because, lE[Duncoded] = lj;~, the improvement here is N /a. Thus under such a severe bandwidth constraint, coding performs as if complete files were available in all the nodes, only to be limited by the bandwidth. 3.4 Trace Based Experiments We now turn to an empirical evaluation of the benefits of coded storage, using real vehicular traces. We use GPS traces of 1,000 taxis in Beijing and 1,608 buses in Chicago. We assume that the nodes continue to run their application throughout the day. Note that we do not assume the nodes to be moving throughout the day, only that on-board radio and the computer may continue to work even if the node is stopped. 53 For inter-vehicular communication, we used a realistic model of IEEE 802.11p from [15], the details of which are given in section 3.4.4 below. We next present the description of the datasets used. 3.4.1 Dataset Description The Beijing dataset consists of GPS traces collected from OO:OOhrs to 23:59hrs on Jan 5, 2009 local time, recorded every minute for a total of 2,927 taxis. The GPS co-ordinates span 32.1223 to 42.7413 in latitude and 111.6586 to 126.1551 in longitude. Of these 2,927 taxis, we chose a thousand randomly for our simulations. In Fig 3.4a, we show 2 the routes taken by a randomly chosen subset of these thousand taxis. Note that we used about 8 taxis to display to avoid clutter. For the Chicago dataset, we collected data starting from Nov 1, 2010 at 11 :06hrs (Chicago local time) for every 30 seconds and used data worth the first 24hrs. The latitudes and longitudes of this dataset range from 41.6440 to 42.0651 and -87.8866 to -87.5256 respectively. The routes taken by a random subset of these nodes (7 buses) are shown in Fig 3.4b. Since the routes are carefully planned ahead, one can see the difference between the routes in Fig 3.4a and Fig 3.4b. We also note that many routes overlap with each other (spatially if not tern porally) and so not all can be seen clearly from the map. In Fig 3.5, the density of the nodes is shown for each dataset, for which we plot the number of moving nodes versus time. In Fig 3.5a and Fig 3.5b, the 0 minute corresponds to the time the dataset begins, and hence for the Beijing dataset it is 00:00 hrs, whereas 2 We used gpsvisualizer.com and google.com to obtain these plots. 54 ang a~<r n • ong j :w g ~rt . \ (a) Beijing dataset H kO (b) Chicago dataset Figure 3.4: Maps of the routes traced by a few randomly selected nodes in the Beijing and the Chicago datasets. We limited the number of nodes so as to not clutter the image. Colors are chosen randomly for each node by the tool we used to plot the routes. 55 ~ 800 -~ ..c Q.) > 600 0) c "> 0 400 E 0 (D 200 ..0 E 1500 :::l z o~~~~~~~~~~~~~~ 0 500 1000 Time (in minutes) (a) Beijing dataset ~ 15 -~ ..c Q.) > 0) 10 c "> 0 E 0 I... Q.) ..0 E :::l 0 z 0 500 1000 Time (in minutes) (b) Chicago dataset Figure 3.5: Density of moving taxis vs time 1500 for the Chicago dataset it is 11:06 hrs (both local time). We determine whether a node is not moving if its coordinates do not change for a continued duration (about two minutes). We note that the average duration that a taxi is moving is 9.4 hours in the Beijing dataset and quite remarkably, this value is 9.6 hours for the Chicago dataset. It can be seen in Fig 3.5a that as the data set starts at 12am, the density drops to the lowest at around 4am and starts to pick up and reaches a peak between 8am and lOam. It drops after that but again reaches a peak between 4pm and 6pm, after which it starts decreasing rapidly. Correspondingly in Fig 3.5b, which starts at about llam, the density peaks between 4pm and 6pm, then drops t o very low at around 4am and then picks up again to reach a peak at 8am. 3.4.2 Performance Metrics In order to characterize the performance of the system, we cannot simply use the average delay in downloading a file as a figure of merit. This is because, since the traces are time limited, there could be files that may not get fully reconstructed by 56 the end of the duration of the trace, and so it is hard to quantify the delay of such incompletely downloaded files. Thus, we rely primarily on two metrics: one is the full recovery probability, which measures the probability that a file can be fully recovered by a sink by a given time and the other is the average file download percentage, which measures, on average, how much of a file is downloaded by a given time. Thus, for example, a file-recovery probability of 0.9 means that the nodes were able to successfully download full files 90% of the time and an average file download percentage of, say, 95 means that the nodes were able to download 95% of the file on the average. 3.4.3 Experiment Methodology The nodes are indexed from 1 to N, where N is the number of nodes (N = 1000 for the Beijing dataset, and N = 1608 for the Chicago dataset). The files are indexed 1 through m. Since the end goal is to deploy a file sharing system in a vehicular network, we try to make reasonable choices of various parameters involved. A capacity of 100GB per node is assumed as a default, unless specified otherwise. Similarly, by default, files are assumed to be of size 1GB, typical of movie clips and we consider a default of 2,500 files in the system. As explained before, the two primary metrics of performance are the full-recovery probability and the average file download percentage, both characterized as functions of time. Therefore, our experimental methodology is to carry out a number of experiments, and in each there is a sink trying to download a file. We record these metrics of interest along time and average across experiments. 57 £ 150.8 ro .D 0 0::: 0.6 ~ ~0.4 0 (.) ~0 .2 ....!.. ::J LL 00 . £ :0 <ll 0.8 .D 0 C..o6 ~ <I) 1) 0.4 (.) <I) 10.2 ::J lL - Coded storage -- Uncoded storage . -- 500 1 000 1500 Time (in minutes) (a) Full-recovery probability (Beijing) • . . -· .-- .... .. . ... ~- / / •.• - ··· - Coded storage ... ... .. ,./ ·•·•· Uncoded storage 500 1000 Time (in minutes) (c) Full-recovery probability (Chicago) 1500 ~100 <ll c <I) ~ 80 a.. "0 ~ 60 c $ .g 40 .... - ·"" ~ <Ll 20 _ •. /···~Coded storage I ~ .. ---· - .....•.•.•.• ...- ···· Uncoded storage ~ 0 o 500 1000 1500 Time (in minutes) (b) Average file download percentage (Beijing) Q) 0>100 <ll c Q) ~ 80 Q) Q. ~ 60 0 c ~ '0 ~ ;;:::: i -~/i . .. / ....... / .. . .... .-· ... ............ ... ... ... . - Coded storage ·•·•· Uncoded storage 500 1000 1500 Time (in minutes) (d) Average file download percentage (Chicago) Figure 3.6: Evaluating the performance of distributed storage codes in the default setting consisting of 2,500 files each of size 1GB stored in nodes each having 100GB storage for both the Beijing and Chicago datasets. There are 1,000 nodes in total in the Beijing dataset and 1,608 nodes in the Chicago dataset. Each experiment consist s of t he following three steps: first, the files are allocated to the nodes; then, a. sink-file pair is determined; and this is followed by a. simulation of the encounter between the sink and t he rest of t he nodes using t he trace. We will describe in detail these aspects next. The first step is that of storing t he files onto the nodes. If coding is not used, files are not transformed; but if coding is used, files are encoded to get chunks. Both in t he uncoded and the coded storage schemes, the files and the chunks are stored by ensuring 58 maximal spreading. That is, in the case of uncoded storage, we make sure to not store the same file twice or more in the same node; and in the case of coded storage, multiple chunks of the same file are not stored in the same node, unless all other nodes have been used. In fact, we found that by randomly storing files/chunks, coding still performed virtually the same whereas the performance of uncoded storage scheme decreased slightly. Thus, we decided to use maximal spreading so as not to worry about the performance degradation introduced by randomization, even though random storage may be more realistic. Next, a random node is selected to be a sink by selecting a random index from {1, 2, ... , N}, and it tries to download a random file (by choosing a random index from {1, 2, ... ,m}). The third step involves the simulation of the contacts between the sink and the other nodes, so that the sink can download the chosen file opportunistically. Note that the day-long trace is divided into intervals of length one minute each for the Beijing dataset and 30 seconds each for the Chicago dataset, resulting in a total of 1440 and 2880 slots for the Beijing and the Chicago dataset respectively. The choice of the granularity is dictated by the dataset. At each slot, we determine the distance between the given sink and every other node, and apply the radio model (described below) to find out the number of packets transferred, if any. For each experiment, we keep track of the percentage of the file downloaded and whether the file download is complete or not at each time step. When presenting the results, we average over 50 random sinks, and for each sink, we run the entire simulation 100 times choosing a different file each time. 59 3.4.4 Realistic Radio Link Model The IEEE 802.1lp standard specifies the data rate to range from 1.5Mbps to 27Mbps with the default being 3Mbps, which we use in our simulations. For inter-vehicular communication, we use an empirical model of packet delivery characteristics obtained from [15]. The authors characterize the packet delivery ratio (PDR) against various parameters such as the separation between two nodes, their relative velocity etc., in a number of different environments and the overall experiments lasted for about 30 hours. Of the various environments in which their experiments were conducted, the closest match to our dataset is the Suburban Road (SR) environment. Thus we use their PDR vs separation distance data (Fig 3(a) in Bai et al. [15]) to carry out our simulations. It may also be emphasized that the authors found that the relative velocity between two nodes does not significantly affect the PDR, the way inter-vehicular distance does. We choose packet sizes of 380 bytes with payload 300 bytes. Additionally a protocol set up time of about 1ms is considered (based on the work by Bai et al. [15]). 3.4.5 Choice of the coding parameter k From the Beijing dataset, we observed an average contact duration of 55.6s (assuming a radio range of 500m) leading to an average data transfer of 21MB (at 3M bps under ideal conditions). Since it is desirable to be able to transfer multiple chunks per encounter, we choose a safe chunk size of 1MB. We use this same chunk size for the Chicago dataset too. 60 0.9 -~ 08 ~ 0 .7 ..0 e o6 0.. ~ 0 .5 CIJ i) I · - 0 - · 5GB (uncoded) --e-- 1GB (coded) () 0 4 ~ .J. 0 .3 :::J · -EJ- · 1GB (uncoded) --1 OOMB (coded) ·-·- · 1OOMB (uncoded) l.L 0 .2 0 1 500 1000 Time (in minutes) ~ 2.5!\ (coded) 0 9 · - 0- · 2 5k (uncoded) -~ 0 .8 -g 0 7 ..0 e o.6 0.. ~ 0 5 CIJ i) 0.4 () CIJ 1 03 :::J l.L 0 .2 0 1 0 --e-- 25k (coded) · -EJ- · 25k (uncoded) 500 1000 Time (in minutes) -- 10GB ( <::odQod) 0 9 ·- · - · 10GB (uncodliiKii --B-- 1 0 0 GB ( cOOIO>d) ~ o ?L~------'1 ..0 e o6 0.. ~ 0 .5 CIJ > 8 0 4 CIJ 1 0 .3 :::J l.L 0 500 1000 Time (in minutes) (c) Capacity per node varied 1500 1500 1500 Figure 3.7: Plots showing how various parameters affect the full-recovery probability. In each of the cases, one parameter is varied while keeping the others constant. Typical values used are a storage capacity of 100GB, 2,500 files and file size 1GB. 61 Q) g' 90 c Q) ~ Q) 70 0.. -g 60 _Q ~ 50 i 0 ""0 40 Q) / ---&--- 5GB (coded) · -o- · 5GB (uncooed) ---i3- 1GB (coded) · _.,_ · 1GB (uncooed) -- 1OOMB (coded) 500 1000 Time (in minutes) 1500 Q) g' c Q) (.} 90 . -o- · 2 5k (uncoded) 80 ---i3- 25k (coded) (ij 70 0.. -g 60 _Q ~ 50 0 ""0 40 ~ ;;:: 30 Q) 0) 90 Cll E so ~ (ij 70 0.. -g 60 0 c 50 ~ ""0 ~ ;;:: Q) 0) Cll 2 (ij ~ 0 500 1000 1 500 Time (in minutes) -- 10GB (ceded ) · - · - · 10 GB ( uncoded) ---e--- 10 0GB ( coded) · - D- · 100GB (uncoded) ---e-- 500GB ( cod ed) · -o- · 500GB (uncoded ) 500 1000 1500 Time (in minutes) (c) Capacity per node varied Figure 3.8: Plots showing the impact of different parameters on the average file download percentage. The parameters are same as in Fig 3. 7. 62 3.4.6 Discussion of the Results Our most important results are shown in Fig 3.6, in which we consider a typical file sharing scenario with 2,500 files each of size 1GB; and each node having about 100GB storage. Such a system is implemented atop both the datasets, and both the full-recovery probability and the average file download percentage are measured for each time step. While we primarily discuss the results with respect to the Beijing dataset, similar dis cussions follow for the Chicago dataset. We note that coding offers significant benefits compared to uncoded replication. For example, at the end of 24 hours, files are recon structed fully 98% of the time by using coding, whereas without coding, only 19% of the files are reconstructed fully (see Fig 3.6a). The corresponding values for average file download percentage are 99% and 61% respectively. If we were to consider the instant when 80% of nodes are able to complete their downloads, this corresponds to about 600 minutes in the trace when coding is used, but only 4.4% of nodes are successful in full downloads by 600 minutes if coding is not used. An interesting observation to make is that since the Beijing trace begins in the middle of the night with relatively little traffic, one can see from Fig 3.6 that the rate at which files are completed starts to slow down around 60 minutes (1 a.m.) and then picks up again at 400 minute (7 a.m.). No such trend can be seen in the Chicago dataset because the dataset starts at around 11am Chicago local time. Another factor affecting the rate towards the end is the scarcity of new chunks (similar to the coupon collector problem). Further, we performed a number of experiments to thoroughly understand the effect of various parameters on the performance of the system, by systematically varying the 63 parameters M, C and m. In our evaluations, we keep two parameters constant and vary the third. The results are shown for the Beijing dataset and those of the Chicago dataset are omitted for brevity since they display similar trends. 3.4.6.1 Effect of file size As file size increases, since system storage remains constant, we are effectively decreas ing the system redundancy, which should adversely impact latency. This is observed for both coded and uncoded storage, but there are clear differences in relative performance. We notice from Fig 3. 7a and Fig 3.8a that when the file size is very small (100MB in the figures), coding offers no benefit at all. But as the file size is increased to 1GB, coding offers tremendous improvements by being able to fully download full files most of the time (98% of the time in Fig 3. 7a), whereas only about a fifth of the time (Fig 3. 7a) without coding. When the file size is increased further to 5GB, the performance of coding suffers, but not drastically, whereas in the absence of coding, the probability of full recovery drops almost to zero (from Fig 3.8a, and we note that many sinks have been able to download about a tenth of the file on the average, but not a complete file). 3.4.6.2 Effect of the number of files and the capacity Figs 3. 7b and 3.8b show the impact of the number of files on the system performance. As the number of files increases, the system redundancy decreases and hence the full recovery probabilities and the file download percentages both start to decrease. And, as the capacity increases from 10GB to 100GB to 500GB, files can be replicated many more times and hence the full-recovery probabilities and the file download percentages both 64 start to get better (Fig 3.7c and Fig 3.8c). An interesting observation to make is that the curve corresponding to the case when there are 25,000 files with 100GB storage per car in Fig 3.7b and the curve corresponding to 2,500 files with 10GB storage per car in Fig 3.7c (or Fig 3.8b and Fig 3.8c) are both identical (if we choose the same set of sink file pairs). This is because having 25,000 files on nodes with 100GB has the same system redundancy as having 2,500 in 10GB nodes. Also note that some of the probabilities or percentages for the uncoded replication start non-zero, since some of the sinks already contain the files they are interested in, whereas when coding is used, no node can contain a full file by itself and so all the probabilities and percentages are 0 to begin with. 3.4.7 Effect of File Distribution We have thus far considered that files get placed in their entirety when storing without coding. In contrast, when coding is used, files get encoded into smaller chunks that get stored on different nodes. Therefore, an alternate way of storing uncoded files would be to split them into smaller chunks and place the chunks into different nodes. In this section, we study the effect of file distribution- uniform vs non-uniform. Consider that a file is split into multiple chunks. Uniform distribution refers to distributing these chunks across various nodes uniformly randomly; whereas, non-uniform distribution refers to placing all the chunks in the same node, which is what has been studied so far. Note that when coding is used, since it does not make sense to store all the coded chunks in the same node, uniform distribution is the only option for coded storage. 65 90~------~----~========~ (!) 80 ~ ~ 70 2 ~60 "0 ~50 c ~ 40 "0 (!) ;:: 30 (!) OJ ~ 20 ~ <( 10 ! ' ' / , - coooo -- Unccdoo (uniform) - -·Unccdoo (non-uniform) 500 1000 1500 Time (in minutes) (a) Average File Download Percentage 0 8.-------~----~----.----, 0.7 >- = 0.6 ~ -8 05 a <=-oA ~ g 03 I ~ 02 01 - - Unccdoo (uniform) - - Unccded (non- uniform) .-'- ' ---- '-' - ------ 1' - . - . - . - - - -- - . - - _ __ _ J %L-----~ ~ o---~1~oo~o-~-~15oo Time (in minutes) (b) Full-recovery Probability Figure 3.9: Effect of File Distribution Fig 3.9 shows the effect of file distribution on the average file download percentage and the full-recovery probability. We used a subset of nodes (632 total) from the Beijing dataset for this experiment (the dataset is described in the work by Ahn et al. [9]). As noted before, uncoded non-uniform file distribution has been studied so far. In Fig 3.9a, we can see that by splitting the file and distributing to various nodes (uniform uncoded distribution), we can improve the average file download percentage drastically. But coded storage still performs better. Uniform distribution performs better than non- uniform distribution since the sink node now has many nodes to collect the chunks from. Fig 3.9b paints a different picture. Here, it can be seen that uniform file distribution performs worse than the non-uniform file distribution. The main reason is that in the case of non-uniform distribution, some sinks already contain the file they require (since during the simulations, sinks randomly decide which file to download). 66 3.4.8 Absolute File Download Latency A cautionary note is in order in interpreting our results in this section in terms of the absolute numbers, which suggest that downloading a large 1GB-sized file in a vehicular network is likely to take six to ten hours even with coding. We note that our traces, though they involve in the order of 1,000 nodes, are still relatively quite sparse in terms of encounters as they involve large areas in Beijing and Chicago. Thus the latency values presented in our study in terms of absolute numbers may not be representative of what might be possible with much denser vehicular network deployments (say 100,000+ vehicles in a large city) during high-traffic hours. But the dramatic gaps observed between the performance of coded and uncoded storage in these simulations indicate strongly that the use of coding is essential for speeding up large file downloads in encounter-based vehicular networks, regardless of vehicular density. 3.5 Chapter Summary We have studied the effect of coded storage on the latency of on-demand, pull-based content access in an intermittently connected vehicular network. We developed a math ematical model to study the relative benefits, and proved that optimized coded storage is never worse than uncoded storage, and can significantly improve the latency perfor mance in the case of large files and bandwidth limitations. We have further validated our findings using realistic simulations based on large-scale vehicular trace involving taxis in Beijing and buses in Chicago. Our numerical results confirm that file download latency 67 (particularly for large files) is improved dramatically when the content is stored using erasure codes. 68 Chapter 4 Optimizing Helper Node Allocation Using Dissemination Utility as Metric In the prev10us chapter, we described how erasure codes can be suitably used to optimize the latency in a vehicular cloud. Whilst we focused on a setting where there is a single demanding node and no helper nodes, here we consider a general case with one or more demanding nodes and zero or more helper nodes. As described in Chapter 1, the underlying network for the vehicular cloud is a vehic- ular network. However, we 1 work through this chapter by considering an intermittently connected mobile network (ICMN) as the underlying network. Since an ICMN has a broader scope, the following work applies to any type of ICMN, including vehicular net- works. 1 This work is being done in collaboration with Keyvan Rezaei Moghadam of USC and Dr. Fan Bai of GM R&D under the guidance of Prof. Bhaskar Krishnamachari 69 4.1 Introduction With the increasing availability ofWifi-direct equipped mobile devices and the planned introduction of wireless access for vehicular environments (WAVE) radios in vehicles in the near future, there is continuing interest in mobile applications involving encounter based intermittently connected mobile networks (ICMN), that can off-load the dissem ination of delay-tolerant content from the increasingly congested and expensive cellular data infrastructure. The fundamental question we explore in this work through mathematical modeling under a general stochastic homogeneous encounter model is how should the agent most efficiently allocate the helper nodes? We consider this question under two different metrics -one that aims to maximize the number of satisfied demands by a deadline, and another that aims to minimize the time taken to satisfy all demands. And we explore this question under two settings. Initially, we consider a social welfare model where the agent is trying to maximize the total utility for all content providers. While this is the typical "engineering" approach to the problem, in the real world, the content provider and the resource managing agent are often different entities with differing interests. Therefore, we consider an economic model where the agent is assumed to be self-interested and trying to maximize its revenue from all content providers, which in turn have to balance their gain in terms of the content dissemination utility with the payment they must make to the agent. Our game-theoretic model allows us to examine the price of anarchy, the ratio between the total utility achieved when the system is operated based on maximizing revenue versus when the resources are allocated to maximize social welfare. 70 The high delays and reliability issues associated with single-copy routing in ICMN has motivated several researchers to develop multi-copy dissemination approaches in which the number of nodes assisting in relaying each content is either limited statically or carefully adapted [11, 16, 71, 94]. In terms of the resource that is limited, many of these works emphasize the bandwidth limitation of ICMN [11, 94, 96], but several others have emphasized the storage limitation [66,87,88, 93], as we also do in this work. Our work is closest in spirit to RAPID [16], whose authors make the important ob servation that efficient dissemination in ICMN should fundamentally be formulated as a utility-based resource allocation problem of how the limited storage resources on the nodes should be managed towards maximizing well-defined system objectives. They rightly dis tinguish their work from prior literature as being the first where the resource management has an intentional effect on desired performance metrics related to average delay or deliv ery rate as opposed to prior schemes that have only an incidental effect on these metrics. Our work closely follows this top-down philosophy, and we thus start by clearly identi fying the relevant performance metrics and then derive the optimal storage allocation to maximize those metrics. However, our work is distinguished from [16] in that they pro pose heuristic mechanisms and evaluate them via simulations, while our objective here is to undertake a theoretical treatment of the problem in order to study rigorously the nature of optimal allocations under a well-defined, tractable, mathematical model. This study provides an import ant set of benchmarks, and provides analytical building blocks and tools for further investigations. We derive mathematical expressions for computing the expected time to satisfy all demands and the expected number of satisfied demands by a given deadline for a given 71 Notation Meaning N number of nodes in the system. m Number of files/number of content providers. [k] Equivalent to {1, ... , k} for integer k. nd,i, nh,i Number of demands and number of helper nodes allocated for file i E [m] cd,i, ch,i Number of completed demands and number of completed helper nodes for file i E [m] with 0 S cd,i S nd,i and 0 :S; ch,i :S; nh,i. { Xj }r Random variables denoting the inter-encounter times. Ui Utility for content provider i E [m ]. A The transition matrix of content dissemination for a fixed file of size ns x ns. g[(cd,ch) The probability that cd of the demanding nodes and ch of the helper nodes (and not any more) have got the file at encounter T for file i E [m]. Table 4.1: List of variables used. helper node storage allocation, under a homogenous stochastic encounter model with general inter-encounter time distribution. Using this, we show how to compute the node allocation that maximizes the social welfare under both metrics. We show some inter- esting trends: for instance, helper nodes have diminishing returns and are less effective at large deadlines, and that increase in demand is actually beneficial in reducing the expected delay in dissemination. We formulate the problem of helper node allocation also from a game theoretic perspective and show that when the central agent tries to maximize its profit under a proportional allocation policy, the resulting system generally has a price of anarchy greater than 1. We also find that, somewhat counter-intuitively, a content provider with lower demand may need to pay more to the agent. 72 4.2 Problem Formulation The model we consider here has a central agent that has control over allocating storage on a set of N nodes, and m content providers with one file each. Each content provider is interested in disseminating its file to some of the nodes that may be interested in the content. All the nodes are assumed to be homogeneous and are identical in their storage capabilities. We focus on a storage-constrained scenario and for ease of exposition and tractability we assume that each node is constrained to carrying at most one file (this assumption can be relaxed, but the resulting model would have the total number of states growing exponentially with the storage size). The central agent can get content (file) from the content providers and can place them in one or more nodes, called the seed nodes of the file (assume this is done oflline). Let the number of seed nodes for each file i be ns,i with ns,i 2 1. We will assume for simplicity that the central agent places each file onto different seed nodes. This is so as to keep up with our assumption of homogeneity of the nodes (so all nodes are capable of storing single file), and more importantly, we will be able to decouple the analysis of the dissemination of different files. Even if we assume that seed nodes store multiple files, all the analysis should be extendable with suitable bookkeeping. There are a set of demanding nodes for each file. Assume that the number of demands for each file i is nd,i. The goal is to get the file i to the nd,i demanding nodes. Since the total number of nodes is N, we would require :>::: 1 (ns,i + nd,i) S N. Denote by nd = :>::: 1 nd,i the total number of demand nodes, ns = :>::: 1 ns,i the total number of seed nodes. 73 The remaining set of nodes, N- ns - nd in number are called helper nodes as they can potentially help in the dissemination of the files. Thus the number of helper nodes is nh = N- n 3 - nd. Since each node can only carry a single file, each helper node can only help one node at a time. The central agent uses the control-plane in advance of the dissemination process to inform each helper node which file it should be helping. When two nodes encounter, it is assumed that each node can download a full file, if the other node carries anything of interest. The encounter model will be made clear later. It is of interest to determine which set of helper nodes must assist in the dissemination of each file and for the metric over which we want to optimize for, we consider the following two formulations: • Metric 1 (Ml): Maximize the total expected number of demands satisfied for each file, by a deadline T. • Metric 2 (M2): Minimize the maximum expected time to satisfy all the demands for each file. The metric to be used depends on the application. For example, Ml might be a useful metric when the content will expire after certain duration. 4.2.1 Notation The indices start with 1. All the vectors are column-vectors unless specified otherwise. (A)(x,y) refers to the element at row x and column y of matrix A, and (v)x refers to the element at location x of the vector v. Note that all the vectors considered here are of appropriate sizes, i.e. when a matrix B multiplies a vector v, the number of columns of 74 B equals the number of rows of v. [m] denotes {1, 2, ... , m} for an integer m. A list of the symbols used in this chapter are presented in Table 4.1. 4.2.2 Contact and Dissemination Model We assume that content dissemination occurs through a series of encounters between nodes. In particular, we assume that when there is an encounter between nodes, it is always between a pair of nodes and not any more. Furthermore, we assume a homoge neous contact process whereby each pair of nodes are equally likely to be involved in an encounter. Since there are N(N- 1)/2 possible pairs, in each encounter, each particular pair is involved with probability p :'"' N(J l). The inter-encounter times are assumed to be independent and identically distributed (i.i.d). Let { Xj }r be the i.i.d inter-encounter times, with mean E[X]. Thus, the expected time for say T encounters is TE[X]. Let Fx(t) be the cdf of X, i.e. Fx(t) = Pr[X S t]. We would also like to define the joint distribution of time taken by T encounters as follows: These quantities depend on the underlying mobility. For example, if the inter encounter times are i.i.d exponential with rate .\,then Fx(t) = 1- exp( -.\t) and fT(t) is Erlang with parameters .\and T as follows: JT(t) = ).T(~-•le)~"'. We note that our contact model captures the widely used model in the DTN commu nity. For example, [44, 58, 93, 109], all assume that node encounter times between pairs of nodes are independent of each other and follow a Poisson process with rate 1/ .\. This means that the inter-encounter durations are i.i.d exponential with rate .\, and thus each 75 pair is equally likely to encounter, whenever there is one. Our analysis does not rely on the exponential assumption, and can capture a general inter-encounter distribution. Since the state of the system changes only during an encounter, when presenting results, we will deal with the counting of the number of encounters instead of the time. Finally, as mentioned before, it is assumed that the encounters are long enough to transmit a full file. As our primary focus is on large files and constrained storage, one way to satisfy this is to have large bandwidth. To handle settings in which both bandwidth and storage are constrained, this ass urn ption is equivalent to assuming that only long-duration encounters are explicitly considered in our model. These long-duration encounters would typically correspond to individuals that are standing/sitting near each other, or to vehicles either tern porarily parked near each other or following each other in the same direction 2 In principle, the model could be extended in an approximate way to consider even partial file transmissions during short duration encounters by appropriately scaling down the probability of transferring a file in each encounter between a node that has the file and an interested node; we defer a careful exploration of such an approximation to future work. 4.3 Understanding Dissemination of a single file Let us first analyze the dissemination of a single file, with the assumption that all the nodes in the system are participating in the dissemination - i.e. they are either seeds, demands or are willing to help (helper nodes). If n 3 , nd, nh denote the number 2 Such long encounters are not rare. Prior work has shown that encounter durations for human contacts follow a heavy-tailed distribution in which long encounters are reasonably common [104], and that the contact durations for certain groups of vehicles are bimodal with long durations corresponding to the vehicles parked side by side [55]. 76 of seeds, demands and helpers, then we assume here that the total number of nodes N = ns + nd + nh. Note that since there is only a single file, we do not consider notation like nd,l to denote the parameters for the first file. We will utilize the solution of this section as the building block in the next, where we consider multiple files. There, we will also handle the case when there could be a single file with n 3 + nd + nh < N. This can be handled by assuming that the remaining nodes participate in the dissemination of a non-existent file. 4.3.1 Modeling the Markov Chain The dissemination of the file can be modeled using a finite-state discrete time absorb ing Markov chain. 4.3.1.1 States of the Markov Chain The states of the Markov chain indicate how many demanding nodes and helper nodes have got the file. So a state can be represented using a tuple (cd, ch) when 0 S cd S nd demands and 0 S ch S nh helpers have got the file. The state (0,0) for example indicates that none of the demands or helper nodes have got the file. The number of states is ns = (nh + 1)(nd + 1). Note that since our goal is to be done with nd demands, all of the states (nd,j) could have been made absorbing states, or they could have been all combined to one absorbing state. We do not make this distinction when writing down the Markov chain because it lends to easier understanding of the analysis and it can capture the distribution of the number of helper nodes completed when all the demands are done. 77 4.3.1.2 Transition Probabilities When in state (cd, ch), the total number of nodes that have the file is n 3 + cd +ch. The number of demands and helper nodes that don't have the file are thus nd- cd and nh- ch respectively. From a state (cd, ch), the Markov chain can either go to state (cd + 1, ch) or ( cd, ch + 1) or will stay in the same state. The corresponding transition probabilities are: P(cd,ch)-+(cd+l,ch) = p(ns + Cd + ch)(nd- cd) P(cd,ch)-+(cd,ch+l) = p(ns + Cd + ch)(nh- ch) where p = 2/N(N- 1). The transition probability to the same state (self-loop) is P(cd,ch)-+(cd,ch) = 1- P(cd,ch)-+(cd+l,ch)- P(cd,ch)-+(cd,ch+l)· (4.1) (4.2) If the adjacent states ( cd + 1, ch) and/or ( cd, ch + 1) do not exist, set the corresponding transition probabilities to zero. Note that (nd, nh) is an absorbing state. 4.3.1.3 Transition matrix A Next, we construct the transition matrix A which is of size ns x ns since ns = (nh + 1)(nd + 1) is the number of states. Note that the indexing of the rows and columns of the matrix start from 1. Each state of the Markov chain corresponds to a row or a column in the matrix at a particular index. Let In(cd, ch) denote the index of state (cd, ch) in the transition matrix A. We use the following mapping to map the state (cd, ch) to an index in the matrix A: In(cd, ch) = (nh + 1)cd + ch + 1. Now given the index i E [ns], cd = quotient(i- 1, nh + 1) and ch = remainder(i- 1, nh + 1) can give back the state (cd, ch)· 78 The transition probability from a state (cd, ch) to (c~, cU will be stored in the location (In(cd, ch), In(c~, cU) in the matrix. The index of the absorbing state (nd, nh) is ns and so A can be written as: Ao a A= 0 1 where 0 is zero vector of appropriate size and a is a column vector. Let 1 represent the all ones vector of appropriate size. It can be proven by induction that for any k, (4.3) 0 1 4.3.1.4 Occupancy of the Markov Chain Since the content dissemination starts at state (0, 0), which corresponds to index In(O, 0) = 1, the initial probability distribution of being at various states is e 1 . After T encounters, the corresponding probability distribution is ei AT. Thus, the probability of being at state (cd, ch) is the element in the index i = In(cd, ch) of ei AT, which can also be represented as (ei AT)i and evaluates to ei AT ei. Define gT(cd, ch) to be the probability that cd demanding nodes and ch helper nodes have got the file at the end ofT encounters. Then, (4.4) 79 4.3.2 A Few Definitions We will define three column vectors u1, u2, u3 here. Let u1 be a column vector of length ns defined as follows: nd nh L L CJ 8 In(cd,ch)' cd=Och=O where ei is a column vector of length ns with 1 at location i and all other locations set to 0. The right hand side is also equivalent to :>::?~o quotient(i- 1, nh + 1)ei, i.e. u1 has the first nh + 1 entries 0, the second nh + 1 entries 1, and so on and up to the last nh + 1 entries set to nd. Let u 2 be a column vector of length ns defined as follows: Note that (u2)ns = 0. nh U2 :" L p(ns + nd- 1 + Ch)ein(nd-l,ch) Ch=O (4.5) Furthermore, define u3 to be a column vector of length ns - 1, containing the first ns - 1 entries of u 2 as follows: Iii = 1, 2, ... , ns - 1. 4.3.3 Ml: Understanding the Expected Number of Satisfied Demands Here, we will derive an expression for the expected number of demands satisfied (also called completed demands) at the end of T encounters. Then depending on the contact 80 model, we will derive the same quantity, given a deadline. We slightly abuse the notation to indicate Cv(T) and Cv(T) to be the number of completed demands after T encounters and after time T respectively. Lemma 4.3.1. The expected number of completed demands after T encounters is (4.6) Proof. At the end ofT encounters, the probability distribution in being at any of the states is given by gT(-,-). When in state (cd, ch), the number of demands satisfied is cd and the From equation 4.4, and then using the definition of u 1 , we will get the desired expres- SlOll. D Theorem 4.3.2. Given a deadline T, the expected number of demands satisfied by the deadline is T oo E[Cv(T)] = 1 [1- Fx(T- t)] ej L.>T fT(t)u 1 dt 0 T=0 (4.7) 81 tion of T encounters and noting that Xj are independent, we can write =iT Pr [XT+l > T- t] fT(t)dt =iT [1- Fx(T- t)] fT(t)dt. Thus, E[Cv(T)] = J 0 T [1- Fx(T- t)] L::~~o E[Cv(T)]fT(t)dt. Using equation 4.6 from Lemma 4.3.1 completes the proof D 4.3.4 M2: Understanding the expected completion Time Theorem 4.3.3. Given the transition matrix A for the dissemination of a file, the ex- pee ted delay (in units of time) to disseminate the file to all nd demands is ( 4.8) where Ao is the (ns- 1) x (ns- 1) submatrix of A, e is a unit vector [1, 0, ... , O]T of size ns - 1 and u3 is defined above. Proof. We will first count the expected number of encounters to satisfy all demands and then multiply it by E[X] to get E[D]. If the last encounter that satisfies all demands is T, then d- 1 demands must have been satisfied at the end ofT- 1 encounters. E[D] = E[X] L::~~l T Pr[Cv(T) = nd I Cv(T -1) = nd- 1]. Also, by conditioning on the number 82 of helper nodes CH done at the end ofT- 1 encounters, nh L Pr[Cv(T) = nd I Cv(T- 1) = nd- 1, CH(T- 1) = ch] ch=O nh L Pr[Cv(T) = nd I Cv(T- 1) = nd- 1, CH(T- 1) = ch] ch=O where CH(T) is the random variable indicating the number of helper nodes done at the end of T encounters. The probability term here is just the transition probability p(ns + nd -1 + ch)· Using this, and then using equation 4.4, we get E[D] = E[X] :>::~~ 1 Tei AT- 1 u 2 . We cannot write :>::~~ 1 T AT- 1 = (I- A)- 2 since I- A is singular. But since the last entry of u 2 is 0, post-multiplying ei AT- 1 by u2 is equivalent to ei A~- 1 u3 (where e1 is of length ns- 1). Applying taylor series to the infinite sum completes the proof. D In Fig 4.1 we show the expected number of demands satisfied as a function of the number of encounters so far (or they can be considered as the deadline) obtained using equation 4.6. The expected number of encounters to satisfy all the demands is found using equation 4.8. There are N = 50 nodes, n 3 = 1 seed, nd = 30 demands and the remaining are helper nodes. We also utilized a custom built simulator with the specified node mobilities. The average number of encounters to satisfy all demands as per the 83 ,........._ ~ ~100 ""0 Q) +-' Q) 80 0... E 0 60 () ({) ""0 c 40 co E Q) 20 ""0 I xpected Delay to satisf~ all demands= 207.31 0... X w 00 50 100 150 200 250 300 Number of encounters Figure 4.1: The expect ed number of demands satisfied in percentage as a function of the number of encounters. simulation was 206.47 and it can be seen that the results obtained closely match the theoretical results. 4.4 Understanding Dissemination of multiple files Not only do we generalize to multiple file dissemination here, but we could also gen- eralize to the case where the number of seeds, demands and helper nodes do not have to add up to the total number of nodes in the system. We can consider all the nodes associated with each file i as a group, such that the number of nodes in group-i are Ni = n s,i + nh,i + nd,i· Also, ZZ: 1 Ni = N . The general idea is to apply the previous results of Ml or M2 for each of the groups as if only the dissemination of nodes in the groups is going on, and then combine the results carefully to obtain the results of Ml or M2 when considering all t he files together. 84 ~ ""0 Q) :§! a_ E 0 u Vl ""0 c ro E Q) ""0 a_ X 20 w 1 0 100 ::R g__ 90 ""0 Q) 80 w 70 15._ E 0 u 5 Vl ""0 c ro E 3 Q) ""0 a_ ttl 10 5 10 15 20 Number of helper nodes allocated (a) E [Cn,1(T)] vs n h , l for nd,l = 10 -e- nh =20 ......_ nh= 10 o~---4~~ 6--~8~-1~0--~ 12~~1~4--+- 16~~ 1 7 8~20 Number of demanding nodes (c) E [Cn,1(T)] vs n d,l forT = 200 encounters 1000 (J) ""0 c: "' E Q) ""0 -e-nd-5 "iii ~ --+-nd= 10 (J) -e-nd=20 ~ (J) .9 >- "' -.; 40 ""0 c. >< 300 UJ 2000 5 10 15 Number of helper nodes allocated (e) E[D!] vs n h,l ""0 Q) :§! a_ E 0 u Vl ""0 c ro E Q) ""0 a_ X w dl w 15._ E 0 u Vl ""0 c ro E 3 Q) ""0 20 a_ ttl 10 5 10 15 --+-T=400 + T=300 ""*""T=200 -tt--T=100 Number of helper nodes allocated (b) E[Cn ,l(T)] vs nh , l for nd,l = 20 -e- nh=20 ....._ nh=10 -e- nh=O 20 4 6 8 10 12 1 4 16 18 20 Number of demanding nodes (d) E [Cn, 1(T)] vs nd,l forT = 400 encounters 1400 (J) ""0 ~ 1200 E Q) ~ 1000 "iii ~ (J) 800 ~ (J) .9 >- "' -.; ""0 c. >< UJ -e-nh=O + nh=5 ....._nh =10 -+-nh=20 5 10 15 20 Number of demanding nodes Figure 4.2: We consider N = 50 nodes that could have multiple files, and show various statistics for file 1. 85 4.4.1 Ml: Compute Expected Number of Satisfied Demands Given a deadline T (in units of time), we want to compute the expected number of satisfied demands E[Cv(T)]. This can be expressed in terms of E[Cv(T)], the expected number of demands satisfied after T encounters in the system. If E[Cv(T)] is known, by the same approach as outlined in Theorem 4.3.2 we can get T oo E[Cv(T)] = l [1- Fx(T- t)] ~E[Cv(T)]fT(t)dt (4.9) Let us next compute E[Cv(T)]. When there was a single file before, after T encounters in the system, the corresponding Markov chain would have taken T steps. But when there are m files, we will have m Markov chains that are not independent of each other. Given an encounter in the system, the probability that the encounter is useful for file i is Pi= 7f~~i-N (when nodes within this group encounter). This is the probability that the Markov chain corresponding to file i will make a move when there is an encounter in the system. If we know that the number of encounters pertaining to group-i is Ti, the results from the previous section can help us determine the expected number of demands satisfied E[Cv,ih)J (use Lemma 4.3.1 with appropriate parameters, e.g. the total number of nodes will be Ni etc.). Thus, conditioned on the number of encounters that could have helped in the dis- semination of file i, the expected number of demands satisfied for file i at the end of T encounters in the system will be :>::~Fo (~)p~i(1- Pi?-TiE[Cv,ih)J. 86 Since Cv( T) is just the sum of the number of demands satisfied for each file, by linear- is computed, and then using equation 4.9, we can compute the expected number of de- mands satisfied by deadline T, E[Cv(T)]. After T encounters, the expected number of demands satisfied for file i is E[Cv,i(T)], and so similar to equation 4.9, given a deadline T, the corresponding expectation will be E[Cv,i(T)] =iT [1- Fx(T- t)] X ~1-; (~)p~i(1- Pir-TiE[Cv,ih)lfT(t)dt (4.10) 4.4.2 M2: Compute Completion Time As in M1, we will divide the nodes into groups for each file and we can use Theo- rem 4.3.3 to compute the expected time (E[Di]) and the expected number of encounters (E[Dil/ E[X]) within group-ito satisfy all demands. Since each encounter in the system will be a possible encounter within group-i with probability Pi = 7f~~i-N, the expected number of encounters in the system required to disseminate to all demands in group-i will be E[Dil/ E[X]pi and the expected time will We are interested in the maximum expected delay which will be maxiE[m[ E[Dil/ E[X]Pi· 87 CIJ ~10~----~----~----~r=~~ ..c 1 -e-File 2 ~ oo-e-e-a-oo-o.e- 8 1 -e- File 1 ~20,---~----~--~----~--~ "" CIJ 5 "G-a'G-o.- ~ 0 :ll 15 "8 >, "'"8-Q .-.: ~ 5 °o 10 20 30 c 40 a> Number of Helpers for file 1 Cl. ~ 10 0 E § 5 c ""iii E a 0 ~Y o~~ 2~ oo----2~50----3~o-o--~35~0--~4oo Deadline (a) M1: nd, ! = 4, nd,l = 8. U maximum when n h,l =(b) M1: Optimal number of helpers for file 1 for var- 12 and so n h,2 = 24. T = 250 ious deadlines - _c u ro Q) - 0 -100 >. +-' +- -1500~--~----~--~----~----~--~----~--~ :::::) 0 5 1 0 1 5 20 25 30 35 40 Number of Helpers for file 1 >. 0,----,----,----,----~----,----,----,----, +-' +-' :::::) -500 E 2 -100 (/) >. (j) -15000 5 1 0 1 5 20 25 30 35 40 Number of Helpers for file 1 (c) M2: nd, l = 4, n d,2 = 8. U maximum when n h,l = 18 and so n h,2 = 18 Figure 4.3: Finding the globally optimal (social welfare maximizing) helper node alloca tion 88 4.4.3 Understanding Ml and M2 In Fig. 4.2, we show some characterizations of the dissemination under M1 and M2 for a single file. As previously mentioned, we only count the number of encounters rather than the actual time, which will depend on the distribution. We show results for say the first file. For M1, when deadline T = T, number of encounters in the system is given, :>::~, ~o (;,)p~' (1 - Pl)T-Tj E[ Cv, 1 ( Tl)] captures the expected demands satisfied, whose percentage is plotted (obtained by dividing by nd,l = nd and scaling by 100). For M2, E[Dil/E[X]pi is plotted. For all the cases, we consider one seed. Fig. 4.2a-Fig. 4.2d show plots when M1 is used, whereas the rest show plots when M2 is used. In Fig. 4.2a, we plot the percentage of demands that get satisfied at the end of various deadlines as a function of the helper nodes allocated when the total number of demands for the file is 10. As can be seen, the effect of the helper nodes is the highest for medium deadlines. When the deadline is small, even if a large number of helpers are allocated, the number of demands satisfied does not improve by the same numbers as when the deadline is medium. Further when the deadline is large enough, most of the demands are completed by virtue of the long deadlines, and the helpers offer only diminishing ret urns. Fig. 4.2b shows a similar plot for 20 demands instead of 10. We can see that the curves have shifted upwards because having more demands actually helps the dissemination, as we will see next. In Fig. 4.2c and Fig. 4.2d we fix the deadlines toT= 200 and T = 400 respectively, and study the percentage of demands completed in expectation as a function of the demands. 89 These figures reveal that as the number of demands increase, the percentage completed also increase, even though this means that more demands have to be satisfied by the deadline. This happens because as the demands also increase, they can also contribute to the dissemination. If a few demands get the file before the deadline, they can still help other demands. We next turn to studying M2 in Fig. 4.2e and Fig. 4.2f where we plot the expected delay (in number of encounters) as a function of either the number of helpers or the demands. In both the plots, it can be seen that increasing either the helpers or the demands can only help. From Fig. 4.2e it can be seen that when the demands are low, even a slight increase in the number of helpers allocated will bring down the expected delays significantly. Furthermore, it can be seen that the helper nodes have diminishing returns. Fig. 4.2f offers an interesting view of the role of demands. As before, the overall trend here is that as the demands increase, the expected delay continues to decrease except at a few places. When the number of helper nodes is very low (or zero), any increase in the demands lends itself favorably, whereas when the number of helper nodes is higher, a slight increase in the demands registers itself as extra burden before starting to help itself. 4.5 Social Allocation Having studied the dissemination of multiple files, we will now turn to finding an optimal allocation of the helper nodes to help in the dissemination of the files. In order 90 to do so, depending on M1 or M2, we will define utilities for the dissemination of each of the files and then define the system utility. An allocation of the helper nodes defined as (nh,l, nh,2, ... , nh,m) indicates that nh,i helper nodes are allocated to help disseminate For M1, given a deadline T, set Ui = E[Cv,i(T)] as the utility of the content provider The system utility will be U = :>::: 1 Ui. and this is same as E[Cv(T)]. For M2, set Ui = - E;;;'i] as the utility of content provider i. The utility of the system will be U =min Ui. Next, find an allocation (nh,l, nh,2, ... , nh,m) with Li nh,i = nh, such that the system utility U is maximized. In Fig 4.3, we show an example of how the allocation of helper nodes affects the utilities and how thereby we find the best allocation for M1 and M2. As before, we consider N = 50 nodes and two files with seeds ns,l = ns,2 = 1 and demands nd,l = 4, nd,2 = 8. Thus the number of remaining nodes are 36, which can be considered as the pool of helper nodes (nh = 36). Since we are interested in allocating all the helper nodes to help either of the files, nh,l + nh,2 = 36. In the plots, we vary nh,l and nh,2 can be inferred. From Fig 4.3a and Fig. 4.3c, the utilities of the individual files as well as the system utility are plotted as a function of the number of helper nodes allocated for file 1. Clearly, as nh,l increases, the utility of file 1 increases and that of the second file decreases because 91 nh,2 decreases. It can be seen that the utilities are maximized when (nh,l, nh,2) = (12, 24) and (18, 18) respectively. Fig 4.3b shows an interesting effect of the deadline on the optimal allocation. We plot the optimal number of nodes allocated for the file 1 against various deadlines (in number of encounters). We note that when the deadline is very small, the helper nodes may not have much influence on file 1 anyway, so all of them are being allocated to file 2. But as the deadline increases, more and more are allocated until the optimal allocation stops at (18, 18) for the nodes. 4.6 Market Allocation To reiterate, there are m content providers (CP), numbered i = 1, 2, ... , m. Content provider i has a single file i, and is interested in disseminating the file to nd,i demanding nodes. Given a pool of helper nodes, so far we have studied the problem of allocating these helper nodes for efficient content dissemination from a social setting. Next, we would like to study the market-based scenario where the central agent requires the content providers to pay for helper nodes to help in the dissemination of their files 3 Based on the bids placed by each content provider (CP), the central agent (CA) will allocate the sets of helper nodes to each of the CPs so as to maximize its own revenue. Note that we could have a formulation where the CPs could each pay for the seed nodes in addition to the helper nodes (albeit at a higher cost). We exclude ourselves from doing that for ease of exposition. It is assumed that the central agent gets the files from 3 We are assuming that in this two-tier network the storage in the nodes can be centrally controlled and managed by a single economic entity - the agent. There may be additional layer of economic interaction whereby this agent pays each individual node for each use of their storage. This payment can be absorbed into our model as a fixed cost for the central agent as it would still be interested in maximizing its revenue. 92 35~~~~----~~----~~.---~ · ·· ·· ·- • . .. ,._ ~- . ~ . ...... ~ Q5 30 '\ .. "'8 ~ 25 ·~t : -5 \ E. ~ 20 '\ ~ o · ····Content Provider 2 0:;; 15 ~ Q) .2l 1§ 10 E o ~~ 5 .... ~ ... .. ... 10 1 10 2 10 3 10 4 Price paid by Content Provider 1 (c 1 ) (a) Number of helpers allocated to each content provider .... .. . .......... -.... ... ... ..... ....... . o~--------~~~~ di Ul :::J -2000 ~ Q) -4000 2 15 iU' -6000 a_ -8000 - Content Provider 1 · ····Content Provider 2 - 1 000 ~ o'-; 0 ,.---~~1~0--=- 1 ~~~1~07 2 ~~~1 o"-c 3 ,.---~~1..,0 4 Price paid by Content Provider 1 (c 1 ) (b) Utilities obtained by each content provider under Ml (Deadline = 250) 0 ..• , •...•.•.•. •...•. •.•.. -2000 :u Ul :::J -4000 .r:. u "' Q) -6000 2 0 -8000 iU' a_ -10000 - Content Provider 1 - ··Content Provider 2 -1200~0Lo,.---~~1~0--, 1~~~1~0-::- 2 ~~~1-'-c03,.---~~1..JO• Price paid by Content Provider 1 (c 1 ) (c) Utilities obtained by each content provider under M2 Figure 4.4: Three player example (with one agent and two content providers). Here co = l,c2 = 50, nd(l) = 4, nd(2) = 8 93 the CPs and places each file in m distinct nodes, which will form the seed nodes for the m CPs. This allocation could be done by the use of control tier of the network, Thus ns,i = 1 for all i E [m] irrespective of the bid placed by CP i. 4.6.1 Game Formulation We will next describe the various elements that constitute the game, namely the players, the allocation policy, and the payoffs. 4.6.1.1 Players In our game formulation, we have m + 1 players in total - the agent and the m CPs with one file each. We will denote by player 0 the agent, and by player i, the CP i with file i, fori E [m]. 4.6.1.2 Actions The action of each player is a price ci. In the case of the agent, this price c 0 > 0 is the minimum unit price for the helper nodes that it requires the CPs to pay. The agent informs this value to all the other players. The action of other players is then the price they want to bid Ci 2 0 fori E [m]. Note that the agent moves first: it fixes the co and informs the CPs. We thus have a Stackelberg game here. We assume that all the players have complete knowledge of the system. The agent thus knows all the demands for all the contents; each CP knows not only the demand of its respective content, but also of other contents. Furthermore each CP knows Nand the unit price c 0 . 94 Since the players will not bid arbitrarily high values, we can restrict Ci S Cmax· There- fore, a row-vector c = [co, c1, ... , em] E [0, cmax]m+l forms the strategy space. Since [0, cmax]m+l is a hyper- cube, it is a compact and convex subset of JRm+l All the players will determine their actions to maximize their payoffs, which will be explained later. 4.6.1.3 Allocation Policy The minimum price on each helper node is c 0 . So if the CP i pays ci and if ci < c 0 , then it will not get any helper nodes. If the price ci 2 c 0 , the player could possibly get lci/coJ, subject to the availability of the helper nodes and the price bid by other players. In the case when the bid values are sufficiently high, such that :>::: 1 l ci/ coJ is more than the number of helper nodes available, the agent could decide to allocate the helper nodes proportional to q. Thus, each CP could possibly get lnhci/(L:j~ 1 cj)J. Combining both cases, the number of helper nodes allocated to CP i as a function of the payment ci made by the CP and the payment of other players c_i can be expressed as follows: ( 4.11) 95 4.6.1.4 Payoffs The agent's utility is the sum of the prices accrued from each of the players minus the maintenance cost. since the maintenance cost is constant, it does not affect the equilibrium calculation and we omit it for simplicity. Thus, the payoff will be Po = Uo = :>::: 1 ci. As a Stackelberg-leader agent will set c 0 to maximize its payoff. By bidding ci, the CP i gets nh,i helper nodes allocated to help in the dissemination of file i. Next, depending on M1 or M2, the CP will reap a utility Ui(nh,i), which may be computed since we know the number of seeds ns,i = 1, helpers nh,i and demands nd,i· Thus the net payoff of the CP i E [m] is (4.12) and it is of interest to the CP to maximize this. Here w is a weighting parameter that dictates how much the CP values the outcome compared to the cost, and depends on the metric. It could vary from player to player, but we do not consider this distinction. 4.6.2 Existence of Nash Equilibria We numerically observe the existence of multiple Nash equilibria for the cases we consider (more details in the next subsection). We investigate the existence of Nash equilibria in a more detaile in Appendix 4.8, where we rely on the quasi concavity property of payoffs to find out cases where Nash equilibria are guaranteed to exist [81]. 96 4.6.3 Three Player Game Example We next turn to numerically understand a three player game consisting of an agent and two CPs. Let there be N = 50 nodes in total of which nd,l = 4 want content 1 and nd,2 = 8 want content 2. Irrespective of the bids placed by the content providers, the central agent will guarantee one seed each: ns,l = ns,2 = 1. There are nh = 36 helper nodes remaining. We need to pick w for M1 and M2 to get the expressions for payoff. Since the utility values from M2 are about two orders higher in magnitude than that in M1, we pick w = 100 for M1 while keeping w = 1 for M2. Given all this information, we are interested in determining what should the central agent fix co to and once this is fixed, what will be the prices (c1, c2) that the content providers bid? Let us first understand the best response dynamics. If we fix co to 1 and c2 to 50. The number of helper nodes allocated according to equation 4.11 and the utilities of both the content providers for various values of c 1 is shown in Fig. 4.4. As can be seen, the number of helper nodes allocated increases as the content provider pays more and more, and it eventually gets almost all the helper nodes (the curve saturates at 35 rather than 36 because of the floor function in equation 4.11). Correspondingly, its payoff first increases since it is getting more helper nodes, but after a certain threshold, the payoff starts to fall due to the increasing cost. The best response of the content provider 1 will just be the price q when the payoff is maximized. Under both M1 and M2, the best response of content provider 1 is to set c 1 = 70. But since both the players know that they will play the best response to each other, they will play according to a Nash Equilibrium, if it exists. In fact, we numerically see 97 that there are multiple pure Nash Equilibria when co = 1. We show this in Fig. 4.5. In each figure, the circle represents a bid (q, c2) that could be a possible NE, and the circles are color coded according to c 0 (refer to the color bar legend to get approximate values of c 0 ). There are in fact many NE, but we do not show them all in the illustrations. The general trend to be noted is that as co increases, the points shift slowly to the upper right side (which means the bids increase), but after a certain extent, one of the bidder realizes that it is too costly and so starts to bid zero (and so the other is non-zero). These are shown along the x-axis or y-axis. When the co is too high, both the bidders bid 0 each. Another thing to note is that there are several equilibria when co is low, but for higher values, there seems to be only one NE. 98 350r-r---~--~--~-------.., 30 200,-,----~---~--------, 300 w z 250 • 1il N 200 0... 0 E 150 g 100 "'0 ili 50 0 • •• •• • • • ••• •• • • #. • • • • .•. •. . .., ,... .. . 50 100 150 Bid from CP 1 at NE 25 20 15 10 200 180 w 160 z 140 1il N 120 ~ 100 E 80 g 60 a5 40 20 0 • 0 •• .. • •• • • • • •• ,' 50 100 Bid from CP 1 at NE (a) Ml (Deadline = 250) (b) Ml (Deadline = 400) 200 200 180 .·A. 180 w 160 • • ... 160 z 140 • • •• • • • • 140 ....., • co •• • N 120 • • • 120 es 100 •• • • • , . • 100 E ~· • 80 80 0 ..__ 4- 60 "0 • 60 di 40 40 20 20 00 50 100 _ ___, 1 M 'U'2oo 250 300 Bid from CP 1 at NE (c) M2 Figure 4.5: The bids for different values of co. 150 70 60 50 40 30 20 10 As the central agent increases co, it will start earning bigger payoffs until a certain extent depending on the scheme, after which the payoff starts to decrease and eventually reaches zero, see Fig. 4.6. To plot this curve, after fixing the model, for each c 0 , we determine the possible Nash Equilibria (q, c2) numerically and use the one that gives the lowest c1 + c2 . gg For M1 with deadline 250, the payoff is maximized when co= 9. The content providers bid (90, 234) and get (10, 26) helper nodes each. The expected demands satisfied for each of the content providers due to this allocation are 1.87 and 7.30, and thus the system utility is 9.17. Since the optimal allocation that maximizes the system utility is in fact (12, 14), we should have that the system utility be 2.10 + 7.10 = 9.20, where 2.10 and 7.10 are the expected number of demands satisfied for each at the end of 250 encounters with the help of 12 and 24 helper nodes. Thus, the price of anarchy is 9.20/9.17 = 1.0033. For M1 with deadline 400, the corresponding co = 13. The bids here are (130, 156) and the number of helper nodes allocated are (10, 12). Note that all the helper nodes did not get allocated since the price minimum price is very high. Here, the utilities for the content providers are 3.1093 and 7.3683 respectively, and the overall system utility is 10.4776. The optimal allocation of helper nodes for this deadline is (18, 18), which guarantees a system utility of 3. 7895 + 7.8259 = 11.6154. Thus, the price of anarchy is 1.1086, higher than before. Given the sufficiently long deadline, the improvement in utility brought by helper nodes for the second content provider is quite low as compared to the price it has to pay. Therefore, it will bid lower than it did when the deadline was shorter (when it knew that the helpers would indeed help). In M2, the agent fixes co to be 24 at which the bids will be (240, 168). 10 and 7 helper nodes will be allocated for each content provider. In fact the content provider 2 bids much lower partly because of the much higher minimum price c 0 and partly because it knows that the demands can help themselves. The expected delay for content provider 1 is 422.3303 (and so its utility is -422.3303), and that of content provider 2 is 457.9848 (utility is -457.9848). The utility of the system is then -457.9848. If the allocation of the 100 450~--------~--------~--------~----------. 400 .... c Q) 350 0> <( 300 ~ c 250 Q) u 200 '+- 0 ~ 150 0 ~ 100 a_ 50 - M2 - M1 (Deadline=250) -<I- M1 (Deadline=400) 50 150 200 Figure 4.6: Payoff of the Central Agent for various co for M1 with deadlines 250 and 400 and M2. The Central Agent will fix the co to maximize its payoff, and it depends on the scheme. helper nodes were to maximize social welfare, the content providers would have got 18 helpers each, due to which the max delay would have been 297.2508 (each will have delay 297.2508 and 294.4246). Thus the price of anarchy in this case is 1.5407, the highest. 4. 7 Chapter Summary In this chapter, we have formulated and analyzed t he fundamental problem of re- source allocation in the form of helper nodes in a vehicular cloud and more generally in an intermittently connected mobile network. We have assumed a general stochastic homogenous encounter model between the nodes in the network. We believe this analysis 101 advances our theoretical understanding of the impact of various parameters and designs for resource allocation in ICMN. 102 4.8 Appendix: Existence of Nash Equilibrium 20 20~--~----~--~----~--~ 0.... -20 0 ,_ ,g -40 - ~ co -60 0.... -80 ' -. .... ' , < :: :---.,_ ..... .. _ ... -1000L_ __ ~----~--~----~--~ 20 40 60 80 1 00 Bid from CP 1 (a) c2 fixed at 50, co = 1 20,---~----~--~----~---, 0.... 0 -20 ,_ ,g -40 - 0 >. -60 co 0.... -80 -100 0 20 40 60 80 100 Bid from CP 1 (c) c2 fixed at 50, co = 10 N 0.... 0 ,_ ,g -40 - 0 >. -60 co 0.... -80 -100 0 m 2 3 4 20 40 60 80 100 Bid from CP 2 (b) c 1 fixed at 50, co = 1 20,---~----~--~----~---, 0 <~~: :>_ N 0.... -20 0 ,_ ,g -40 - ~ co -60 0.... -80 -.. ·-__ ... [ill 2 3 4 - 100 o~---2~0----4~0----~60----~8~0--~100 Bid from CP 2 (d) c1 fixed at 50, co = 10 Figure 4.7: A set of cases for model Ml (deadline T = 250 encounters) when quasicon cavity holds for both P1 and P2. While previously we were able to show the existence of multiple Nash equilibria for parameters that were of interest to us, here we investigate the existence of Nash equilibria further in a more general setting. Specifically we determine regions of w which guarant ee the existence of Nash Equilibria for a few co values. As before we consider the case of 103 three players - one agent and two content providers for ease of understanding. Therefore m = 2 here. We use the result from Theorem 2.2 from the work by Reny et al. [81] here to discuss sufficient conditions for the existence of Nash equilibrium. We first note that the set of pure strategies is com pact and convex. Consider player i E [m] who bids Ci. We fix the strategies of other players to c_i and would like to determine the conditions under which the payoff of player i, Pi is quasiconcave with respect to ci (from the theorem, quasiconcavity guarantees the existence of Nash Equilibria). Pi (a function of ci) is quasiconcave if either (i) it is non-decreasing, (ii) it is non- increasing, Or (iii) there exists a cr SUCh that pi is non-decreasing for Ci < cr and DOll increasing for Ci > ci. While this can be checked by taking a derivative of Pi with respect to Ci, since the expression for Pi is not suitable for differentiation, we resort to numerically study the quasiconcavity property. Note that this can be verified rather easily numerically. In Fig 4. 7, we show a few plots of P1 and P2 for co = 1 and co = 10 for various w, where quasiconcavity is satisfied for model M1 (when the deadline is T = 250). Fig 4.8 shows a similar set of plots for model M2. For the model M1, for the choices of co = 1 and co = 10, we see that P1, P2 are quasiconcave (in this case non increasing) when w S 4. For higher values of w, quasicon cavity does not hold (see Fig 4.9. Similarly, for the model M2, for the choices of c 0 = 1 and c 0 = 10, quasiconcavity is observed for w S 0.008. Since quasiconcavity is only a sufficient condition, higher values of w do not necessarily preclude the existence of Nash equilibria. 104 Or---~----~--~----~---, -20 0.... -40 0 .__ E. -60 - - 0 ~ -80 0.... -100 --w = 0.001 ---w = 0.008 -1200~--~----~--~----~--~ 20 40 60 80 1 00 0 ._,_ -20 0.... -40 0 .__ E. -60 - 0 ~ -80 0.... -100 Bid from CP 1 (a) c2 fixed at 50, eo = 1 --w = 0.001 -- -w = 0.008 -1200~--~----~--~----~--~ 20 40 60 80 1 00 Bid from CP 1 (c) c2 fixed at 50, co= 10 Or.---~----~--~----~---, -20 C\J 0.... -40 0 .__ E. -60 - - 0 ~ -80 0.... -100 --w = 0.001 ---w=0.008 -1200~--~----~--~----~--~ 20 40 60 80 100 0~ -20 C\J 0.... -40 0 .__ E. -60 - 0 ~ -80 0.... -100 Bid from CP 2 (b) c 1 fixed at 50, eo = 1 --w = 0.001 -- -w = 0.008 -1200~--~----~--~----~--~ 20 40 60 80 100 Bid from CP 2 (d) c1 fixed at 50, co= 10 Figure 4.8: A set of cases for model M2 when quasiconcavity holds for both P1 and P2. Fig 4.9 shows a case where the quasiconcavity property does not hold for both P 1 and P 2 . Note that even though the curve in Fig 4.9a looks like it is decreasing, there are cases where the payoff decreases and then increases. Investigation of the quasi concavity of the payoffs analytically is left as a future work. 105 20~--~----~----~----~----~ a.. 0-20 .._ 0 - - 0-40 >. (\S a.. -60 -so~--~----~----~----~----~ 0 20 40 60 80 1 00 20 C\1 a.. 10 0 .._ .E 0 - 0 >. (\S -10 a.. -20 Bid from CP 1 (a) c2 fixed at 50 -30~--~----~----~----~----~ 0 20 40 60 80 1 00 Bid from CP 2 (b) c1 fixed at 50 Figure 4.9: Demonstrating a case where quasiconcavity does not hold. Here co = 1 and w = 10. 106 Part II Data Center Cloud 107 Chapter 5 Optimizing Repair Traffic Using Locally Repairable Codes Distributed storage systems for large clusters typically use replication to provide re liability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability. In this chapter, we 1 overcome this limitation by using a new class of erasure codes called Locally Repairable Codes (LRCs). We implement LRC in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2x on the repair disk 1/0 and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher com pared to replication. 1 Some of the content in this chapter has also been presented in [85] 108 5.1 Motivation MapReduce architectures are becoming increasingly popular for big data manage ment due to their high scalability properties. At Facebook, large analytics clusters store petabytes of information and handle multiple analytics jobs using Hadoop MapReduce. Standard implementations rely on a distributed file system that provides reliability by exploiting triple block replication. The major disadvantage of replication is the very large storage overhead of 200%, which reflects on the cluster costs. This overhead is becom ing a major bottleneck as the amount of managed data grows faster than data center infrastructure. For this reason, Facebook and many others are transitioning to erasure coding tech niques (typically, classical Reed-Solomon codes) to introduce redundancy while saving storage [23, 56], especially for data that is more archival in nature. In this chapter we show that classical codes are highly suboptimal for distributed MapReduce architectures. We introduce new erasure codes that address the main challenges of distributed data re liability and information theoretic bounds that show the optimality of our construction. We rely on measurements from a large Facebook production cluster (more than 3000 nodes, 30 PB of logical data storage) that uses Hadoop MapReduce for data analytics. Facebook recently started deploying an open source HDFS Module called HDFS RAID ( [3, 32]) that relies on Reed-Solomon (RS) codes. In HDFS RAID, the replication fac- tor of "cold" (i.e., rarely accessed) files is lowered to 1 and a new parity file is created, consisting of parity blocks. 109 5.1.1 Importance of Repair At Facebook, large analytics clusters store petabytes of information and handle mul tiple MapReduce analytics jobs. In a 3000 node production cluster storing approximately 230 million blocks (each of size 256MB), only 8% of the data is currently RS encoded ('RAIDed'). Fig. 1.2 shows a recent trace of node failures in this production cluster. It is quite typical to have 20 or more node failures per day that trigger repair jobs, even when most repairs are delayed to avoid transient failures. A typical data node will be storing approximately 15 TB and the repair traffic with the current configuration is estimated around 10- 20% of the total average of 2 PB/day cluster network traffic. As discussed, (10,4) RS encoded blocks require approximately lOx more network repair overhead per bit compared to replicated blocks. We estimate that if 50% of the cluster was RS encoded, the repair network traffic would completely saturate the cluster network links. Our goal is to design more efficient coding schemes that would allow a large fraction of the data to be coded without facing this repair bottleneck. This would save petabytes of storage overheads and significantly reduce cluster costs. There are four additional reasons why efficiently repairable codes are becoming in creasingly important in coded storage systems. The first is the issue of degraded reads. Transient errors with no permanent data loss correspond to 90% of data center failure events [34, 56]. During the period of a transient failure event, block reads of a coded stripe will be degraded if the corresponding data blocks are unavailable. In this case, the missing data block can be reconstructed by a repair process, which is not aimed at fault tolerance but at higher data availability. The only difference with standard repair is that 110 the reconstructed block does not have to be written in disk. For this reason, efficient and fast repair can significantly improve data availability. The second is the problem of efficient node decommissioning. Hadoop offers the de commission feature to retire a faulty data node. Functional data has to be copied out of the node before decommission, a process that is complicated and time consuming. Fast repairs allow to treat node decommissioning as a scheduled repair and start a MapReduce job to recreate the blocks without creating very large network traffic. The third reason is that repair influences the performance of other concurrent MapRe duce jobs. Several researchers have observed that the main bottleneck in MapReduce is the network [24]. As mentioned, repair network traffic is currently consuming a non negligible fraction of the cluster network bandwidth. This issue is becoming more sig nificant as the storage used is increasing disproportionately fast compared to network bandwidth in data centers. This increasing storage density trend emphasizes the impor tance of local repairs when coding is used. Finally, local repair would be a key in facilitating geographically distributed file systems across data centers. Ceo-diversity has been identified as one of the key future directions for improving latency and reliability [41]. Traditionally, sites used to distribute data across data centers via replication. This, however, dramatically increases the total storage cost. Reed-Solomon codes across geographic locations at this scale would be completely impractical due to the high bandwidth requirements across wide area networks. Our work makes local repairs possible at a marginally higher storage overhead cost. Replication is obviously the winner in optimizing the four issues discussed, but requires a very large storage overhead. On the opposing tradeoff point, MDS codes have minimal 111 storage overhead for a given reliability requirement, but suffer in repair and hence in all these implied issues. One way to view the contribution of this chapter is a new intermediate point on this tradeoff, that sacrifices some storage efficiency to gain in these other metrics. The remainder of this chapter is organized as follows: We initially present an overview of Locally Repairable Codes in Section 5.2. Then, Section 5.3 presents the HDFS-Xorbas architecture and Section 5.4 discusses our experimental evaluation on Amazon EC2 and Facebook's cluster. We finally conclude with a summary in Section 5.5. 5.2 Locally Repairable Codes Maximum distance separable (MDS) codes are often used in various applications in communications and storage systems [107]. A (n, k)-MDS code ofrate R =~takes a file of size M, splits it in k equally sized blocks, and then encodes it in n coded blocks each f . M 0 SIZe T· A (n, k)-MDS code has the property that any k out of then coded blocks can be used to reconstruct the entire file. It is easy to prove that this is the best fault tolerance possible for this level of redundancy: any set of k blocks has an aggregate size of M and therefore no smaller set of blocks could possibly recover the file. Fault tolerance is captured by the metric of minimum distance. Definition 3 (Code Minimum Distance:). The minimum distanced of a code of length n, is equal to the minimum number of erasures of coded blocks after which the file cannot be retrieved. 112 MDS codes, as their name suggests, have the largest possible distance which is dMDS = n- k + 1. For example the minimum distance of (10,4) RS is n- k + 1 = 5 which means that five or more block erasures are needed to create data loss. The second metric we will be interested in is Block Locality: Definition 4 (Block Locality). An (n, k) code has block locality r, when each coded block is a function of at most r other coded blocks of the code. Codes with block locality r have the property that, upon any single block erasure, fast repair of the lost coded block can be performed by computing a function on r existing blocks of the code. This concept was recently and independently introduced [39, 73, 74]. When we require small locality, each single coded block should be repairable by using only a small subset of existing coded blocks r < < k, even when n, k grow. The following fact shows that locality and good distance are in conflict: Lemma 5.2.1. MDS codes with parameters (n, k) cannot have locality smaller thank. Lemma 5.2.1 implies that MDS codes have the worst possible locality since any k blocks suffice to reconstruct the entire file, not just a single block. This is exactly the cost of optimal fault tolerance. We identify the family of LRC codes as ( n, k, r ), where n, k are as before and r is the locality. We now describe the explicit (16, 10, 5) LRC code we implemented in HDFS-Xorbas. For each stripe, we start with 10 data blocks X 1 , X 2 , ... , X 10 and use a (14, 10) Reed-Solomon over a binary extension field lF2m to construct 4 parity blocks P1, P2, ... , P4. This is the code currently used in production clusters in Face book that can tolerate any 4 block failures due to the RS parities. The basic idea of LRCs is very 113 eimple: we make repair efficient by addlng additional local parities. 'J'his ie shown in figure 5.L 5 file blocks 5 file blocks 4 HS parity blocks Figure 5,1: Locally repairable code implemented in HDFS-Xorbas, The four parity blocks P 1 , P 2 , P 3 , P 1 are constructed with a standard RS code and the local parities provide efficient repair in the case of single block failures, The main theoretical challenge is to choose the coefficients Ci to maximize the fault tolerance of the code, can be repaired by accessing only 5 other blocks. For example, if block .X3 is lost (or degraded read while unavailable) it can be reconstructed by X - 1c" 'V x- 'V V ', -3 ----- C3 \'·-1- C1.A 1- c2 2- Ct.A4- C.:S-"-E·)· (5 .1) 'I'he multiplicative inverse of the field element c 3 exists as long as CJ / (J which is the requirernenL we will enforce for all Llw lucal pariLy coefficienLs, IL Lurns ouL LhaL Lhe coet1icients Ci can be selected to guarantee that all the linear equations vvill be linearly independent. A randomlzed and a determlnistic algorithm to construct such coefficients 114 are presented by Papailiopoulos et al. [75]. We note that the complexity of the determin istic algorithm is exponential in the code parameters ( n, k) and therefore useful only for small code constructions. The disadvantage of adding these local parities is the extra storage requirement. While the original RS code was storing 14 blocks for every 10, the three local parities increase the storage overhead into 17/10. There is one additional optimization that we can perform: it can be shown that the coefficients c1, c2, ... c10 can be chosen so that the local parities satisfy an additional alignment equation 51+52+53 = 0 (proof in [75]). We can therefore not store the local parity 53 and instead consider it an implied parity. Note that to obtain this in the figure, we set c~ = c~ = 1. When a single block failure happens in a RS parity, the implied parity can be recon structed and used to repair that failure. For example, if P2 is lost, it can be recovered by reading 5 blocks P 1 , P 3 , P 4 , 5 1 , 5 2 and solving the equation (5.2) Papailiopoulos et al. [75] show how to find non-zero coefficients Ci (that must depend on the parities Pi but are not data dependent) for the alignment condition to hold. It is also shown that for the Reed-Solomon code implemented in HDFS RAID, choosing ci = 1 and therefore performing simple XOR operations is sufficient. It is further proved that this code has the largest possible distance ( d = 5) for this given locality r = 5 and blocklength n = 16. 115 5.3 Xorbas: System Description HDFS-RAID is an open source module that implements RS encoding and decoding over Apache Hadoop [3]. It provides a Distributed Raid File system (DRFS) that runs above HDFS. Files stored in DRFS are divided into stripes, i.e., groups of several blocks. For each stripe, a number of parity blocks are calculated and stored as a separate parity file corresponding to the original file. HDFS-RAID is implemented in Java (approxi mately 12,000 lines of code) and is currently used in production by several organizations, including Facebook. The module consists of several components, among which RaidNode and BlockFixer are the most relevant here: • The RaidNode is a daemon responsible for the creation and maintenance of parity files for all data files stored in the DRFS. One node in the cluster is generally designated to run the RaidNode. The daemon periodically scans the HDFS file system and decides whether a file is to be RAIDed or not, based on its size and age. In large clusters, RAIDing is done in a distributed manner by assigning MapReduce jobs to nodes across the cluster. After encoding, the RaidNode lowers the replication level of RAIDed files to one. • The BlockFixer is a separate process that runs at the RaidNode and periodically checks for lost or corrupted blocks among the RAIDed files. When blocks are tagged as lost or corrupted, the BlockFixer rebuilds them using the surviving blocks of the stripe, again, by dispatching repair MapReduce (MR) jobs. Note that these are not typical MR jobs. Implemented under the MR framework, repair-jobs exploit 116 its parallelization and scheduling properties, and can run along regular jobs under a single control mechanism. Both RaidNode and BlockFixer rely on an underlying component: ErasureCode. Era sureCode implements the erasure encoding/decoding functionality. In Facebook's HDFS RAID, an RS (10, 4) erasure code is implemented through ErasureCode (4 parity blocks are created for every 10 data blocks). 5.3.1 HDFS-Xorbas Our system, HDFS-Xorbas (or simply Xorbas), is a modification of HDFS-RAID that incorporates Locally Repairable Codes (LRC). To distinguish it from the HDFS RAID implementing RS codes, we refer to the latter as HDFS-RS. In Xorbas, the ErasureCode class has been extended to implement LRC on top of traditional RS codes. The RaidNode and BlockFixer classes were also subject to modifications in order to take advantage of the new coding scheme. HDFS-Xorbas is designed for deployment in a large-scale Hadoop data warehouse, such as Facebook's clusters. For that reason, our system provides backwards com pati bility: Xorbas understands both LRC and RS codes and can incrementally modify RS encoded files into LRCs by adding only local XOR parities. To provide this integration with HDFS-RS, the specific LRCs we use are designed as extension codes of the (10,4) Reed-Solomon codes used at Facebook. First, a file is coded using RS code and then a small number of additional local parity blocks are created to provide local repairs. 117 5.3.1.1 Encoding Once the RaidNode detects a file which is suitable for RAIDing (according to param eters set in a configuration file) it launches the encoder for the file. The encoder initially divides the file into stripes of 10 blocks and calculates 4 RS parity blocks. Depending on the size of the file, the last stripe may contain fewer than 10 blocks. Incomplete stripes are considered as "zero-padded" full-stripes as far as the parity calculation is concerned HDFS-Xorbas computes two extra parities for a total of 16 blocks per stripe (10 data blocks, 4 RS parities and 2 Local XOR parities), as shown in Fig. 5.1. Similar to the calculation of the RS parities, Xorbas calculates all parity blocks in a distributed manner through MapReduce encoder jobs. All blocks are spread across the cluster according to Hadoop's configured block placement policy. The default policy randomly places blocks at DataNodes, avoiding collocating blocks of the same stripe. 5.3.1.2 Decoding & Repair RaidNode starts a decoding process when corrupt files are detected. Xorbas uses two decoders: the light-decoder aimed at single block failures per stripe, and the heavy decoder, employed when the light-decoder fails. When the BlockFixer detects a missing (or corrupted) block, it determines the 5 blocks required for the reconstruction according to the structure of the LRC. A special MapReduce is dispatched to attempt light-decoding: a single map task opens parallel streams to the nodes containing the required blocks, downloads them, and performs a simple XOR. In the presence of multiple failures, the 5 required blocks may not be avail able. In that case the light-decoder fails and the heavy decoder is initiated. The heavy 118 decoder operates in the same way as in Reed-Solomon: streams to all the blocks of the stripe are opened and decoding is equivalent to solving a system of linear equations. The RS linear system has a Vandermonde structure [107] which allows small CPU utilization. The recovered block is finally sent and stored to a Datanode according to the cluster's block placement policy. In the currently deployed HDFS-RS implementation, even when a single block is corrupt, the BlockFixer opens streams to all 13 other blocks of the stripe (which could be reduced to 10 with a more efficient implementation). The benefit of Xorbas should therefore be clear: for all the single block failures and also many double block failures (as long as the two missing blocks belong to different local XORs ), the network and disk I/0 overheads will be significantly smaller. 5.4 Evaluation In this section, we provide details on a series of experiments we performed to evaluate the performance of HDFS-Xorbas in two environments: Amazon's Elastic Compute Cloud (EC2) [1] and a test cluster in Facebook. 5.4.1 Evaluation Metrics We rely primarily on the following metrics to evaluate HDFS-Xorbas against HDFS RS: HDFS Bytes Read, Network Traffic, and Repair Duration. HDFS Bytes Read corre sponds to the total amount of data read by the jobs initiated for repair. It is obtained by aggregating partial measurements collected from the statistics-reports of the jobs spawned 119 iiJ S2. ~ 100 Q) 0:: V> Q) ~ (f) 50 LL 0 I FS-RS . HDFS-Xorbas (a) HDFS Bytes Read per failure event. (b) Network Out Traffic per failure event. (c) Repair duration per failure event. Figure 5.2: The metrics measured during the 200 file experiment. Network-in is similar to Network-out and so it is not displayed here. During the course of the experiment, we simulated eight failure events and the x-axis gives details of the number of DataNodes terminated during each failure event and the number of blocks lost are displayed in parentheses. 120 following a failure event. Network Traffic represents the total amount of data communi cated from nodes in the cluster (measured in GB). Since the cluster does not handle any external traffic, Network Traffic is equal to the amount of data moving into nodes. It is measured using Amazon's AWS Cloudwatch monitoring tools. Repair Duration is simply calculated as the time interval between the starting time of the first repair job and the ending time of the last repair job. 5.4.2 Amazon EC2 On EC2, we created two Hadoop clusters, one running HDFS-RS and the other HDFS Xorbas. Each cluster consisted of 51 instances of type ml.small, which corresponds to a 32-bit machine with 1. 7 GB memory, 1 compute unit and 160 GB of storage, running Ubuntu/Linux-2.6.32. One instance in each cluster served as a master, hosting Hadoop's NameNode, JobTracker and RaidNode daemons, while the remaining 50 instances served as slaves for HDFS and MapReduce, each hosting a DataNode and a TaskTracker dae mon, thereby forming a Hadoop cluster of total capacity roughly equal to 7.4 TB. Unfor tunately, no information is provided by EC2 on the topology of the cluster. The clusters were initially loaded with the same amount of logical data. Then a common pattern of failure events was triggered manually in both clusters to study the dynamics of data recovery. The objective was to measure key properties such as the number of HDFS Bytes Read and the real Network Traffic generated by the repairs. All files used were of size 640 MB. With block size configured to 64 MB, each file yields a single stripe with 14 and 16 full size blocks in HDFS-RS and HDFS-Xorbas respectively. We used a block size of 64 MB, and all our files were of size 640 MB. 121 Therefore, each file yields a single stripe with 14 and 16 full size blocks in HDFS-RS and HDFS-Xorbas respectively. This choice is representative of the majority of stripes in a production Hadoop cluster: extremely large files are split into many stripes, so in total only a small fraction of the stripes will have a smaller size. In addition, it allows us to better predict the total amount of data that needs to be read in order to reconstruct missing blocks and hence interpret our experimental results. Finally, since block repair depends only on blocks of the same stripe, using larger files that would yield more than one stripe would not affect our results. An experiment involving arbitrary file sizes, is discussed in Section 5.4.3. During the course of a single experiment, once all files were RAIDed, a total of eight failure events were triggered in each cluster. A failure event consists of the termination of one or more DataNodes. In our failure pattern, the first four failure events consisted of single DataNodes terminations, the next two were terminations of triplets of DataNodes and finally two terminations of pairs of DataNodes. Upon a failure event, MapReduce repair jobs are spawned by the RaidNode to restore missing blocks. Sufficient time was provided for both clusters to complete the repair process, allowing measurements corresponding to distinct events to be isolated. For example, events are distinct in Fig. 5.2. Note that the Datanodes selected for termination stored roughly the same number of blocks for both clusters. The objective was to compare the two systems for the repair cost per block lost. However, since Xorbas has an additional storage overhead, a random failure event would in expectation, lead to loss of 14.3% more blocks in Xorbas compared toRS. In any case, results can be adjusted to take this into account, without significantly affecting the gains observed in our experiments. 122 In total, three experiments were performed on the above setup, successively increasing the number of files stored (50, 100, and 200 files), in order to understand the impact of the amount of data stored on system performance. Fig. 5.2 depicts the measurement from the last case, while the other two produce similar results. The measurements of all the experiments are combined in Fig. 5.4, plotting HDFS Bytes Read, Network Traffic and Repair Duration versus the number of blocks lost, for all three experiments carried out in EC2. We also plot the linear least squares fitting curve for these measurements. 5.4.2.1 HDFS Bytes Read Fig. 5.2a depicts the total number of HDFS bytes read by the BlockFixer jobs initiated during each failure event. The bar plots show that HDFS-Xorbas reads 41%-52% the amount of data that RS reads to reconstruct the same number of lost blocks. These measurements are consistent with the theoretically expected values, given that more than one blocks per stripe are occasionally lost (note that 12.14/5 = 41%). Fig. 5.4a shows that the number of HDFS bytes read is linearly dependent on the number of blocks lost, as expected. The slopes give us the average number of HDFS bytes read per block for Xorbas and HDFS-RS. The average number of blocks read per lost block are estimated to be 11.5 and 5.8, showing the 2x benefit of HDFS-Xorbas. 5.4.2.2 Network Traffic Fig. 5.2b depicts the network traffic produced by BlockFixer jobs during the entire repair procedure. In particular, it shows the outgoing network traffic produced in the cluster, aggregated across instances. Incoming network traffic is similar since the cluster 123 only communicates information internally. In Fig. 5.3a, we present the Network Traffic plotted continuously during the course of the 200 file experiment, with a 5-minute reso lution. The sequence of failure events is clearly visible. Throughout our experiments, we consistently observed that network traffic was roughly equal to twice the number of bytes read. Therefore, gains in the number of HDFS bytes read translate to network traffic gains, as expected. 5.4.2.3 Repair Time Fig. 5.2c depicts the total duration of the recovery procedure i.e., the interval from the launch time of the first block fixing job to the termination of the last one. Combining measurements from all the experiments, Fig. 5.4c shows the repair duration versus the number of blocks repaired. These figures show that Xorbas finishes 25% to 45% faster than HDFS-RS. The fact that the traffic peaks of the two systems are different is an indication that the available bandwidth was not fully saturated in these experiments. However, it is consistently reported that the network is typically the bottleneck for large-scale MapRe duce tasks [24, 42, 45]. Similar behavior is observed in the Facebook production cluster at large-scale repairs. This is because hundreds of machines can share a single top-level switch which becomes saturated. Therefore, since LRC transfers significantly less data, we expect network saturation to further delay RS repairs in larger scale and hence give higher recovery time gains of LRC over RS. From the CPU Utilization plots we conclude that HDFS RS and Xorbas have very similar CPU requirements and this does not seem to influence the repair times. 124 (a) Cluster network traffic. 25,---,----,---,----,---,---.----,---,~ (020 S2. ""0 ffi 15 0:: f.!) Q) >.,10 11) .::.:. f.!) - HDFS-RS - HDFS-Xorba$ o 5r ii i i · · IH+ (b) Cluster Disk Bytes Read. (c) Cluster average CPU utilization. Figure 5.3: Measurements in time from the two EC2 clusters during the sequence of failing events. 125 ""0 m 100 0::: (/) 80 Q) +-' >- [() 60 (f) ~ - LL .. 40 i- ,-~.~ 0 ..4: I " 20 ..,..,-_:: .,.,.~f, . .al"i: ,..- 00 50 100 150 200 250 Number of blocks lost 350 ~300 [() Q () 250 ~ ro ~ 200 +-' :::::l 0 150 ~ .._ ~ 100 Q) z 50 00 100 150 200 250 Number of blocks lost (j) 250~··························· i ................................. L .................................. L ..... c••'~ i ....... ............................ ~ :::::l .~45~··························· •······························ •·····················•o· ' ······················ •···································~ ~ ~40~··························· ; .................................. ; .. ~ ,.< .......... • .............................. ; .................................... ~ c 235~··························· · ···························~'' ···················································· • c- ·'~·······························~ ~ :::J30~···· ·· ····· ; ;c· .. ........ , ~ .. ............. •,, . - . ....... . .. .... ; ·· ········· ·· ········ ·····~ 0 .!:: 25~ ... . ...... , ro Q_ Q) 20~ ... -; .. ' 0::: 15 (c) Repair Duration versus blocks lost Figure 5.4: Measurement points of failure events versus the total number of blocks lost in the corresponding events. Measurements are from all three experiments. 126 Total Bytes Read Avg Job Ex. Time All Blocks Avail. 30GB 83 min ~ 20% of blocks missing RS Xorbas 43.88 GB 92 min 74.06 GB 106 min Table 5.1: Repair impact on workload. 5.4.2.4 Repair nuder Workload To demonstrate the impact of repair performance on the cluster's workload, we sim- ulate block losses in a cluster executing other tasks. We created two clusters, 15 slave nodes each. The submitted artificial workload consists of word-count jobs running on five identical 3GB text files. Each job comprises several tasks enough to occupy all com- putational slots, while Hadoop's FairScheduler allocates tasks to TaskTrackers so that computational time is fairly shared among jobs. Fig. 5.5 depicts the execution time of each job under two scenarios: i) all blocks are available upon request, and ii) almost 20% of the required blocks are missing. Unavailable blocks must be reconstructed to be accessed, incurring a delay in the job completion which is much smaller in the case of HDFS-Xorbas. In the conducted experiments the additional delay due to missing blocks is more than doubled (from 9 minutes for LRC to 23 minutes for RS). We note that the benefits depend critically on how the Hadoop FairScheduler is con- figured. If concurrent jobs are blocked but the scheduler still allocates slots to them, delays can significantly increase. Further, jobs that need to read blocks may fail if repair times exceed a threshold. In these experiments we set the scheduling configuration op- tions in the way most favorable to RS. Finally, as previously discussed, we expect that LRCs will be even faster than RS in larger-scale experiments due to network saturation. 127 Workload compl etion time(~ 20% of required blocks lost) 130 - All blocks availabl e - 20% missing- Xorbas 120 - 20% missing- RS 110 - - ± 2_7 -'-41%. :;; 100 +' ;j .:: ·~ s 90 .:: ·~ <lJ s 80 ~ 70 60 50 40 2 3 4 5 6 7 8 9 Job s Figure 5.5: Completion times of 10 WordCount jobs: encountering no block missing, and rv 20% of blocks missing on the two clusters. Dotted lines depict average job completion times. 128 5.4.3 Facebook's cluster In addition to the series of controlled experiments performed over EC2, we performed one more experiment on Facebook's test cluster. This test cluster consisted of 35 nodes configured with a total capacity of 370 TB. Instead of placing files of pre-determined sizes as we did in EC2, we utilized the existing set of files in the cluster: 3, 262 files, totaling to approximately 2.7 TB of logical data. The block size used was 256 MB (same as in Facebook's production clusters). Roughly 94% of the files consisted of 3 blocks and the remaining of 10 blocks, leading to an average 3.4 blocks per file. RS Xorbas Blocks HDFS GB read Lost Total /block 369 486.6 1.318 563 330.8 0.58 Repair Duration 26 min 19 min Table 5.2: Experiment on Facebook's Cluster Results. For our experiment, HDFS-RS was deployed on the cluster and upon completion of data RAIDing, a random DataNode was terminated. HDFS Bytes Read and the Repair Duration measurements were collected. Unfortunately, we did not have access to Network Traffic measurements. The experiment was repeated, deploying HDFS-Xorbas on the same set-up. Results are shown in Table 5.2. Note that in this experiment, HDFS- Xorbas stored 27% more than HDFS-RS (ideally, the overhead should be 13%), due to the small size of the majority of the files stored in the cluster. As noted before, files typically stored in HD FS are large (and small files are typically archived into large HAR files). Further, it may be emphasized that the particular dataset used for this experiment is by no means representative of the dataset stored in Facebook's production clusters. 129 In this experiment, the number of blocks lost in the second run, exceed those of the first run by more than the storage overhead introduced by HDFS-Xorbas. However, we still observe benefits in the amount of data read and repair duration, and the gains are even more clearer when normalizing by the number of blocks lost. 5.5 Chapter Summary We have designed and implemented HDFS-Xorbas, a module that replaces Reed Solomon codes with Locally Repairable Codes in HDFS-RAID. Locally Repairable Codes (LRCs) are efficiently repairable both in terms of network bandwidth and disk 1/0. We evaluated HDFS-Xorbas using experiments on Amazon EC2 and a cluster in Facebook. While the disadvantage of the new code is that it requires 14% more storage compared to RS, our experiments showed that Xorbas enables approximately a 2x reduction in disk 1/0 and repair network traffic compared to the Reed-Solomon code currently used in production. 130 Chapter 6 Optimizing Placement by maximizing Mean Time To Data Loss (MTTDL) In the previous chapter, we saw how erasure coding can be useful to reduce the storage overhead in data centers. In this chapter, we will investigate an important issue that is often overlooked - the placement of data across a data center. When placing replicas in a data center, the general trend is to place them on separate racks. While this can help reduce the probability of permanent data loss, it creates a lot of cross-rack traffic when repairing failed nodes. In this work, we 1 identify this opposing tradeoff between fault-tolerance and repair traffic, and capture both through the widely used reliability metric Mean Time to Data Loss (MTTDL). We then formalize the placement of replicas and erasure coded blocks by a family of placement schemes, and use the metric MTTDL to determine the best placement scheme for a given set of data center parameters. We have also implemented a Monte Carlo data center simulator to simulate the failure and repair cycles that are common in data centers and to measure the MTTDL. Using results from the simulator, we 1 This work is being carried out in collaboration with Simon Woo, Prof. Alex Dimakis and Prof. Minlan Yu 131 present usable insights that can help guide data center operators in determining placement schemes that can improve the MTTDL. 6.1 Motivation In Google File System (GFS) [34,36], a rack-aware placement policy is used to make sure that two chunks of the same stripe get stored in the same rack [34]. Similarly, replicas are also placed on different racks. In Azure, fault-domains are identified (generally a rack), and only one block per stripe is placed in any one fault-domain [48]. Note that a fault-domain is a group of nodes that could fail together. For example, if a rack switch fails, then the nodes connected to that rack switch could get disconnected from the rest of the network, equivalent to a group of nodes failing. Power supply could also dictate fault-domains. If a power supply fails resulting in a group of nodes going down, then this group of nodes could constitute a fault-domain. In general, there can be many types of fault domains and multiple of them in a data center. For the rest of the work, we consider racks to be the only fault-domains in a data center. The network in a data center is generally organized as a tree topology. This causes the bandwidth between any two nodes in different racks to be much smaller than the bandwidth between any two nodes in the same rack. Let us consider what happens when stripes are stored as above - by placing one block per stripe in a different rack. Consider a single stripe. If one block is lost due to a node failure, to recover it, a new node will have to download remaining blocks that are on different racks. This will take a longer duration than if the remaining blocks were all 132 in the same rack. Therefore, we see that the repair time can be reduced by placing all the blocks in the same rack. But if all the blocks were to be placed in the same rack, then clearly, if the rack fails then all the blocks of the stripe will fail causing permanent data loss. Therefore we see that the fault tolerance and the repair times are two opposing factors. Spreading the blocks across multiple racks increases the fault tolerance but also increases the repair time, while putting them all together decreases the fault tolerance and decreases the repair time. Therefore, it is not clear which scheme should be used. Now, both the repair time and fault-tolerance can be captured using the metric mean time to data loss or MTTDL (see Section 2.1.4). So we use it as a metric to optimize for the placement schemes. In Section 2.1.4, we described the analytical approach to determine MTTDL. However, there are several reasons why this method of deriving MTTDL does not hold for our case: • The standard Markov models rely on the node failure rates as well as the repair rates being exponentially distributed. It is particularly problematic to assume the repair rates to be exponential - this implies that the repair is memoryless: when there is a second failure, all progress is reset to zero and the repairs are considered to start afresh. This is not the case in practice. • In the Markov model of reliability, it is considered that the Markov chain for each stripe moves from one state to another independent of Markov chain of other stripes. But when a node fails, blocks from multiple stripes are lost together and so the state transitions on all the corresponding Markov Chains are not independent. The widely used formula of computing the MTTDL of all the data by computing the 133 MTTDL of a single stripe and then dividing it by the number of stripes relies on the independence property and therefore does not hold. • It is also assumed that there is a fixed bandwidth available for repair, and that each node can read the required data for repair at this speed. But in reality, the speed is determined by how many blocks are to be read locally within the rack, and how many from outside the rack, the inter and intra rack bandwidths and so on. • One of the differentiating factors of Data centers as com pared to other systems is parallel repair. In RAID systems, when a disk fails, the lost blocks are recovered one by one by the raid controller or the computer. Whereas, in a Data center, when a node fails, several nodes can participate in the rebuild. The main reason parallel repair can be achieved is because of random distribution of the blocks. For this reason, we design and implement a Monte Carlo Data center simulator to estimate the MTTDL of various storage schemes. Further details about the simulator are presented in Section 6.2.4. 6.2 Models and Assumptions In this section, we describe the models used for the various components. For the data center, we assume a tree network topology. We model the placement by counting the number of non-zero blocks in each rack for any stripe. For the failures, we assume an exponential model. More details are presented next. 134 Each switch is connected to N I R nodes and a core switch, so it contains N I R + 1 ports. The core switch is connected to R other rack-level switches, and therefore it has R ports. 6.2.2 Storage Model We use ( n, k) coding where each stripe has n blocks. We consider that there are m blocks in the system, thus giving a total of min stripes. In general, there can be exponentially many ways of distributing these blocks, many of which will clearly not be preferred (for example, putting all blocks of a stripe in the same rack when rack failures exist). In general, it is not possible to search through all the possible ways of storing data to determine the optimal storage. Even if such a scheme is determined, it is not possible to maintain such a storage in the face of failures. We rule out the case when multiple blocks from the same stripe are stored in the same node. A placement scheme for a stripe defines the number of blocks stored per rack. Since there can be anywhere from thousands to millions of stripes, it is intractable to consider a separate placement policy for each stripe. Therefore, one of the first simplifications we can make is to use the same placement scheme for all the stripes. Given n blocks of a stripe, we represent a placement scheme by a set C = { q, c2, .. } such that Ci > 0 and the sum of all Ci is n. For example, when (8, 5) coding is used, the following are some of the placement schemes possible: {3,3,2},{7, 1}, {1,1,1, 1,1, 1, 1,1}. Consider a placement scheme C = { q, c2, .. }. For each stripe, first a random rack is selected which stores c 1 blocks in distinct nodes, then a different rack is chosen to store 136 next c2 blocks and so on. Since all the blocks of a stripe have an identical role in repair, blocks can be arbitrarily enumerated to be distributed. We define a few terminologies: • MAXSPREADING: This is when we distribute all the blocks across different racks. Therefore this corresponds to a placement scheme of {1, 1, ... , 1}. • MINSPREADING: Here we try to use as few racks as possible, while making sure that a single rack failure does not cause data loss. For exam pie, for 3x replication, {2, 1} and for (8, 5) coding, {3, 3, 2} will correspond to MINSPREADING. Note that when 3 blocks are stored in a rack, we still choose distinct nodes for these three blocks. 6.2.3 Failure Model We assume that node lifetimes are exponentially distributed. Nodes fail at the end of their lifetime and are replaced immediately (and therefore participate in repairs and other activities immediately). Let the rate of the exponential random variable governing node failures be .\,. All nodes have the same failure rate and the failure of a node is independent of other nodes. When a node fails, the data it lost must be reconstructed elsewhere and therefore repair jobs will be initialized. The details of the way repair takes place is detailed in the next section. Correlated failures are also common in data centers - where a group of nodes fail together. This could happen due to a number of reasons, such as the loss of power for a group of nodes, non-responsive or a failed rack switch etc. In this work, we consider rack 137 failures - groups of all nodes in a rack fail together. We assume these failures are also exponential, and occur at a rate Ar. When a rack fails, it is assumed that all the data in all the nodes within the rack is treated to be lost and must therefore be reconstructed by starting new repair jobs. 6.2.4 Repair Model Recall that each stripe consists of n blocks and that any k blocks are sufficient to recover the file. If a single block a stripe is lost, then it can be recovered by using any k of the remaining blocks. Consider that a block is lost due to a node/rack failure. Now suppose a node is selected to repair this lost block. Note that this node can be different or even same as the lost node, since it is assumed that the nodes are replaced quickly. This node will download k blocks of the stripe from the other nodes that store these blocks and will reconstruct the lost block. If a second failure occurs at the repairing node or at any of the nodes containing the k blocks, repair will either be delayed or will fail. Now if more than n- k+ 1 blocks are lost, then less then k blocks remain and therefore it is not possible to repair the stripe and the stripe is considered to be lost. If any one stripe is lost, we consider it to be a permanent data loss event. As an example, consider n = 7, k = 3 and a placement scheme {4, 2, 1}. If a block from the first group is lost, then the repair is initiated in a node from within the same rack to exploit fast downloading of three remaining blocks from within the rack. But if say a block from the second group is lost. Instead of starting the repair in that rack, 138 Notation N R (n, k) {core Meaning number of nodes in the Data center Number of racks in the Data center The coding scheme used Number of blocks The placement scheme Node failure rate Rack failure rate Effective bandwidth of the rack switch Effective bandwidth of the core switch Table 6.1: List of notations used. we still start the repair in the same rack where first group is stored and then send the recovered block to the rack containing the second group. We summarize the notation used in this chapter in Table 6.1. 6.2.5 Block Storage Simulator We built a custom block storage simulator which mimics a data center (in a very restricted way). The architecture of the simulator is shown in Fig 6.2. The Simulator module reads the configuration from a file and instantitates all other modules based on the configuration parameters (which include the coding and placement scheme, failure rates etc.). The simulator is event-driven and so there is an event queue that keeps track of all the events in a time-ordered fashion. Each event is timestamped and stores which module should perform which action. The Namenode module initiates the storage and knows which block is in which node at all times. The Node module mimics a node and handles failure and repair and knows the list of blocks it stores. The Rack module is not shown in the figure for simplicity, 139 Configuration Simulator Event Queue Switch (ToR) • • Switch (ToR) Figure 6.2: The datacenter topology used for the simulations Switch (core) and it behaves very similar to the Node module, except that it knows what nodes are associated with it. Each switch knows its speed " / ('r for the rack-level switches and 'Ycore for the core switch). Each switch is modeled by a set of input and output queues. Every time slot, a switch moves "( 1 blocks from each input queue to the appropriate output queue and again "( 1 blocks from each output queue to the appropriate destination. Failures are simulated, and they trigger repair which involves moving blocks. The simulator is run until data loss event is detected.We perform multiple runs to determine the average time to data loss. Details about the parameters used are given in Section 6.4. 140 We now give an overview of the simulator starting with some design decisions that we had to make. Design Decisions The mam reason for developing the block storage simulator is to understand the performance of various placement schemes in relation to one another, especially in terms of the MTTDL. Data is represented as blocks, and it does not matter which files the blocks come from. When nodes fail, we initiate the repair on various nodes. Since we prefer not to allocate a fixed repair throughput across all nodes (as is done in most other existing works), and since we want the repair throughput to depend on whether the blocks are read within the same rack or outside, we also incorporated switches. One of the requirements is to be able to simulate thousands of failures during which the blocks in the system will move around millions of times (during repairs). Therefore, we chose not to use ns2 [5] or any other kind of packet-level network simulator. Scope of the Simulator While the simulator is comprehensive, it, for example, does not consider many other aspects and network dynamics in data centers - such as congestion, packet dropping, varying workloads, heterogeneity of machines, as well as switches etc. We believe that since our goal is in estimating the MTTDL when employing various placement schemes, we believe that implementing these other features might hamper us from doing so. 141 6.3 Load Balancing In this section, we illustrate the necessity for load balancing when choosing nodes to repair lost blocks. Consider the following simple failure and repair model. There are N nodes indexed i = 1, 2, ... , N, each storing m blocks to begin with. The load li of node i is defined to be the number of blocks it has. Therefore the load distribution is uniform to begin with (li = m fori= 1, 2, ... , N). At each time step, a randomly selected node fails, losing all the blocks it has. This lost node is immediately replaced by a new node, so that the number of nodes in the system is still N. Repair is initiated so that the lost blocks can be recovered. There are many ways of selecting which nodes fix the lost blocks. Consider a scheme where a node is selected at random for each lost block. Note that we are not interested in how the blocks are recovered. We can just assume that there is sufficient redundancy in the system so that the nodes can recover the lost blocks. We consider such a setup since we are mainly interested in understanding how the load changes with time. The repair is completed within the next time step and the cycle repeats- a new node is chosen randomly to fail and repair is initiated and so on. Consider for exam pie the following setup: there are N = 100 nodes, each storing m = 10 nodes at time t = 0. Suppose node 10 fails, then 10 blocks are lost and we will pick random nodes ten times. If say node 1 was picked once and node 10 was never picked, then at the end of the repair, node 1 will have 11 blocks, whereas node 10 will have 0 blocks. In the next step, if node 1 fails, then 11 blocks need to be recovered, and so on. 142 CD E-o -cu 0 0 c_ 0 cu -rn:t= ·- c >·- CD CD 0-£; CD E 0)0 cu I.- E CD""O () cu (D 0 0... 500 failures 400 300 200 - ·-· ~1. 00 fa :ih:Jres·-.- . _. _ 100 10 failures 0 r <lo failoi es -100~--~--~~~~~~----~~--~~~~ 1 0° 1 0 1 1 0 2 Node Index (ordered by the descending load on the node) Figure 6.3: 100 nodes start with 10 blocks each, followed by a number of failure-repair cycles. The nodes are sorted by the descending order of the loads at the end of the failure-repair cycles. More than 500 failures follow a trend similar to 500 failures. 143 6.3.1 Motivation If the load distribution becomes skewed, when a node with high load is lost, it creates a flurry of repair, opening up a longer window of vulnerability. This could lead to reduction of MTTDL. Whereas, if the repair process is tuned so that a replacement to a lost node restores all the data, the repair could take much longer, thereby lengthening the window of vulnerability. 6.3.2 Proposed Solution If the loads of the nodes are h, l 2 , ... , lN, then we propose the following simple algo rithm to handle this issue. Let (3 2 0 be an exponent. The target nodes will be picked according to a probability distribution Pi, such that Pi e<: 1/ (1 + li )~'. We use 1 + li instead of lito handle cases when li = 0. We can see that if (3 = 0, then Pi = 1/N. If (3 = co, then Pi = 1 for the node i which has the minimum load and Pi = 0 for other nodes. Note that another way is to do the quickest repair and then migrate the data to maintain a good load distribution, but that creates additional network traffic. We do not know what would be the impact of such a strategy. 6.4 Evaluation In this section, we describe the results obtained from the various simulations. Note that we used a default of 100 nodes spread across 10 racks, storing 10000 blocks. The default value of (3 used is 2. 144 6.4.1 Effect of Node Failure Rate Here we fix the rack failure rate to 0 and vary the node failure rates to understand its effect on the MTTDL and to find out which placement performs well. If there are no node failures in the system (,\, = 0), then placement does not matter. In fact, we can just keep one replica of all the blocks. Unfortunately, this is never the case in practice. If a node fails, the repair will be the fastest if all the remaining blocks are in the same rack (since intra-rack transfers are faster than inter-rack transfers) giving the max imum MTTDL. Indeed this is what we observe from the simulations. We notice that MINSPREADING, which in the case of (8, 5) coding is {8} has the maximum MTTDL. But since rack failures need to be accounted for in practice, we do not consider place ment schemes that might cause data loss when a rack goes down, albeit temporarily. In the case of (8, 5) coding, if 4 or more blocks are lost, then data is lost, and so schemes such as (4, 2, 2), (4, 1, 1, 1, 1) etc. will not be used. Intuitively, it appears that we want to group as much as possible to maximize the MTTDL. So one might think that {3, 3, 2} should have the maximum MTTDL among the legal schemes, but that is not the case. In fact {3, 2, 1, 1} and {3, 1, 1, 1, 1, 1} have higher MTTDL values. The reason is as follows: Assume {3, 3, 2} is used. When a block is lost, multiple blocks have to come from the same rack. This is slower than if the remaining blocks can be obtained in parallel from multiple racks. 145 10 6 _J .. 0 I- I- ............... . :2: 10 5 .•. ...... ..... ...... . . . 10 4 ........................ ... . Node failure rate (a) Rack failure rate = 0 Node failure rate (b) Rack failure rate = 2E-6 Figure 6.4: Effect of node failure rate for various placement schemes. 146 _J 0 I I ~ Rack failure rate (a) Node failure rate= 0 0~----~--~--~~~~~~----~--~~--~~~~ 10- 7 1 0- 6 10- 5 Rack failure rate (b) Node failure rate = 5E-5 Figure 6.5: Effect of rack failure rate for various placement schemes. 147 In Fig 6.4b, we set the rack failure rate to 2E-6 and vary the node failure rate. When the node failure rate is very small, the trend displayed is similar to Fig 6.5a where the node failure rate is zero. It turns out that placing 3 blocks in a rack when using placement scheme {3, 1, 1, 1, 1, 1} doesn't help anymore. This is because losing 3 blocks due to a rack failure make the system much more vulnerable to permanent data loss than losing one block (when using MAXSPREADING). However, something interesting happens as the node failure rate increases. After a certain threshold, MAXSPREADING is no more the best and {3, 2, 1, 1, 1} becomes achieves higher MTTDL. 6.4.2 Effect of Rack Failures While most of the rack failures are transient in nature, it is useful to understand what happens to the MTTDL if data stored in entire racks need to be reconstructed when racks go down. In Fig 6.5a, we show a case where there are no node failures in the system, but only rack failures (even though this scenario is impractical, it is useful for intuition). It can be seen that the MTTDL values for various schemes differ by a constant magnitude across the rates, and that that MAXSPREADING performs the best. We also see a very surprising trend: the MTTDL of {3, 3, 2} is higher than that of {3, 1, 1, 1, 1, 1} or {2, 2, 2, 2}. Fig 6.5b shows the dependence on the rack failure rate when the node failure rate is 5E-5. When the rack failure rate is very low, {3, 1, 1, 1, 1, 1} performs the best, as was the case discussed earlier for Fig 6 .4a. But as the rack failure rate increases, spreading the blocks across starts to make more sense and MAXSPREADING seems to perform better. 148 6.4.3 Effect of the load exponent (/3) 90~--------~----------~----------~--------~ 0 II c:::l. 80 .._ Q) f) 70 Q) 0> .rg 60 c Q) 0 (D 50 c.. +-' ~ 40 E ~ 30 .._ c.. E ·- 20 _J 0 ~ ~ 0 --e- {3,1 ,1 ,1 ,1} --e- {3,3,2} 5 10 15 20 Load Exponent(J3) Figure 6.6: The improvement in MTTDL in percentage as the load exponent parameter is varied. In this case, we used a node failure rate of lE-4. The load exponent f3 helps regulate the load on the nodes to not become overly uneven (high variance). As mentioned earlier, lower pleads to faster repair but more uneven load, and higher f3 leads to slower repair (because only a subset of the nodes are chosen for repair with high probability). In Section 6.3, we motivated the importance of load balancing by considering a simple setup. But the datacenter simulations are more involved (we need to consider racks as well as nodes, and the repair does not follow the simple model assumed for the example) , it stands to see whether the MTTDL is affected by varying /3. 149 Since the data center simulator is more complicated than the simple set up discussed in Section 6.3, it stands to see whether the MTTDL is affected by varying (3. We note that there can be two places where (3 is applicable. One is in picking a rack and the other is in picking a node within the selected rack. Note that often times there the placement might dicate which rack to pick. For example, when repairing a block of a rack, if there are many of the remaining blocks of the stripe are in a particular rack, it might be better to pick that rack. We therefore vary (3 to understand its effect on the MTTDL. In Fig 6.6 we measure the improvement in MTTDL as the load exponent (3 is increased, with the baseline being (3 = 0. From the figure, we can see that as the (3 is increased, the MTTDL increases, to as high as 80% for {3, 1, 1, 1, 1, 1} placement scheme. The improvement is not as significant for {3, 3, 2} 6.4.4 Effect of number of blocks stored Fig 6.7 shows the effect of the number of blocks stored on the MTTDL for a few placement schemes. As the number of blocks is increases, the MTTDL of all the schemes decrease. This is because, as the number of blocks increases, with each failure, the repair time correspondingly increases. Since MTTDL is inversely related to the repair time, the MTTDL decreases. We note that the scheme {3, 1, 1, 1, 1, 1} performs the best. The explanation is similar to that given in Section 6.4.1. It can also be seen that the difference between the best scheme and the MAXSPREADING scheme grows wider as the number of blocks decreases. 150 - {1 '1 '1 '1 '1 '1 '1 '1} -•- {3,3,2} -{3, 1 '1 '1 '1 '1} -+- {2,2,2,2} 10 3 ~~~~~~~------~----~--~--~~~~~ 10 4 10 5 Total number of blocks stored Figure 6.7: Effect of number of blocks stored in the system for various placement schemes. Here a (8, 5) code is used. The performance of {3, 2, 1, 1, 1} is similar to that of {3, 1, 1, 1, 1, 1}. The node failure rate is 5E- 5 and the rack failure rate is 0. 151 6.5 Chapter Summary We have illustrated the importance of placement of data across a data center in this chapter. We considered ( n, k) coded representation of data, which captures both replication (k = 1) and erasure coding k > 1 and identified that the fault-tolerance and the repair-speed are two opposing factors when considering placement. For example, placing all blocks of a stripe in the same rack can speed up repairs but increase the rate of failures. We have captured these two opposing factors into one widely used metric for reliability- the Mean Time To Data Loss (MTTDL). We have also shown that incorrect placement schemes can reduce the MTTDL by increasing the cross-rack traffic, or by causing delays in reading data for recovery. In this work, we show a few cases where non-intuitive placement schemes offer better reliabilities. In general, the best placement scheme depends on a number of factors, including the number of nodes, number of racks, number of total blocks to be stored, the available network bandwidth within and across the racks and node and rack failures rates. While it is hard to come up with a simple method or algorithm to determine what will be the best scheme given all these factors, we have outlined the reasoning behind why certain non-intuitive placements do better than others, which could guide data center operators in determining good placement schemes. Another approach, also used in this work, is to perform Monte Carlo simulation of failures and repairs to determine the MTTDL of placement schemes. Much work remains to be done in the future. One is to extend this work for non-MDS codes such as Locally Repairable Codes. Also, in this chapter, we have mainly considered 152 (8, 5) coding scheme. It will therefore be interesting to vary the coding parameters (n, k) to determine good coding scheme that maximize the MTTDL, in conjunction with good placement schemes. A longer summary of future work is described in Chapter 7. 153 Part III Epilogue 154 Chapter 7 Conclusions and Future Work We have considered two key challenges each in two large scale cloud environments - vehicular cloud and data center cloud, and proposed solutions to address these challenges. Part I contains the details of the challenges and the solutions for the Vehicular cloud, and similarly Part II contains the details for the Data center cloud. 7.1 Summary In Chapter 1, we described the methodology used to guide the research done in this thesis. The general philosophy is to identify a bottleneck that hinders applications or systems to scale, propose a possible way to overcome the bottleneck, and identify and address any challenges associated with the solution to the bottleneck. 7.1.1 Vehicular Cloud When considering content access from vehicles, we identify that using cellular network could be a potential bottleneck due to the high cost of access and scalability issues faced by cellular network operators. Therefore, we propose the concept of Vehicular Cloud in 155 Chapter 1 to address this bottleneck. The idea here is that, instead of making use of cellular networks to download content to vehicles, we can store the content in vehicles themselves. A vehicle that wants a particular content can opportunistically download from other vehicles when it encounters them. We could also let other vehicles act as relays or helper nodes to deliver the content to the requesting node (demand) to the node that has the content (seed). We identify two challenges, namely high latencies of content access and the non triviality of helper node allocation. These challenges are tackled in Chapters 3 and 4 included in Part I. The first challenge is addressed in Chapter 3, where we propose the use of distributed storage codes to reduce the latency of content access from the vehicular cloud. We prove mathematically that coded storage is never worse than uncoded storage in terms of latency, especially for large files and and high bandwidth limitations. We also used a realistic trace based simulator to validate that coded storage can indeed reduce the latencies. The second challenge is addressed in Chapter 4, where we mathematically formulate the problem of helper node allocation, and optimize utilities for two models. 7.1.2 Data center Cloud Due to the sheer number of components involved in data centers, failures are common and must be taken into consideration when designing the storage solutions. 3-replication is widely used to overcome data loss due to failures but we identify that 3x replication can quickly get very expensive. Therefore, we propose the use of erasure codes to address 156 this bottleneck. Next, we identify two challenges associated with this proposal. One is the repair problem - the use of erasure codes can cause a lot of network traffic and disk 1/0 during repairs (which are common). This challenge is addressed in Chapter 5. The second challenge is the placement problem - the placement of blocks across the data center affects the fault-tolerance as well as rebuild rates. This is addressed in Chapter 6. 7.2 Future Directions There are a number of directions that we aspire to pursue in the future. We categorize them into future work in the vehicular cloud and in the data center cloud. 7.3 Vehicular Cloud One of the limitations in Chapter 4 is that the nodes have storage space for a single file. While this assumption was critical for the current analysis, we would like to extend it to support bigger storage capacities. Some recent works have explored heuristically the problem of allocating storage in the context of statistically structured heterogeneous ICMN, including via social network anal ysis [35, 46]. To shed some theoretical light on such problems, the formulation presented in Chapter 4 will need to be extended in future work, to handle more realistic heteroge nous mobility patterns. This may be mathematically and computationally challenging as not only the number of helper nodes, but also their identity starts to matter, resulting in a combinatorial explosion of states. Nevertheless, approaches leveraging approximation algorithms may prove fruitful. 157 Similarly, in Chapter 3, the contact model assumes that the node requesting the con tent (sink node) meets a uniformly random node every encounter. Extending this model to a slightly more realistic model where the sink node meets other nodes according to a probability distribution is already a hard problem to solve, to the best of our knowledge. An assumption made in this chapter is that all the files are equally popular. If the goal is to reduce the expected latency, where the expectation is over the various files, it intu itively makes sense that files with higher popularities must have higher redundancy than the files with lower probabilities. A square-root replication scheme [26] could be made use of to determine the redundancy values Cti for the files. However, it is unclear whether this would be the optimal method in our setting. 7.4 Data Center Cloud In Chapter 6, we considered the placement of blocks across a data center. One of the limitations was that we considered the placement for only MDS codes. The Locally Repairable Codes implemented in Chapter 5 are non-MDS and we believe that their placement needs to be investigated. There are at least two reasons why studying the placement issues of LRC can be interesting: for ( n, k) MDS codes, each lost block requires k other blocks to repair. But in LRCs, less than k blocks might be sufficient to repair one lost block, and k blocks are less might be required to repair two or more lost blocks. The second reason is that for repair considerations, all blocks can be treated identical for MDS codes but that does not hold for LRCs. 158 Maximizing the amount of logical data stored : Given a datacenter configuration (number of nodes, number of racks, switch speeds, uode failure rates, rack failure rates) aud given a maximum storage capacity, we waut to determine how much logical data we cau store. To recall, the term logical data is used to indicate the raw amount of data before coding or replication. For example, if data occupies 3 TB after 3-replicatiou, the logical equivalent is only 1 TB. Let us consider au example to motivate this problem. Given a 100 uode cluster (with other parameters such as 10 racks etc.) aud given that it cau store a maximum amount of 1 PB, we waut to determine the maximum amount of logical data that cau be stored. Since failures are common, we cauuot use all the 1PB to store logical data, since, iu the event of a uode or rack failure, we will lose data. The de facto method used is 3x replication aud the amount of logical data would be 333.33 TB. But is it possible to do better? A good approach to this problem is to use erasure codes. While 3x has 33.33% efficiency, a code like (8, 5) will have 62.5% efficiency. So a datacenter vendor could encode say about 10% of data with a code like (8, 5) aud store the rest with 3x replication, m which case the effective logical data stored will be 349 TB. Au intuitive metric to use to decide whether to use this storage scheme is MTTDL. If the MTTDL of eucodiug 10% of data using (8,5) code is lower thau that of pure replication, it intuitively suggests that we should uot be using coding. The benefit of MTTDL is that it will take iuto account the available repair bandwidth. If for example, the available repair bandwidth is low, as more aud more data is encoded, the repair 159 bandwidth will get saturated and eventually, the repairs could start to take very long, thereby affecting the MTTD L negatively. Another issue to consider here is to determine the actual coding scheme to be used. We are interested in determining the optimal set of parameters n and k which minimizes n/k, such that (n, k) code can be used. Another extension would be to use LRCs. Determining a good group size : If replicas or coded blocks are distributed randomly across a lot of nodes, then failure of a random fraction of nodes can cause data loss with high probability. Consider the following example to motivate this problem. Let there be ten nodes and suppose a million blocks need to be stored on these ten nodes with three replication (therefore, there are a million stripes and each stripe contains three blocks). Suppose for each stripe three nodes are randomly chosen for storage. Due to the sheer number of stripes, it is clear that a random failure of any three nodes will cause data loss with high probability. But now consider that nodes are divided into two groups, each of five nodes. Half a million stripes are placed in the first group of five nodes and the other half in the remaining five nodes. If again three nodes fail, say one node from the first group and two from the second group, then we have not lost data. The nodes can initiate a repair to repair the lost blocks. In general, consider that there are N nodes, and that data is stored usmg ( n, k) coding. We propose that the N nodes be split into groups of size n S g S N each and each group stores equal number of stripes independent of other groups. The blocks of each stripe will be placed randomly across the g nodes from the group it belongs to. We are interested in determining a good value for g. We will shortly discuss a good metric which can be used to optimize for. 160 Let us consider that a fraction p of the nodes fail. When nodes fail, blocks are lost. For all stripes if k or more blocks remain, then the lost blocks can be recovered by initiating a repair process, but if any stripe contains less than k blocks, then we consider that data has been lost. It can be shown analytically that when g = n the probability of data loss is minimum, and the probability of data loss increases with the group size. But we argue that the probability of data loss is not the right metric. This is because g = n is not practical, since it will lead to very high repair durations. For each lost node there are only n- 1 nodes that can participate in the repair. Whereas when g = N, for each lost node, its very likely that the remaining blocks are distributed on all N- 1 nodes and therefore all N- 1 nodes can participate in the repair, leading to very fast repairs. Therefore, we can see that there is a tradeoff between the data loss probability and repair speeds. As g increases from n to N, the data loss probability increases, but the repair speeds also get faster. We believe that MTTDL can be a good metric to optimize g over, and that in some cases a group size of n < g < N could maximize the MTTDL. 161 References [1] Amazon EC2. http:llaws.amazon.comlec21. Accessed: Nov 2013. [2] Amdahl's Law. http:llen.wikipedia.orglwikiiAmdahl's\_law. Accessed: Nov 2013. [3] HDFS-RAID Wiki. http: I lwiki. apache. orglhadoopiHDFS-RAID. Accessed: Nov 2013. [4] Intel's In-Vehicle Infotainment (!VI). http: I lbi t .ly lvehinf otainment. Ac cessed: Nov 2013. [5] The Network Simulator (ns2). http: I lwww. isi. edulnsnamlnsl. Accessed: Nov 2013. [6] Cisco visual networking index: Global mobile data traffic forecast update, 20112016. Cisco White Paper, 2012. [7] S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan. Volley: Automated data placement for geo-distributed cloud services. In Proceedings of the 'lth USENIX conference on Networked systems design and implementation, 2010. [8] S. Ahmed and S. Kanhere. VANETCODE: Network Coding to enhance coopera tive downloading in Vehicular Ad-hoc Networks. In Proc. of ACM International Conference on Wireless Communications and Mobile Computing, 2006. [9] J. Ahn, M. Sathiamoorthy, B. Krishnamachari, F. Bai, and L. Zhang. Optimizing content dissemination in vehicular networks with radio heterogeneity. In IEEE Transactions on Mobile Computing, 2013. [10] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center net work architecture. In ACM SIGCOMM Computer Communication Review. ACM, 2008. [11] M. Alresaini, M. Sathiamoorthy, B. Krishnamachari, and M. Neely. Backpressure with adaptive redundancy (BWAR). In Proc. of IEEE International Conference on Computer Communications {INFOCOM), 2012. [12] E. Altaian, P. Nain, and J. Bermond. Distributed storage management of evolving files in delay tolerant ad hoc networks. In Proc. of International Conference on Computer Communications {INFOCOM), 2009. 162 [13] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFPIS Conference {spring). ACM, 1967. [14] L. Armstrong. Classes of Applications. Presentation, http: I /tinyurl. com/ vanetapps. Accessed: Dec 2010. [15] F. Bai, D. Stancil, and H. Krishnan. Toward understanding characteristics of dedi cated short range communications (DSRC) from a perspective of vehicular network engineers. In Proc. of ACM Mobile Computing and Networking {MobiCom), 2010. [16] A. Balasubramanian, B. Levine, and A. Venkataramani. Dtn routing as a resource allocation problem. In Proc. of ACM Special Interest Group on Data Communica tions {SIGCOMM), 2007. [17] P. Bodfk, I. Menache, M. Chowdhury, P. Mani, D. Maltz, and I. Stoica. Surviving failures in bandwidth-constrained datacenters. In Proceedings of the ACM SIG COMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. ACM, 2012. [18] D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, http:/ /hadoop.apache.org/docs/r0.18.3/hdfs_design.html, retrieved 2013. [19] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, et a!. Apache hadoop goes realtime at facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011. [20] W. A. Burkhard and J. Menon. Disk array storage system reliability. In The Twenty- Third International Symposium on Fault- Tolerant Computing. FTCS-23. IEEE, 1993. [21] J. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A digital fountain approach to reliable distribution of bulk data. In Proc. of ACM Special Interest Group on Data Communications {SIGCOMM), 1998. [22] V. Cadambe, S. ]afar, H. Maleki, K. Ramchandran, and C. Suh. Asymptotic inter ference alignment for optimal repair of mds codes in distributed storage. Submitted to IEEE Transactions on Information Theory, Sep. 2011 (consolidated paper of arXiv:1004.4299 and arXiv:1004.4663). [23] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et a!. Windows azure storage: A highly available cloud storage service with strong consistency. In Proc. of ACM Symposium on Operating Systems Principles {SOSP), 2011. [24] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. SIGCOMM Computer Communica tion Review, 2011. 163 [25] A. Cidon, S. Rumble, R. Stutsman, S. Katti, J. Ousterhout, and M. Rosenblum. Copysets: Reducing the frequency of data loss in cloud storage. In U senix Annual Technical Conference ( ATC), 2013. [26] E. Cohen and S. Shenker. Replication strategies in unstructured peer-to-peer net works. In Proc. of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications. ACM, 2002. [27] A. Dimakis, P. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. Information Theory, IEEE Transactions on, 2010. [28] A. Dimakis, V. Prabhakaran, and K. Ramchandran. Ubiquitous access to dis tributed data in large-scale sensor networks through decentralized erasure codes. In Proc. of ACM/IEEE International Conference on Information Processing in Sensor Networks {IPSN), 2005. [29] A. Dimakis, K. Ramchandran, Y. Wu, and C. Suh. A survey on network codes for distributed storage. In Proceedings of the IEEE. IEEE, 2011. [30] J. R. Douceur and R. P. Wattenhofer. Competitive hill-climbing strategies for replica placement in a distributed file system. In Distributed Computing. Springer, 2001. [31] J. Eriksson, L. Girod, B. Hull, R. Newton, S. Madden, and H. Balakrishnan. The pothole patrol: Using a mobile sensor network for road surface monitoring. In Proc. of ACM International Conference on Mobile Systems, Applications, and Services {MobiSys), 2008. [32] B. Fan, W. Tantisiriroj, L. Xiao, and G. Gibson. Diskreduce: Raid for data intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petas cale Data Storage. ACM, 2009. [33] K. Fitchard. Verizon: In the game of 4G, spectrum trumps technol- ogy. http:/ /gigaom.com/2012/03/06/verizon-in-the-game-of-capacity-spectrum- trumps-technology/ [34] D. Ford, F. Labelle, F. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In Proc. of USENIX Symposium on Operating Systems Design and Implementation {OSDI), 2010. [35] W. Gao, Q. Li, B. Zhao, and G. Cao. Social-aware multicast in disruption-tolerant networks. IEEE/ACM Transactions on Networking (TON), 2012. [36] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In ACM SIGOPS Operating Systems Review. ACM, 2003. 164 [37] C. Gkantsidis, J. Miller, and P. Rodriguez. Comprehensive view of a live network coding P2P system. In Proc. of ACM SIGCOMM conference on Internet Measure ment Conference {IMC), 2006. [38] C. Gkantsidis and P. Rodriguez. Network coding for large scale content distribu tion. In Proc. of IEEE International Conference on Computer Communications {INFOCOM), 2005. [39] P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin. On the locality of codeword symbols. CaRR, abs/1106.3625, 2011. [40] K. M. Greenan. Reliability and power-efficiency in erasure-coded storage systems. PhD thesis, Citeseer, 2009. [41] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: Research problems in data center networks. Computer Communications Review {CCR), 2009. [42] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and flexible data center network. SIGCOMM Computer Communication Review, 2009. [43] G. Grimmett and D. Stirzaker. Probability and random processes. Oxford Univer sity Press, USA, 2001. [44] R. Groenevelt, P. Nain, and G. Koole. The message delay in mobile ad hoc net works. Performance Evaluation, 2005. [45] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. Dcell: a scalable and fault tolerant network structure for data centers. SIGCOMM Computer Communication Review, 2008. [46] Z. Guo, B. Wang, and J. Cui. Prediction assisted single-copy routing in underwater delay tolerant networks. In Proc. of IEEE Global Communications Conference, Exhibition and Industry Forum {GLOBECOM), 2010. [47] C. Huang, M. Chen, and J. Li. Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems. In Sixth IEEE International Symposium on Network Computing and Applications {NCA). IEEE, 2007. [48] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, S. Yekhanin, et a!. Erasure coding in windows azure storage. In USENIX Annual Technical Conference ( ATC), 2012. [49] S. Joannidis, A. Chaintreau, and L. Massoulie. Optimal and scalable distribution of content updates over a mobile social network. In Proc. of IEEE International Conference on Computer Communications {INFOCOM). IEEE, 2009. [50] S. Joannidis, L. Massoulie, and A. Chaintreau. Distributed caching over heteroge neous mobile networks. Queueing Systems, 2010. 165 [51] S. Jain, M. Demmer, R. Patra, and K. Fall. Using redundancy to cope with failures in a delay tolerant network. ACM SIGCOMM Computer Communication Review, 2005. [52] D. Jiang and L. Delgrossi. IEEE 802.11 p: Towards an international standard for wireless access in vehicular environments. In IEEE Vehicular Technology Confer ence (VTC) Spring 2008. [53] A. Kamra, J. Feldman, V. Misra, and D. Rubenstein. Growth codes: Maximizing sensor network data persistence. Proc. of ACM Special Interest Group on Data Communications {SIGCOMM), 2006. [54] S. Kapadia, B. Krishnamachari, and S. Ghandeharizadeh. Static Replication Strate gies for Content Availability in Vehicular Ad-hoc Networks. Mobile Networks and Applications, 2009. [55] T. Karagiannis, J. Le Boudec, and M. Vojnovic. Power law and exponential decay of intercontact times between mobile devices. Mobile Computing, IEEE Transactions on, 2010. [56] 0. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Rethinking erasure codes for cloud file systems: Minimizing I/0 for recovery and degraded reads. In Proc. of USENIX Conference on File and Storage Technologies {FAST), 2012. [57] 0. Khan, R. Burns, J. S. Plank, and C. Huang. In search of I/O-optimal recovery from disk failures. In HotStorage '11: 3rd Workshop on Hot Topics in Storage and File Systems. USENIX, 2011. [58] A. Krifa, C. Barakat, and T. Spyropoulos. Message drop and scheduling in DTNs: Theory and practice. Mobile Computing, IEEE Transactions on, 2012. [59] K. Lee, S. Lee, R. Cheung, U. Lee, and M. Gerla. First experience with cartorrent in a real vehicular ad hoc network testbed. In IEEE MOVE, May 2007. [60] U. Lee, J. Lee, J. Park, and M. Gerla. Fleanet: A virtual market place on vehicular networks. Vehicular Technology, IEEE Transactions on, 2010. [61] U. Lee, J. Park, J. Yeh, G. Pau, and M. Gerla. CodeTorrent: Content Distribution using Network Coding in VANET. In Proc. of ACM International Workshop on Decentralized Resource Sharing in Mobile Computing and Networking {MobiShare), 2006. [62] D. Leong, A. Dimakis, and T. Ho. Distributed storage allocation problems. In Network Coding, Theory, and Applications, 2009. NetCod'09. Workshop on. IEEE, 2009. [63] M. Li, Z. Yang, and W. Lou. CodeOn: Cooperative popular content distribution for vehicular networks using symbol level network coding. Selected Areas in Com munications, IEEE Journal on, 2011. 166 [64] Q. Li, S. Zhu, and G. Cao. Routing in socially selfish delay tolerant networks. In Proc. of IEEE International Conference on Computer Communications {INFO COM), 2010. [65] Q. Lian, W. Chen, and Z. Zhang. On the impact of replica placement to the reliability of distributed brick storage systems. In Proc. of 25th IEEE International Conference on Distributed Computing Systems {ICDCS). IEEE, 2005. [66] A. Lindgren, A. Doria, and 0. Schelen. Probabilistic routing in intermittently connected networks. ACM SIGMOBILE Mobile Computing and Communications Review, 2003. [67] M. Luby. LT Codes. Proc. of IEEE Foundations of Computer Science, 2002. [68] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review, 2008. [69] M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge Univ Press, 2005. [70] A. Nandan, S. Tewari, S. Das, M. Gerla, and L. Kleinrock. AdTorrent: Delivering location cognizant advertisements to car networks. In Proc. of IEEE/IFIP Annual Conference on Wireless On-Demand Network Systems and Services {WONS), 2006. [71] M. J. Neely and E. Modiano. Capacity and delay tradeoffs for ad hoc mobile networks. Information Theory, IEEE Transactions on, 2005. [72] J. R. Norris. Markov chains. Number 2008. Cambridge university press, 1998. [73] F. Oggier and A. Datta. Self-repairing homomorphic codes for distributed storage systems. In Proc. of IEEE International Conference on Computer Communications {INFOCOM), 2011. [74] D. Papailiopoulos, J. Luo, A. Dimakis, C. Huang, and J. Li. Simple regenerating codes: Network coding for cloud storage. Arxiv preprint arXiv:1109.0264, 2011. [75] D. S. Papailiopoulos and A. G. Dimakis. Locally repairable codes. CaRR, absl1206.3804, 2012. [76] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inex pensive disks (raid). In Proc. of the 1988 ACM SIGMOD International Conference on Management of Data, 1988. [77] D. Pouge. Excited About the Cloud? Get Ready for Capped Data Plans. http: I lpogue. blogs.nytimes.com I 2011 I06 116 I excited-about-the-cloud-get ready-for-capped-data-plans/. Accessed: Nov 2013. [78] K. Rashmi, N. Shah, and P. Kumar. Optimal exact-regenerating codes for dis tributed storage at the MSR and MBR points via a product-matrix construction. Information Theory, IEEE Transactions on, 2011. 167 [79] I. Reed and G. Solomon. Polynomial codes over certain finite fields. In Journal of the SIAM, 1960. [80] J. Reich and A. Chaintreau. The age of impatience: optimal replication schemes for opportunistic networks. In Proc. of ACM International Conference on emerging Networking EXperiments and Technologies (CoNEXT), 2009. [81] P. Reny. Non cooperative games: Equilibrium existence. The New Palgrave Dic tionary of Economics, Second Edition, 2005. [82] T. Richardson and R. Urbanke. Modern Coding Theory. Cambridge University Press, 2008. [83] R. Rodrigues and B. Liskov. High availability in dhts: Erasure coding vs. replica tion. Peer-to-Peer Systems IV, 2005. [84] R. Rodrigues and B. Liskov. High availability in DHTs: Erasure coding vs. replica tion. In Proc. of International workshop on Peer-To-Peer Systems {IPTPS), 2005. [85] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring elephants: Novel erasure codes for big data. Proceedings of the VLDB Endowment, 2013. [86] M. Sathiamoorthy, A. Dimakis, B. Krishnamachari, and F. Bai. Distributed storage codes reduce latency in vehicular networks. Transactions on Mobile Computing, 2013. [87] M. Sathiamoorthy, A. G. Dimakis, B. Krishnamachari, and F. Bai. Distributed storage codes reduce latency in vehicular networks. In Proc. of IEEE International Conference on Computer Communications {INFOCOM-Mini). IEEE, 2012. [88] M. Sathiamoorthy, W. Gao, B. Krishnamachari, and G. Cao. Minimum latency data diffusion in intermittently connected mobile networks. In Vehicular Technology Conference (VTC Spring), 2012 IEEE 75th, 2012. [89] N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran. Interference alignment in regenerating codes for distributed storage: Necessity and code constructions. Information Theory, IEEE Transactions on, 2012. [90] U. Shevade, Y.-C. Chen, L. Qiu, Y. Zhang, V. Chandar, M. K. Han, H. H. Song, and Y. Seung. Enabling high-bandwidth vehicular content distribution. In Proc. of ACM International Conference on emerging Networking EXperiments and Tech nologies (CoNEXT), 2010. [91] U. Shevade, H. Song, L. Qiu, andY. Zhang. Incentive-aware routing in dtns. In Proc. of the IEEE International Conference on Network Protocols {ICNP), 2008. [92] A. Shokrollahi. Raptor codes. IEEE Trans. on Information Theory, June 2006. 168 [93] T. Small and Z. Haas. Resource and performance tradeoffs in delay-tolerant wire less networks. In Proc. of A CM Special Interest Group on Data Communications {SIGCOMM), 2005. [94] T. Spyropoulos, K. Psounis, and C. Raghavendra. Spray and wait: an efficient routing scheme for intermittently connected mobile networks. In Proc. of the 2005 ACM SIGCOMM workshop on Delay-tolerant networking. ACM, 2005. [95] T. Spyropoulos, K. Psounis, and C. Raghavendra. Performance analysis of mobility assisted routing. In Proc. of ACM Mobile Computing and Networking {MobiCom), 2006. [96] T. Spyropoulos, K. Psounis, and C. Raghavendra. Spray and focus: Efficient mobility-assisted routing for heterogeneous and correlated mobility. In Inter national Conference on Pervasive Computation and Communications (PerCom) Workshops. IEEE, 2007. [97] T. Spyropoulos, K. Psounis, and C. S. Raghavendra. Spray and wait: an efficient routing scheme for intermittently connected mobile networks. In Proceedings of the 2005 ACM SIGCOMM workshop on Delay-tolerant networking {WDTN). ACM, 2005. [98] I. Tamo, Z. Wang, and J. Bruck. MDS array codes with optimal rebuilding. CaRR, abs/1103.3737, 2011. [99] A. Vahdat and D. Becker. Epidemic routing for partially connected ad hoc net works. Technical report, CS-2000-06, Duke University, 2000. [100] R. Van Der Hofstad. Random graphs and complex networks. Available on http:/ jwww.win.tue.nlFrhofstad/NotesRGCN.pdf, 2009. [101] V. Venkatesan, I. Iliadis, and R. Haas. Reliability of data storage systems under network rebuild bandwidth constraints. In Modeling, Analysis f3 Simulation of Computer and Telecommunication Systems {MASCOTS), 2012 IEEE 20th Inter national Symposium on. IEEE, 2012. [102] G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A simulation approach to eval uating design decisions in mapreduce setups. In IEEE International Symposium on Modeling, Analysis f3 Simulation of Computer and Telecommunication Systems {MASCOTS). IEEE, 2009. [103] Y. Wang, S. Jain, M. Martonosi, and K. Fall. Erasure-coding based routing for opportunistic networks. In Proceedings of the 2005 ACM SIGCOMM workshop on Delay-tolerant networking {WDTN), 2005. [104] Y. Wang, B. Krishnamachari, and T. Valente. Findings from an empirical study of fine-grained human social contacts. In Proc. of IEEE/IFIP Annual Conference on Wireless On-Demand Network Systems and Services {WONS), 2009. 169 [105] H. Weatherspoon and J. D. Kubiatowicz. Erasure coding vs. replication: a quan titiative comparison. In Proc. of International workshop on Peer-To-Peer Systems {IPTPS), 2002. [106] L. Wei, Z. Cao, and H. Zhu. Mobigame: A user-centric reputation based incentive protocol for delay /disruption tolerant networks. In Proc. of IEEE Global Commu nications Conference, Exhibition and Industry Forum {GLOBECOM), 2011. [107] S. B. Wicker and V. K. Bhargava. Reed-solomon codes and their applications. In IEEE Press, 1994. [108] Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt, and W. Litwin. Reliabil ity mechanisms for very large storage systems. In Proceedings of 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies {MSST). IEEE, 2003. [109] X. Zhang, G. Neglia, J. Kurose, and D. Towsley. Performance modeling of epidemic routing. Computer Networks, 2007. 170
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Using formal optimization techniques to improve the performance of mobile and data center networks
PDF
Modeling intermittently connected vehicular networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Robust video transmission in erasure networks with network coding
PDF
Interaction and topology in distributed multi-agent coordination
PDF
Data replication and scheduling for content availability in vehicular networks
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Fundamental limits of caching networks: turning memory into bandwidth
PDF
Structured codes in network information theory
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Optimal resource allocation and cross-layer control in cognitive and cooperative wireless networks
PDF
Dynamic routing and rate control in stochastic network optimization: from theory to practice
PDF
Resource scheduling in geo-distributed computing
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Lifting transforms on graphs: theory and applications
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
On practical network optimization: convergence, finite buffers, and load balancing
Asset Metadata
Creator
Sathiamoorthy, Maheswaran
(author)
Core Title
Optimizing distributed storage in cloud environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/26/2013
Defense Date
08/14/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cloud environment,data centers,distributed storage,erasure codes,OAI-PMH Harvest,optimization,vehicular networking
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
), Bai, Fan (
committee member
), Dimakis, Alexandros G. (
committee member
), Neely, Michael J. (
committee member
), Yu, Minlan (
committee member
)
Creator Email
callsmahesh@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-351890
Unique identifier
UC11295788
Identifier
etd-Sathiamoor-2181.pdf (filename),usctheses-c3-351890 (legacy record id)
Legacy Identifier
etd-Sathiamoor-2181.pdf
Dmrecord
351890
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sathiamoorthy, Maheswaran
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cloud environment
data centers
distributed storage
erasure codes
optimization
vehicular networking