Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
QoS based resource management for Internet applications
(USC Thesis Other)
QoS based resource management for Internet applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
QOS BASED RESOURCE MANAGEMENT FOR INTERNET APPLICATIONS by Yan Yang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2010 Copyright 2010 Yan Yang Dedication To my parents, Zheng and Max... ii Acknowledgments I would like to extend my heartfelt thanks to my advisor, Leana Golubchik, for her con- tinuous support and inspiring guidance during my graduate career. Her encouragement greatly helped me bolster my confidence through the inevitable ups and downs of my PhD study. Without her, I would not have accomplished where I am right now. I am very grateful to other members of my thesis committee, Ellis Horowitz, Jay C.- C. Kuo, Cryus Shahabi, and Aiichiro Nakano, for their valuable comments and construc- tive suggestions on my thesis proposal and draft. I also greatly appreciate the audience for their helpful feedbacks to my defense presentation. During my graduate study at USC, I was very fortunate to be a member of Internet Multimedia Lab (IML) and got a chance to work with a great mentor and many talented colleagues. I would like to thank William Cheng, who supervised me on the Bistro project and helped me solving many problems during the development of Bistro system. Many thanks to Alix Chow, who have been my main student collaborator on various of research projects during the past four years. I am also greatly indebted to the whole IML group, including Lesile Cheung, Yuan Yao, Kai Song, Viktor Chen, Adam W.-J. Lee, iii Bassem Abdouni, Bo-jun Wang, and Sung-Han Lin, for their helpful feedbacks to my research work and presentations. There are also other fellow colleagues here at USC I would like to thank for their support and help during my graduate study: Feng Pan, Jun Yang, Donghui Feng, Feili Hou, Hua Liu. This thesis is dedicated to my family. I am so grateful to my parents for their constant encouragement, enormous support and endless love throughout my life. My deepest love and gratitude to my lovely wife Zheng and our newly born little Max, for always being there for me through the good times and the bad, and for making my life so colorful and joyful! iv Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 MultiTorrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 P2P V oD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Incentives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 Bistro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 2: Related Work 28 2.1 MultiTorrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 P2P V oD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Incentives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Bistro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 3: MultiTorrent 39 3.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Design Space Exploration . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Multi-Torrent Applications . . . . . . . . . . . . . . . . . . . . 56 3.4 Further Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter 4: P2P VoD 75 4.1 Performance Metrics and Experimental Setup . . . . . . . . . . . . . . 76 4.2 Peer Request Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Service Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . 87 v 4.4 Heterogeneous Environment . . . . . . . . . . . . . . . . . . . . . . . 99 Chapter 5: Incentives 103 5.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Performance Metrics and Experimental Setup . . . . . . . . . . . . . . 109 5.3 Layered Coding Incentives . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.1 Peer Requesting . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.2 Peer Serving . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.3 Realistic Settings . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 6: Bistro 128 6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.1.3 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . 131 6.2 Genetic Algorithm Heuristic . . . . . . . . . . . . . . . . . . . . . . . 132 6.2.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 132 6.2.2 GA in Our Problem . . . . . . . . . . . . . . . . . . . . . . . . 134 6.3 Validation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 137 Chapter 7: Summary 147 7.1 MultiTorrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.2 P2P V oD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.3 Incentives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.4 Bistro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Bibliography 152 vi List of Tables 3.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 2 Classes Bandwidth Distribution . . . . . . . . . . . . . . . . . . . . . 51 3.3 Average Seed-to-Leecher Ratio . . . . . . . . . . . . . . . . . . . . . . 60 4.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Different Load Balancing Schemes . . . . . . . . . . . . . . . . . . . . 82 4.3 Different Load Balancing Schemes with DAS . . . . . . . . . . . . . . 92 4.4 Message Overhead under DAS and Mixed Piece Selection . . . . . . . 98 4.5 Heterogeneous Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2 Class Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3 System with sufficient capacity . . . . . . . . . . . . . . . . . . . . . . 120 5.4 System with insufficient capacity . . . . . . . . . . . . . . . . . . . . . 123 5.5 Video Layer Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.6 Class Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1 Initial Population for Test Cases 1-4 . . . . . . . . . . . . . . . . . . . 139 6.2 Bistro Failure Probability Settings for Each Test Case . . . . . . . . . . 139 vii List of Figures 1.1 Upload vs. Download . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Request Deadline Distribution . . . . . . . . . . . . . . . . . . . . . . 13 1.3 PC (600kbps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 PC (300kbps Base Layer) . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5 CI (300kbps Enhancement Layer) . . . . . . . . . . . . . . . . . . . . 18 1.6 Traditional Uploads and Uploads with Bistro . . . . . . . . . . . . . . . 22 3.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Number of Nodes in the System . . . . . . . . . . . . . . . . . . . . . 44 3.3 Staying Around Improvement . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Inter-Arrival Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 Staying Fraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6 Game Patch System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Different CTFT Weight . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.8 Different Seed-to-Leecher Ratio (R) . . . . . . . . . . . . . . . . . . . 63 3.9 492MB File with Bursty Arrival . . . . . . . . . . . . . . . . . . . . . 64 3.10 80MB File with Bursty Arrival . . . . . . . . . . . . . . . . . . . . . . 65 3.11 NetFlix Top 10 Movie as of 3/18/08 . . . . . . . . . . . . . . . . . . . 67 3.12 Adobe SW on MiniNova as of 3/18/08 . . . . . . . . . . . . . . . . . . 67 3.13 Online Movie Rental (Fast Nodes) . . . . . . . . . . . . . . . . . . . . 67 3.14 Online Movie Rental (Slow Nodes) . . . . . . . . . . . . . . . . . . . . 68 viii 3.15 Software Installer (Fast Nodes) . . . . . . . . . . . . . . . . . . . . . . 69 3.16 Software Installer (Slow Nodes) . . . . . . . . . . . . . . . . . . . . . 70 4.1 CI (Random, LLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Different N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 CI (LLP-S, LLP-P) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4 LLP Update Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Piece Service Example . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6 Service Rejection Example . . . . . . . . . . . . . . . . . . . . . . . . 89 4.7 Overhead (LLP-P+DAS) . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.8 Peet Set Size (CI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.9 Peer Set Size (Overhead) . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.10 CI (Mixed Selection) . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.11 LLP (Heterogeneous) . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.12 YNP (Heterogeneous) . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1 PC (200kbps Base Layer) . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2 CI (400kbps Enhancement Layer) . . . . . . . . . . . . . . . . . . . . 114 5.3 PC (300kbps Base Layer) . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4 CI (300kbps Enhancement Layer) . . . . . . . . . . . . . . . . . . . . 116 5.5 Uplink Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6 PC (Base Layer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.7 CI (Enhancement 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.8 CI (Enhancement 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1 Graphical Representation of Notation . . . . . . . . . . . . . . . . . . . 129 6.2 Test Case 1: A Scenario with Reliable Conditions . . . . . . . . . . . . 140 6.3 Test Case 2: A Scenario with Reliable Conditions . . . . . . . . . . . . 141 ix 6.4 Test Case 3: A Scenario with Error-prone Conditions . . . . . . . . . . 142 6.5 Test Case 4: A Scenario with Error-prone Conditions . . . . . . . . . . 143 6.6 Number of Generations to Converge with Random Initial Population . . 145 x Abstract In recent years, many Internet based applications have arisen to take advantage of wide- spread inexpensive broadband connections. However, congestion on Internet is still sig- nificant. Therefore, efficient management of Internet resource that lead to improvements in Quality of Service (QoS) for Internet based applications remains an important prob- lem. In this dissertation, we focus on this problem in the context of several important applications. Specifically, we consider three classes of Internet based applications, as follows. File downloading is an important class of applications, due its high bandwidth usage, mainly due to the success of applications such as BitTorrent (BT). While the current BT systems use a single torrent based approach, torrent files are related to each other and empirical evidence suggests that most nodes participate in multiple torrents. Consequently, in this part of the dissertation, we propose a multi-torrent BT system and illustrates that our approach improves the overall system performance provides incentives for nodes to act as seeds. xi The use of P2P-based design in providing large scale video streaming services has become more popular due to its effective use of Internet and server resources. Live streaming and Video-on-Demand (V oD) streaming are typical examples which can take advantages of a P2P-based approach. In this part of the dissertation, we focus on a num- ber of fundamental open questions in designing P2P-based V oD systems. We explore practical solutions to these questions and show they result in better QoS. Although P2P-based design of V oD systems received much attention in recent years, most existing V oD systems do not have built-in incentives. In this part of the disserta- tion, we consider a BT-like V oD system and study the following questions: (1) why an incentive mechanism is needed, and (2) what are appropriate incentives for a BT-like V oD system. We propose a layered coding based incentives approach and show that: (a) our approach does provide better incentives than BT’s current Tit-For-Tat (TFT) mechanism, and (b) our approach improves system performance as well as uses system resource efficiently. Uploads correspond to another important class of Internet based applications. In this part of the dissertation, we consider the following problem in the context of Bistro, a system that focuses on collection of data over the Internet. We focus on a data assign- ment problem within the Bistro fault tolerance protocol. We formulate this problem as a non-linear optimization problem and develop a genetic algorithm based heuristic as an approximation. We show that our approach is more accurate than several simple heuristics used for comparison, and efficient, as compared to a brute-force approach. xii Chapter 1 Introduction With the proliferation of inexpensive broadband Internet connections, many applications have arisen to take advantage of this widespread bandwidth. Cisco VNI [36] predicts the global IP traffic in 2012 will be 75 times more than that in 2002. NBC reports that during the Beijing 2008 Olympic Games, the number of clicks to their online video program was 1100 times more than that of the entire 2007 year. All of this indicates that Internet is becoming more and more important to our daily life. However, although computer speed and network bandwidth significantly increased over the past years, a sig- nificant amount of congestion still exists on the Internet. Moreover, many applications require large data transfers. Thus, efficient manage of the Internet bandwidth resources and improvement of resulting Quality of Service (QoS) for Internet based applications remains as an important problem. In this thesis, we look at how to efficiently manage resources for 3 classes of Internet applications: (1) a Peer-to-Peer (P2P) file sharing system, (2) a P2P Video-on-Demand (V oD) streaming system, and (3) a wide area upload system. Here, a resource refers to anything on the Internet required by these applications, e.g., network bandwidth for a 1 file sharing system, server hardware for a wide area upload system. With efficient man- agement of these resources, our goal is to improve the performance of these applications and provide better Quality-of-Service (QoS) to end users. Among different Internet applications, file sharing has become one of the most resource consuming applications. To reduce the server cost and improve the system scalability, peer-to-peer (P2P) technology has been widely used in recent years. One rep- resentative example of such technology is BitTorrent (BT), which utilizes upload band- width resources of downloading peers to efficiently transfer data among peers. Accord- ing to CacheLogic [5], BT is responsible for35% of all Internet traffic in 2005. Because of its unique characteristics and its huge success, BT has attracted a lot of interest from the research community. While most of existing literature is focused on a single-torrent system, many published torrents are related in practice. In Chapter 3, we propose and study a MultiTorrent system which encourages peers to stay around while downloading in multiple torrents. Nodes staying around provides seeds for finished torrents and such behavior improves system performance (refer to Chapter 1.1). Compared to the current BT system, which is based on a single torrent, resources in the multi-torrent system are managed more efficiently, resulting in performance improvements. In addition to file sharing, real-time online streaming is also becoming more and more popular. Cisco VNI [36] predicts that by 2013, global online video will corre- spond to 60% of consumer Internet traffic (up from 32% in 2009). Video content shar- ing web-sites such as YouTube have also gained significant popularity - it is reported 2 that YouTube has more than 100 million downloads per day. Compares to file shar- ing, online streaming has strict real-time requirements, which translates into more strin- gent constraints on network resources. However, traditional approaches would require a streaming content provider to maintain a large server farm, which is a costly solution. For example, it is reported recently that YouTube spent≈ 460M USD of server costs in 2008. Therefore, P2P technology has become an important trend in building scalable, low cost online streaming systems. In Chapter 4, we identify several fundamental ques- tions in utilizing P2P technology for online Video-on-Demand (V oD) streaming. We propose practical schemes aiming at addressing these questions and do a performance study. Although P2P-based design of V oD systems received a lot interest in recent years, little exists to provide incentives for such systems. In particular, most existing V oD systems either do not have incentives or just directly use the incentive mechanism from P2P file sharing applications, e.g., BT. In Chapter 5, we consider a BitTorrent (BT)-like P2P V oD system and focus on the following questions: (1) why an incentive mechanism is import, (2) how the lack of an incentive mechanism results in lack of QoS differen- tiation in a P2P-based V oD system, affects the system performance, and (3) what are practical approaches providing incentives in a P2P V oD system. We first illustrate why the single-layered coding, which is used by most P2P V oD systems is hard to provide incentives. Motivated by this, we propose to use layered coding and propose Layered Coding Incentives (LCI) as an incentive mechanism for a BT-like V oD system. Our 3 simulation study illustrates that our approach provides: (a) a basic service for all users; (b) a good differentiated service; and (c) an efficient use of system resources. High demand for services or data creates hot spots, which is a major hurdle to achiev- ing scalability in Internet-based applications. In many cases, hot spots are due to real life deadlines associated with some events, such as submissions of papers to conferences, and electronic submissions of income tax forms. Wide area uploads correspond to an important class of Internet applications. In traditional upload applications, every client sets up a one-to-one communication channel with the destination server, and transfers data through that channel. It is simple yet not scalable and creates hot spots. Bistro is a wide-area upload system built at the application layer and it avoids creating hot spots by breaking an upload process into three steps (refer to Chapter 1.4). In Chapter 6, we focus investigates a data assignment problem in a fault tolerance protocol of Bistro, a wide area upload framework. In the data assignment problem, we need to determine how much data goes to each bistro and how this affects the probability that the final destination can successfully receive the data from intermediate bistros or recover it. Our goal is to make the data uploads in Bistro more reliable, and specifically to maximize the probability that the destination receives or is able to recover the original data. The remainder of this document is organized as follows. We go over related works in Chapter 2. We give an overview of our results on multi-torrent schemes in Chapter 3, on P2P V oD schemes in Chapter 4, on Inventives for P2P V oD streaming in Chapter 5, 4 and on data assignment in upload systems in Chapter 6. Our contribution for this thesis will be given in Chapter 7. 1.1 MultiTorrent Most of the current studies ([55, 2, 18, 54]) on BT (with a notable exception of [25]) are focused on a single-torrent system – they assume that a node joins a single tor- rent in a system, where all nodes in that system download from the same torrent and eventually leave the system. In practice, many published torrents are related (e.g., dif- ferent episodes of a TV show or movies/music from the same genre), which results in many peers downloading data from multiple torrents. Specifically, from the trace analysis given in [25], more than 85% of peers participate in multiple torrents. How- ever, in the current BT system, there is no mechanism to “relate” the multiple down- loads of the torrents in which a particular node is participating - we term this a multi- torrent approach. There are a number of applications which could benefit from such an approach. For instance, fast delivery of patches in popular multi-player online games (e.g., [4]), is an important problem. Such games have millions of subscribers which (periodically) need to download patches, resulting in substantial Internet traffic. Such downloads need to be completed fairly quickly, as users cannot continue to play the game until all patches are downloaded. Since each user (potentially) downloads multi- ple patches, this indicates that not only is P2P technology useful for this application, but 5 also that there is an opportunity for a multi-torrent approach. An interesting application in this context would be a software installer, e.g., such as the Google Pack installer or the Microsoft Live installer, where users would typically download some subset of the available software modules simultaneously, providing another opportunity for exploring a multi-torrent approach. Similar applications include software update installers, such as Microsoft Windows updates, Mac OS X updates, Adobe updates, and so on. Multime- dia online applications, such as iTunes Movie Store, NetFlix Video-on-Demand service, and YouTube, can benefit from a multi-torrent approach as well. In these applications, users normally have sufficient storage space to keep previously viewed content. This makes use of multi-torrents a natural approach in such applications. (A more detailed description of such applications is given in Chapter 3.3.) Intuitively, a multi-torrent approach might have the following advantages. Typically, it is desirable to have “seeds” in a torrent, (i.e., nodes which have completed their down- load but continue to contribute their uploading capacity to the system). Briefly, seeds can contribute to (a) helping newly joined nodes ramp up so that they can become con- tributing peers faster, (b) helping nodes nearing the end of their downloads find the last few file pieces/chunks faster, and (c) keeping a torrent “alive” (i.e., making sure that all pieces of a file are available in the system). (A more detailed discussion of these benefits is given in Chapter 3.1.) Although current studies, e.g., as in [25, 39], suggest that seeds do exist in BT, there are no incentives in the current BT system for nodes to provide seeding capacity (e.g., by staying around as seeds after their download in a particular 6 torrent is complete or by sharing files downloaded earlier). However, if such incen- tives were provided – for instance, if a node was to receive better service in torrent A because it acts as a seed in torrentB – then, intuitively, seeding behavior of nodes in BT might increase. Exploring an approach to providing such an incentive and evaluating the resulting BT performance characteristics is the main topic of this work. Surprisingly, little exists in the literature which considers relating multiple torrents. To the best of our knowledge, [25] is the only work which studies this topic. Specif- ically, in [25] the authors develop an analytical model of multiple torrents where it is assumed that a node participating as a “leecher” (node which has not completed its download yet) in a particular torrent is willing to serve as a seed in torrents in which it has participated some time earlier in its lifetime. This model is then used to illus- trate that this multiple torrent approach can extend the lifetime of a torrent. Here, the lifetime of a torrent is the time until some piece/chunk is lost from a torrent due to the departure of one of its nodes. (A more detailed description of [25] is given in Chapter 2.1.) While the work in [25] provides intriguing ideas and results, it leaves a number of open questions. Two such questions are: (1) how to provide appropriate incentives for nodes to contribute resources as seeds (preferably) in a distributed manner and through simple/local changes, and (2) what are the resulting performance consequences of such behavior, both on the nodes which are willing to be seeds and on the overall system. In our work in Chapter 3, we focus on the above stated questions using our proposed multi-torrent approach. We first illustrate how nodes staying around in the system as 7 seeds after finishing their downloads helps improve the overall system performance. We then show why the current BT unchoking algorithm lacks incentives for nodes to stay around as seeds in a multiple torrent environment. Motivated by that, we propose a cross torrent based method to encourage nodes to stay around as seeds and perform an extensive performance study of our techniques. Thus, the contributions of this work are as follows: • We propose a multi-torrent BT system which can be easily implemented through fairly small modifications of the current BT protocol (refer to Chapter 3.1), unlike the approach proposed in [25] which requires modifications to the BT tracker. Thus, we believe that our approach is scalable and easily deployable. • We propose a “cross-torrent-based” tit-for-tat (CTFT) strategy motivated by pro- viding incentives for nodes to act as seeds (refer to Chapter 3.2). Unlike the exchange-based incentives suggested in [1] which require nodes to maintain and search through a request tree before transmitting file chunks, CTFT only uses peer transmission rate information and does not require additional modifications to the current BT system. We believe it is a more efficient, scalable, and easily deployable approach. • We perform an extensive simulation-based study which illustrates that (a) our approach does improve the overall performance of the system and (b) our 8 approach does provide incentives for nodes to act as seeds by providing better performance for such nodes (refer to Chapter 3.3). 1.2 P2P VoD In recent years peer-to-peer (P2P) systems have become an effective approach to video content distribution. In particular, many P2P live streaming systems, such as PPLive [?] and CoolStreaming [64], have been deployed and gained wide popularity. Because of their scalability characteristics, significant research effort (e.g., [70], [61], [67], [49], to name a few) has been focused on the design of P2P systems for live streaming. These and other works provide insight into performance, scalability, dynamics (and other char- acteristics) of P2P live streaming systems (see Chapter 2.2 for a more detailed overview and relationship to our work). More recently, interest in the P2P community has also turned toward another impor- tant and technically challenging application, specifically Video-on-Demand (V oD) ser- vice for high-quality, full-length movies. Examples of such systems include Joost, Hulu, Netflix, iTunes Store, and so on. Naturally, P2P-based designs have been considered in the context of V oD applications as well. For instance, the study in [33] shows that the MSN video server load can be reduced by≈ 95% through the use of a P2P-based approach. A P2P-based design of V oD applications is also the focus of our work. 9 Before describing our contributions, we first note that there are a number of funda- mental differences between V oD and live streaming. V oD applications have a greater data diversity than live streaming applications. In live streaming systems, nodes typ- ically request data around a particular playback point - that is, users are watching the stream at around the same time. In contrast, in V oD systems nodes request videos at different times, and thus their playback points differ greatly. One implication of this is that nodes in live streaming applications only need a playback buffer of several minutes (as peers are clustered around the same playback point), while nodes in V oD applica- tions may need to hold the entire movie (as each node tends to have a different playback point). Another implication is that playback deadlines of file pieces in V oD have a larger variance than those in live streaming. This is partly due to (a) the playback points of nodes being dependent on their arrival time (which is diverse) and (b) nodes in V oD systems potentially requesting data relatively far into the future (from their playback point), as V oD data is pre-recorded (in contrast to data being generated live). In this thesis, we consider a BitTorrent (BT)-like P2P design, as BT is one of the more popular P2P systems. BT is also a highly efficient system, e.g., the study in [2] illustrates that it makes efficient use of peers’ upload bandwidth. Because of its success, a number of previous efforts, including [52], [9], [61], [35], have focused on adapting BT (originally designed for file downloads) to V oD applica- tions. To adapt BT for streaming, a number of fundamental issues need to be addressed. To illustrate the difference between using BT for file downloads and for V oD streaming, 10 we briefly summarize two such issues, which are studied, e.g., in [52], [9], [61], [35]. Firstly, the default piece selection strategy used in BT is not well suited for V oD appli- cations, as BT uses a rarest first strategy to determine which piece a peer should request next. As such a strategy does not consider realtime playback constraints of video, it is unlikely to lead to good quality of service (QoS), as shown in [35], [70]. Secondly, BT’s built-in incentive mechanism, TFT (as described above), is not well suited for V oD sys- tems due to the asymmetric peer relationship in V oD applications, i.e., young nodes can download from old nodes but not vice versa, as shown in [52]. Approaches to address these and other issues in the context of both, live streaming and V oD systems are dis- cussed in a number of works, including [33], [28], [69], and [58]. However, several open fundamental questions still remain, which are particular to P2P-based V oD systems, and (to the best of our knowledge) previous efforts either do not address them (at least not for BT-like systems) or do not propose solutions that lead to practical implementations. We describe two such fundamental issues next, and focus on exploring practical solutions to the corresponding problems in Chapter 4. One such question is to which peer should a node send a request for a data piece, among all neighbors which have that piece; we will refer to this as the “peer request problem”. For instance, simply picking such a neighbor at random has the disadvantage that older peers (nodes which arrived earlier and have a larger fraction of the content) receive more requests from many younger peers (nodes which arrived later). Figure 1.1 11 0−9 20−29 40−49 60−69 80−89 100 0 5 10 15 20 25 30 35 Download Progress (%) Upload Progress (%) Random LLP Random LLP Figure 1.1: Upload vs. Download depicts percentage of upload performed by a node as a function of percentage of down- load completed in a typical P2P V oD system, obtained by simulation (refer to Chap- ter 4.1). Compared to an “ideal” load balancing scheme (referred to as “LLP” and explained in Chapter 4.2), randomly picking the neighbor to which a request is sent results in unbalanced workload over the age of a node, as≈ 50% of uploads occur after 80% of the download is completed. Consequently, the data request load is unevenly dis- tributed among the peers, (intuitively) leading to losses in QoS because: (1) requests sent to overloaded older peers suffer from long waiting times, and (2) the wasted bandwidth of young peers reduces the overall system capacity and further slows down the stream- ing process. Such a load balancing problem is briefly discussed in [52], with possible solutions based on an approach proposed in [21] but in the context of file downloading 12 0 200 400 600 800 1000 0 2 4 6 8 10 x 10 5 Time (Sec) #Request Figure 1.2: Request Deadline Distribution rather than V oD streaming (see Chapter 2.2 for details). We study the peer request prob- lem in Chapter 4.2 (where we also compare our approaches to other approaches, e.g., those suggested in [52, 45, 24]). Another fundamental question is which request for data in its queue should a peer serve next, among all the requests for data made to that peer; we will refer to this as the “service scheduling problem”. Typically, in a BT-like V oD system, as in [52], each node can serve up to a certain number of requests concurrently, with the remaining requests waiting in a queue until a service slot becomes available. As noted earlier, the play- back deadlines of the requested data pieces in a V oD system are quite diverse, some having urgent deadlines and others being more relaxed. This diversity can be observed in Figure 1.2, which depicts a request deadline distribution obtained from simulations 13 (refer to Chapter 4.1). We can see that although many requests have a very short dead- line, e.g.,≈ 33% of requests have deadlines within50 seconds, there are some requests that have very long deadlines, where the longest deadline for a request is≈ 1000 sec- onds. Intuitively, a service discipline that would reduce the probability of a data piece missing its playback deadline should result in better QoS. In [52], FCFS scheduling is suggested. We study this scheme in Chapter 4.3 and show that better solutions (which take advantage of deadline diversity) are possible. Another question within the service scheduling context is whether all requests for data pieces should be served, and if not, which ones should be rejected. Existing V oD designs, including [34], [9], [52], [35], accept and serve every request. A natural ques- tion is – if a request will (likely) miss its deadline, why waste resources by serving it? The disadvantages of not identifying and rejecting such requests are that: (a) serving such a request wastes upload resources (as this request will likely miss its deadline and not be played) and (b) the request in question might be able to receive timely service from another (perhaps less loaded) peer. Intuitively, in either case, it might be best to reject the request in question which should result in better QoS overall. We study this service rejection problem in Chapter 4.3. In summary, in Chapter 4 we focus on the two questions stated above. We first moti- vate these questions through examples. We then propose practical approaches to address them. We also show that solely addressing one of these questions is not sufficient for achieving high QoS, and thus a good P2P V oD streaming system should consider both. 14 For ease of exposition, we first present our schemes in a homogeneous environment (i.e., where nodes have equal upload capacities). We then show how simply our schemes can be adapted to heterogeneous environments (refer to Chapter 4). Thus, the contributions of this work are as follows: • We explore practical solutions to the “peer request problem” (described above), that can be easily implemented through fairly small modifications to the current BT protocol - these approaches result in better QoS in the V oD system and at the same time are scalable, efficient, and easily deployable today (see Chapter 4.2). • We propose the use of Deadline-Aware Scheduling (DAS) which includes an ear- liest deadline first (EDF) scheduling approach and an early drop (EDP) based approach to address the “service scheduling problem”. We show that DAS results in better QoS in a V oD system. To the best of our knowledge, this work is the first to explore the use of earliest deadline first and early drop approaches in the context of BT-like V oD systems (see Chapter 4.3). • We show that addressing the “peer request problem” or “service scheduling prob- lem” alone is not sufficient to achieve high QoS, i.e., that an appropriate com- bination of good solutions to each question is needed in a P2P V oD streaming system to provide high QoS with low overhead. To support this claim, we present an extensive evaluation study on the use of these approaches under a variety of environments (see Chapter 4.3). 15 Random 20% Slow, Random 20% Slow, TFT 1 1.2 1.4 1.6 1.8 2 PC 128Kbps 800Kbps Figure 1.3: PC (600kbps) 1.3 Incentives As we shown in Section 1.2, peer-to-peer (P2P) systems have become an effective approach to video content distribution and in recent years, using P2P based design has turned toward another important and technically challenging application, specifically Video-on-Demand (V oD) service for high-quality, full-length movies has received a lot attention. Examples of such systems include Joost, Hulu, Netflix, iTunes Store, and so on. One of many interesting aspects of BT is its incentive mechanism. BT uses TFT to encourage nodes contributing to the system and it is one of the major factors contributing to BT’s success. BT’s success not only shows that P2P system is highly efficient but also a good incentive mechanism is essential. However, there are surprisingly little works exist on providing incentives for P2P V oD systems. In fact, most popular P2P 16 Random TFT LCI 1 1.1 1.2 1.3 1.4 1.5 PC 128Kbps 800Kbps Figure 1.4: PC (300kbps Base Layer) V oD systems, e.g., PPLive [34], have no built-in incentives. We believe the following reasons contribute to this: First, many V oD systems are proprietary and users are forced to share their upload bandwidth. It simplify system design for these systems by not having an incentive mechanism. For example, users in PPLive [34], are forced to share a portion of their upload bandwidth. Second, P2P streaming include V oD streaming has real time playback constraint and download rate is no longer a good way to differentiate node contribution. Downloading data chunk too slow or too fast is useless for the actual video playback. Download rate is no longer an appropriate incentives in P2P V oD systems. Third, compare to P2P live streaming, where each user plays at a similar point, the playback in V oD streaming is more diverted which results in more data diversity. Due to the data diversity, bi-directional data exchange can not always happen , i.e., nodes 17 Random TFT LCI 0 0.2 0.4 0.6 0.8 1 1.2 CI 128Kbps 800Kbps Figure 1.5: CI (300kbps Enhancement Layer) who stay in the system longer can not get data chunks from new joined peers [52]. This reduces the effect of bi-directional incentives mechanism, such as TFT. Intuitively, lack of incentive mechanism in P2P V oD systems is bad: (a) system lacks service differentiation, i.e., nodes receive similar service regardless of their contribution; and (b) system largely depends on nodes with higher upload bandwidth, but lack of service differentiation makes those nodes are unwilling to stay in the system and in turn hurts the performance. To the best of our knowledge, previous works on P2P V oD systems either don’t consider incentives or just directly applying existing incentive mechanisms from P2P file sharing system, e.g., [35]. These approaches are not enough to have good system performance (refer to Section 5.1). Therefore, providing incentive mechanism for P2P V oD systems is still open and we believe it is important for the performance of a P2P 18 V oD system. In this work, we consider a BitTorrent (BT)-like P2P V oD design similar to Chapter 4 and aim to focus on providing incentives to improve system performance. More specifically, we propose Layered Coding Incentives (LCI), an incentive mech- anism in BT-like V oD systems using layered coding. With layered coding, a video file is coded into multiple layers with nested dependency: a higher layer refines the video gen- erated by lower layers (see Section 5.1). A user’s video quality thus depends on number of layers it receives. Layered coding is used in other works to provide incentives for P2P live streaming system, e.g. [47, 46]. However, due to the difference between live streaming and V oD, we cannot directly use the existing approaches (refer to Section 2.3). Thus, the contributions of this work are as follows: • We study the questions: (1) why an incentive mechanism is needed; and (2) what’s the appropriate incentives for a P2P V oD system. We first illustrate why the single- layered coding, which is used by most P2P V oD systems is hard to provide incen- tives. Motivated by that, we propose to use layered coding as incentives for a BT-like V oD system and we show that the resulted system has good performance with built-in incentives. • We propose an approach for nodes to download data chunks from different layers adaptively. This approach decides which layers to download based on the current downloading progress (see Section 5.1. It is easy to implement and makes efficient use of node capacity (see Section 5.3). 19 • We propose approaches for nodes to serve peers aiming at: (1) a basic service is provided to all nodes in the system; (2) differentiated service is received among heterogeneous nodes; and (3) node capacity is efficiently used. To the best of our knowledge, this is the first work proposing to provide a basic service to all users in BT-like V oD systems. We show our approaches achieved these objectives. Our approaches are completely decentralized and thus are very easily deployable (see Section 5.3). • We explore the effects of our proposed solutions through an extensive simulation- based study which illustrates that our approaches do result in a better incentive mechanism in BT-like V oD systems than TFT (see Section 5.3). 1.4 Bistro High demand for some services or data creates hot spots, which is a major hurdle to achieving scalability in Internet-based applications. In many cases, hot spots are asso- ciated with real life events. There are also real life deadlines associated with some events, such as submissions of papers to conferences, electronic submissions of income tax forms, submissions of proposals to granting agencies, homework submissions in dis- tance education, and online shopping for limited-time bargain products. The demand of applications with deadlines is potentially higher when the deadlines are approaching. 20 Previous work attempted to relieve hot spots in one-to-one communications, such as email and instant messaging, one-to-many communications, such as web downloads, and many-to-many communications, such as chat rooms and video conferencing. A number of techniques have been introduced including service replication (e.g., replica- tion of DNS servers), data replication (e.g., web caching and Akamai), and data replace- ment (e.g., different streaming rates for audio and video streaming). To the best of our knowledge, however, there are no research attempts to relieve hot spots in many-to-one applications, except for Bistro [3]. Many-to-one applications, or upload applications, correspond to an important class of applications, and particularly digital government applications. Specifically, govern- ment at all levels is a major collector and provider of data, and there are clear bene- fits to disseminating and collecting data over the Internet, given its existing large-scale infrastructure and wide-spread reach in commercial, private, and government domains. Broadly, this work is in the area of collection of data over the Internet, where the focus is on the scalability issues which arise in the context of Internet-based massive data col- lection applications. By data collection, we mean applications such as Internal Revenue Service (IRS) applications with respect to electronic submission of income tax forms. Briefly other such applications are as follows. The Integrated Justice Information Tech- nology Initiative facilitates information sharing among state, local, and tribal justice components. An integrated (global) information sharing system involves collection, analysis, and dissemination of criminal data. Clearly, in order to facilitate such a system 21 bistros Bistro System Destination bistro (a) upload without the Bistro System (b) upload with the Bistro System Server Clients ... Clients ... ... ... Figure 1.6: Traditional Uploads and Uploads with Bistro one must provide a scalable infrastructure for collection of data. Furthermore, a number of government agencies (e.g., NSF, NIH) support research activities, where the funds are awarded through a grant proposal process, with deadlines imposed on submission dates. The entire process involves not only submission of proposals, which can involve fairly large data sizes, but also a review process, a reporting process (after the grant is awarded), and possibly a results dissemination process. All these processes involve a data collection step. Lastly, digital democracy applications, such as online voting during federal, state, or local elections, constitute another set of massive upload applications. Of course, there are numerous other examples of digital government applications with large-scale data collection needs. In all these upload applications, hot spots are due to a demand for a popular service. The real-life event which causes hot spots often imposes a hard deadline on the data transfer service. Bistro is a wide-area upload architecture built at the application layer, and previous work [7] has shown that it is scalable and secure. Figure 1.6 compares the upload data flow in traditional applications and in the Bistro framework. In traditional upload appli- cations, every client sets up a one-to-one communication channel with the destination 22 server, and transfers data through that channel. It is simple yet not scalable and creates hot spots. In Bistro, an upload process is broken down into three steps [3]. First, in the timestamp step, clients send hashes of their files to the server, and obtain timestamps. These timestamps clock clients’ submission time. In the data transfer step, clients send their data to intermediaries called bistros. In the last step, called the data collection step, the server coordinates bistros to transfer clients’ data to itself. The server then matches the hashes of the received files against the hashes it received directly from the clients. The server accepts files that pass this test, and asks the clients to resubmit otherwise. This completes the upload procedure in the original Bistro architecture. (See [3, 7] for full details of the protocol.) Let us give an example of a large-scale digital government upload application. Con- sider the collection of income tax forms in the US. Shortly before the April 15 deadline, we can expect a large number of individuals to be trying to submit their tax forms to the IRS server. Without the Bistro framework, each client sets up a connection to the server, and sends his/her form via that connection. However, since there are a large number of clients, the server or the resources between the client and server (e.g., the network), are likely to saturate. That is, the server is likely to be too busy to handle all requests. Also, the tax forms’ size is typically fairly large (on the order of 100 KB [37]), which means that it may take a while for the client to transfer the form to the IRS server. The high demand for service around the deadline time and the relatively large size of files cause 23 “traffic jams” and result in very long transfer times. Hence, clients may not be able to submit their forms before the deadline. Alternatively, instead of using a single server, the IRS could use a cluster of servers, i.e., clients can submit to a number of servers providing the same service (in our case, a tax form submission service). If we locate the cluster in the IRS domain, it is still possible to saturate network resources in the IRS domain. On the other hand, if we spread the cluster over different domains, we have to trust the other domains since they could have access to sensitive information. Moreover, the cost for maintaining a cluster of servers is high. This is especially the case for a IRS type application where the submission event is relatively rare. Hence an investment in a cluster of servers might be a fairly expensive one. With the Bistro framework, a client first sends a checksum of his/her tax form, instead of the tax form itself, to the IRS server, and obtains a timestamp. A checksum is typically much smaller than the size of a tax form (20 bytes for a SHA1 checksum versus 100KB for a tax form), so the problem mentioned above is alleviated. After the client receives a timestamp, s/he cannot change the content of his/her tax form without IRS detecting it. This is because it is practically impossible to modify a tax form in a meaningful way and have it result in the same checksum. At this point, since we are essentially guaranteed that the client cannot change the form, we can transfer the form at a later time. The client then sends the entire tax form to one of the bistros. Since we assume that anyone can install a bistro server, bistros are not trusted; hence a secure protocol was developed to prevent bistros from accessing content of uploaded files [7]. 24 The cost of maintaining a Bistro system is much lower, because we can use the same set of intermediate bistros in many other upload events, e.g., online voting in federal, state, or local elections. Finally, the IRS server will contact the bistros to transfer the form from the chosen bistros to itself. It is possible that when the destination server tries to pull clients’ data, some interme- diaries are not available due to, for example, power failure or network problems. Then all data on the unavailable bistros is not accessible and can be considered lost. In addi- tion, since one of the design goals in [3, 7] is to deploy Bistro in a public infrastructure, such as the Internet, intermediaries are not trusted. As a result, clients encrypt their data to ensure data privacy. Yet, even though malicious bistros cannot read the data, they can modify it. The destination server can detect this, but it has no way of recovering the original data, which results in data losses. In case of data losses, the original Bistro protocol requires the destination server to request client resubmissions, which can be a slow process. Hence, unavailable bistros and malicious behavior can result in degraded system performance. To improve system performance in the face of these failures and malicious behavior, we developed a fault tolerance protocol in [8]. In this protocol, we use forward error correction techniques to recover corrupted or lost data. We add redundancy to clients’ data and modify the data transfer in the original protocol to allow clients to stripe their data to multiple intermediate bistros. In particular, we use erasure codes to recover the lost data. An (n,k) erasure code encodes a FEC (forward error correction) group ofk 25 packets inton packets. As long as we receive anyk packets from a FEC group during transmission, the receiver can recover the file. In addition to forward error correction techniques, we employ striping techniques to further improve system reliability. In our fault tolerance protocol, we stripe clients’ data across a number of bistros, in contrast to putting all data from a client on the same bistro in the original protocol. This is done to reduce the risk of losing all clients’ data and hence, improve system reliability and performance. In this thesis (refer to Chapter 6), we are interested in a data assignment problem within the context of the Bistro fault tolerance protocol. In the data assignment prob- lem, we need to determine how much data should go to each bistro and how this affects the probability that the final destination can successfully receive the data from interme- diate bistros or recover it, where our goal is to maximize that probability. We believe that data assignment problem is an important problem in the Bistro fault tolerance pro- tocol: (1) since intermediate bistros are not trusted, and a well designed data assignment strategy can make the system more reliable; and (2) since every intermediate bistro has different reliability characteristics, different assignments affect system reliability. In this problem, a client needs to decide how much data it should stripe to each bistro. If the destination server receives at leastk out ofn packets from a FEC group, it can recover the original data. Let us go back to the IRS example. Suppose we have 3 intermediate bistros,b1,b2, andb3. Assume that their failure probabilities are 0.1, 0.2, and 0.3, respectively. Also 26 assume that the client uses a (3,2) erasure code to encode his/her tax form, and the size of the encoded tax form is 3 packets. That is, the IRS server needs to collect at least 2 out of 3 packets. Two possible (out of several) ways to assign these packets to the intermediate bistros are: (1) assign all of them to b1, the most reliable bistro, or (2) spread them among all bistros. We are interested in the probability that the IRS server receives the tax form, i.e., receives at least 2 out of 3 packets successfully. In the first scheme, the probability is 0.9. In the second scheme, the probability is 0.902. From this example, we can see that the assignment of packets affects the reliability of the Bistro system. Our task is to find an optimal assignment such that the destination has a maximal probability of collecting enough packets of the file in order to be able to reconstruct it. In this thesis, we use a genetic algorithm heuristic to approximate an optimal solution of this data assignment problem. The detailed problem formulation can be found in Chapter 6.1. The contributions of this work are as follows. We formulate the data assignment problem in the Bistro fault tolerance protocol. We then propose a genetic algorithm heuristic to approximate an optimal solution to this problem which is more accurate than several simple heuristics used for comparison, as well as efficient, as compared to a brute-force approach. 27 Chapter 2 Related Work In this chapter, we give an overview of literature related to this dissertation proposal. Firstly, we describe related work in the context of the multi-torrent work. Then we review previous works on P2P V oD systems. We then present related work in the con- text of the data assignment problem. Lastly, we discuss related literature on providing incentives for P2P V oD streaming systems, which inspires our future work. 2.1 MultiTorrent BT (described by [12]) has been the topic of a large number of research efforts. Thus, here we give a brief overview of those works most related to ours. Although measure- ment studies, e.g., as in [25, 39], indicate that seeds exist in real BT systems, there are no incentives for nodes to stay around as seeds. In our work, we focus on providing incentives for nodes to stay around as seeds in the context of a multi-torrent system. As noted, most BT studies focus on a single torrent while measurements in [25] suggest that85% of users participate in multiple torrents. To the best of our knowledge, [25] is the only other effort which studies multiple torrents. Specifically, [25] develop an analytical model of multiple torrents where it is assumed that a node participating 28 as a leecher in a particular torrent is willing to serve as a seed in torrents in which it has participated some time earlier in its lifetime. This model is then used to illustrate that their multiple torrent approach can extend the lifetime of a torrent, i.e., the time until some chunk is lost from a torrent due to the departure of one of its nodes. The work in [25] also proposes a multi-torrent design based on a tracker overlay system, used to facilitate information exchange among peers (e.g., which peer can participate as a seed in which torrent). [25] open an interesting research problem and provide useful solutions. However, this work does leave a number of open questions (as discussed in Chapter 1.1). Specifically, it does not explore how to provide incentives for nodes to act as seeds in torrents they have completed earlier (it only briefly suggests the use of exchange-based incentives (see [1]) without providing details of how these would be used). In contrast, our work focuses on addressing this question. Moreover, [25] do not provide a quantitative evaluation of the performance consequences to the nodes willing to be seeds as well as to the overall system – the focus of [25] is primarily on extending a torrent’s lifetime (as described above). In contrast, our work illustrates a number of performance tradeoffs to be considered in a multi-torrent system. Our work differs from [25] in several other aspects as well. In our experiments, nodes can join their torrents of interest simultaneously and behave as seeds while they have not completed all their downloads, whereas [25] only considers what we referred to as “re-seeding”. In addition, the proposed multi-torrent design in [25] requires modifications to the BT tracker, and specifically requires a tracker overlay system. In contrast, our approach 29 makes only local modifications, while maintaining the original spirit of BT where nodes exhibit local selfish behavior (as discussed earlier). A number of works, e.g., [2, 18, 54, 42], studied various incentive-related issues in the context of single torrent systems. For instance, [19] use a market based approach to incentivize sharing, where efficient network resource allocation is achieved by relating file values (in the market) to their relative demand. At a high level, our dynamic seeding approach can be thought of as also attempting to match seeding capacity demand with supply; however, it does that in a distributed manner and by exploiting local information only. Provision of incentives in P2P systems is also studied in the context of virtual currency or micropayment based approaches, e.g., as in [68], [60]. The main limitation of such an approach is that currency management is fairly complex and a centralized administrative authority is often needed. In contrast, CTFT requires only local knowl- edge and is more practical for real world deployment. Lastly, Piatek et al. (see [54]) is an example of a work which develops a strategic BT client (as an illustration of what can be achieved with malicious behavior) which attempts to get the most reciprocation through better assignment and scheduling of upload capacity to peers. They also briefly mentions how such a client would work in multiple swarms (i.e., multiple torrents). However, no details or evaluation are given. The primary focus of that work is to exploit the BT system and show that its perfor- mance can be degraded by clients which try to maximize the download capability of a 30 malicious node, i.e., such a node would not act as a seed. In our work, we try to encour- age nodes to stay as seeds by providing better service for the seeding nodes which results in an overall system performance gain. Because our work provides incentives (through better performance) for nodes to stay around as seeds, we conjecture that our CTFT mechanism would hurt the client proposed in [54] as it is not expected to be a seed. 2.2 P2P VoD Design of V oD systems has received attention from systems, networking, and signal processing communities for a number of years. Early efforts mainly used traditional client-server architectures and focused on how to efficiently disseminate video content from the server to clients; this includes efforts such as patching [22], periodic broadcast- ings [32], stream merging [15], and so on. Recently, much of the research focus in this area shifted to P2P-based designs, e.g., [33, 34, 58, 69]. In [33] authors conduct mea- surements and a simulation study, using data traces from the MSN video service, and show that a P2P based approach can greatly reduce server cost. In [34] authors discuss challenges and design issues of PPLive, a popular P2P streaming system, with millions of real world users, and [58] studies a P2P V oD system consisting of set-top boxes in a homogeneous DSL network. The work in [69] proposes a tree based P2P V oD system which organizes nodes by arrival time and tries to combine advantages of tree-based and mesh-based overlays. While these works provide interesting insight into P2P-based 31 V oD design, to the best of our knowledge, only a few of them consider the fundamental questions posed in this paper, which we discuss below. Parvez et al. in [52] briefly suggest use of stratification in V oD systems, proposed in [21] in the context of downloads, which they suggest to achive by: (1) sending multiple copies of the same request to several peers so that eventually all requests are served by a subset of faster peers, (2) sending requests to the peers with faster response time, based on historical information so that after some time a node only sends requests to a subset of peers which provide fast response, and (3) having nodes maintain a limited buffer around the playback point (rather than the entire video content downloaded thus far) so that nodes only serve a subset of peers which have “nearby” playback points. These are not evaluated in [52], and as noted above, our evaluation of some of these schemes did not lead to as good performance as those studied here. Authors in [24] propose a DHT-based design to balance request load. They use a scoring function, to select the peer with the lowest cost to serve a request, which takes into consideration a peer’s information such as bandwidth, current load, and online time. At a high level, this is similar to our LLP-based schemes; however, there is not sufficient information provided (e.g., about how to weigh the different parameters of the scoring function), for us to be able to make quantitative comparisons to this approach. Also, they only focus on the level of load balancing as their performance metric, while we focus on the resulting QoS. Moreover, as we showed in Chapter 4.3, a good load balancing scheme is not sufficient for high QoS; hence, our proposed DAS improvement. Liang et 32 al. in [45] use tracker assistance to improve performance. We evaluated this scheme in Chapter 4.2 and showed that our proposed schemes have better performance with lower overhead. The piece selection problem in BT-like systems is studied, e.g., in [9], [61], [35], [70]. While piece selection is not the focus of our work, we evaluated how our proposed approaches are affected by the different piece selection strategies. Another category of works related to ours is those focused on P2P live streaming, e.g., [64]. Although theoretical analysis and measurement studies of P2P live streaming systems provide insight into the design of V oD systems as well, as noted in Chapter 1.2, there are fundamental differences between these applications, which give rise to some of the fundamental questions studied here. 2.3 Incentives As we discussed in Section 2.2, V oD systems has received attention from research com- munities for a number of years and recent studies, e.g. [34, 33], show P2P based approaches can greatly reduce server cost compared to the traditional server-client approaches. An important element of P2P system design is the incentives mechanism. We show in Section 1.3 that incentive mechanism is essential for P2P systems to achieve good performance. BT [12], uses TFT as incentives so that nodes contribute more can have a faster download rate. We show that LCI has better performance than BT TFT in 33 a BT-like V oD system with layered coding. In [26], Habib et al. proposed a rank-based peer selection mechanism for a P2P media streaming system, where the contribution of a user is represented by a score. A peer with higher score has more flexibility in peer selection. A payment based incentive mechanism is proposed in [59] and peers earn points by forwarding data to next hop and compete with each other for good parents in an auction like procedure. Approaches in [26, 59] both require centralized agency to administrate the system and thus are not as scalable as our distributed approach. Unlike P2P file sharing system, download rate is no longer an appropriate incen- tives for a P2P streaming system due to the reasons described in Section 1.3. Instead, playback quality should be used to differentiate performance among nodes with differ- ent capacities. However, most popular P2P V oS systems, such as PPLive[34], use the single layer coding approach and it is hard for users in such systems to have different video qualities. Liu et al. in [46] proposed to use BT TFT and layered coding so that a node with more upload contribution has higher chance to receive more video layers and having a better video quality. In their recent work [47], substream trading is pro- posed aiming for an open P2P live streaming system. Substream trading is TFT-like but accommodates different video coding schemes, including layered coding, for an open P2P live streaming system. With substream trading, a node uploads more can potentially exchange data with more peers and result in receiving more video layers. Our work uses layered coding but the major distinction between our work and [46, 47] are: (1) we show that TFT does’t work well for P2P V oD systems with layered coding; and (2) we 34 provide a minimal service through the base layer. To the best of our knowledge, this is the first work to propose a basic service for BT-like V oD systems. As we described in Section 1.3, very few works have considered providing incentives for P2P V oD systems. In fact, many existing P2P V oD systems are lack of incentives and nodes are forced to share their upload bandwidth, e.g., PPLive [34]. Hwang et al. in [35] considered a BT-like V oD system and uses TFT directly. We show in Section 1.3 why we need an incentive mechanism for a P2P V oD system and what’s the appropriate incentives. Our proposed LCI has better performance than TFT in a BT-like V oD system with layered coding. Streaming using layered coding has emerged as a promising way to handle the net- work heterogeneity. Many works has been proposed to use layered coding to the P2P systems due to the inherent heterogeneity among nodes. [56] proposed PALS, a frame- work for P2P adaptive layered streaming. [14] targeted the challenge of how to find optimal routing structure maximizing receiver throughput and achieving intra-layer and inter-layer fairness. [63] proposed a scheduling approach for streaming using layered coding consists of (1) four objectives that should be achieved by data scheduling and (2) a 3-stage scheduling mechanism. While these works provide us insights about improv- ing performance of P2P streaming systems using different data scheduling techniques, their approaches are tied to various of private systems and thus cannot be directly used for us. In contrast, the V oD system we consider in this work is based on BT protocol 35 and is more open. Previous study in [2] show BT system is highly efficient and our approaches can be useful for BT-like V oD systems in general. 2.4 Bistro This chapter briefly discusses related work in the context of data assignment in dis- tributed systems. RAID [53] is commonly used in distributed storage systems, to provide better fault tolerance and performance. RAID spreads information across several disks, using tech- niques such as disk striping, disk mirroring, and erasure codes to achieve redundancy, lower latency and/or higher bandwidth for reading and/or writing, and recoverability from hard-disk crashes. In the Bistro fault tolerance protocol, we also stripe data with erasure codes. We can think of RAID as employing the “even” assignment strategy as described in Chapter 6.3. We have shown that this is not a good approximation for our application. An important issue in data striping over the Internet is the data placement problem. That is, which data to place where. With the deployment of large content distribu- tion networks (Inktomi [29], Exodus [17], Digital Island [38]) which provide hosting services to multiple content providers, data placement issues become more and more important. A number of problem formulations are used to characterize and improve different objectives. 36 In [43], the k-median formulation is used to address the problem of data placement in order to reduce network bandwidth consumption. In our problem, by increasing the probability for the destination server to reconstruct the original file, we can reduce the chance of retransmitting the file and improve reliability and performance. The k-median problem [50] is a well-known NP-hard problem. The k-median formulation is used to address the problem of distributing a single replica over a fixed number of hosts. Some works use a k-median formulation to try to address a performance metric in replica placement [40, 41]. The solution is usually network topology dependent. [51] pro- poses a bin packing formulation to achieve load balancing in distributing documents in a cluster of web servers. [51] also proposes an algorithm for the initial distribution and network flow formulations in cases where either access patterns change or there is a server failure. This is similar to our problem in that intermediate bistros can fail and we need to maximize the availability of the file. [11] studies the formulation of a file allo- cation problem which is proved to be NP-Complete in [16]. The file allocation problem is similar to our problem in that it is also a data placement problem and it needs to store N files onM servers in order to optimize a performance parameter, with respect to the storage capacity available at each server. However, it is difficult to apply the above formulations directly to our problem. Our problem differs from those in the following aspects: (1) in the Bistro system, inter- mediate bistros are not trusted, while in replication, the replicas are usually put in a trusted server; (2) in the Bistro framework, there is no difference between intermediate 37 bistros while in replication, there is usually a primary replication server which is more important than other servers; (3) in file allocation problems, there are normally storage capacity constraints but in our model, we do not focus on this constraint; and (4) our problem is modeled on the application layer and is network topology independent. Genetic algorithms have been used to solve various optimization problems including graph partitioning [6], multiprocessor document allocation [20], and file allocation [48]. We took advantage of their ability to explore, fast and efficiently, the solution space of a problem in order to design our heuristic for the data assignment problem. To the best of our knowledge, we are the first to apply genetic algorithms to the data assignment problem in many-to-one applications. The genetic operations in our GA heuristic are novel and can explore the search space quickly. 38 Chapter 3 MultiTorrent BitTorrent (BT) has become an extremely popular and successful peer-to-peer file shar- ing system. Although empirical evidence suggests that most nodes participate in multi- ple torrents, surprisingly little research exists on this topic. In Chapter 1.1, we already provided an overview of our work on multi-torrent. A detailed discussion is presented in this chapter. We first provide a background of BT in Chapter 3.1. Our proposed approaches are discussed in Chapter 3.2. An extensive performance study to validate our proposed approaches is given in Chapter 3.3. 3.1 Background and Motivation As background and to establish terminology, we briefly describe how BitTorrent (BT) works currently. BT System: In BT, nodes join the system (after receiving “start-up” information from the tracker) and begin requesting chunks of data from their neighbors. The tracker maintains a list of nodes which are currently participating in the corresponding torrent. It is responsible for assisting in peer discovery and is not involved in any data transfer or 39 data scheduling. Nodes which do not have a complete copy of the file are termed “leech- ers” and those which do are termed “seeds”. Each leecheri picks a number (typically 5) of nodes to whose requests it will respond with an upload of an appropriate chunk, i.e., these nodes are “unchoked”. A subset of these nodes (typically 4) are picked based on the tit-for-tat (TFT) mechanism, i.e., those neighbors which have provided the best service (in terms of download rate) to node i recently. And a subset (typically 1) is picked randomly, i.e., they are “optimistically unchoked” (to explore better neighbors). The TFT mechanism in BT is basically a “local selfish behavior” where leechers try to upload to the most reciprocative neighbors. Seeds also pick a subset of neighbors (typically 5), and upload data to them. In past versions of BT, seeds chose neighbors with the highest download rates. In a more recent protocol, [44], the seeding capac- ity is distributed more uniformly among the neighboring peers. All these choices are re-evaluated periodically. Multiple Torrents: As noted in [25], most BT peers (> 85%) participate in multiple torrents; however, in the current BT these downloads are performed without an attempt to “relate” the multiple torrents. We also note that (a) users in BT typically obtain start- up information (.torrent file) from a source, (e.g., a forum or a web-site) and (b) there is some correlation in content of interest (as noted in Chapter 1.1). Thus, we expect there to be a reasonably high probability that users obtaining start-up information from the same source would download from multiple common torrents simultaneously. Motivated by this, we first focus on the opportunity to have nodes act as seeds in the torrents they have 40 already completed while they are downloading as leechers in the torrents they have not completed 1 . The potential benefits of this include the following three categories. (1) Seeds helping newly joined nodes ramp up faster: It usually takes a newly joined node some time to ramp up its download rate, due to lack of data chunks which it can offer/upload to other nodes – this prohibits it from being unchoked via TFT, e.g., as discussed in [10]. Therefore, at the beginning it receives its chunks mostly through optimistic unchoking of other leechers as well as from seeds. Thus, having a larger number of seeds in the system would allow a newly joined node to ramp up faster. This faster ramp up would also allow new nodes to contribute their uploading resource to the system earlier, thus improving not only their performance but the overall system capacity. (2) Seeds helping improve “end-game” behavior: When a node nears the end of its download process in BT, it has a lower probability of finding peers which carry the few remaining data chunks it needs to complete its file download – this would result in a decrease in its download rate, e.g., as discussed in [10]. Having more seeds in the system would increase the probability of a leecher finding peers with those last few chunks (as seeds have the complete file), and hence would improve the download rate. In the current BT system, this is addressed by having an “end-game” mode in which nodes, toward the end of their download process, attempt to download the same 1 In Chapter 3.3, we also explore applications where nodes participate as seeds in some torrents (using previously downloaded files) while they act as leechers in torrents of current interest to them, i.e., similarly to the scenarios explored in [25]. 41 chunks from multiple peers, in order to improve their probability of getting those last few chunks from someone. This however, has the potential of wasting system resources as a node might receive duplicate chunks. (Our approach would be orthogonal to the current end-game mode.) (3) Seeds keeping a torrent alive: Having nodes participate longer in the system, and particularly seeds, also extends the lifetime of a torrent. That is, it reduces the probability of a node departure resulting in some data chunks disappearing from the torrent (i.e., if the departing node had the last copy of a particular chunk and left before another node downloaded it) – this helps maintain data availability of the less popular content. As this benefit is well studied in [25], we do not explore it further in this work. Although these potential benefits to the overall system performance are significant, in the current BT system, a node does not have an incentive to contribute its resources as a seed in some torrent while its downloading from another torrent. Moreover, doing this might hurt that node’s performance – it would use part of its upload capacity to help peers from which it does not need any data while reducing its ability to “compete” for TFT unchoking from peers which could provide data it does need. That is, a selfish peer might be better off using all its uploading capacity in the torrents where it is a leecher. In what follows we give a motivating example which illustrates a potential for incentivizing peers to remain as seeds in some torrents while they are completing their downloads as leechers in other torrents. 42 Non- Staying Staying 0 50 100 150 200 250 Original Original Half Stay CTFT Half Stay Download Time (Minutes) No Stay All Stay Non- Staying Staying Slow Fast Figure 3.1: Motivating Example 3.2 Proposed Approach We begin with a simple (simulation-based) example which motivates the use of multiple torrents and the need for incentivizing peers to remain as seeds in some torrents while they are completing their downloads as leechers in other torrents. We then suggest an approach for providing such incentives. Motivating Example: here, we have slow and fast nodes (with node capacities and mix given in Table 3.2) where each node arriving to the system joins 2 (randomly chosen) torrents out of10 available ones. (The details of the simulator used, to generate the results of experiments described below, are described in Chapter 3.3, with the sim- ulation settings given in Table 5.1.) When a node completes a download in one of its torrents, depending on the experiment, it does or does not “stay around” as a seed in that torrent while it completes its download in the remaining one; specifically, we perform 43 0 50 100 150 200 250 0 100 200 300 400 500 600 700 800 900 Number of Nodes Simulation Time (min) Number of Nodes Figure 3.2: Number of Nodes in the System the following experiments: (1) original BT where none of the nodes stay around, (2) original BT where all nodes stays around, (3) original BT where each node stays around with probability 1 2 , (4) BT with our simple modification to the current TFT mechanism, which we term cross-torrent TFT (CTFT), where each node stays around with probabil- ity 1 2 (the details of our approach are given below). The resulting average download times for these experiments are depicted in Figure 3.1 - here, the first group corresponds to experiments (1) and (2), the second group corresponds to experiment (3) where we depict download times of nodes that did and did not stay around, and the third group corresponds to experiment (4), where again we depict download times for nodes that did and did not stay around. Comparing results of experiments (1) and (2) we can note that nodes “staying around” does improve the average download time, in this case by≈ 17% for the slow 44 nodes and by ≈ 21% for the fast nodes. However, as noted the “stay around” in this simulation is “forced”, and the current BT system has no incentives for the nodes to stay around after they finish their downloads. To illustrate this further, note that in Figure 3.1 when half of the nodes stay around, i.e., experiment (3), nodes that do and do not stay around experience similar improvements in download times (as compared to experiment (1)). That is, there is no incentive for nodes to stay around voluntarily, and a mechanism for encouraging “staying around” behavior is needed. To this end, we propose a simple modification to the current TFT mechanism (as detailed in Chapter 3.2), and employ it in experiment (4). As depicted in Figure 3.1, nodes which do stay around experience better average download times (for both, fast and slow classes) than those which do not stay around - this is in contrast to experiment (3), which has the same settings except for our proposed CTFT mechanism. Specifically, in this example, when using CTFT, staying nodes download≈ 10% faster than their non-staying counterparts. This motivates the need for a mechanism to provide incentives for nodes to stay around voluntarily. We describe our proposal for such a mechanism next. In the current system, a node which does stay around may end up experiencing poorer performance than one which does not stay around (refer to Chapter 3.1). In that case, nodes are not likely to stay around, unless proper incentives can be provided. Therefore, the question we consider here is how to encourage nodes to stay around in order to reap the potential benefits of multiple torrents. 45 There are a number of approaches one could take to try to exploit the potential bene- fits of multiple torrents. In this chapter we present one such simple approach and explore its benefits through a simulation-based performance study in Chapter 3.3. (A discussion of possible other directions is given in Chapter 3.4.) In devising this approach we are motivated by the following. Firstly, we would like to provide incentives for a node to stay around as a seed in torrents it has finished while it is downloading from other torrents as a leecher. Specifically, we would like such nodes to experience shorter down- load times than nodes which do not stay around. Secondly, we would like to achieve this while maintaining the original spirit of BT where nodes exhibit “local selfish behavior” (refer to Chapter 3.1). Lastly, we would like our scheme to be easily implementable and deployable. To this end we focus on local modifications (not centralized approaches), and without the need of assistance from (or modifications to) the tracker(s) as in [25]. Given this, our scheme modifies the TFT unchoking mechanism of leechers, by con- sidering multiple torrents and favoring nodes which stay around as seeds. We term this approach Cross-Torrent Tit-for-Tat (CTFT). Cross-Torrent TFT (CTFT): CTFT modifies the unchoking part of leechers’ behavior as follows. When a leecher is choosing peers for TFT unchokes, instead of choosing peers with the fastest downloading rates in a particular torrent (as is currently done), we look at the peers’ aggregate downloading rate in all torrents in which they participate. Moreover, this aggregation is done in a weighted manner, where higher weight is given to the downloading rate from torrent(s) where a peer is a seed, i.e., in 46 order to favor nodes which stay around as seeds. That is, the total contribution, which we use to rank peers for TFT unchoking, of a peer N y with respect to a peer N x is P #Torrents i=1 w i (y)×D i (x,y), whereD i (x,y) is the downloading rate of nodeN x from nodeN y andw i (y) is the weight we assign to that downloading rate. IfN y is not a seed in torrent i, then w i (y) = 1, otherwise, w i (y) can be set to a value that is larger than one. (The effect of different weights is studied in Chapter 3.3.) For example, let us consider two nodes N x and N y which are both participating in torrentsT a andT b . NodeN y has finished downloading the file in torrentT a and is staying as a seed in that torrent. Suppose thatD a (x,y) andD b (x,y) are the downloading rates of nodeN x from nodeN y in torrentsT a andT b , respectively. When nodeN x is selecting peers to unchoke via TFT in torrentT b , NodeN x ranks the peers in torrentT b (which are interested in downloading fromN x ) based on their weighted total contribution. As node N y is a seed in torrent T a and a leecher in torrent T b , its total contribution to node N x will beW ×D a (x,y)+D b (x,y), whereW > 1 is the weight assigned to seeds. This in a sense gives “credit” to nodeN y in torrentT b for both, contributions it is making as a leecher inT b and contributions it is making as a seed inT a . Note that this “credit” is given locally – that is, nodeN x is only concerned with the benefit it gets from nodeN y in both torrents, not with the benefit other nodes might receive fromN y . This is in line with the original spirit of BT, as described in Chapter 3.1. Through an extensive performance study (in Chapter 3.3) we illustrate benefits of the proposed approach. Although these benefits can be significant, we also show that 47 these benefits are not uniform when there is sufficient heterogeneity in the file sizes of the different torrents. Specifically, the decrease in download times of the smaller files is achieved at the cost of the increase in download times of the larger files (e.g., as illustrated in Figure 3.6). Intuitively, this is due to the fact that the distribution of seeding capacity does not take into consideration where such capacity is needed more. There are a number of approaches that can be taken to mitigate this problem. One approach (which we evaluate in Chapter 3.3) is to estimate the “need” for seeding capac- ity in the different torrents and then only participate as seeds in those with “greater need”. Thus we modify the above CTFT approach as follows. Each node does a local 2 estimate (i.e., based on its neighbors only) of the ratio of seeds to leechers, and then only participates as a seed in those torrents where the seeds to leechers ratio is below R, whereR is a system parameter. This local estimate and hence the choice of in which torrents to seed is re-evaluated every S time units, where S is another system param- eter 3 . We term this adaptation of the CTFT approach as CTFT with dynamic seeding (CTFT-DS). We also note that another way to “shift” the seeding capacity to larger files is to give higher weights to torrents with larger files in the CTFT approach described above. This can also be combined with the CTFT-DS. (Evaluating the combined effects of these approaches is part of our future efforts.) We note that we explored a number of 2 The benefit of doing such estimates locally is ease of implementation and low protocol overhead. 3 Specific parameter setting forR andS are discussed in Chapter 3.3. 48 possible approaches to controlling seeding in torrents, e.g., based on information about the number of seeds and leechers obtained from the tracker, based on comparing the total amount of upload bandwidth to the total amount of download bandwidth of leechers, and so on. The results using these approaches are similar to those of the approach presented here. However, these approaches require tracker side assistance and are less scalable. We omit these results here due to space limitations. 3.3 Simulation Study In the following performance study we use the BT simulator provided by [2] (this sim- ulator is also used by other groups for BT research). This is an event-based simulator which simulates 4 the chunk exchange mechanism in the BT protocol. To explore our proposed approach, we modify the simulator in [2] as follows: (1) We extend the simu- lator to support multiple torrents. Each torrent has one initial seed when the simulation begins. An arriving node chooses which torrents it wants to join, either as a leecher or a seed (as described in detail below). (2) We allow nodes to stay around in torrents in which they finished downloads (based on techniques described above). Nodes leave the system when they finish downloads in all joined torrents. (3) We add support for our CTFT and CTFT-DS mechanisms (refer to Chapter 3.2). (4) We allow node arrivals; 4 It does not simulate TCP behavior and simply shares the upload capacity of a peer evenly between its uploading sessions. The end-game mode and sub-chunk details are also not represented. This simu- lator is used by other groups as well, and we believe that these simplifications do not affect our results qualitatively. 49 in what follows we use a Poisson arrival process with rate λ. (5) We update the seed uploading algorithm to be like the current BT protocol, i.e., more uniform rather than uploading to the fastest peers as in older versions of BT. Unless specified otherwise, the following results correspond to the simulation set- tings in Table 5.1. The system starts with one origin seed per torrent, each with1000kbps upload capacity and each staying in the system for the duration of the simulation. Arriv- ing nodes are assigned to a particular class according to a given distribution. The classes differ in their upload and download capacities. The default classes and corresponding distribution 5 are given in Table 3.2. The fast class is used when simulating a homoge- neous system. For tractability of simulations, our experiments include up to10 torrents, where each node joins a subset of these torrents, depending on the experiment (as described below). To obtain a fair comparison between approaches, we use the same node arrival sequence for each simulation with a given arrival rate and class distribution. In experiments where nodes randomly select which torrents to join (as leechers or as seeds), we also use the same torrent selection sequence. In the following simulations, we look at the steady state behavior of the system. Each simulation run corresponds to15 hours, and we only compute our results over the last 12 hours. Figure 3.2 depicts the number of nodes in the system as a function of simulation time; this indicates that the system has passed the initial “ramp up” stage during those first 3 hours. 5 We experimented with a broad range of distributions, and the results were qualitatively similar; due to lack of space we only present representative results with other results given in [66]. 50 Table 3.1: Simulation Settings File size 200 MB (800 Chunks, 256 KB each) Simulation Time 12 hours (+3 hours Warmup) Avg node inter-arrival time ( 1 λ ) 45 sec Peer Set Size 40 Leecher Unchokings 4 Regular + 1 Optimistic #Seed Unchokings 5 Table 3.2: 2 Classes Bandwidth Distribution Class Fraction Download Capacity Upload Capacity Slow 40% 1500kbps 128kbps Fast 60% 5000kbps 512kbps In what follows, unless otherwise stated, we focus on two metrics (1) the average download time over all torrents and (2) the average download time for the last torrent (i.e., the average amount of time it takes a node to complete all its downloads). These metrics are computed either over all nodes, or on a per-class basis, depending on the experiment. 3.3.1 Design Space Exploration To illustrate the performance consequences of our approach, we first explore the effects of different parameters using simple scenarios, where seeding capacity is due only to nodes staying around (as described above), except for the original source. Below we use the following notation for the various schemes being simulated: (a) “No Stay” refers to all nodes leaving each torrent as soon as the download in that torrent is complete; (b) “Stay” refers to (some fraction) of the nodes (depending on the experiment) staying 51 2 3 4 5 6 0 2 4 6 8 10 12 14 16 18 Number of Torrents Improvement (%) Avg (Fixed) Last (Fixed) Avg (Random) Last (Random) Figure 3.3: Staying Around Improvement around as seeds in each of their torrents until they complete the last of their downloads, at which point they leave the system; (c) “Original” refers to the use of the original BT TFT mechanism; (d) “CTFT” refers to the use of our proposed CTFT mechanism (with W = 4 as default). Different Number of Torrents: In this experiment we study how a multi-torrent system performs as a function of the number of torrents each node joins, where the performance improvements are due to staying around only, i.e., here we use the original BT TFT mechanism. For clarity of presentation, we consider a homogeneous system with only the fast nodes in Table 3.2. Figure 3.3 depicts the percentage improvement in download time, as compared to the “No Stay” case, for two experiments (1) when the choice of torrents to join is fixed, i.e., all nodes join the same X torrents, and (2) when this choice is random, i.e., each node uniformly selectsX torrents (out of 10) to 52 0 50 100 150 200 250 0 5 10 15 20 25 Inter−arrival Time (sec) % Change (Stay vs No Stay) Avg DL Time Improvement Increase in # Nodes Figure 3.4: Inter-Arrival Time join. The value of X is depicted on the x-axis of Figure 3.3. From these results we observe the following. (1) Having nodes stay around as seeds improves average and last download times for both fixed and random selection experiments. (2) In most cases, performance improvements increase withX, due to larger variances in download times (as a function of X) which results in longer stay around times. For example, when the choice of torrents to join is fixed, the standard deviation of average download times increases from≈ 11 for 2 torrents to≈ 45 for 6 torrents. Although there is not a monotonic trend in the percentage improvement as a function of number of torrents, the actual improvements are approximately linear. We note that when the number of torrents each node joins is relatively few, the average download time using random torrent selection is a bit faster than that using fixed torrent selection, e.g., the difference is≈ 10% for the case when nodes join 2 torrents. Their respective 53 20 40 60 80 −5 0 5 10 15 Stay Fraction (%) Improvement (%) Fast (Orig) Slow (Orig) Fast (CTFT) Slow (CTFT) Figure 3.5: Staying Fraction download times are closer when node joins a relatively large number of torrents, e.g., only≈ 1% difference in the case when nodes join 5 torrents. This can be explained as follows - the network size in the random selection experiment, becomes more similar to that of the network size in the fixed selection experiment as the number of torrents joined grows (as also discussed below). Different Node Arrival Rates: In this experiment we observe the effects due to the arrival rate. For clarity of presentation, we consider a homogeneous system with only the fast nodes given in Table 3.2 and each node joining the same 3 torrents. Figure 3.4 depicts the percentage improvement in average download time and percentage increase in network size, both due to nodes staying around (as compared to the “No Stay” case), as a function of node inter-arrival time. From these results, we observe that staying around results in larger improvements when the inter-arrival time is longer (lower arrival 54 Orig Orig, Stay CTFT CTFT−DS 0 50 100 150 200 250 300 350 400 Download Time (min) Schemes 492MB 256MB 80MB Figure 3.6: Game Patch System rate). This is due to staying around being more helpful in torrents with fewer nodes. When the arrival rate is lower, we have a smaller network size and staying around results in greater relative increases in network size (as shown in the figure). This also increases the peer set size 6 for each node, and, in general, larger peer set sizes result in better BT performance (see [57]). Different Fraction of Nodes Staying Around: Here, we study how CTFT performs with different fractions of nodes staying around. We consider a heterogeneous system in Table 3.2 where each node randomly selects 3 torrents to join out of 10. Figure 3.5 presents the average download time improvement of the staying nodes as compared to the non-staying nodes when we vary the fraction of staying nodes from20% to80%. 6 We believe that this type of effect on the network and peer set sizes also contributes to the dif- ferences in performance improvements observed in the previous experiment (i.e., above settings with random selection would have smaller network/peer set sizes than those with fixed selection). 55 Results for both the fast nodes and slow nodes are shown for the “Original” and “CTFT” schemes. Negative percentages indicate cases where staying nodes download slower than the non-staying ones. We make the following observations. (1) In “Original” fast and slow staying nodes have about the same performance as the non-staying nodes (staying nodes are doing a bit better in some cases and in most cases they are doing a bit worse). This indicates that the original TFT does not provide appro- priate incentives for nodes to stay around as seeds. (2) In all cases of “CTFT”, staying nodes download faster than non-staying nodes because CTFT favors nodes which stay around as seeds, which illustrates that CTFT has the desired effect of providing incen- tives for nodes to stay around as seeds. 3.3.2 Multi-Torrent Applications Above, we focused on potential performance benefits of our multi-torrent approach, where for clarity of exposition we made simplifying assumption about user and file char- acteristics. However, in the real world, file sizes can vary significantly and depending on application, users can exhibit different behavior. Thus, in this chapter we explore our approach in the context of the following potential multi-torrent applications in which users have natural common interest and which we believe can benefit from a multi- torrent approach: (1) game patches, (2) online movie rentals, and (3) software installers. 56 Focusing on specific applications, allows us to consider realistic file sizes as well as user behavior 7 , e.g., we allow each node to download from a different number of torrents, and we also allow “re-seeding”. By “re-seeding” we mean that nodes can join the system and immediately become seeds in some torrents by sharing files which they downloaded at some earlier time. The ability and willingness to share files downloaded earlier may be appropriate for some applications, as we explore below. This is similar to the multi-torrent characteristics considered in [25], with the main difference being that in our approach we do this using local (to a node) information and in a distributed manner, i.e., without the aid of trackers or additional protocol modifications as in [25]. Each simulation run below corresponds to 30 hours, with the initial 10 hours being warmup time; we increase our simulation time due to the increase in file sizes as well as greater heterogeneity, e.g., in file sizes and torrent popularity. 1. Game Patches: The first application we consider is an online game patch sys- tem. With the popularity of online games, fast delivery of game patches is difficult because: (1) the online game community is huge (e.g., World of Warcraft (WoW), reached 10M subscribers in March 2008 (see [4])); (2) game patches such as bug fixes, feature enhancements, and map updates are released fairly frequently - while minor patches are on the order of 10MB, major patches can be several hundred MBs; (3) when players leave the game, they need to install all available patches before being 7 We still consider the case where nodes start all their downloads simultaneously. In the real world, nodes may start new downloads while other downloads are in progress. This would increase the stay around time and should result in a greater benefit from our approach. Studying this would require more detailed modeling of user behavior and is outside the scope of this paper. 57 able to play again - since purchased game software installation discs are not (typically) patched up-to-date, new installations require downloads of all current patches as well. All this indicates that game patch distribution results in a huge amount of Internet traf- fic, and thus can benefit from use of BT-like systems. In fact, some game software companies have already started delivering game patch software in a P2P fashion (see [4]). We collect 3 game patch file sizes from WoW’s major patch versions 2.0, 2.1, and 2.2; these are 492MB, 256MB, and 80MB, respectively. For clarity of presentation, we use a homogeneous system with fast node characteristics given in Table 3.2, and we assume that each arriving node requires all three patches 8 . When CTFT-DS is used, we set the seed-to-leecher ratio (R) to1 and the re-evaluation interval (S) to3 minutes. We explore the effects of (R) and (S) in later experiments. Figure 3.6 depicts the resulting average download times using the following schemes: (1) “Original”; (2) “Original, Stay”; (3) “CTFT”; and (4) “CTFT-DS”, where we make the following observations: • Nodes staying around results in a significant decrease in the average download time of the smaller files at the expense of large files. This is due to the following - when nodes finish downloading smaller files, they continue to seed in those torrents for a long time, while they complete large file downloads. Thus, more of the overall system capacity ends up being used by the smaller file downloads 8 Future work includes investigation of effects of patch release dates and other characteristics of game systems. 58 (rather than larger file downloads) in the “Stay” case as compared to the “No Stay” case. This motivates the CTFT-DS scheme (as described in Chapter 3.2). • The average large file download times of “CTFT-DS” are as good as those of “Original” and are much better than those of “CTFT” or “Original, Stay”; at the same time smaller files are downloaded faster under “CTFT-DS” than under “Original” and only slightly slower than under “CTFT”. That is, “CTFT-DS” (as compared to “Original”) does not hurt large file downloads while significantly improving smaller file downloads. This is due to its dynamic seeding approach which balances the number of seeds and leechers and hence shifts some of the upload capacity towards larger files. The statistics of average seed-to-leecher ratio, for each file size, can be found in Table 3.3. We observe that for 80MB and 256MB file, the seed-to-leecher ratio is quite high without dynamic seed- ing and is close to 1 with dynamic seeding, which is due to dynamic seeding’s re-allocations. • We also note that: (1) large file torrents lack sufficient seeds in all cases (this can be seen in Table 3.3 where the seed-to-leecher ratio is still quite low even with dynamic seeding) and hence further improvements are possible there, and (2) although in Figure 3.6 “CTFT” has similar performance to “Original, Stay”, it does provide incentives for nodes to stay around for seeding 9 . 9 This was illustrated in other experiments, when we depicted a breakdown of download times between staying and non-staying nodes. Due to lack of space, we do not include such a breakdown for this experiment. 59 Table 3.3: Average Seed-to-Leecher Ratio Scheme 80MB 256MB 492MB Original, Stay 60.553 22.001 0.002 CTFT-DS 0.995 0.922 0.004 1.1 Effect of CTFT Weight: To study the impact of CTFT weight on dynamic seeding, we use a heterogeneous system with fast and slow nodes given in Table 3.2. We use the same file settings and assume each arriving node requires all three patches. To compare the performance of staying nodes and non-staying nodes, we let each node stays around with probability 1 2 . We experiment with CTFT weight from 1 to 100, where a weight of 1 corresponds to giving no special consideration for seeding. In Figure 3.7, we depict the resulting percentage improvement in download times experienced by staying nodes, as compared to the non-staying ones. Note that negative improvements correspond to cases where the staying nodes download slower than the non-staying ones. We observe: • For both node classes, staying nodes under large and medium files have better performance than non-staying nodes when CTFT uses weights larger than 1 (i.e., when we do give “credit” to nodes for seeding), which illustrates that CTFT pro- vides proper incentives for nodes to stay around. • For both node classes, staying nodes and non-staying nodes have similar perfor- mance under the small file, regardless of CTFT weight. This is due to: (1) the download time for the small file is too short; (2) plenty of seeds for the small file also makes the CTFT effect less obvious. 60 • Slow nodes benefit more from the stay around (percentage-wise) than fast nodes, which is also true for nodes downloading the large file. This is expected, as the download times in these cases are longer. • Staying nodes’ improvement due to CTFT is quite sensitive under small weights, but is reasonably insensitive when higher weights are used. We also note that a very high weight setting could hurt the clustering characteristics typically exhib- ited by BT systems. Such clustering of nodes with similar bandwidth capabilities serves as an incentive to contribute capacity in the original BT protocol and is studied, e.g., in [44]. Specifically, when the CTFT weight is high, slow nodes (which stay around) would have a higher probability of “clustering” with fast nodes after they finish some downloads. This can be observed in fast nodes for the 492MB and the 256MB file when CTFT weight is greater than 40. There- fore, we believe that very high weights should not be used, especially since similar performance improvements can be achieved with lower weights. In a real implementation, there are multiple approaches for nodes to determine the actual weight. One such approach is to use the tracker side assistance: Tracker collects nodes average download time and determines the actual weight based on this. Tracker gradu- ally increases the weight, if average download times decrease. Otherwise, tracker grad- ually decreases the weight. Another possible approach is to let nodes determine the weight locally by estimating peers’ downloading rates. Peers’ download rates can be estimated using the “having” messages rates. We note that an accurate weight is not 61 0 10 20 30 40 50 60 70 80 −60 −40 −20 0 20 40 60 CTFT Weight Improvment (%) (Stay vs No Stay) 80MB Slow 256MB Slow 492MB Slow 80MB Fast 256MB Fast 492MB Fast Figure 3.7: Different CTFT Weight required as the system performance is not very sensitive to the weight setting, as shown in our experiments. 1.2 Effect of Seed-to-Leecher Ratio (R): To study the impact of choice of R on dynamic seeding, we use a heterogeneous system with fast and slow nodes given in Table 3.2. We use the same file settings and assume that each arriving node requires all three patches. We let all nodes stay around and perform dynamic seeding. We experiment withR from 0.2 to 80, where a value of 1 corresponds to equal number of seeding nodes and leechers. In Figure 3.8, we depict the resulting improvement in download times (in number of minutes) as compared to CTFT without dynamic seeding. Note that negative improvement corresponds to cases where nodes download slower. We observe: • For both node classes, dynamic seeding improves the performance of the large file downloading significantly. For example, the improvements of fast and slow 62 0 10 20 30 40 50 60 70 80 0 200 400 600 80MB Slow 256MB Slow 492MB Slow 0 10 20 30 40 50 60 70 80 −100 0 100 200 R Improvement (min) 80MB Fast 256MB Fast 492MB Fast Figure 3.8: Different Seed-to-Leecher Ratio (R) nodes are≈ 688 minutes and≈ 162 minutes whenR = 1, respectively. This is because dynamic seeding shifts the seeding capacities from the small and medium files to the large file. As expected, it also results the performance of the small and medium files downloading slower than without dynamic seeding. • For both node classes, increasingR results in a faster download time for the small and the medium size file. This is due to more nodes seeding as a result of a largerR. However, increasingR hurts the performance of the large file because more seeding capacity is used for the small and the medium files and hence less is shifted towards the large file. In a real implementation, nodes can determine the value ofR using similar approaches to determining the CTFT weight (as discussed earlier). 63 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 250 300 350 400 450 500 550 Sim Time (Mins) Download Rate (kbps) No Re−eval 10 Sec 180 Sec 360 Sec Figure 3.9: 492MB File with Bursty Arrival 1.3 Effect of Re-evaluation Interval (S): The purpose of having nodes re-evaluate the seed-to-leecher ratio is to let them adapt to the possible changes in the network, e.g., a sudden burst of arrivals. With a sudden network change, the seed-to-leecher ratio for a node’s neighboring peers can change dramatically. Seeding without re-evaluation can inefficiently allocate the system resources, resulting in degradation in system perfor- mance. To study the impact of re-evaluation interval (S) on dynamic seeding, we use a homogeneous system with fast nodes given in Table 3.2. For a clearer presentation, we let each node download the small (80MB) and the large (492MB) files only. We insert a burst of arrivalsin the simulation, between the900th and the960th minutes. The arrival rate during the bursty period is 5 times greater than the normal arrival rate. We compare the average download rate using 10,180,360 seconds re-evaluation intervals with that of a system without re-evaluation. We depict the average download rates for 64 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 0 500 1000 1500 2000 2500 3000 Sim Time (Mins) Download Rate (kbps) No Re−eval 10 Sec 180 Sec 360 Sec Figure 3.10: 80MB File with Bursty Arrival the 492MB and the 80MB files in Figures 3.9 and 3.10, respectively. We only depict the results starting with840 th minute to focus on the burst’s effect. We observe: • For the 492MB file, with re-evaluation, the download rate recovers faster from the performance drop between 900 and 950 minutes and maintains a higher rate after that. This is because with higher arrivals nodes become seeds for the80MB file and thus the seeding capacity for the 80MB file experiences a surplus. With re-evaluation, the system is able to re-allocate resources from the 80MB file to the492MB file faster. • For the 80MB file, with re-evaluation, the system is able to maintain a higher download rate when the burst occurs. There is a smaller performance drop between 900 and 950 minutes than in the system of without re-evaluation. This is because more system resources are allocated to the 80MB file when the burst 65 occurs. We also observe that without re-evaluation, the average download rate is much higher than that with re-evaluation. This is because with bursty arrivals, more nodes become seeds for the80MB file and the seeding capacity for80MB file becomes significantly larger than needed. For example, with re-evaluation, the 80MB file downloads only≈ 1 minute slower than without re-evaluation. This indicates that system resources are not used as efficiently without re-evaluation. • Without re-evaluation, the system experiences some oscillations after the arrival burst. This can be observed in the performance drops of the 80MB file around 1200,1500 minutes. The reason is that there are more node departures than arrivals in these periods, due to the burst. This also shows that the system’s behav- ior is more stable with re-evaluation. The above observations demonstrate that the system adapts to bursty arrivals better with re-evaluation. We also note that the performance is, in general, not very sensitive toS and therefore an accurate value ofS is not required in a real implementation. 2. Online Movie Rentals: Next, we consider an online movie rental system, e.g., such as Apple’s iTunes Store or Netflix. The large file sizes typical of this application (e.g., a 720p movie on iTunes is ≈ 1.2GB) result in long download times and high costs to providers for hosting the movies. Thus, this is another application where a BT- like system would be useful. For our experiments, we collected the length and ratings for the Top 10 rented movies from NetFlix (a popular online movie rental site in US). For tractability of experiments, we assume a movie’s bitrate to be 450kbps. Due to the 66 Size(MB) Len(min) # Rating Pop 333 101 1920102 10% 363 110 1677314 9% 369 112 2394935 13% 372 113 2283232 12% 386 117 1480414 8% 396 120 1344481 7% 402 122 1751443 9% 405 123 2535077 13% 448 136 1943487 10% 498 151 1790922 9% Figure 3.11: NetFlix Top 10 Movie as of 3/18/08 Size(MB) Name # DL Pop 268 Acrobat 748 14% 286 DreamWeaver 121 2% 328 ColdFusion 79 2% 414 Flex 42 1% 450 Illustrator 42 1% 461 Flash 425 8% 463 Photoshop 2763 54% 533 After Effects 658 13% 582 Premiere 178 3% 749 Indesign 90 2% Figure 3.12: Adobe SW on MiniNova as of 3/18/08 333 363 369 372 386 396 402 405 448 498 0 50 100 150 200 File Size (MB) Downlaod Time (min) Original, No Re−Seed Original, Re−Seed CTFT−DS Figure 3.13: Online Movie Rental (Fast Nodes) difficulty in obtaining real statistics of movie popularities, we use the number of times each movie is rated by customers as its popularity indicator; we normalize this number by the total number of customer ratings to obtain the resulting popularity distribution. The collected movie file sizes and popularity distribution are given in Figure 3.11. 67 333 363 369 372 386 396 402 405 448 498 0 50 100 150 200 250 300 350 File Size (MB) Downlaod Time (min) Original, No Re−Seed Original, Re−Seed CTFT−DS Figure 3.14: Online Movie Rental (Slow Nodes) We consider heterogeneous nodes in Table 3.2 where each arriving node randomly (uniformly) selectsY torrents (movies) to join, whereY is between1 andM. It will par- ticipate as a leecher in one of them (representing the movie it wants to download) and as a seed (i.e., as described earlier it “re-seeds”) in the remainingY −1 of them (represent- ing movies it downloaded previously and is willing to share). The specific Y torrents (movies) are chosen based on movie popularity (in this case using the distribution given in Figure 3.11). And, the choice of which one (out ofY ) is the movie of interest (i.e., the one to be downloaded) is chosen randomly (uniformly). The results of our experiments are depicted in Figures 3.13 and 3.14, where M = 4. Here, we compare the average download times for the following schemes: (1) “Original, No Re-Seed”, (2) “Original, Re-Seed”, and (3) “CTFT-DS”. We observe the following. (1) Under “Original, No Re- Seed”, nodes download much slower than under “Original, Re-Seed” and “CTFT-DS”, 68 268 286 328 414 450 461 463 533 582 749 0 100 200 300 400 500 File Size (MB) Downlaod Time (min) Original, No Stay, No Re−Seed Original, Stay, Re−Seed CTFT−DS Figure 3.15: Software Installer (Fast Nodes) i.e., re-seeding helps. (2) “CTFT-DS” has better performance than “Original, Re-Seed”; we also observe significant reduction in download times for large files under “CTFT- DS”. Moreover, as in the case of “CTFT”, “CTFT-DS” provides incentives for nodes to stay around while “Original, Re-Seed” lacks such incentives. 3. Software Installers: The last application we consider is a software installer, e.g., such as Google Pack installer and Microsoft Live installer. In such applications, soft- ware companies put together software bundles for user downloads; this can be software from the same company or popular software packages. We consider this application as we believe it can also benefit from a BT-like system. In our experiments, we collect file sizes and average numbers for software downloads from “http://www.mininova.org”, a 69 268 286 328 414 450 461 463 533 582 749 0 100 200 300 400 500 600 700 File Size (MB) Downlaod Time (min) Original, No Stay, No Re−Seed Original, Stay, Re−Seed CTFT−DS Figure 3.16: Software Installer (Slow Nodes) major BT sites in US, which are listed in Figure 3.12. Due to the difficulty in obtain- ing real statistics about software popularity, we estimate it using our collected average number of downloads for each module (normalized by the total number of downloads). We consider heterogeneous nodes in Table 3.2 where each arriving node randomly (uniformly) selects Y torrents (software modules) to join, where Y is between 1 and M. It then randomly (uniformly) chooses Z out of Y in which it will participate as a leecher (representing software modules it wants to download); it participates as a seed in the remainingY −Z (representing software modules it downloaded previously and is willing to share). The specific Y torrents (software modules) are chosen based on software popularity (in this case using the distribution given in Figure 3.12). In the experiments presented in Figure 3.15 and 3.16, M = 4. This is quite similar to the movie rental application, with the exception of nodes joining as leechers in multiple 70 torrents and with significantly larger variance in file sizes. In these experiments we compare the resulting average download times of the following schemes: (1) “Original, No Stay, No Re-Seed”, (2) “Original, Stay, Re-Seed”, and (3) “CTFT-DS”. Note that, in (1) there is no re-seeding and no staying around (i.e., this essentially corresponds to the current BT). We observe the following. (1) Dynamic seeding helps large file downloads, which can be observed from the difference in download times between “CTFT-DS” and “Original, Re-Seed”, for 533MB, 582MB, and 749MB files. (2) “Original, No Stay, No Re-Seed” has the worst overall performance; for fast and slow nodes it downloads significantly slower than “CTFT-DS”. (3) Small file downloads do best with “Original, Stay, Re-Seed”, as is the case for 268MB, 286MB, and 328MB files. As before, this improvement comes at the expense of slowing down large file downloads, which is remedied by our CTFT-DS scheme. As also illustrated before, the “Original” approach, unlike “CTFT-DS”, does not provide incentives for nodes to stay around. 3.4 Further Discussion In this section we briefly discuss on-going and future directions for further improve- ments as well as considerations which should be addressed when designing and devel- oping a multi-torrent system. Performance in the Wild: While our simulation-based study of nodes acting as seeds demonstrates the benefits of multi-torrents (refer to Chapter 3.3), we expect the 71 improvements to be even greater in the real world. One reason is that new nodes in the real world have a much longer initial ramp-up period because of various delays (e.g., for acquiring peer list, connection establishment, and clustering) which do not exist in the simulator. Thus, having seeds in the system could improve the ramp up period of new nodes significantly. In addition, a new node that joins the system with re-seeding in other torrents can boost its own initial ramp up even more in the real-world when we use our proposed CTFT algorithm. Moreover, in the real world, the download times of different torrents would have a larger variance than in the simulator. This would result in longer stay around times, which would again result in more significant improvements in the overall system performance. Lastly, as mentioned in Chapter 3.1, nodes staying around and re-seeding helps keep torrents alive. This effect is also not illustrated in our simulation study as we use constant node arrival rates. This is, however, predicted by the analytical model in [25]. Requirements and Further Improvements: Although nodes with common tor- rents may exist in the same network at the same time, they may not be peers in all their common torrents, which could diminish the effects of our approach 10 . One possible enhancement would then be to have trackers assist nodes in finding peers with common torrents. System Parameters: A number of interesting system characteristics can be stud- ied by exploring various system and proposed schemes’ parameters, e.g., different 10 Our scheme will degenerate into the original TFT, without additional overheads, when this common- ality is not present. 72 approaches to weight settings in CTFT, as briefly discussed in Chapter 3.2. A num- ber of other system parameters could also affect the performance of multi-torrents, e.g., the peer set size and the number of peers chosen for unchoking. Studying the effects of such parameters on multi-torrent performance is the topic of our on-going efforts. Practical Implementation: Our CTFT approach is completely decentralized and does not require any modifications to the BT communication protocol. That is, we only need to modify the BT client locally (e.g., by adding a bit more state information at each peer and changing the TFT unchoking algorithm to the CTFT algorithm given in Chapter 3.2.) Thus, our approach is easily deployable in the current BT system. Malicious Behavior: When using our CTFT mechanism, we give a greater weight to downloading rates of peers who are seeds in some torrents. A malicious node could take advantage of this by pretending that it is a seed in some torrent. Of course, this would lead to the malicious node not being unchoked in that torrent (by the peers to which it lies) as the malicious node would have to claim to have all data. However, doing this may improve the malicious nodes probability of being unchoked in some other torrents. Our future efforts include a study of how malicious behavior affects system performance, in the context of multi-torrents, and what counter measures are possible. One direction for detecting a lying node could be to observe that a seed node is able to serve any requested data chunks while a lying node may not have all the chunks to serve requests. We also note that malicious behavior does require code modification which is not accessible to everyone. 73 Other Multi-torrent Possibilities: Other possibilities in designing multi-torrents include exploration of better bandwidth allocation. Instead of evenly splitting upload resources among all unchoking peers (as is done in the current BT system), assigning different bandwidth to different peers is a possibility; this has been proposed in [54]. Similar ideas can be used in the context of multi-torrents. Moreover, scheduling among multiple downloads may be another interesting issue in multi-torrents, e.g., when a node has a number of files to download, should it download all of them in parallel or a few at a time, and if the latter which ones and how many at a time. Answering such open questions is the topic of future work. 74 Chapter 4 P2P VoD In recent years a number of research efforts have focused on effective use of P2P-based systems in providing large scale video streaming services. In particular, live stream- ing and Video-on-Demand (V oD) systems have attracted much interest. While previ- ous efforts mainly focused on the common challenges faced by both types of applica- tions, there are still a number of fundamental open questions in designing P2P-based V oD systems, which is the focus of our effort in this chapter. In Chapter 1.2, we gave an overview of these open questions. In this chapter, we consider a BitTorrent (BT)- like P2P V oD system and first illustrate deficiencies of current approaches inadequately meeting streaming Quality of Service (QoS) requirements to these questions. Motivated by this, in Chapter 4.2, we propose practical schemes to the Peer Request Problem. In Chapter 4.3, we propose Deadline Aware Scheduling to the Peer Scheduling Problem. We also show that addressing any one of these problems alone is not sufficient to achieve high QoS. To support this claim, we present an extensive evaluation study on the use of these approaches under a variety of environments. 75 4.1 Performance Metrics and Experimental Setup We explore and evaluate solutions to the stated questions in Chapter 1.2 through sim- ulations, using the BT simulator provided by [2] (also used by other groups for BT related research). This is an event-based simulator, originally developed to simulate the piece exchange mechanism in the BT protocol. To explore our proposed approaches for BT-like V oD systems, we modify the simulator in [2] as follows: • We remove BT’s default piece exchange mechanism (to adapt it to V oD stream- ing), and implement the data piece request and service mechanisms of Chap- ters 4.2 and 4.3. • We let each node send up toD requests to peers concurrently 1 ; each peer serves U of the the incoming requests, with the remainder placed in a queue. • Nodes start their playback after a startup delay, s. After that, playback proceeds at the rate ofr without interruption. If a piece is not received before its playback time, it is marked as missing 2 . 1 IncreasingD allows a node to request data pieces further into the future at the cost of causing longer queues at peers, thereby increasing waiting time; a detailed exploration of this parameter is outside the scope of this paper. 2 A smalls result in lower startup delay but also in poorer video quality; a larges improves continuity but also increases startup delay (and both are aspects of QoS). A detailed exploration ofs is outside the scope of this paper. 76 • Each node serves requests until it finishes playback. Once it finishes the playback, the remaining requests in the queue are discarded and need to be reissued by the requesting nodes. This emulates a user quitting the system in the real world. • We allow node arrivals; in what follows we use a Poisson arrival process with rate λ. • There is one initial server in the system and it stays in the system for the duration of the simulation. Each node can request a data piece from this server, if that piece cannot be found among the peers. Unless otherwise stated, the results that follow correspond to simulation settings given in Table 5.1. All experiments simulate a BT-like V oD system for 30 hours. To isolate effects of our proposed approaches, we first consider the in-order piece selection strategy, i.e., nodes requesting pieces according to their playback order. We then study our proposed approaches under mixed selection (i.e., nodes requesting some pieces according to playback order and some based on their rarity) in Chapter 4.3. By default, each node serves its incoming request queue using FCFS policy. For a fair comparison between approaches, we use the same node arrival sequence for each simulation with a given arrival rate. In experiments where nodes randomly select to which peer to send a request, we also use the same selection sequence. In what 77 Table 4.1: Simulation Settings Simulation Time 30 hours Avg node inter-arrival time ( 1 λ ) 60 sec Movie Encoding Rate 500 Kbps Startup Delay (s) 10 sec Piece Size 256 KB File Size 400 MB (1600 pieces) Peer Set Size 40 Node Max #Upload Connection (U) 5 Node Max #Concurrent Request (D) 10 Node Capacity (Down / Up) 5000 Kbps / 512 Kbps Server Max #Upload Connection 5 Server Upload Capacity 5000 Kbps follows, unless otherwise stated, we focus on the continuity index (CI), defined in [67], as our main metric for video viewing quality, where: CI = #total pieces−#total missing pieces #total pieces . A higher CI implies better video playback quality. 4.2 Peer Request Problem We begin with a simple example which illustrates the poor viewing quality that can occur in an unbalanced system (i.e., where requests are not evenly distributed among nodes and only a subset of the nodes serve most requests). Motivated by this, we explore approaches to balancing the request load, to improve playback quality. 78 Motivating Example: for ease of exposition, we perform an experiment using a homo- geneous set of nodes, where we use the following peer request policies. Random (Rand): Each node sends the request to a randomly chosen neighbor which has the needed data piece. This is a typical approach used in other works, e.g., [52], and we use it as a default/baseline case. Least Loaded Peer (LLP): Each node sends a request to the neighbor with the shortest queue size, among all those that have the needed data piece, randomly breaking ties. We use this as an ideal case, to mimic perfect load balancing. The resulting CDF of corresponding CIs is depicted in Figure 4.1, where we observe that the viewing quality (as indicated by CI) under LLP is significantly better than that of Random. Specifically, the average LLP CI is ≈ 0.97 while it is only ≈ 0.68 for Random. The standard deviation for LLP and Random is≈ 0.07 and≈ 0.12, respec- tively, which is also an indication that peers are likely to get more “stable” QoS under LLP than under Random. This is due to the more balanced distribution of requests under LLP. Intuitively, in an unbalanced system, some nodes will have a long incoming request queue. Requests sent to these nodes will experience long waiting times, which will increase the probability that (1) pieces miss their playback deadlines, (2) waiting for service of delayed pieces prevents timely requesting of other pieces, and (3) upload bandwidth of lightly loaded nodes is wasted, thereby reducing the overall system capac- ity. We sample nodes in our experiments and observe that under LLP, the node queue size is significantly more stable than under Random and tends to be smaller (e.g., the 79 queue size of node #900 under LLP is always below 10 requests while that of node #900 under Random goes beyond35 requests). To understand why the load is unbalanced under Random, we observe percentage of upload performed as a function of the percentage of download completed 3 . The results are given in Figure 1.1, where we observe: • Under Random, most of the uploads occur in the later stages of the download process, e.g.,≈ 50% of uploads occur after80% of download is completed. This is due to old nodes having more pieces, resulting in a higher probability of older nodes receiving requests. This also verifies a point made in [52] that older nodes are often overloaded. • Under LLP, the uploads are more evenly distributed during the downloading pro- cess, e.g.,≈ 10% of the uploads occur between20% and30% of download com- pletion (in contrast to only ≈ 3% in the Random case). This is due to nodes always sending requests to a neighbor with the shortest queue (under LLP) which helps spread the load among peers more evenly. Motivated by this, we now focus on load balancing techniques. Conceptually, LLP would be a simple approach to load balancing; however, it is difficult to implement, as it requires exact knowledge of instantaneous node queue lengths. We could approximate 3 Note that, with100% of download completed, upload continues if a node has not completed the video playback. 80 it, by obtaining information about peers’ queue sizes; however, that results in a trade- off between message overhead (for updating such information) and resulting system performance. We explore this tradeoff in detail below. But, first, we experiment with a straightforward approach, which approximates LLP without the need for updates, to understand whether high QoS can be achieved without overhead. Specifically, we use: Least Requested Peer (LRP): For each neighbor, each node counts how many requests it has sent to that peer and picks the one with the smallest count, randomly breaking ties. As noted in the above example, it is the older nodes that tend to get overloaded. An approach to load balancing, based on the notion of peer age but in the context of file downloads rather than streaming, is proposed in [21] using “stratification”. Conceptu- ally, stratification attempts load balancing by insuring that peers of aget only download from peers of age t+Δ. Adapting the notion of stratification to V oD systems is sug- gested in [52] 4 , and a similar idea to stratification is also proposed in [45], using tracker support. (Details of how [45] and [52] differ from our work can be found in Chap- ter 2.2.) For comparison purposes, we include an approach similar to that in [45] in our experiments below which can be described as follows. Tracker Assistant (Tracker): The tracker sorts peers according to their arrival times. Whenever a node requests the list of available nodes from the tracker (e.g., upon arrival 4 In [52], the suggestion of using stratification-type approaches is only made at a high level, without evaluation or sufficient scheme details for implementation. We tried experimenting with their suggested schemes, using reasonably straightforward implementation, and their performance was not as good as the schemes explored here. Due to lack of space, we do not present detailed results here. 81 Table 4.2: Different Load Balancing Schemes Rand YNP CNP LRP Tracker LLP Average CI 0.678 0.785 0.786 0.716 0.785 0.966 Std. Deviation 0.104 0.155 0.157 0.143 0.097 0.066 CI Improvement (%) - 15.82 15.91 5.59 15.77 42.56 or when lacking peers due to peer departures), it will receive a list of nodes which have the closest arrival times to its own. To explore the notion of load balancing by not overloading older peers, we also propose the following schemes. Youngest-N Peers (YNP): Each node sorts its neighbors according to their age, where a peer’s age can be determined from its join time (available at the tracker). YNP then randomly picks a peer among theN > 1 youngest peers which have the piece of inter- est and requests that piece from that neighbor. This approach tries to send requests to younger peers as they are less likely to be overloaded. We choose randomly among a subset of youngest peers, rather than the youngest one, as choosing the youngest one may lead to many nodes sending their requests to the same youngest peer (thus poten- tially overloading it). Closest-N Peers (CNP): This approach more closely emulates the stratification behav- ior described above. It is similar to YNP but instead sorts the neighbors based on how close they are to a node’s own age, and then randomly picks from the N closest age peers that have the needed piece. 82 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x: CI F(x) Random LLP Random LLP Figure 4.1: CI (Random, LLP) Table 4.2 reports the average CI, standard deviation (STD), and the improvement in the average CI as compared to Random for all the approaches discussed above. Fig- ure 4.2 depicts the average CI as a function ofN for YNP (the results for CNP are very similar to YNP and are thus omitted). We make the following observations: • YNP and CNP give significant performance improvements as compared to Ran- dom, for good choices of N. Similar performance is also achieved by Tracker. (Although Tracker does not require picking a good N, it performs worse than YNP and CNP when we apply our scheduling improvements in Chapter 4.3.) • YNP (and CNP) can be quite sensitive to the choice of N as seen in Figure 4.2. IfN is too small, YNP and CNP risk overloading a few peers, but with very large values ofN (approaching a node’s neighbor set size), YNP and CNP degenerate to Random (as also observed in our experiments). In our later experiments, we fix 83 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8 0.9 1 N CI YNP YNP+DAS YNP+DAS YNP Figure 4.2: Different N N = 15 when we use YNP/CNP. We reduce the sensitivity to choice ofN through approaches presented in Chapter 4.3. • LRP does not perform as well as the other schemes, which indicates that this straightforward approach to approximating LLP is not sufficient. Adding random- ization (as in YNP and CNP) might help but is outside the scope of this paper. • LLP gives the best performance among all schemes. However, since it is difficult to implement in practice (as noted above), we next consider its implementable approximations that perform better than LRP (but at the cost of update overhead). 84 0 20 40 60 80 100 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Update Interval (Sec) CI LLP−S LLP−P LLP−P+DAS LLP−P+DAS LLP−S LLP−P Figure 4.3: CI (LLP-S, LLP-P) LLP with Stale Information (LLP-S): One possible implementation of LLP is to let each node report its queue length to its neighbors periodically, which we term LLP- S 5 . Not surprisingly 6 this results in a tradeoff between information freshness and update overhead. Figure 4.3 shows LLP-S performance and Figure 4.4 shows the corresponding message overhead, per data piece, plotted on a log scale. With a small update interval (e.g., 5 seconds), LLP-S performs well but at the cost of high message overhead (e.g., ≈ 38 messages per data piece in the case of a 5 second update interval). Under longer update intervals, LLP-S’s performance drops quickly, e.g., it performs similarly to Ran- dom when the update interval is increased to90 seconds (with a corresponding message overhead of≈ 3 messages per data piece). 5 Studying malicious behavior in queue length reporting is outside the scope of this paper. 6 This is also noted in, e.g., [24], with differences between [24] and our work explained in Chapter 2.2. 85 0 20 40 60 80 100 10 −1 10 0 10 1 10 2 10 3 Update Interval (Sec) #Message/Data Piece LLP−S LLP−P LLP−P+DAS LLP−P+DAS LLP−P LLP−S Figure 4.4: LLP Update Overhead LLP Piggyback (LLP-P): because of a relatively high overhead of using LLP-S (even with larger update intervals), we propose LLP Piggyback (LLP-P) which is suitable for BT-like V oD system. In a BT-like system, when a node receives a data piece, it sends out a Have message to all its neighbors. We piggyback our LLP update messages on these Have messages, thus reducing the additional message overhead. Since it is possible that no Have message is sent out by a node for a long period of time (e.g., a node experiencing slow download or one that downloaded all pieces), we still include explicit update messages in LLP-P, when no update message has been sent (either explicitly or through piggybacking) for T l time units. Due to lack of space, we give a formal description of LLP-P in [65]. Figures 4.3 and 4.4 depict LLP-P’s performance and corresponding message over- head, respectively, where we observe: 86 • With piggybacking, the update message overhead is significantly reduced, e.g., message overhead corresponding to a30 second update interval is only≈ 0.6, as compared to≈ 7.6 without piggybacking. • With piggybacking, the average CI is less sensitive to the update interval - it only drops≈ 4% when going from a 5 second to a 90 second interval. This indicates that we can use a larger update interval without significant performance degrada- tion, which is due to the already frequent updates achieved through piggybacking on Have messages (these were measured to be sent out, on the average, every≈ 4 seconds in the simulation). Rather than relying solely on the “Have” messages, one could also piggyback updates on other messages, e.g., piece requests, streaming data, etc. We expect this can further reduce the overhead while improving CI. Although, as demonstrated in this chapter, a good load balancing scheme is impor- tant, significant room for CI improvement remains. Next, we study how service schedul- ing affects system performance and propose approaches to further improve CI. 4.3 Service Scheduling Problem In [52], the authors show, using their model, how to bound delay in a BT-like V oD system, under the FCFS queuing policy, which is related to the second question we posed in Chapter 1.2, i.e., what service scheduling policies are better suited for V oD 87 A:t+2 B:t+1 B:t+1 B: t+1 time t+1 t+2 FCFS Deadline Aware A made the deadline A: t+2 t B miss the deadline B:t+1 A:t+2 A:t+2 A: t+2 time t+1 t+2 B made the deadline B: t+1 t A made the deadline Piece Request Serving Request Figure 4.5: Piece Service Example systems and perhaps under what environments. It includes two sub-problems: (i) in what order should requests be served and (ii) whether some requests should be rejected. To address these, we propose Deadline-Aware Scheduling (DAS) which considers the requests’ deadline constraints. As argued in Chapter 1.2, the requested pieces’ deadlines (in a V oD system) are quite diverse (refer to Figure 1.2). Hence, a node’s request queue contains a mix of requests, those with urgent deadlines and those with less urgent ones. In such situations a FCFS policy may not work well as illustrated by the following example. Peer Service Example: we are given two requests A and B which arrive in order (see Figure 4.5), where each request takes1 time unit to serve. A’s deadline expires int+2 time units and B’s deadline in t+1 time units. Using FCFS, A is served first and B is served 1 time unit later. Unfortunately, then B misses its deadline as by the time 88 Case (i) Piece Request Serving Request A:t+2 B:t+1 B:t+1 B: t+1 time t+1 t+2 A made the deadline A: t+2 t B miss the deadline D:t+1 F:t+2 F:t+2 time t+1 t+2 D made the deadline D: t+1 t F made the deadline E:t+2.5 F:t+2 E:t+2.5 t+3 E:t+2.5 E miss the deadline E:t+2.5 Case (ii) Figure 4.6: Service Rejection Example A’s service completes, B’s deadline passes. However, if the service policy is deadline aware, B could receive service before A, and A would still make its deadline. As a result, bothA andB make their deadlines. Earliest Deadline First (EDF): the above simple example illustrates the benefits of a deadline-aware policy. Motivated by this, we use earliest deadline first (EDF) policy. Under EDF, each node maintains a queue sorted by the request deadline and picks the request with the most urgent deadline to serve first. In our experiments, each node stamps a request with information of the remaining time until playback point for that piece 7 . Upon receipt of the request, the serving peer extracts this information and uses it as the request deadline. 7 Malicious nodes can attempt to forge deadlines, thus making their requests more urgent. Studying of malicious behaviors and corresponding prevention schemes is part of future efforts. We note that a simple detection can reduce the effectiveness of such exploits, e.g., a node can estimate deadlines for peers’ requests based on their request history or join time. 89 Service Rejection Example: we return to the above example and extend it with two additional cases illustrated in Figure 4.6. In Case (i), we have another request,C, which has a deadline in t+2.5 and arrives after B is served. Since it is less urgent than A, according to EDF, C will be served after A, which would be at t + 3, resulting in C missing its deadline. In this case, we should not serveC so that it can try other peers. In Case (ii), at time t, there are two requests, D and E, in the queue, with respective deadlines oft+1 andt+2.5. Then, requestF arrives, and its deadline ist+2. Since D is the most urgent one, according to EDF, it will be served first, after which (at time t+1),F will be served, since it is more urgent thanE. AfterF is served, which is at timet+2, we find that there is no way forE to make its deadline (as its service takes unit time). In this situation, we should have accepted service ofF but droppedE, asE is less urgent and thus has more time to look for service from other peers. Early Drop (EDP): The question before us now is how to avoid wasting a request’s time, waiting time in the queue, if (given the current load on that peer) it cannot be served on time. In particular, (1) if a request cannot be served on time, inserting it into the queue wastes resources/time that can be used by other requests, and (2) inserting a new request into the queue may change the waiting time of existing requests (when EDF is used), suggesting that we should re-evaluate existing requests to see if they can still be served on time. To address these issues, we propose the Early Drop (EDP) policy which works as follows. We first estimate the waiting time of a newly arrived request, using currently 90 available bandwidth and the request load already in the queue that can affect the newly arrived request (i.e., this is based on the request’s deadline and the service policy used, e.g., FCFS vs. EDF). If it is determined that the newly arrived request can make its deadline, it is inserted into the queue (according to the service policy). At that point, we estimate (in a similar manner) the waiting time of all requests that were already in the queue before the new arrival and ended up being queued behind the new arrival (i.e., after it was inserted into the queue). If some of these requests will now miss their deadlines, these requests are dropped from the queue, and the peers that made the original requests can try to obtain the corresponding pieces from other peers. Thus, our approach tries to drop requests from the queue as early as possible, i.e., as soon as it is determined that they will miss their deadline. We give a more formal description of EDP in [65]. Deadline-Aware Scheduling (DAS): Given our deadline considerations, it is (intu- itively) useful to combine EDF and EDP, and we term the combined scheme Deadline- Aware Scheduling (DAS). Here, we study the effect of DAS under different load balanc- ing schemes discussed in Chapter ?? 8 . In Table 4.3 we report the average CI, standard deviation, and the improvement in the average CI as compared to the case without DAS. We set N = 15 in YNP; LLP-P’s update threshold is set to 40 second (based on our earlier experiments). We omit CNP results as they are quite similar to those of YNP. We observe the following: 8 The performance of EDF or EDP acting alone can be found in [65]. 91 Table 4.3: Different Load Balancing Schemes with DAS Rand YNP LRP Tracker LLP-P LLP Average CI 0.929 0.982 0.887 0.975 0.996 0.998 Std. Deviation 0.104 0.155 0.143 0.097 0.005 0.066 CI Improvement (%) 37.02 25.10 23.88 24.20 5.29 3.31 Msg. Overhead 1.959 0.549 6.904 1.830 0.430 0.053 0 20 40 60 80 0 100 200 Update Interval (Sec) 0 20 40 60 80 0 0.1 0.2 DAS Overhead Total Overhead DAS Overhead Total Overhead Figure 4.7: Overhead (LLP-P+DAS) • All load balancing schemes show significant improvement with DAS applied. When DAS is used in LLP or LLP-P, we can achieve CI of nearly 1, which indi- cates the importance of including DAS, even when good load balancing schemes are used. • Figure 4.2 depicts the average CI of YNP as a function ofN. We found that YNP is less sensitive toN when DAS is used, which is a highly desirable property. The reason is that, even if peers decide to send requests to a small subset of neighbors, under DAS, these requests still have a high chance of either being served before 92 0 20 40 60 80 100 0.4 0.5 0.6 0.7 0.8 0.9 1 Peer Set Size CI LLP−P YNP+DAS YNP+DAS LLP−P Figure 4.8: Peet Set Size (CI) their deadlines or being reissued to a “better” peer. Even as N grows large, we no longer see the big drop in CI that occurs when DAS is not used. This makes this policy more practical to implement than YNP alone (e.g., we can safely use reasonably small values ofN). • Figure 4.3 depicts the average CI of LLP-P as a function of the update interval threshold. We found that LLP-P with DAS is less sensitive to the update interval threshold and can achieve a CI close to 1 even with a large interval. The update message overhead of LLP-P with DAS is similar to LLP-P without DAS as shown in Figure 4.4. DAS Overhead: use of EDP introduces additional message overhead, i.e., piece requests have to be reissued, when requests are dropped by a peer. Such request drop- ping may cause a chain effect in request reissuing, as a reissued request may cause 93 0 20 40 60 80 100 0.2 0.4 0.6 0.8 1 Peer Set Size Overhead LLP−P YNP+DAS YNP+DAS LLP−P Figure 4.9: Peer Set Size (Overhead) additional drops when it arrives at another peer, etc. To evaluate the overhead due to DAS, in our experiments, we measure the number of requests sent per data piece under different load balancing schemes and report 9 them in Table 4.3, where we observe: • In general, the better load balanced is the scheme, the lower is the DAS overhead, since a more balanced scheme can “direct” the requests to “better” peers which cuts down on reissueing of requests. • Under LLP-P, increasing the update interval threshold increases the DAS overhead as shown in Figure 4.7, which is due to the reduced effectiveness of LLP-P under larger update intervals. However, the total message overhead of LLP-P is still dominated by LLP update messages. Thus, the total message overhead increases 9 For LLP-P, we show the total message overhead which includes both, LLP update messages and DAS request reissue messages. 94 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x: CI F(x) Random+DAS(0.1, Naive) Random(0.1) Random+DAS(0.1) Random+DAS(0) Figure 4.10: CI (Mixed Selection) when we use a smaller update threshold as depicted in Figure 4.7. Our measure- ments indicate that the total message overhead of LLP-P with DAS is similar to that without DAS. Is LLP-P Always Preferred?: above LLP-P shows good performance under small message overhead; thus, a natural question is whether LLP-P should be the scheme of choice. In BT, each node has a list of neighboring peers and the overhead of LLP- P depends on (1) how often it sends explicit update messages, and (2) to how many neighbors it needs to send these messages (i.e., its peer set size). To study how peer set size affects LLP-P overhead, we vary each node’s peer set size from 10 to 100 and compare LLP-P to YNP. Figure 4.8 depicts the resulting average CI, and Figure 4.9 shows the corresponding message overhead, where we observe: 95 • With a larger peer set size, the average CI of both LLP-P and YNP is improved, as a larger number of peers increases piece availability among neighbors which helps load balancing schemes. • With a larger peer set size, message overhead of LLP-P increases. This is due to LLP update messages being sent to more peers. By contrast, message overhead of YNP reduces because better load balancing through larger peer size helps reduce reissuing of requests. Therefore, under larger peer set sizes, YNP can outperform LLP-P in terms of message overhead while both YNP and LLP-P can achieve similar CI. Mixed Piece Selection: for ease of exposition, above we evaluated our approaches using in-order piece selection for determining which data piece to request. Mixed piece selection is studied in the literature, e.g., [35], [70], [9], where most techniques can be summarized as a combination of rarest-first selection (mainly for piece diversity) when deadlines are not urgent and in-order selection (mainly for making deadlines) when they become urgent. It has been shown that mixed piece selection improves performance, when done properly. Thus, we also evaluated our load balancing approaches under mixed piece selection; since the results were qualitatively similar to those presented earlier, we omit them here. Instead, we focus on combing mixed piece selection with our DAS approaches. Under EDP, a node keeps searching for a peer to serve a request, and eventually obtains the piece on-time, unless no peer has that piece or those peers who do, are too 96 overloaded to make the deadline. Using a mixed strategy can help reduce the probability of not being able to obtain a piece on-time. However, a naive implementation of mixed selection under DAS does not work well. Specifically, the rarest-first selection part of mixed piece selection conflicts with EDF, as rarest-first selection typically requests pieces far away from the current playback point that have more slack time than the normal in-order request. Consequently, such requests end up at the back of the queue and end up waiting a long time to be served. To make DAS work well with mixed piece selection, we use two request queues (per node), one for in-order requests and one for rarest-first requests. In-order requests are served using EDF, and rarest-first requests are served using FCFS 10 . When a service slot becomes available, we consider the first request in the rarest-first queue - if serving that request does not result in a missed deadline for a request in the in-order queue, then we go ahead and serve it; otherwise, we pick a request from the in-order queue (using EDF). We give the details of adapting DAS for mixed piece selection in [65]. Let p be the probability of selecting a rarest piece and 1−p be the probability of doing in-order selection, in the mixed strategy. To evaluate our approach, we perform the following experiments: (1) Random using mixed selection withp = 0.1 11 ; (2) Random 10 This is done to respect the motivation of requesting pieces that are rare at request time. For the same reason we limit the rarest-first queue. 11 We pick p = 0.1 as it is a typical value evaluated in the literature. Exploring other values of p or other mixed selection schemes, e.g., as in [35], is outside the scope of this paper 97 Table 4.4: Message Overhead under DAS and Mixed Piece Selection Rand YNP LRP Tracker LLP-P Msg. Overhead 3.669 0.837 7.125 3.600 0.758 Overhead Inc. (%) 87.29 52.46 3.20 96.72 76.28 using mixed selection with DAS, andp = 0.1 using naive (one queue per node) imple- mentation; (3) Random using in-order selection with DAS; (4) Random using mixed selection with DAS, with p = 0.1 and the two-queues per node adaptation described above. Figure 4.10 depicts the corresponding results, where we observe: • Without the separate request queues, the performance of DAS is quite poor under mixed piece selection - the average CI is only ≈ 0.33 as compared to ≈ 0.68 without DAS. In contrast, the proposed two-queues per node adaptation gives sig- nificant improvements - the average CI is≈ 0.99 as compared to≈ 0.68 without DAS. • DAS performs better under mixed selection than under in-order selection; the average CI is≈ 0.99 as compared to≈ 0.93, which is due to better piece diversity under mixed selection, with later pieces having higher availability under mixed selection than under in-order piece selection. We experimented with other load balancing schemes using DAS with mixed selec- tion. All schemes showed significant improvements, with the average CI of LRP being ≈ 0.97 and that of other schemes being ≈ 0.99. Thus, even schemes with relatively poor performance before, such as Random and LRP, using mixed selection with DAS can achieve a CI similar to more load balanced schemes. However, this comes at the cost 98 of higher message overhead (when compared to using in-order selection) as depicted in Table 4.4. Under mixed piece selection, part of the system resources are shifted to serv- ing rare pieces, which reduces the service rate of the in-order pieces and increases the corresponding queue length at nodes. This, in turn, increases the chance of a request reissue. 4.4 Heterogeneous Environment So far, we focused on a homogeneous environment, which enabled a simpler exposition and clearer evaluation of our schemes. However, nodes in the real world have different capabilities (e.g., upload capacity). The different upload capacities affect load balancing characteristics, e.g., faster nodes can finish servicing requests in their queues before the slower nodes. Thus, the load balancing schemes need to be adjusted, to account for the heterogeneity in node capacities. We show how to adjust the LLP and YNP schemes; other load balancing schemes can be modified similarly 12 . LLP-HLB: LLP related schemes consider nodes’ queue length as a way to balance load and thus reduce response times. In a heterogeneous environment, the response times of nodes also depend on their upload bandwidth. Thus, a natural way to adapt LLP is to consider the amount of time it would take to respond to all requests at a node (rather 12 Studying how to provide upload incentives in a heterogeneous environment is outside the scope of this paper. 99 0 50 100 150 200 250 300 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 Slow Node Upload Bandwidth (Kbps) CI LLP LLP−HLB LLP−HLB LLP Figure 4.11: LLP (Heterogeneous) Table 4.5: Heterogeneous Settings Slow Node Upload BW (Kbps) 32 64 128 256 Fast Node Upload BW (Kbps) 992 960 896 768 than just its queue length), which is proportional to node queue length upload bandwidth ; we term this scheme LLP-HLB. We evaluate LLP-HLB under a variety of heterogeneous settings given in Table 4.5, with arrival probabilities of fast and slow nodes being the same (hence the average sys- tem capacity remains roughly the same across the settings). Figure 4.11 shows the average CI comparison between LLP and LLP-HLB, where CI drops when nodes have a larger disparity in upload bandwidth and LLP-HLB shows improvements over LLP. However, these improvements are not large, which indicates that LLP adapts to hetero- geneous environments fairly well; this is due to fast nodes clearing their request queues 100 0 50 100 150 200 250 300 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 Slow Node Upload Bandwidth (Kbps) CI YNP YNP−HLB YNP−HLB YNP Figure 4.12: YNP (Heterogeneous) faster (i.e., having a shorter queue), which results in more requests being directed to them. We also observed similar results for LLP-S and LLP-P, with details found in [65]. YNP-HLB: in the case of YNP, instead of randomly choosing amongN youngest peers, we adapt it to make this choice based on weighted probabilities, where the weights are proportional to the corresponding nodes’ upload capacities. We term this YNP-HLB. Figure 4.12 demonstrates the improved performance of YNP-HLB over YNP in various heterogeneous settings. As expected, the HLB adaptation has a significantly greater affect on YNP, as YNP has no information about loads at different peers, and hence does not naturally account for heterogeneity like LLP. Since our HLB schemes depend on knowledge of neighbors’ upload capacities, the above experiments assume the nodes have perfect knowledge of peers’ upload band- width. In a real systems, errors can occur in bandwidth estimation. We examined the 101 sensitivity of our schemes by introducing20% and40% errors in peer’s bandwidth infor- mation. In both cases, we found that the impact on the system performance was negli- gible (refer to [65]). 102 Chapter 5 Incentives In recent years a number of research efforts have focused on effective use of P2P-based systems in providing large scale video streaming services. Although P2P-based design of V oD systems received much attention in recent years, most existing V oD systems do not have built-in incentives. In this part of the dissertation, we consider a BT-like V oD system and study the following questions: (1) why an incentive mechanism is needed, and (2) what are appropriate incentives for a BT-like V oD system. Motivated by this, in Chapter 5.3, we propose a layered coding based incentive mechanism and show that: (a) our approach does provide better incentives than BT TFT; and (b) our approach not only improves system performance but also uses system resources efficiently. 5.1 Background and Motivation Video coding Single-layer coding is currently used by most V oD applications, including most of the popular P2P streaming applications, e.g., PPLive [34]. Single-layer coding is widely adopted because of its high coding efficiency and simple design. However, it is hard for 103 users in a P2P V oD system to have different video playback quality, such as video frame rate, video bit rate and so on, with single-layer coding. To address the inflexibility of single-layer coding, Multiple Description Coding (MDC) and layered coding (LC) are proposed. With MDC, a prerecorded video file is encoded into multiple layers (copies), each layer has a different video codec descrip- tion for different video quality. Layers in MDC is independent to each other and the video provider can send different layer to the receiver based on the link capacity. The efficiency of MDC depends on the trade-off among the achievable qualities with differ- ent number of descriptions[47]. MDC is inherently inefficient when a large number of descriptions are created. The inefficiency of MDC largely prevents its usage in practical P2P streaming systems[47]. For this reason, we do not further consider MDC in this work but our approaches proposed in this work still apply to a system using MDC. The basic idea of LC is to encode a video file into multiple layers but with nested dependence. The base layer contains the basic data representing the most important fea- tures of the video. Additional layers, called enhancement layers, contain data that pro- gressively refine the reconstructed video quality. An enhancement layer can be decoded only if all the lower layers are available. In recent years, significant advances have been made in layered coding. H.264/SVC (layer coding) achieves a rate distortion perfor- mance comparable with H.264/A VC (single-layer coding), with the same visual repro- duction quality typically achieved at only≈ 10% bit rate [62]. Therefore, layered coding 104 is a reasonable candidate to be used in P2P streaming systems. Specially, layered cod- ing is useful to address the bandwidth heterogeneity of the receivers with differentiated services. Because of the above benefits from layered coding, we are going to focus on using layered coding to provide incentives for BT-like V oD systems. Although layered coding brings flexibility for participating peers to have different video quality, the unequal importance of different layers put some challenges to incen- tives mechanism for P2P streaming. One challenges is how to request data chunks for different layers. Another challenge is how to allocate resources for different layers. In a BT-like V oD system, it means how to serve peers with different video layers. We study both questions in Chapter 5.3. Motivating Example We motivate the need for incentives and the use for layered video coding through simple examples. We use simulations to illustrate the benefits of layered coding, with the details of the simulator described in Chapter 5.2 and settings described in Table 5.1. To illustrate the need for incentives, we conduct the following experiments. We use a single-layer encoded video file with 600kbps video rate. Nodes use a mixed piece selection probability of 0.2 (detailed study of different mixed ratio can be found in Chapter 5.3). Each node is required to download all chunks with single-layer coding for the playback. When playback reaches a piece but it is not received yet, the video is paused. This behavior emulates the real world V oD playback experience such as YouTube when the streaming buffer doesn’t have enough data to play. We use playback 105 continuity (PC) to measure the video playback continuity (refer to Chapter 5.2). Video playback is smoother with a lowerPC andPC = 1 means no interruption. The node class distribution is specified in Table 5.2. Figure 1.3 depicts the resulting PC. The first group represents the average PC where the system contains only faster nodes and uses no incentive mechanism. The second group represents a system with small portion (20%) of slower node without incentives. The third group represents a ssytem with 20% of slower nodes and TFT is used as the incentive mechanism. We have the following observations: • Without incentives, faster nodes have similar performance with slower nodes and they have worse performance compared to a faster nodes only system. PC for faster nodes increases from≈ 1.15 to≈ 1.44. This is due to (1) the existence of slower nodes and they reduces the system capacity; (2) slower nodes and faster nodes have similar chances to be served by peers without incentive mechanisms. • With TFT being used as incentives, faster nodes have a much better performance than slower nodes. PC for faster nodes and slower nodes are≈ 1.82 and≈ 1.18. This is because faster nodes contribute more and have more chance to be served by peers under TFT. The above experiments give us following insights: (1) an incentive mechanism is needed for P2P V oD systems not only for fairness but more importantly for performance. In our experiments, faster nodes have better performance when TFT is used and it shows 106 TFT provides differentiated services between faster and slower nodes. This motivates the need of incentives for P2P V oD systems. (2) although TFT does provide better performance for faster nodes, it hurts slower nodes performance too much. Slow nodes performance dropped ≈ 26% when TFT is used. The resulting PC is ≈ 1.82, which means waiting time consists of ≈ 41% of total playback time. This represents a very bad user experience. In a real world system, such bad user experience will drive users away and therefore a more desirable way is to provide some basic service to all users while letting high capacity users to have better video quality. This motivates us to use layered coding instead of single-layer coding. To illustrate the performance of TFT for layered coding, we conduct the follow- ing experiments. We split the original video into two layers, one base layer and one enhancement layer, with each layer has a video rate300kbps. Since base layer contains the most important video feature, every node is required to download all data chunks. In contrast, nodes can skip enhancement layer data if it misses the playback deadline as it doesn’t doesn’t affect playback continuity. We usePC andCI to measure base layer and enhancement layer performance, respectively (refer to Chapter 5.2). Figure 1.4 and Figure 1.5 depict the resulting PC and CI, respectively. The first group represents the result where there is no incentives in the system and the second group represents the result where TFT is used. The third group represents the result where LCI is used as incetives (refer to Chapter 5.3). We have the following observa- tions. 107 • Faster and slower nodes have similar performance in both PC and CI without incentives. It is another proof that incentives are needed. • Faster nodes have better performance than slower node in base layer with TFT. PC of faster and slower nodes are≈ 1.07 and≈ 1.38, respectively. It shows TFT cannot provide basic service with layered coding. Meanwhile, faster nodes don’t have better video quality than slower nodes using TFT. In a word, TFT doesn’t work well with layered coding. CI of faster and slower nodes are≈ 0.93 and≈ 0.89, respectively, which means TFT doesn’t provide good service differentiation. These two observations clearly indicate TFT is not suitable for layered coding. • In contrast, the proposed LCI scheme shows good performance with layered cod- ing. More specifically, faster and slower nodes have similar performance for base layer with LCI, which indicates a basic service is provided for all nodes. PC of faster and slower nodes are both≈ 1.02. Faster nodes have a significant higherCI than slower nodes for enhancement layer. The resultingCI of faster and slower nodes are ≈ 0.98 and ≈ 0.01, respectively. This shows LCI is able to provide better video quality for faster nodes. Thus, what we need is an incentive mechanism which can provide: (1) basic service regardless of capacity, and (2) differentiated service on enhancement layer. This moti- vates us to use LCI as incentives with layered coding. We describe LCI in details in Chapter 5.3. 108 5.2 Performance Metrics and Experimental Setup Experimental Setup We explore and evaluate our proposed schemes through simulations, using the BT simulator provided by [2] (also used by other groups for BT related research). This is an event-based simulator, originally developed to simulate the chunk exchange mechanism in the BT protocol. To explore our proposed approaches for BT-like V oD systems, we modify the simulator in [2] as follows: • We extend the BT simulator to support multiple video layers with each layer can have different video rate. • Nodes start their playback after a startup delay, S min . After that, playback pro- ceeds at the base layer rate. If base layer data chunk is not received on its playback time, playback is paused. For enhancement layer, the data chunk is skipped if it is not received on the playback time and marked as missing. • Each node serves requests until it finishes playback. Once it finishes the playback. This emulates a user quitting the system in the real world. • We allow node arrivals; in what follows we use a Poisson arrival process with rate λ. 109 Table 5.1: Simulation Settings Simulation Time 10 hours Avg node inter-arrival time ( 1 λ ) 10 sec Chunk Size 256 KB Total Video Rate 600 kbps Video Length 30 Mins Peer Set Size 40 Max #Upload Connection (U) 5 Server Upload Capacity 800 kbps • There is one initial server in the system and it stays in the system for the duration of the simulation. Each node can request data chunk from this server, if that chunk cannot be found among the peers. Unless otherwise stated, the results that follow correspond to simulation settings given in Table 5.1. Similar settings are used in [35]. All experiments simulate a BT-like V oD system for 10 hours. For a fair comparison between approaches, we use the same node arrival sequence for each simulation with a given arrival rate. In experiments where nodes randomly select to which peer to send a request, we also use the same selection sequence. Performance Metrics: For a playback using layered video coding , we need to measure performance in both continuity and quality. Continuity is determined by how fast a node downloads base layer data pieces and we use playback continuity (PC) defined in [35]. Given the original length of video contentL ′ ,PC is defined as PC = (T −S min ) L ′ . 110 Table 5.2: Class Description Class Fraction Download BW Upload BW Slow 20% 1500kbps 128kbps Fast 80% 5000kbps 800kbps T denotes the actual playback time taken to complete the entire playback. We start to measureT as soon as a peer joins the network and requests the data. S min denotes the minimum start-up time required for a peer to be able to start playback after it joins the network. The minmum start-up time is defined as the minmum time takes to down- load the first piece of the movie file. For a homogeneous set of nodes with bandwidth 800kbps,S min = 2048kbits/800kbps = 2.56. Thus,PC measures ratio of total video playback time including waiting to the actual video file length. For enhacement layer, we want to measure how much percentage of the data is received for a given layer. Therefore, we use continuity index (CI), defined in [67], as our metric. CI = #total chunks−#total missing chunks #total chunks . CI measures how much percentage of data is received on time and a higherCI implies better video quality. 5.3 Layered Coding Incentives Now we present Layered Coding Incentives (LCI) and describe it from peer requesting and serving policies. 111 0 0.2 0.4 0.6 0.8 1 1 1.5 2 2.5 3 3.5 Mixed Prob PC All One LCI Figure 5.1: PC (200kbps Base Layer) 5.3.1 Peer Requesting With layered coding, each node can download data chunks from multiple layers. The number of layers a node can download depends on its capacity as well as the system capacity from our peers. Therefore, a natural question for requesting with layered cod- ing would be: how does a node decide which layers to download? Intuitively, two straightforward approaches would be: (1) every node downloads all available layers; and (2) every node only download layers with total rate less than its upload capacity. It’s obvious that these two approaches have limitations. Downloading all layers will make node upload bandwidth spread into multiple layers and may degrade the overall performance. Downloading only base layer will improve performance on base layer but it also means a node can only receive a minimal video quality. 112 This motivates us to propose LCI-Request. LCI-Request let a node monitor its download progress for subscribed layers and only download next layer if it makes good progress. The detailed description for LCI-Request are as follows: • Step 1: Each node monitors a moving window start from the current playback point fort seconds of data chunks. This window contains data chunks to be played in the near future. • Step 2: If downloaded window data chunks is higher thanW high , start download- ing data chunks in next layer. This means there is enough data in the window to play this layer. • Step 3: If downloaded window data chunks is lower thanW low , stop downloading data chunks in next layer. This means there is not enough data in the window for playback. • Step 4: Among all missing data chunks for layers to download, do mixed piece selection. To evaluate this approach, we divide the video file in Table 5.1 into a200kbps base layer and a 400kbps enhancement. For a clearer presentation, we use a system has (1) a homogeneous class of nodes with upload bandwidth 256kbps; (2) no incentive mechanisms; (3)W high andW low are set to 0.75 and 0.5, respectively. We conduct the following experiments: (1) each node downloads both base and enhancement layers; 113 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mixed Prob CI All One LCI Figure 5.2: CI (400kbps Enhancement Layer) (2) each node downloads base layer; and (3) each node downloads layers using LCI- Request. Figure 5.1 and Figure 5.2 show resultedPC andCI as the function of mixed ratio. We have the following observations: • Downloading both layers makePC much worse compare to other schemes. This is because total system capacity is not enough to support both layers. Request enhancement layers makes less bandwidth available for base layer chunks. It clearly indicates that blindly downloading all video layers degrades performance and therefore nodes should not do that. • Downloading only base layer shows the best PC but nodes receive no enhance- ment layer data. 114 0 0.2 0.4 0.6 0.8 1 1 1.05 1.1 1.15 1.2 Mixed Prob PC 128Kbps(Perfect) 800Kbps(Perfect) 128Kbps(Estimate) 800Kbps(Estimate) Figure 5.3: PC (300kbps Base Layer) • PC of LCI-Request is very close to download only base layer, this means base layer download progress is good with the moving window. Meanwhile, nodes still can receive some portion of enhancement layer data, which mean LCI-Request uses node capacity more efficiently than downloading base layer only. 5.3.2 Peer Serving Serving peers with LCI has the following objectives: (1) basic services regardless of bandwidth; (2) service differentiation; and (3) node capacity is utilized efficiently. Basic Service: A basic service means a smooth video playback with a minimal video quality. With single-layer coding, the minimal video quality is hard to achieve. With 115 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mixed Prob CI 128Kbps(Perfect) 800Kbps(Perfect) 128Kbps(Estimate) 800Kbps(Estimate) Figure 5.4: CI (300kbps Enhancement Layer) layered coding, we believe the minimal video quality means video playback with the base layer. 1 One approach to have a basic service in a BT-like V oD system is to make base layer free to all users. More specifically, each user randomly uploads to peers on base layer. It is easy to implement to this policy and it shows good performance in our experiments if every nodes follow this rule. Meanwhile, in order to let slower nodes have similar basic service to faster nodes, it is very important to let faster nodes help them. However, a question here is how can we force faster nodes to do so? From faster nodes’ perspective, they are more willing to upload to faster peers for enhancement layer data chunks so that they can get more enhancement layers in return. Moreover, if there is no mechanism to encourage faster nodes uploading to slower nodes, there is no reason for them to do so 1 A natural question here is how to determine the base layer rate to to have the minimal quality. How- ever, this question is related to video coding and thus is out of the scope for this work. 116 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Upload Usage CDF 128kbps 800kbps Figure 5.5: Uplink Usage as they neither gain anything by uploading to slower nodes nor lose anything by not uploading to them. Therefore, a mechanism is needed to force faster nodes uploading to slower peers for the base layer. In order to make faster nodes upload base layer data chunks to slower nodes, we propose the following rule, which we term Contribution Requirement: each node needs to upload at leastXkbps on the base layer to get enhancement layers. In our simulation, we use X = 0.5∗r, where r denotes the video rate for base layer. For example, if the base layer rate is300kbps, in order to download the next enhancement layer, a node needs to upload at least150kbps to the base layer. With contribution requirement, faster nodes now has to upload base layer data chunks otherwise they lose the chance to get enhancement layer data from peers following this policy. 117 Service Differentiation: Another objective for serving is to have differentiated service. In P2P systems, total system capacity is determined by total upload rate contributed by peers. For example, a BT-like V oD system with all nodes upload bandwidth800kbps can only support video rate up to 800kbps. Base on this observation, we propose a policy that a node only serves enhancement layeri with video rater i if the peer can upload to that rate. We term this rate requirement. To determine if a peer meets the rate requirement, a node needs to know its upload bandwidth. Figure 1.5 shows an ideal case where we assume each node know its peers actual bandwidth. The results show that with the knowledge of peer bandwidth, LCI can have good service differentiation. This also indicates that if a node can make sure if a peer meets the rate requirement, we can have good service differentiation. However, such perfect knowledge is very hard to obtained in a real P2P V oD sys- tem, which indicates an approximation algorithm can be useful here. There are existing technologies for estimating peer bandwidth. For example, end-to-end bandwidth detec- tion is often used by client-server applications to determine the effective bandwidth. For a BT-like V oD system, end-to-end bandwidth detection has limited use because: (1) it requires sending probing packets, which can be very expensive in a mesh based P2P network like BT; and (2) it measures pure network bandwidth rather than the actual bandwidth a peer contributing and thus is easy to be explored by malicious users. BT also measures a peer’s upload rate but it is not enough for a our need either because it only measures the upload rate if there is data exchange between two peers. In V oD 118 systems, data exchange can be one directional, which makes such measurement even less useful. Motivated by the need of estimating peer upload bandwidth and drawbacks of exist- ing approaches, we propose an approximation algorithm to estimate peer upload band- width using reported upload rate. • Step 1: once nodea receives a piece from nodeb, nodea reports nodeb’s upload rateBW new b to all neighboring peers. • Step 2: assume nodec is a common neighbor for nodea andb, nodec can aggre- gates bandwidth reports for node b using a moving average. Node c’s estimated upload bandwidth for peerb can be calculated asBW b =BW b ∗w 1 +BW new b ∗w 2 . • Step 3: repeat Step 2 when new rate report is received. To reduce the message overhead in Step 1, peer upload rate report can be piggybacked in existing system messages. In BT, “Have” message is sent to all neighboring peers to announce the receipt of a complete piece. The peer reported upload rate can thus to easily piggybacked on BT “Have” message. We usew 1 = 0.9 andw 2 = 0.1, which are common weights used by many network algorithms to calculate a moving average, one example is the calculation of average round trip time in TFRC protocol [27]. Figure 5.3 and Figure 5.4 show the resulting performance for LCI with contribution and rate requirements as a function of mixed piece selection probability.. To illustrate the 119 Table 5.3: System with sufficient capacity Upload BW (kbps) 128 (20%) 800 (80%) 128 (20%) 800 (80%) PC 1.0219 1.0139 1.0153 1.0143 CI 0.0091 0.9724 0.9504 0.9725 Upload Usage 0.9826 0.7743 0.9722 0.8630 performance of proposed bandwidth estimation heuristic, we compare its performance to that with perfect peer knowledge. We have the following observations. • Faster and slower nodes have similar performance on base layer. PC is low when mixed ratio is small and it is insensitive when it is less than0.6. For example,PC for faster and slower nodes are≈ 1.02 and≈ 1.01 when mixed ratio is0.2. • Faster and slower nodes have a good differentiated service on enhancement layer and it is also insensitive to the choice of mixed ratio. For example,CI for faster and slower nodes are≈ 0.98 and0.01, respectively. The insensitivity of bothPC and CI to mixed ratio means we can pick mixed ratio in a wide range of mixed probability for a real BT-like V oD system. • The bandwidth estimation works well. Compared to the results with perfect knowledge, both PC and CI using bandwidth estimation are very close. This means it estimates peer upload bandwidth accurately. Upload Utilization: We already show our proposed serving mechanism shows: (1) a basic service regardless of bandwidth; and (2) a good differentiated service on enhance- ment layers. In additional to that, we want to make sure our mechanism utilize the node upload link capacity efficiently. Intuitively, the proposed rate requirement might reduce 120 Algorithm 1 LCI-Serve Input: Connection listC, #LayersM, #UploadsU, Upload usage thresholdT up . Output: Peer list to servePS. SortC by upload bandwidth fori = 1 to|C| do Peern i ← peer onith connection inC. forj = 1 toM do if Layerj is base layer then Allown i to download layerj. else ifn i satisfy Contribution Requirement and Rate Requirement then Allown i to download layerj. else if Upload usager up <T up then Allown i to download layerj. end if end if end for PS←PS∪n i U ←U −1. ifU ≤0 then break end if end for ReturnPS. the upload bandwidth usage for faster nodes as they can only exchange enhancement layer data chunks with other faster peers. Figure 5.5 shows the CDF plot of node upload bandwidth usage ratio. We have the following observations: • Slower nodes utilize their upload bandwidth efficiently: the average upload link usage is ≈ 0.98. This is because slower nodes can serve both slower and faster nodes on base layer chunks and their upload bandwidth usage is high. 121 • Faster nodes doesn’t utilize their upload bandwidth efficiently: the average upload link usage is only≈ 0.77. This is due to the proposed rate requirement limits the faster nodes exchanges enhancement layer chunks to other faster nodes only and there is bandwidth left even after serving all the faster peers. The above example clearly shows that the system has extra capacities to support more nodes to have enhancement layer data chunks in addition to the faster nodes. Therefore, if we can let faster nodes upload to more peers, we can improve the upload link utilization for faster nodes. Motivated by these observations, we propose the fol- lowing scheme to relax the rate requirement: • Step 1: each node monitors its upload link usage when serving other peers; • Step 2: if the link usage is less than a threshold T up , then it serves peers with upload cap u ≥ r i−1 , where i is the current highest layer requires peer’s upload rateu≥r i if link usage is not belowT up ; • Step 3: repeat Step 1 and 2 until upload bandwidth usage is aboveT up . Table 5.3 show the resultingPC,CI, and upload bandwidth usage with the proposed scheme. We can see that the system have enough capacities to support both slower and faster nodes to have good performance on both base and enhancement layer. With the proposed scheme, we see the average upload link usage for faster nodes improves from ≈ 0.77 to ≈ 0.86. To verify if the system indeed have the capacity to support both class of nodes, we reduce the faster nodes upload bandwidth from800kbps to650kbps. 122 Table 5.4: System with insufficient capacity Upload BW (kbps) 128 (20%) 650 (80%) 128 (20%) 650 (80%) PC 1.0394 1.0256 1.0467 1.0251 CI 0.0063 0.9587 0.1299 0.9588 Upload Usage 0.9801 0.9021 0.9795 0.9407 Table 5.5: Video Layer Distribution Layer Rate Spatial Temporal Base 96kbps 176x144 (QCIF) 15fps Enhancement 1 384kbps 352x288 (CIF) 30fps Enhancement 2 1536kbps 702x576 (4CIF) 60fps Table 5.4 show the resulting PC, CI, and upload bandwidth usage. Now we see that CI for slower nodes only improves from≈ 0.01 to≈ 0.13 andPC for slower and faster nodes is similar to that without relaxing the rate requirement. This shows that (1) the system has insufficient capacity to support both classes of nodes to get enhancement layer data; and (2) our propose scheme utilize the system capacity efficiently but not hurting the faster nodes performance. The complete description of LCI-Serve scheme is described in Algorithm 1. 5.3.3 Realistic Settings We have explored the design space for LCI. For a clearer presentation, we used relatively simple settings to illustrate the . In this section, we use video layer distributions (refer to Table 5.5) used in [62] and 3 classes of nodes with bandwidth listed in Table 5.6 to validate LCI performance. 123 Table 5.6: Class Description Class Fraction Download BW Upload BW Mobile 30% 1500kbps 128kbps DSL 30% 5000kbps 512kbps Cable/University 40% 12000kbps 2048kbps We use video length of 30 minutes for the simulations and the number of chunks for base layer, enhancement layer 1, and enhancement layer 2 are84,337,1350, respec- tively. Base layer contains the video with lowest spatial (resolution) and temporal qual- ity (frame rate) and it emulates a typical video quality a user experiences with a mobile device. The first enhancement layer represents the standard definition video quality and the second enhancement layer upgrades the quality to high definition. For node class distribution, the first class nodes have very limited upload capacity and thus represents a mobile user using the cellular data network. The second class of nodes have medium upload capacity and their bandwidth is similar to typical home DSL users. For third class of nodes, they have very high upload capacity and their bandwidth corresponds to a cable customer or university users. Figure 5.6, Figure 5.7, and Figure 5.8 show the resulting performance with LCI and we have the following observations. • For base layer, all 3 classes of nodes have similar PC. Figure 5.6 depicts the resulting PC for each node class as a function of mixed probability. For example, when mixed probability is 0.2, the resultingPC for mobile, DSL, and cable/university users are≈ 1.03,≈ 1.02, and≈ 1.01, respectively. This shows 124 0 0.2 0.4 0.6 0.8 1 1 1.05 1.1 1.15 1.2 1.25 Mixed Prob PC 128Kbps 512kbps 2048Kbps Figure 5.6: PC (Base Layer) LCI is able to provide a basic service among all class of nodes regardless of their bandwidth. • Figure 5.7 depicts the resulting CI for enhancement layer 1 and we notice that DSL and cable/university users have high CI compared to mobile users. For example, CI for mobile, DSL, and cable/university users are≈ 0.06,≈ 0.93,≈ 0.98, respectively when mixed probability is 0.2. This shows LCI is able to pro- vide a differentiated service on the first enhancement layer. • Similar to enhancement layer 1, Figure 5.8 depicts the resultingCI for the second enhancement layer. For example,CI for mobile, DSL, and cable/university users are ≈ 0.03,≈ 0.05,≈ 0.96 respectively when mixed probability is 0.2. This shows only the cable/university users have the capacity to watch HD contents 125 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mixed Prob CI (EL 1) 128Kbps 512kbps 2048kbps Figure 5.7: CI (Enhancement 1) among all users and again verifies that LCI provides service differentiation on the second enhancement layer. • For all layers, LCI performance is insensitive to the mixed probability, which means we have a wide range of choices available to pick the mixed probability in a real BT-like V oD system. 126 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mixed Prob CI (EL 2) 128Kbps 512kbps 2048kbps Figure 5.8: CI (Enhancement 2) 127 Chapter 6 Bistro We provided an overview of the Bistro framework and its fault tolerance protocol in Chapter 1.4. In this chapter, we give detailed discussion on a data assignment problem in the Bistro fault tolerance protocol. In Chapter 6.1, we formulate this problem into a non-linear optimization problem and a genetic algorithm heuristic as an approximation is developed in Chapter 6.2. In Chapter 6.3, we evaluate our approach using simulations and compare the results of our heuristic with other simple heuristics as well as an optimal solution obtained from a brute-force approach. 6.1 Problem Formulation In this section, we formulate an optimization problem corresponding to the assignment problem described before. 6.1.1 Notation Our notation is based on Figure 6.1. The original file is divided into K FEC groups, where each group containsk packets. After encoding, each FEC group is expanded ton packets. Each FEC group is further divided intoG checksum groups. The total number 128 Original File Divide into K FEC groups Encode with erasure code Divide into G checksum groups n packets W packets k packets k packets k packets g packets each Figure 6.1: Graphical Representation of Notation of striping units is thenKG. The number of intermediate bistros to which we can stripe isB. In this paper, we use a simple model that assumes all bistros fail independently. Let us assume the probability that bistroi fails bep i (i = 1,2,··· ,B). In practice, we can estimatep i by monitoring bistroi for a period of time. Investigating more sophisticated reliability models is part of our future work. We believe that using more sophisticated models does not change our framework significantly. 6.1.2 Definitions Let x i (i = 1,2,··· ,B) denote the number of checksum groups we transfer to an intermediate bistroB i . This data assignment forms a setX, where X,{x 1 ,x 2 ,··· ,x B }. LetS X denote the maximal cardinality subset of the power set of X, such that all the elementsS i ∈S X satisfy the condition that the sum of all the elements inS i is no more 129 than the maximal number of checksum groups that the destination server can lose and still be able to reconstruct the entire file. Intuitively, S X denotes all the intermediate bistro state patterns corresponding to data assignmentX where the final destination can still successfully reconstruct the entire file. Therefore, we have, S X ⊆ 2 X and S i ∈S X for i = 1,2,··· ,|S X |. (6.1) X x j ∈S i x j ≤⌊ n−k n KG⌋ for i = 1,2,··· ,|S X |. (6.2) A special case forS X is the empty set∅; we can easily verify from Equations (6.1) and (6.2) that ∅ ∈ S X holds true for every X. Here, the empty set ∅ corresponds to no intermediate bistro failures. Hence,S X denotes all intermediate bistro state pat- terns corresponding to a data assignmentX where the final destination can successfully reconstruct the original file. Let P i denote the probability of obtaining the failure pattern S i corresponding to assignmentX; then we have P i , Y x j ∈S i p j · Y xm/ ∈S i (1−p m ). (6.3) 130 We can use Equation (6.3) to calculate P i for each S i . Thus, the probability that the destination server can collect enough checksum groups to reconstruct the original file under assignmentX, denoted byP X , is given by P X = |S X | X i=1 P i . (6.4) 6.1.3 Optimization Problem We would like to find an assignment{x 1 ,x 2 ,··· ,x B }, wherex i (i = 1,2,··· ,B) is a non-negative integer, such that the probabilityP X is maximized. Thus, our assignment problem can be written formally as: max |S X | X i=1 ( Y x j ∈S i p j · Y xm/ ∈S i (1−p m )). subject to S X ⊆ 2 X ; S i ∈S X i = 1,2,··· ,|S X |; P x j ∈S i x j ≤⌊ n−k n KG⌋ i = 1,2,··· ,|S X |; P B i=1 x i =KG; x i ∈N. A straightforward approach here would be to divide our optimization problem into two steps: (1) computing S X , and (2) calculating P X . In the first step, to compute 131 S X , we need to determine eachS i which satisfies Equation (6.2). This is a subset-sum problem and is NP-Complete [23]. Once we haveS X , calculatingP X takes linear time. Therefore, the total time complexity for our formulation above is determined by the time complexity of the first step. 6.2 Genetic Algorithm Heuristic In this chapter, we develop an approximation to an optimal solution of the optimiza- tion problem formulated in the previous chapter. We have already shown that the time complexity of our optimization problem is determined by its first step, which is NP- Complete. Therefore, we are interested in developing a heuristic. In particular, we will develop a genetic algorithm based heuristic. We first give a brief introduction to genetic algorithms, and then we show how to apply a genetic algorithm approach to our problem. 6.2.1 Genetic Algorithms Genetic algorithms (GAs) [30] are stochastic search techniques guided by the principles of evolution and natural genetics. They are modeled loosely on the principles of evo- lution via natural selection, employing a population of individuals that undergo selec- tion in the presence of variation-inducing operators such as mutation and recombination 132 (cross-over). A fitness function is used to evaluate individuals, and reproductive success varies with fitness. It generally proceeds as follows [13]: • Step 1. Randomly generate an initial population; • Step 2. Compute and save the fitness of each individual in the current population; • Step 3. Define selection criteria for each individual such that the good gene is likely to be inherited; • Step 4. Generate a new generation by inheriting good genes via genetic operators; • Step 5. Repeat steps 2 to 4 until a satisfactory solution is obtained. A typical genetic algorithm uses three genetic operators: selection, cross-over, and mutation to direct the population over a series of time steps or generations toward con- vergence at the global optimum. Selection attempts to apply pressure upon the pop- ulation in a manner similar to that of natural selection found in biological systems. Poorer performing individuals (evaluated by a fitness function) are weeded out and bet- ter performing, or fitter, individuals have a greater than average chance of promoting the information they contain to the next generation. Crossover allows solutions to exchange information in a way similar to that used by a natural organism undergoing reproduc- tion. Mutation is used to randomly change (flip) the value of single bits within individual strings to keep the diversity of a population and help a genetic algorithm to get out of a local optimum. It is typically used sparingly [13]. 133 An effective GA representation, a meaningful fitness evaluation, and genetic oper- ations are key to a successful GA application. The appeal of GAs comes from their simplicity as robust search algorithms as well as from their power to discover good solu- tions rapidly for difficult high-dimensional problems. GAs are useful and efficient when: (1) the search space is large, complex, or poorly understood; (2) domain knowledge is scarce or expert knowledge is difficult; (3) no mathematical analysis is available; and (4) traditional search methods fail [30, 31]. One of the advantages of the GA approach is the ease with which it can handle arbitrary types of constraints and objectives [13]. 6.2.2 GA in Our Problem As mentioned above, a good design of cross-over and mutation techniques can make GA powerful. We now develop a GA solution for our problem formulated in Chapter 6.1. In our formulation, we encode the data assignment into a vector {x 1 ,x 2 ,··· ,x B }. Each element of the vector is the number of checksum groups we will transfer to bistroi. Cross-over Cross-over always happens between two vectors. In our GA design, we use a two-point cross-over mechanism. Two cross-over positions are randomly selected. Once we have chosen two positions, we swap elements between them, using the two vectors. This gives us two new vectors. This cross-over scheme is similar to that of [13] which is used to solve the traveling salesman problem. The difference lies in the way we maintain 134 the feasibility of our solution. Our selection always picks the best half of a generation and allows them to cross-over to generate the next generation. For example, let B = 6. Then, given two assignments, {10,10,10,10,10,10} and {20,0,20,0,10,10}, we randomly choose 2 positions, say 2 and 5; then all the elements between these two indices are swapped. Then, we obtain two children from the original two vectors. They are{10,0,20,0,10,10} and{20,10,10,10,10,10}. For each child, we calculate their fitness using Equation (6.4) in Chapter 6.1. One side effect of our cross-over operation is that a new assignment may break one of our constraints, i.e., P x i = KG(i = 1,2,··· ,B). That is, the total number of checksum groups of a new assignment may not equal to the total number of check- sum groups we want to stripe. In the previous example, the total number of checksum groups to stripe is KG = 60, but after cross-over, the sum of checksum groups in the newly generated assignments are 50 and 70, respectively. However, these assignments are not feasible. To solve this problem, we add/deduct checksum groups from some positions of the assignment. These positions are chosen randomly until the total number of checksum groups satisfies the assignment constraint again. For example, in assign- ment{20,10,10,10,10,10}, we need to deduct 10 checksum groups to satisfy the con- straint. To do this, we generate 10 random numbers and each number represents a chosen position in the assignment. For each of these randomly chosen positions, we deduct 1 checksum group if the number of checksum groups in this position is greater than zero. 135 Otherwise, we randomly choose another position to deduct 1 checksum group. Simi- larly, if in a newly generated assignment, the total number of checksum groups is less than the assignment constraint, we randomly add checksum groups until the constraint is satisfied. These positions are randomly chosen as well. Mutation Another important element in our GA design is mutation. Although it is infrequently used, it keeps the diversity of a population and can help our GA heuristic escape from a local optimum. In our scheme, mutation is implemented by introducing a random assignment after a certain number of generations, if we find that our GA heuristic is stuck at a local optimum. In our simulations, we invoke a mutation if we observe the average fitness of a population does not change for 5 generations. We choose 5 gener- ations, as we find in our test cases, the GA heuristic usually converges in less than 10 generations. We consider it a good indication for either a local optimum or a global opti- mum if the average fitness does not change in 5 generations. In our previous example, assignment{20,10,10,10,10,10}, after adjusting it to satisfy an assignment constraint, may give a valid assignment {17,9,8,10,10,6}. If it happens to keep crossing over with assignments without 0 elements, we may end up with assignments in our popu- lation all with non-zero values. Using this population, we are very unlikely to obtain an assignment with an element with 0 since there is no 0 gene in the population. This 136 will prevent us from reaching a global optimum if it contains 0 elements. In this case, a mutation is needed to intentionally introduce some assignments with 0 elements. Once we have completed the above operations, we are done with one generation. We run this process iteratively. The average fitness will keep improving if our GA heuristic is not stuck at a local optimum or reaches a global optimum. It stops afterT long generations so that a satisfactory assignment is found. We will discuss how to setT long in the next chapter. From the above description, a sketch of our GA heuristic is as follows: • Step 1. Generate an initial population by encoding randomly generated assign- ments; • Step 2. Calculate fitness of each individual assignment using Equation (6.4); • Step 3. Pick the best half of the population and perform cross-over; • Step 4. Adjust the newly generated assignments to satisfy P B i=1 x i =KG; • Step 5. Repeat steps 2 to 4 until a satisfactory assignment is found. 6.3 Validation and Evaluation In this section, we present a small set of simulation results to illustrate the potential of our GA heuristic. These simulations are done on an Intel Celeron 733MHZ PC with 256MB memory. In these simulations, we create a scenario withKG = 24 andB = 6. 137 We assign a failure probability to each server. We use a brute-force search program to traverse the whole search space to compute a global optimal assignment. This is done for comparison purposes. We use 6 bistros as we found it feasible for a brute- force approach to compute a global optimum in this case. For more than 6 bistros, the brute-force approach takes an extremely long amount of time to compute the solution. We simulated a number of other test cases with B > 6, but omit them here due to lack of space. For each test case, we stop the iterations afterT long generations. We set T long = 50 in the simulation as we found that in most cases, our GA heuristic converges to a global optimum in less than10 generations. We believe it is a reasonable choice as 50 generations are considered to be a long convergence time in many GA applications [13]. To also demonstrate the strength of our GA heuristic, we compare it to three sim- ple heuristics. The first simple heuristic, which we term the “all-in-one assignment” strategy, puts all the checksum groups on the most reliable bistro. The second sim- ple heuristic spreads all the checksum groups evenly among the bistros, and we term it the “even assignment” strategy. In the third simple heuristic, we spread the checksum groups among bistros in proportion to their failure probabilities. The more reliable a bistro is, the more checksum groups it will receive. For example, if the failure proba- bility of bistroa is half of the failure probability of bistrob, bistrob will receive half as many checksum groups as bistroa. We term this the “proportional assignment” strategy. 138 4 4 4 4 4 4 7 6 5 3 2 1 6 2 6 2 6 1 1 7 1 7 1 7 7 1 7 1 7 1 24 0 0 0 0 0 0 12 0 0 12 0 8 0 8 0 8 0 Table 6.1: Initial Population for Test Cases 1-4 b 1 b 2 b 3 b 4 b 5 b 6 1 0.025 0.030 0.035 0.040 0.045 0.050 2 0.020 0.050 0.080 0.110 0.140 0.170 3 0.150 0.250 0.350 0.450 0.550 0.650 4 0.250 0.250 0.250 0.250 0.250 0.250 Table 6.2: Bistro Failure Probability Settings for Each Test Case For test cases 1-4 (described below), we use the initial population which is given in Table 6.1. Table 6.2 gives the bistro failure probability settings for these test cases. We present our simulation results in Figures 6.2 - 6.5. Each figure plots the recovery probability as a function of error capacity. Error capacity is defined as the maximum number of checksum groups that the destination server can tolerate to lose and still be able to reconstruct the original file. In each figure, we compare the results of our GA heuristic, “all-in-one assignment”, “even assignment”, and “proportional assignment” with an optimal solution obtained by the brute-force approach. In test case 1 and test case 2 (Figures 6.2 and 6.3, respectively), we assign to each server a small failure probability. Here we try to simulate reliable conditions. Mean- while, in test case 1, we make the difference between failure probabilities of bistros 139 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.025, 0.030, 0.035, 0.040, 0.045, 0.050) Optimal GA (a) Optimal vs GA Heuristic 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.025, 0.030, 0.035, 0.040, 0.045, 0.050) Optimal All-In-One (b) Optimal vs All-in-One Assignment 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.025, 0.030, 0.035, 0.040, 0.045, 0.050) Optimal Even (c) Optimal vs Even Assignment 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.025, 0.030, 0.035, 0.040, 0.045, 0.050) Optimal Proportional (d) Optimal vs Proportional Assignment Figure 6.2: Test Case 1: A Scenario with Reliable Conditions small while in test case 2, we make it large. In our simulations, we found that our GA heuristic can reach a global optimum in: 7 out of 12 cases in test case 1 (Figure 6.2a) and 10 out of 12 cases in test case 2 (Figure 6.3a). For those cases where our GA heuristic is stuck at a local optimum, it approximates the global optimum by at least 99.5%. We have also found that in these two cases, the three simple heuristics perform worse than the GA based heuristic. The “all-in-one” strategy reaches a global optimum in 3 out 12 cases in test case 1 (Figure 6.2b) and 7 out of 12 cases in test case 2 (Figure 6.3b). The 140 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.02, 0.05, 0.08, 0.11, 0.14, 0.17) Optimal GA (a) Optimal vs GA Heuristic 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.02, 0.05, 0.08, 0.11, 0.14, 0.17) Optimal All-In-One (b) Optimal vs All-in-One Assignment 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.02, 0.05, 0.08, 0.11, 0.14, 0.17) Optimal Even (c) Optimal vs Even Assignment 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.02, 0.05, 0.08, 0.11, 0.14, 0.17) Optimal Proportional (d) Optimal vs Proportional Assignment Figure 6.3: Test Case 2: A Scenario with Reliable Conditions “even” strategy reaches a global optimum in 5 out 12 cases in test case 1 (Figure 6.2c) and 4 out of 12 cases in test case 2 (Figure 6.3c). The “proportional” strategy reaches a global optimum in 4 out of 12 cases in test case 1 (Figure 6.2d) and 3 out of 12 cases in test case 2 (Figure 6.3d). In addition, “even” and “proportional” assignment strategies perform quite poorly in cases where error capacity is less than 4 checksum groups. Their best approximation to the optimal solution is only about 80% (refer to Figures 6.2 and 6.3). 141 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.15, 0.25, 0.35, 0.45, 0.55, 0.65) Optimal GA (a) Optimal vs GA Heuristic 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.15, 0.25, 0.35, 0.45, 0.55, 0.65) Optimal All-In-One (b) Optimal vs All-in-One Assignment 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.15, 0.25, 0.35, 0.45, 0.55, 0.65) Optimal Even (c) Optimal vs Even Assignment 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.15, 0.25, 0.35, 0.45, 0.55, 0.65) Optimal Proportional (d) Optimal vs Proportional Assignment Figure 6.4: Test Case 3: A Scenario with Error-prone Conditions In test case 3 and test case 4 (Figures 6.4 and 6.5, respectively), we assign each server a high failure probability. We try to simulate error-prone conditions in which bistros are unreliable. In test case 3, we make the difference in failure probabilities of bistros large while in test case 4 they are the same. In the simulation, we found that our GA heuristic can reach a global optimum in most cases: 10 out of 12 in both test cases (Figures 6.4a and 6.5a). For those cases where our GA heuristic is stuck at a local optimum, it approximates the global optimum by at least 94.1%. We have 142 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.25, 0.25, 0.25, 0.25, 0.25, 0.25) Optimal GA (a) Optimal vs GA Heuristic 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.25, 0.25, 0.25, 0.25, 0.25, 0.25) Optimal All-In-One (b) Optimal vs All-in-One Assignment 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.25, 0.25, 0.25, 0.25, 0.25, 0.25) Optimal Even (c) Optimal vs Even Assignment 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 Recovery Probability Error Capacity KG=24, P=(0.25, 0.25, 0.25, 0.25, 0.25, 0.25) Optimal Proportional (d) Optimal vs Proportional Assignment Figure 6.5: Test Case 4: A Scenario with Error-prone Conditions also found that in these test cases, the “all-in-one” strategy can obtain a fairly good performance while the other two simple heuristics perform poorly. The “all-in-one” strategy reaches a global optimum in 8 out 12 cases in test case 3 (Figure 6.4b) and 7 out of 12 cases in test case 4 (Figure 6.5b), which is close to our GA heuristic. However, the “even” strategy and “proportional” strategy reach a global optimum only in 1 out of 12 cases in both test cases (Figures 6.4c-6.4d and Figures 6.5c-6.5d). In addition, in both test cases, their performance is far below that of our GA heuristic. For example, 143 the “even” and “proportional” assignment strategies perform quite poorly in both cases with error capacity is less than 4 checksum groups. In both cases, their approximations to a global optimum is at most 30%. All these results clearly indicate that the “even” and the “proportional” assignment strategies may not be very useful in an error-prone environment. Our experiments also indicate that the “all-in-one” assignment strategy is not as good as our GA heuristic. It reaches a global optimum fewer times and the achieved approximation is not as good as our GA heuristic in a number of cases. We also note that when the error capacity is small, an optimal assignment tends to choose the most reliable bistro. With the increase in error capacity, more bistros will be involved in an assignment. It can be explained intuitively as follows. When the error capacity is small, the gain of putting checksum groups on multiple bistros is smaller than the failure risk of those bistros which have a larger failure probability than the most reli- able one. However, with an increase in error capacity, this risk can be “reimbursed” by the error capacity and can yield a higher final probability of reconstructing the original file. We now consider convergence characteristics of our GA heuristic. To this end we generated a number of initial populations by randomly selecting 8 assignments from the entire search space. In each test case, we continue the computation until the GA heuristic reaches the same result as it did in the previous test cases. We record the average number of generations in 100 randomly generated initial populations and present this average number in Figure 6.6. 144 2 3 4 5 6 7 8 9 0 2 4 6 8 10 12 Number of Generations Error Capacity test1 test2 test3 test4 Figure 6.6: Number of Generations to Converge with Random Initial Population From Figure 6.6, we can find that our GA heuristic converges reasonably fast. In all test cases, the average number of generations to reach the same result as in a fixed initial population case is less than 9. Considering that the whole search space is 24 6 , our GA heuristic only needs to check at most 8×9 = 72 assignments. According to [13], by introducing good genes into the initial population, we can further speed up convergence of a GA. In the future, we can apply a heuristic in selecting the initial population. In fact, as an initial attempt, we introduced our “all-in-one” and “even” heuristic assignments into our initial population in tests presented in Figures 6.2 - 6.5 and we found that our GA heuristic can converge in less than 5 generations in all cases. From the above results, we can see that our GA heuristic reduces the running time of an NP-hard problem efficiently while still achieving a good approximation to a optimal solution. Although we only presented the results of test cases for 6 servers in this paper, 145 we observed similar results in test cases for more servers. Thus, we believe this genetic algorithms approach is feasible in realistic settings. 146 Chapter 7 Summary We present a summary of our corresponding contributions, as detailed in Chapters 3, 4, 5 and 6. 7.1 MultiTorrent In Chapter 3, we focused on a multi-torrent system, and specifically on the questions of what incentives could be provided for nodes to contribute resources as seeds in a multi-torrent environment, and what are the resulting performance consequences of such behavior, both on the nodes which are willing to be seeds and on the overall system. The contributions of this work were as follows: • We proposed a multi-torrent BT system which can be easily implemented through fairly small modifications of the current BT protocol. Thus, we believe that our approach is scalable and easily deployable. • We proposed a “cross-torrent-based” tit-for-tat (CTFT) strategy motivated by pro- viding incentives for nodes to act as seeds. We believe it is a more efficient, scalable, and easily deployable approach. 147 • We performed an extensive simulation-based study which illustrated that (a) our approach does improve the overall performance of the system and (b) our approach does provide incentives for nodes to act as seeds by providing better performance for such nodes. Our work illustrated that performance gains are possible through consideration of multiple torrents. 7.2 P2P VoD In Chapter 4, we studied several fundamental questions in the context of BT-like V oD systems and proposed practical approaches to addressing them. The contributions of that work were as follows: • We explored practical solutions to the “peer request problem”, that can be easily implemented through fairly small modifications to the current BT protocol - these approaches resulted in better QoS in the V oD system and at the same time were scalable, efficient, and easily deployable today. • We proposed the use of Deadline-Aware Scheduling (DAS) which included an earliest deadline first (EDF) scheduling approach and an early drop (EDP) based approach to address the “service scheduling problem”. We showed that DAS results in better QoS in a V oD system. To the best of our knowledge, this work is 148 the first to explore the use of earliest deadline first and early drop approaches in the context of BT-like V oD systems. • We showed that addressing the “peer request problem” or “service scheduling problem” alone is not sufficient to achieve high QoS, i.e., that an appropriate combination of good solutions to each question is needed in a P2P V oD stream- ing system to provide high QoS with low overhead. To support this claim, we presented an extensive evaluation study on the use of these approaches under a variety of environments. Our extensive simulation-based study showed that our approaches can provide sig- nificant improvements in QoS in BT-like V oD system. 7.3 Incentives In Chapter 5, we propose Layered Coding Incentives (LCI) to provide incentives in the context of a BT-like V oD system. The contributions of that work were as follows: • We study the questions: (1) why an incentive mechanism is needed; and (2) what’s the appropriate incentives for a BT-like V oD system. We first illustrate why the single-layered coding, which is used by most P2P V oD systems does not land itself to providing incentives. Motivated by that, we propose to use layered coding 149 in order to facilitate incentives in a BT-like V oD system and we show that the resulting system has good performance with built-in incentives (see Section 5.1). • We propose an approach for nodes to download data chunks from different layers adaptively. This approach decides which layers to download based on the current downloading progress. It is easy to implement and makes efficient use of node capacity (see Section 5.3). • We propose approaches for nodes to serve peers aiming at: (1) a basic service that is provided to all nodes in the system; (2) differentiated service is received among heterogeneous nodes; and (3) node capacity is efficiently used. To the best of our knowledge, this is the first work proposing to provide a basic service to all users in BT-like V oD systems. We show our approaches achieved these objectives and are completely decentralized(see Section 5.3). • We explore the effects of our proposed solutions through an extensive simulation- based study which illustrates that our approaches do result in a better incentive mechanism in BT-like V oD systems than TFT (see Section 5.3). 7.4 Bistro In Chapter 6, we studied a data assignment problem, in the context of the Bistro system. The contributions of that work were as follows. 150 • We formulated the data assignment problem in the Bistro fault tolerance protocol as a non-linear optimization problem. • We proposed a genetic algorithm based heuristic to approximate an optimal solu- tion to this problem which was more accurate than several simple heuristics used for comparison, as well as efficient, as compared to a brute-force approach. Our results indicated that the proposed genetic algorithm based heuristic is efficient and provides a good approximation. We believe that the proposed approach is feasible and can result in a better fault tolerance framework. 151 Bibliography [1] K. Anagnostakis and M. Greenwald. Exchanged-based incentive mechanisms for peer-to-peer file sharing. In ICDCS, 2004. [2] A. Bharambe, C. Herley, and V . Padmanabhan. Analyzing and improving bittorrent performance. In INFOCOM, 2006. [3] S. Bhattacharjee, W. Cheng, C. Chou, L. Golubchik, and S. Khuller. Bistro: a framework for building scalable wide-area upload applications. Performance Eval- uation Review, 28(2):29–35, September 2000. [4] Blizzard. World of warcraft (http://www.worldofwarcraft.com). 2008. [5] CacheLogic. Peer-to-peer in 2005. 2005. [6] R. Chandrasekharam, S. Subhramanian, and S. Chaudhury. Genetic algorithm for node partitioning problem and applications in VLSI design. IEEE Proceedings, 140(5):255–260, September 1993. [7] W. Cheng, C. Chou, L. Golubchik, and S. Khuller. A secure and scalable wide-area upload service. In ICOMP, 2001. [8] L. Cheung, C. Chou, L. Golubchik, and Y . Yang. A fault tolerance protocol for uploads: Design and evaluation. In ISPA, 2004. [9] Y . R. Choe, D. L. Schuff, J. M. Dyaberi, and V . S. Pai. Improving vod server efficiency with bittorrent. In Multimedia, 2007. [10] A. L. H. Chow, L. Golubchik, and V . Misra. Improving bittorrent: A simple approach. In IPTPS, 2008. [11] W. Chu. Optimal file allocation in a multiple computer system. IEEE Trans. on Computers, c-18(10), 1999. [12] B. Cohen. Incentives build robustness in bittorrent. In P2PECON, 2003. 152 [13] D. A. Coley. An Introduction to Genetic Algorithms for Scientists and Engineers. World Scientific Publishing Company, 1997. [14] L. Dai, Y . Cui, and Y . Xue. Maximizing throughput in layered peer- to-peer stream- ing. In ICC, 2006. [15] D. Eager, M. Vernon, and J. Zahorjan. Bandwidth skimming: a technique for cost-effective video-on-demand. In MMCN, 2000. [16] K. P. Eswaran. Placement of records in a file and file allocation in a computer network. Information Processing, pages 304–307, 1974. [17] Exodus. http://www.exdous.com. [18] B. Fan, D.-M. Chiu, and J. C.S. Lui. The delicate tradeoffs in bittorrent-like file sharing protocol design. In ICNP, 2006. [19] M. J. Freedman, C. Aperjis, and R. Johari. Prices are right: Managing resources and incentives in peer-assisted content distribution. In IPTPS, 2008. [20] O. Frieder and H. T. Siegelmann. Multiprocessor document allocation: A genetic algorithm approach. IEEE Trans. on Knowledge and Data Engineering, 9, no.4, July/Aug 1997. [21] A. Gai, F. Mathieu, F. D. Montgolfier, and J. Reynier. Stratification in p2p net- works: Application to bittorrent. In ICDCS, 2007. [22] L. Gao, D. Towsley, and J. Kurose. Efficient schemes for broadcasting popular videos. In NOSSDAV, 1998. [23] M. Garey and D. Johnson. Computers and Intractability. W.H. Freeman, 1979. [24] K. Graffi, S. Kaune, K. Pussep, A. Kovacevic, and R. Steinmetz. Load balancing for multimedia streaming in heterogeneous peer-to-peer systems. In NOSSDAV, 2008. [25] L. Guo, S. Chen, Z. Xiao, E. Tan, X. Ding, and X. Zhang. Measurements, analysis and modeling of bittorrent-like systems. In IMC, 2005. [26] A. Habib and J. Chuang. Incentives mechanism for peer-to-peer media streaming. In IWQoS, 2004. [27] M. Handley, S. Floyd, J. Padhye, and J. Widmer. Tcp friendly rate control (tfrc). In RFC 3448, 2003. [28] M. H. Hefeeda, B. K. Bhargava, and D. K. Y . Yau. A hybrid architecture for cost- effective on-demand media streaming. Elsevier Computer Networks, 44, 2004. 153 [29] J. Heidemann and V . Visweswaraiah. Automatic selection of nearby web servers. In WISP, 1998. [30] J. H. Holland. Robust algorithms for adaptation set in a general formal framework. In Proc. IEEE Symposium on Adaptive Processes-Decision and Control, pages 5.1–5.5, 1970. [31] J. H. Holland. Adaptation in Natural and Artificial Systems. University of Michi- gan Press, 1975. [32] A. Hu. Video-on-demand broadcasting protocols: a comprehensive study. In INFOCOM, 2001. [33] C. Huang, J. Li, and K. Ross. Can internet video-on-demand be profitable. In SIGCOMM, 2007. [34] Y . Huang, T. Z. J. Fu, D.-M. Chiu, J. C. S. Lui, and C. Huang. Challenges, design and analysis of a large-scale p2p-vod system. In SIGCOMM, 2008. [35] K. Hwang, V . Misra, and D. Rubenstein. Stored media streaming in bittorrent-like p2p networks. Tech Report, Columbia University, NY, (cucs-024-08), 2008. [36] Cisco Inc. Cisco visual network index. 2009. [37] IRS. Fill-in Forms. http://www.irs.gov/formspubs/lists/0,,id=97817,00.html, 2005. [38] Digital Island. http://www.digitalisland.com. [39] M. Izal, G. Urvoy-Keller, E.W. Biersack, P.A. Felber, A. Al Hamra, and L. Garc’es- Erice. Dissecting bittorrent: Five months in a torrent’s lifetime. In Proc. of PAM, 2004. [40] S. Jamin, C. Jin, Y . Jin, D. Riaz, Y . Shavitt, and L. Zhang. On the placement of Internet instrumentation. In INFOCOM, March 2000. [41] S. Jamin, C. Jin, T. Kurc, D. Riaz, and Y . Shavitt. Constrained mirror placement on the Internet. In INFOCOM, April 2001. [42] S. Jun and M. Ahamad. Incentives in bittorrent induce free riding. In P2PECON, 2005. [43] V . Padmanabhan L. Qiu and G. V oelker. On the placement of web server replicas. In INFOCOM, April 2001. [44] A. Legout, N. Liogkas, E. Kohler, and L. Zhang. Clustering and sharing incentives in bittorrent systems. In SIGMETRICS, 2007. 154 [45] Chao Liang, Zhenghua Fu, Yong Liu, and Chai Wah Wu. ipass: Incentivized peer-assisted system for asynchronous streaming. In INFOCOM Mini Conference, 2009. [46] Z. Liu, Y . Shen, S. S. Panwar, K. W. Ross, and Y . Wang. Using layered video to provide incentives in p2p live streaming. In P2P-TV, 2007. [47] Z. Liu, Y . Shen, K. W. Ross, S. S. Panwar, and Y . Wang. Substream trading: Towards an open p2p live streaming system. In ICNP, 2008. [48] T. Loukopoulos and I. Ahmad. Static and adaptive data replication algorithms for fast information access in large distributed systems. In ICDCS, 2000. [49] N. Magharei, R. Rejaie, and Y . Guo. Mesh or multiple-tree: A comparative study of live p2p streaming approaches. In INFOCOM, 2007. [50] P. Mirchandani and R. Francis. Discrete Location Theory. John Wiley and Sons, 1990. [51] B. Narendran, S. Rangarajan, and S. Yajnik. Data distribution algorithms for load balanced fault-tolerant web access. In SRDS, 1997. [52] N. Parvez, C. Williamson, A. Mahanti, and N. Carlsson. Analysis of bittorrent-like protocols for on-demand stored media streaming. In SIGMETRICS, 2008. [53] David A. Patterson, Garth Gibson, and Randy H. Katz. A case for redundant arrays of inexpensive disks (RAID). In SIGMOD, 1988. [54] M. Piatek, T. Isdal, T. Anderson, A. Krishnamurthy, and A. Venkataramani. Do incentives build robustness in bittorrent? In NSDI, 2007. [55] D. Qiu and R. Srikant. Modeling and performance analysis of bittorrent-like peer- to-peer networks. In SIGCOMM, 2004. [56] R. Rejaie and A. Ortega. Pals: Peer-to-peer adaptive layered streaming. In NOSS- DAV, 2003. [57] M. Sirivianos, J. H. Park, R. Chen, and X. Yang. Free-riding in bittorrent networks with the large view exploit. In IPTPS, 2007. [58] K. Suh, C. Diot, J. Kurose, L. Massoulie, C. Neumann, D. Towsley, and M. Varvello. Push-to-peer video-on-demand system: design and evaluation. IEEE JSAC, 25(9), 2007. [59] G. Tan and S. Jarvis. A payment-based incentive and service differentiation mech- anism for peer-to-peer streaming broadcast. In IWQoS, 2005. 155 [60] V . Vishnumurthy, S. Chandrakumar, and E. G. Sirer. Karma: A secure economic framework for p2p resource sharing. In P2PECON, 2003. [61] A. Vlavianos, M. Iliofotou, and M. Faloutsos. Bitos: Enhancing bittorrent for supporting streaming applications. In INFOCOM Workshop, 2006. [62] M. Wien, R. Cazoulat, A. Graffunder, A. Hutter, and P. Amon. Real-time system for adaptive video streaming based on svc. In TCSVT, 2007. [63] X. Xiao, Y . Shi, Y . Gao, and Q. Zhang. Layerp2p: A new data scheduling approach for layered streaming in heterogeneous networks. In INFOCOM, 2009. [64] S. Xie, B. Li, G. Y . Keung, and X. Zhang. Coolstreaming: Design, theory, and practice. IEEE Trans. on Multimedia, 9(8), December 2007. [65] Y . Yang, A. Chow, L. Golubchik, and D. Bragg. Improving QoS in bittorrent-like vod systems. Technical report, http://vista.usc.edu/pub/vod-tech.pdf, 2009. [66] Y . Yang, A. L. H. Chow, and L. Golubchik. Multi-torrent: a performance study. Technical report, http://vista.usc.edu/pub/multibt-tech.pdf, 2007. [67] X. Zhang, J. Liu, B. Li, and T. S. P. Yum. Coolstreaming/donet: A data-driven overlay network for efficient live media streaming. In INFOCOM, 2005. [68] S. Zhong, J. Chen, and Y . R. Yang. Sprite: A simple, cheat-proof, credit-based system for mobile ad-hoc networks. In INFOCOM, 2003. [69] M. Zhou and J. Liu. Tree-assisted gossiping for overlay video distribution. ACM Multimedia Tools and Applications, 29(3), 2006. [70] Y . Zhou, D. M. Chiu, and J. C. S. Lui. A simple model for analyzing p2p streaming protocols. In ICNP, 2007. 156
Abstract (if available)
Abstract
In recent years, many Internet based applications have arisen to take advantage of widespread inexpensive broadband connections. However, congestion on Internet is still significant. Therefore, efficient management of Internet resource that lead to improvements in Quality of Service (QoS) for Internet based applications remains an important problem. In this dissertation, we focus on this problem in the context of several important applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
QoS-based distributed design of streaming and data distribution systems
PDF
Distributed resource management for QoS-aware service provision
PDF
Satisfying QoS requirements through user-system interaction analysis
PDF
Adaptive resource management in distributed systems
PDF
Performance and incentive schemes for peer-to-peer systems
PDF
QoS-aware algorithm design for distributed systems
PDF
An end-to-end framework for provisioning based resource and application management
PDF
Resource management for scientific workflows
PDF
Resource scheduling in geo-distributed computing
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Distributed indexing and aggregation techniques for peer-to-peer and grid computing
PDF
Improve cellular performance with minimal infrastructure changes
PDF
Mobility-based topology control of robot networks
PDF
Location-based spatial queries in mobile environments
PDF
Image and video enhancement through motion based interpolation and nonlocal-means denoising techniques
PDF
Policy based data placement in distributed systems
PDF
Software connectors for highly distributed and voluminous data-intensive systems
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Learning about the Internet through efficient sampling and aggregation
Asset Metadata
Creator
Yang, Yan
(author)
Core Title
QoS based resource management for Internet applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/06/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
distributed system,incentives,OAI-PMH Harvest,peer-to-peer,QoS,resource management,streaming
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Golubchik, Leana (
committee chair
), Horowitz, Ellis (
committee member
), Kuo, C.-C. Jay (
committee member
)
Creator Email
csyangyan@gmail.com,yangyan@USC.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3584
Unique identifier
UC1473326
Identifier
etd-Yang-4175 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-434688 (legacy record id),usctheses-m3584 (legacy record id)
Legacy Identifier
etd-Yang-4175.pdf
Dmrecord
434688
Document Type
Dissertation
Rights
Yang, Yan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
distributed system
incentives
peer-to-peer
QoS
resource management
streaming