Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Distributed indexing and aggregation techniques for peer-to-peer and grid computing
(USC Thesis Other)
Distributed indexing and aggregation techniques for peer-to-peer and grid computing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DISTRIBUTED INDEXING AND AGGREGATION TECHNIQUES FOR PEER-TO-PEER AND GRID COMPUTING by Min Cai A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2006 Copyright 2006 Min Cai Dedication to my family, for their love and support ii Acknowledgements I have been very lucky to work with many wonderful people at USC during the past ve years of my graduate study. I would like to thank all of them for their help and support. First and foremost, I am deeply indebted to my advisor, Professor Kai Hwang, for his invalu- able guidance during the course of this thesis development. He has signicantly inuenced the way I think about research. Without his support, this dissertation would not have been possible. I would like to thank Professors C.-C. Jay Kuo, Ramesh Govindan, Clifford Neuman and Christos Papadopoulous for serving on my qualifying exam and dissertation committees. Their constructive feedback and suggestions greatly improved this dissertation. I am grateful to my previous advisor, Professor Martin Frank, for guiding my entrance to the research world and for his valuable discussions on MAAN and RDFPeers. My thanks also goes to Professors Ann Chervenak, Pedro Szekely and Robert Neches, who have given me much help and advice during my stay at ISI. I am grateful to Dr. Jianping Pan for his insights and suggestions on WormShield. I would like to express my thanks and my deepest love to my wife, who has been a source of great caring and support. I also would like to thank my fellow graduate students, Baoshi Yan, Shanshan Song, Li Zhou, Yu Chen, Ying Chen and Xiaosong Lou for their help and suggestions. They have made my study at USC enjoyable and fruitful. Finally, I gratefully acknowledge the nancial support provided by NSF through the GridSec project under Grant No. ACI-0325409. iii Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract xiii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Distributed Resource Indexing . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Distributed Information Aggregation . . . . . . . . . . . . . . . . . . . . 4 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Multi-Attribute Addressable Network 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Chord Distributed Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Range Queries Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Uniform Locality Preserving Hashing . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Multi-Attribute Query Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Iterative Query Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.2 Single-Attribute Dominated Query Resolution . . . . . . . . . . . . . . 21 2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Distributed Aggregation Tree 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 DAT System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Distributed Aggregation Problem . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Basic Formulation of the DAT Model . . . . . . . . . . . . . . . . . . . 35 3.3 DAT Construction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.1 Basic DAT Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 Analysis of Basic DAT Properties . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 Balanced DAT Construction . . . . . . . . . . . . . . . . . . . . . . . . 43 iv 3.3.4 Analysis of Balanced DAT Properties . . . . . . . . . . . . . . . . . . . 46 3.4 DAT Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Implementation Architecture . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Aggregation Synchronization . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.2 Measured DAT Tree Properties . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.3 Measured Message Overheads . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.4 Effects of Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4 Distributed Cardinality Counting 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 LogLog Cardinality Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Adaptive Cardinality Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Distributed Cardinality Counting . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.1 Scalability of Adaptive Counting . . . . . . . . . . . . . . . . . . . . . . 68 4.5.2 Performance of Cardinality Digesting . . . . . . . . . . . . . . . . . . . 70 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5 P2P Replica Location Service 72 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Globus Replica Location Service . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Design of P2P Replica Location Service . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Adaptive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.2 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.2.1 Storage Load Balancing . . . . . . . . . . . . . . . . . . . . . 80 5.3.2.2 Query Load Balancing . . . . . . . . . . . . . . . . . . . . . . 81 5.4 P-RLS Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5.1 Scalability Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5.2 Analytical Model for Stabilization Trafc . . . . . . . . . . . . . . . . . 89 5.5.3 Simulations for Adaptive Replication . . . . . . . . . . . . . . . . . . . 90 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6 Distributed RDF Repository 98 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 RDFPeers Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Storing RDF Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4 Native Queries in RDFPeers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.1 Atomic Triple Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.2 Disjunctive and Range Queries . . . . . . . . . . . . . . . . . . . . . . . 106 6.4.3 Conjunctive Multi-Predicate Queries . . . . . . . . . . . . . . . . . . . . 108 v 6.5 Resolving RDQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.6 RDF Subscription and Notication . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.6.1 Subscribing to Atomic Queries . . . . . . . . . . . . . . . . . . . . . . . 111 6.6.2 Subscribing to Disjunctive and Range Queries . . . . . . . . . . . . . . . 113 6.6.3 Subscribing to Conjunctive Multi-Predicate Queries . . . . . . . . . . . 114 6.6.4 Unsupported Subscription Types . . . . . . . . . . . . . . . . . . . . . 114 6.6.5 Extension To Support Highly Skewed Subscription Patterns . . . . . . . 115 6.7 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.7.1 Routing Hops to Resolve Native Queries . . . . . . . . . . . . . . . . . 117 6.7.2 Throughput of Triple Storing and Querying . . . . . . . . . . . . . . . . 118 6.7.3 Message Trafc of Subscription and Notication . . . . . . . . . . . . . 120 6.7.4 Dealing with Overly Popular URIs and Literals . . . . . . . . . . . . . . 121 6.7.5 Load Balancing via Successor Probing . . . . . . . . . . . . . . . . . . . 125 6.8 Example Application: Shared-HiKE . . . . . . . . . . . . . . . . . . . . . . . . 126 6.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7 Distributed Worm Signature Generation 131 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2 The WormShield System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2.1 Modeling Worm Spreading Properties . . . . . . . . . . . . . . . . . . . 133 7.2.2 Distributed Worm Signature Generation . . . . . . . . . . . . . . . . . . 135 7.3 Fingerprint Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.3.1 Window Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.2 Comparison with COPP and Value Sampling . . . . . . . . . . . . . . . 140 7.4 Distributed Fingerprint Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.4.1 Distribution of Fingerprint Repetition . . . . . . . . . . . . . . . . . . . 144 7.4.2 Distribution of Address Dispersion . . . . . . . . . . . . . . . . . . . . 147 7.4.3 Efciency of Fingerprint Filtering . . . . . . . . . . . . . . . . . . . . . 148 7.4.4 Aggregation Trafc of A Single Monitor . . . . . . . . . . . . . . . . . 150 7.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.5.2 Spreading Patterns of Simulated Worms . . . . . . . . . . . . . . . . . . 153 7.5.3 Speed of Signature Generation . . . . . . . . . . . . . . . . . . . . . . . 154 7.5.4 Effects of False Signatures . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.5.5 Deployment Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.6 Extensions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.6.1 Privacy-Preserving Signature Generation . . . . . . . . . . . . . . . . . 162 7.6.2 Security in WormShield Deployment . . . . . . . . . . . . . . . . . . . 163 7.6.3 Polymorphic Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.6.4 Other Practical Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 vi 8 Conclusions 170 8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Bibliography 175 Appendix A Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Appendix B Solution to Equation 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 vii List of Tables 2.1 An Example of Attribute Schema for Grid Resources in MAAN . . . . . . . . . 25 3.1 Example Applications of Distributed Information Aggregation . . . . . . . . . . 34 5.1 Measured Message Sizes for the P-RLS Implementation . . . . . . . . . . . . . 89 5.2 Stabilization Trafc in Bytes per Second Predicted by Our Analytical Model . . . 90 5.3 Mean Number of Mappings Per Node for a Given Network Size and Replication Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.1 The Eight Possible Atomic Triple Queries for Exact Matches . . . . . . . . . . . 106 6.2 The URIs and Literals Occurring More Than 1,000 Times in the RDF File (kt- structure.rdf.u8.gz) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.1 Summary of Eight Internet Packet Traces Used in Our Experiments . . . . . . . 144 7.2 Parameters of Four Simulated Worms . . . . . . . . . . . . . . . . . . . . . . . 152 7.3 Comparison of Six Worm Signature Generation Systems . . . . . . . . . . . . . 169 viii List of Figures 1.1 Overview of the research topics investigated in this dissertation . . . . . . . . . . 6 1.2 Relationship among the eight chapters in this dissertation . . . . . . . . . . . . . 11 2.1 A 6-bit Chord network consisting of 8 nodes and 4 object keys . . . . . . . . . . 15 2.2 An example for the single-attribute-dominated query resolution algorithm . . . . 24 2.3 The number of neighbors as a logarithmic function of network size . . . . . . . . 26 2.4 The routing hops increases on a log-scale with the network size for 5-attribute range query with "% range selectivity . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 The routing hops increases linearly with the network size for 2-attribute range queries with 10% range selectivity . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 The routing hops as a linear function of query's range selectivity (64 nodes, 1 attribute) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 The expected number of routing hops as a function of the number of attributes (64 nodes, 10% range selectivity) . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Example of aggregating the global value through a DAT tree of 7 nodes. . . . . . 36 3.2 Basic DAT construction using Chord nger routes to N 0 in a 16-node network . . 40 3.3 Illustration of the parent ngers of nodes in different identier spaces . . . . . . 42 3.4 Subset of ngers used in balanced routing . . . . . . . . . . . . . . . . . . . . . 44 3.5 Parent ngers of nodes in 4 intervals using balanced routing . . . . . . . . . . . 45 3.6 Building a balanced DAT trees using the balanced routing scheme. . . . . . . . . 47 3.7 DAT Implementation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8 Comparison of tree properties for different DAT schemes. . . . . . . . . . . . . . 53 ix 3.9 Aggregation overhead of centralized, basic and balanced DAT schemes with net- work size varying from 16 to 8192. . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.10 Comparison of load balance for centralized, basic and balanced DAT schemes . . 56 4.1 The bias of LogLog algorithm for small cardinalities with load factor from 1 to 5. 63 4.2 The standard error of LogLog algorithm for small cardinalities with load factor less than 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Distributed estimation of global cardinality through a DAT tree with 8 nodes. . . 68 4.4 Comparison of linear counting, loglog counting, multi-resolution bitmap, and adaptive counting algorithms with n scaling from 0:1K to 128M, m=4K. . . . . . 70 4.5 Performance of digesting large sets . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1 Example of a hierarchical RLI Index conguration supported by the RLS imple- mentation available in the Globus Toolkit Version 3. . . . . . . . . . . . . . . . . 73 5.2 Example of the mapping placement of 3 mappings in the P-RLS network with 8 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 P-RLI queries for logical name popular-object traverse the predecessors of the root node N i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4 The implementation architecture of P2P replica location service . . . . . . . . . 84 5.5 Update latency in milliseconds for performing an update operation in the P-RLS network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.6 Query latency in milliseconds for performing a query operation in the P-RLI network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7 Number of RPC Calls performed for a xed number of update operations as the size of the P-RLS network increases . . . . . . . . . . . . . . . . . . . . . . . . 87 5.8 Rate of increase in pointers to neighbor nodes maintained by each P-RLI node as the network size increases, where the replication factor k equals to two . . . . . . 88 5.9 Stabilization Trafc for a P-RLS network of up to 16 nodes with stabilization intervals of 5 and 10 seconds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.10 Comparison of measured and predicted values for stabilization trafc . . . . . . . 90 x 5.11 The distribution of mappings among nodes in a P-RLS network of 100 nodes with 500,000 unique mappings. The replication factor varies from 0 to 12. . . . . . . 93 5.12 Cumulative distribution of mappings per node as the replication factor increases in a P-RLS network of 100 nodes with 500,000 unique mappings . . . . . . . . . 93 5.13 Ratios of the P-RLI nodes with the greatest and smallest number of mappings compared to the average number of mappings per node . . . . . . . . . . . . . . 94 5.14 The number of queries resolved on the root node N and its predecessors becomes more evenly distributed when the number of replicas per mapping increases . . . 95 6.1 The Architecture of RDFPeers . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 An example of storing 3 RDF triples into an RDFPeers network of 8 nodes in 4-bit identier space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 The number of routing hops to resolve atomic triple patterns Q2 through Q8. . . . 117 6.4 The number of routing hops to resolve disjunctive exact-match queries in a net- work with 1000 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.5 The number of routing hops to resolve disjunctive range queries (0.1% selectivity) in a network with 1000 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.6 Aggregated throughput of triple storing increases with the number of concurrent clients in a 12-node network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.7 Aggregated query throughput increases with the number of concurrent clients in a 12-node network with 10,000 and 100,000 preloaded respectively . . . . . . . . 120 6.8 For a constant number of triple subscriptions and insertions, the cost of our sub- scription scheme in messages grows no more than logarithmically with network size, 128 topics, 1024 subscriptions and 16384 triples . . . . . . . . . . . . . . . 121 6.9 For a constant network size and load, registration and notication trafc grows linearly with the subscription rate, 128 topics, 64 nodes, and 8192 triples . . . . . 122 6.10 The frequency count distribution of URIs and literals in the ODP Kids and Teens catalog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.11 The number of triples per node as a function of the threshold of popular triples (100 physical nodes with 6 virtual nodes per physical node). . . . . . . . . . . . 124 xi 6.12 The number of triples per node as a function of the number of successor nodes probed (100 physical nodes, Popular Threshold=1000). . . . . . . . . . . . . . . 126 6.13 Shared-HiKE, a P2P knowledge editor built on top of Subscribable RDFPeers . . 127 7.1 An example of aggregating ngerprint repetition in a WormShield network of 8 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2 The functional design of a WormShield monitor . . . . . . . . . . . . . . . . . . 139 7.3 Comparison of three different sampling methods. . . . . . . . . . . . . . . . . . 142 7.4 The Zipf-like distribution of ngerprint repetition in various Internet traces . . . 146 7.5 The distribution of address dispersion in 3D histogram. . . . . . . . . . . . . . . 148 7.6 The ltering ratio of packet traces decreases with increasing local thresholds. . . 149 7.7 Aggregation trafc data rates with different local thresholds. . . . . . . . . . . . 151 7.8 The spreading patterns of Slammer worms with uniform and subnet-preferred scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.9 Comparison of WormShield and isolated monitors in the speed of signature gen- eration for CodeRed worms with varying global thresholds . . . . . . . . . . . . 156 7.10 Number of hosts infected by CodeRedI-v2 with different probing rates under three monitor congurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.11 Number of false signatures drops sharply with increasing global thresholds . . . 157 7.12 ROC curves showing the number of infected hosts against the number of false signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.13 Improvement factors of three signature generation schemes using increasing num- ber of monitors from 200 to 3;000. . . . . . . . . . . . . . . . . . . . . . . . . . 161 xii Abstract Peer-to-Peer (P2P) systems and Grids are emerging as two novel paradigms of distributed com- puting for wide-area resource sharing on the Internet. In these two paradigms, it is essential to discover resources by their attributes and to acquire global information in a fully decentralized fashion. This dissertation proposes a multi-attribute addressable network (MAAN) for resource indexing, a distributed aggregation tree (DAT) for information aggregation, and a distributed counting scheme for estimating global cardinalities. MAAN indexes resources by their attribute values on Chord via uniform locality preserving hashing. It resolves multi-attribute range queries with a single-attribute dominated algorithm that scales to both the network size and the number of attributes. The DAT scheme implicitly builds an aggregation tree from Chord routing paths without membership maintenance. It also improves the Chord routing algorithm to build a balanced DAT tree, when nodes are evenly distributed in the identier space. Furthermore, the distributed counting scheme digests large sets into small cardinality summaries and merges them through a DAT tree. The global cardinality is estimated by using an adaptive counting algorithm that not only scales to large cardinalities but also scales to small ones. Based on these techniques, this dissertation has developed three applications including a P2P replica location service (P-RLS), a distributed RDF repository (RDFPeers) and a worm signature generation system (WormShield). P-RLS extends the Globus RLS system with properties of self- organization, fault-tolerance and improved scalability. It also reduces query hotspots by using a xiii predecessor replication scheme. RDFPeers stores, searches and subscribes to RDF triples in a MAAN network. In RDFPeers, the routing hops for triple insertion, for most query resolution and for triple subscription are logarithmic to the network size. WormShield automatically generates worm signatures by using distributed ngerprint lter- ing and aggregation at multiple edge networks. Due to Zipf-like ngerprint distribution, the ltering scheme reduces the amount of aggregation trafc by several orders of magnitude. The global ngerprint statistics are computed through DAT trees in a scalable and load-balanced fash- ion. The experimental results demonstrate that the indexing and aggregation schemes perform well in different P2P and Grid applications. xiv Chapter 1 Introduction With the pervasive use of computers and high-speed networks, Peer-to-Peer (P2P) systems [78] and Grids [12, 43] are emerging as two novel approaches to wide-area resource sharing on the Internet. In contrast to the traditional client-server architecture, the P2P paradigm employs a at and symmetric structure to distribute computation, storage and communication among nodes for better scalability, load balance, and failure tolerance. For example, P2P le-sharing systems, e.g. Gnutella, KazaA and eDonkey, construct an unstructured overlay network of millions of nodes to store and retrieve les in a fully decentralized fashion. Without any centralized authorities, P2P systems often adopt a bottom-up approach in which each peer only connects to a small fraction of other peers. On the other hand, Grid technologies enable exible, secure, and coordinated resource shar- ing among dynamic collections of individuals, institutions, and resources [44]. The development and deployment of Grids were mainly motivated by the requirements of scientic communities in sharing remote resources, such as computational resources, large data-sets, and expensive in- struments. These resources are typically owned by different administrative organizations and are shared under locally dened polices within the scientic community. Thus, Grids typically use a top-down approach and focus more on the deployment of a persistent, standard-based service 1 infrastructure for secured resource sharing. The nature of scientic computing requires Grids to integrate substantial resources for delivering nontrivial qualities of service. 1.1 Motivation Distributed resource indexing and information aggregation are two essential building blocks for large-scale distributed systems in general, and P2P and Grid systems in particular. For example, in a Grid environment, users or applications need to look up the real-time status of resources, and to discover the appropriate ones that are of their interests. In addition, administrators need to continuously monitor some global system properties for capacity planning or system diagnostics. 1.1.1 Distributed Resource Indexing Traditional approaches in Grids index resource information by using a centralized server or a set of hierarchically organized servers. For example, Globus [44] uses an LDAP-based directory ser- vice named MDS [41] for resource monitoring and discovery. However, the centralized server(s) can become a registration bottleneck in a dynamic environment where many resources join, leave, and change characteristics (e.g. CPU load) at any time. Thus, this scheme does not scale well to a large number of Grid nodes across autonomous organizations. Also, centralized approaches have the inherent drawback of a single point of failure. Hierar- chical approaches provide better scalability and failure tolerance by introducing a set of hierarchi- cally organized servers and partitioning resource information on different servers, similar to the DNS system. Typically, the partitioning scheme is predened and can not adapt to the dynamic change of virtual organizations. It might take a long time for resource information to be updated from the leaf nodes to the root. 2 In terms of resource indexing, P2P systems have evolved from centralized indexing to ooding- based schemes and then to Distributed Hash Tables (DHT) [95, 118, 104]. The centralized ap- proach such as Napster maintains a central index server, and each peer looks up a resource by querying the server. Similar to the centralized approach in Grids, this approach has inherent draw- backs of scalability and single point of failure. The ooding-based approach used in Gnutella constructs an unstructured overlay network among nodes, and queries are ooded to the whole network. This approach is truly decentralized and improves failure resilience by eliminating the single point of failure. However, query ooding introduces extra message trafc and processing overhead in the network. To reduce the ooding overhead, DHT systems construct a structured overlay network among nodes and utilize message routing instead of ooding to look up a resource. Each node in these systems typically maintains O(log n) pointers to its neighbors, and each insertion and lookup are guaranteed to nish in O(log n) hops for a network with n nodes. In addition to efcient insertion and lookup, DHT systems have some other advantages, such as self-organization for low maintenance costs and self-healing for failure resilience. For example, the construction of the DHT overlay network is automatic, and all nodes will self-organize into a given topology (e.g. ring or hypercube) without any extra conguration. In DHT, when a node leaves the network, another node will automatically detect its leave and take over its responsibility for indexing resources and answering queries. This self-healing behavior is very useful to provide redundancy and fault tolerance in large-scale distributed systems. While DHTs have several desirable properties, they can only look up a resource that exactly matches a given key. They often assume their applications already know the key of the target resource. For example, le systems such as CFS use DHT to index each le block and use the 3 unique block identier as a key to store and retrieve the block. However, this kind of hash table functionality is not enough for resource discovery in P2P and Grid systems. In these systems, resources have multiple attributes and should be registered with a list of attribute-value pairs. Also, resource requesters need to search for resources that meet multiple attribute requirements. Therefore, it is critical to extend DHTs to index resources by attribute values and support multi- attribute range queries for resource discovery. 1.1.2 Distributed Information Aggregation In addition to distributed resource indexing, large-scale distributed systems often need to acquire some global information of the whole system. Examples of global system properties include network size and total free storage in P2P le-sharing networks, and total CPU usages in Grids. Moreover, distributed intrusion detection systems have to continuously monitor certain global measurement metrics from several sites. For example, a worm signature can be detected by counting its global ngerprint repetition and distinct IP addresses [115, 17]. Distributed information aggregation is essential to computing the global information in large- scale distributed systems. An aggregate function, e.g. min, max, count and sum, takes a set of input values and calculates a single output value that summarizes the inputs. In a distributed environment, there are two alternatives, centralized and distributed, to yield a global value by aggregating local values from all nodes. The former collects all local values at a single node and aggregates them directly using an aggregate function. The latter recursively applies the aggrega- tion function on a subset of local values until the global aggregation value is generated. Aggregating local information into global one poses a major challenge to P2P and Grid sys- tems due to their large-scale and decentralized natures. Centralized aggregation does not scale 4 well to a large number of nodes and has the drawback of a single point of failure. For this pur- pose, P2P systems deliberatively eliminate any centralized server that interacts with all nodes. Besides, distributed aggregation requires coordination among nodes so that the aggregation can be done gradually in the network. An aggregation tree is necessary for aggregating local values recursively. In an aggregation tree, each node applies the same aggregate function on the local values of its children. The root node yields the global value aggregated from all nodes. However, it is challenging to maintain a distributed tree structure in dynamic P2P systems. Any effective scheme for distributed aggregation in large-scale distributed systems has to meet the requirements of scalability, adaptiveness, and load balancing. First, the scalability has three criteria. To scale to a large number of nodes, each aggregation should only introduce a limited number of messages with respect to the network size. To scale to a large number of aggregation trees, the scheme should have low construction and maintenance overhead for each tree. To scale to a large number of data records, some efcient probabilistic counting schemes should be used to reduce the aggregation cost. Second, the aggregation scheme has to adapt to the dynamics of node arrival and departure. The node insertion and deletion in the aggregation tree should have minimal impact to the aggre- gation process. Third, the aggregation workload should be distributed evenly among all nodes without any performance bottleneck. Load balancing is thus essential for both workload fairness and system scalability. 1.2 Thesis Statement This dissertation investigates three distributed resource indexing and information aggregation techniques for various real-world applications, including a P2P replica location service, a dis- 5 tributed RDF repository, and a collaborative worm signature generation system. Fig. 1.1 illus- trates the roadmap of the research topics studied in this dissertation. ! " ! ! #$#% & ' ( % ! ) *' Figure 1.1: Overview of the research topics investigated in this dissertation For distributed resource indexing, we proposed a new structured P2P system called multi- attribute addressable network (MAAN) that extends the Chord DHT system to support multi- attribute and range queries. In MAAN, a resource could be any entity with multiple attribute values, such as a computational or storage resource in Grids, or a RDF triple in Semantic Web. For example, a computer resource is represented by its attribute-value pairs as follows: (<cpu- speed, 2.8GHz>, <memory-size, 1GB>, <cpu-usage, 95%>, ...). MAAN indexes each resource on the Chord successor nodes of its attribute values. By using a uniform locality preserving hash function, numerically close values for the same attribute are stored on nearby nodes. To address a range query, MAAN routes the query to the rst node within the range. The query is then forwarded to other nodes sequentially until reaching the last node in the range. MAAN also resolves multi-attribute range queries by using a single-attribute dominated approach that only does a single iteration around the Chord identier space. Additionally, we developed two techniques for distributed information aggregation: an ef- cient scheme for building distributed aggregation trees (DAT) and a probabilistic counting 6 scheme for estimating the global cardinality. In DAT, all nodes aggregate towards the global information with regard to a given object key called rendezvous key. A rendezvous key is the Chord identier of a given aggregate index determined by DAT applications. For example, in Grid resource monitoring systems, the global resource status is indexed by different attribute names.. By leveraging the topology and routing mechanisms of Chord, the DAT scheme constructs an aggregation tree implicitly from Chord routing paths without any membership maintenance. To balance the DAT trees, we introduced a balanced routing algorithm on Chord that dynamically selects the parent of a node from its nger nodes according to its distance to the root. The theoretical analysis in this dissertation proves that this algorithm builds an almost completely balanced DAT tree when nodes are evenly distributed in the Chord identier space. For distributed cardinality counting, we designed an adaptive counting algorithm that intelli- gently adapts itself to the cardinality of the set it counts with a common data structure. Instead of collecting the full set of data records, this algorithm digests them into small summaries and then estimates the important properties of the set, e.g. the number of distinct elements from the sum- maries. The global cardinality thus can be efciently computed by aggregating the cardinality summaries through a DAT tree at its root node. For example, 144 counters with 5 bits each (total 90 bytes) can estimate up to 1:5 2 32 distinct IP addresses with a relative error less than 10:8%. We validated and demonstrated the indexing and aggregation techniques with three applica- tions on Grid replica location service, distributed metadata management and worm signature gen- eration. First, we developed a P2P replica location service (P-RLS) that allows registration and discovery of data replicas in Grids. P-RLS extends the RLS implementation in Globus Toolkit 3.0 with properties of self-organization, fault-tolerance and improved scalability. P-RLS uses 7 the Chord algorithm to self-organize P-RLS servers and exploits the Chord overlay network to replicate P-RLS mappings adaptively. The P-RLS performance measurements demonstrate that update and query latencies increase at a logarithmic rate with the size of the P-RLS network, while the overhead of maintaining the P-RLS network is reasonable. The simulation results for adaptive replication demonstrate that as the number of replicas per mapping increases, the mappings are more evenly distributed among P-RLS nodes. Furthermore, we introduced a predecessor replication scheme that reduces query hotspots of popular mappings by distributing queries among nodes. Second, we designed a scalable P2P RDF repository (RDFPeers) that stores each RDF triple in a MAAN network by applying globally known hash functions. RDF (Resource Description Framework) is the W3C specication for modeling metadata on Semantic Web, P2P systems [83] and Grids [103]. RDFPeers routes queries efciently to the nodes storing matched triples. Users can also selectively subscribe to and be notied of new RDF triples. In RDFPeers, both the neighbors per node and the routing hops for triple insertion, triple subscription and most query resolution are logarithmic to the network size. By building DAT trees on top of RDFPeers, we extended the RDQL language to support aggregate queries, such as count, sum, average and so on. The experiments with real-world RDF data demonstrate that the triple-storing load in RDFPeers differs by less than an order of magnitude. Finally, based on the aggregation techniques, we developed a collaborative worm signature generation system called WormShield. Fast and accurate generation of worm signatures is es- sential to contain zero-day worms at the Internet scale. However, at the early stage of a worm outbreak, individual edge networks are often short of enough worm exploits for generating accu- rate signatures. In WormShield, distributed monitors in multiple edge networks collaboratively 8 analyze the global repetition of worm substrings (i.e., ngerprints) and their address dispersion via ngerprint ltering and aggregation. Due to Zipf-like ngerprint distribution, distributed ltering reduces the amount of aggre- gation trafc signicantly. The global ngerprint statistics are computed using DAT trees in a scalable and load-balanced fashion. WormShield monitors also estimate the global IP address dispersion of worm signatures by the adaptive counting scheme. The simulation results show that 256 collaborative monitors can generate the signature of CodeRedI-v2 about 135 times faster on average than using the same number of isolated monitors. On the other hand, each monitor only generates about 0:6KB/s aggregation trafc, which is 0:003% of the 18MB/s link trafc sniffed. 1.3 Key Contributions The dissertation extends previous methods for distributed indexing and aggregation in P2P and Grid systems. Summarized below are six major contributions: (1) A multi-attribute addressable network for indexing resources and resolving multi-attribute range queries. For a network of n nodes, the single-attribute dominated algorithm resolves a multi-attribute based query in O(log n + n s min ) hops, where s min is the minimum range selectivity on all attributes. (2) A distributed aggregation tree scheme for global information aggregation. By using the bal- anced Chord routing algorithm, the DAT trees are proved to be balanced if nodes are evenly distributed in the Chord identier space. (3) An adaptive counting scheme for estimating the global cardinality of large data sets with moderate processing and communication costs. This scheme can not only scale to large cardinalities but also scale to small ones. 9 (4) A P2P replica location service with properties of self-organization, fault-tolerance and im- proved scalability. Its predecessor replication scheme signicantly reduces query hotspots of popular mappings. (5) A scalable P2P RDF repository for distributed metadata management. It avoids ooding queries to the network and only uses logarithmic routing hops for triple insertion and most query and subscription operations. (6) A collaborative worm signature generation system that aggregates the global ngerprint rep- etition and address dispersion from multiple edge networks. It not only generates a worm signature faster than using the same number of isolated monitors, but also yields a very low false positive rate. 1.4 Dissertation Organization This dissertation consists of eight chapters. Fig. 1.2 illustrates the organization and relationship among these chapters. Chapter 1 introduces the problem of distributed indexing and aggregation, briefs our solu- tion approaches, and provides an overview of the dissertation. Chapter 2 presents the algorithms, implementation and performance evaluation of the MAAN system for indexing Grid resources by their attributes. Chapter 3 models the problem of global information aggregation on a struc- tured P2P network and presents the algorithm design and theoretical analysis of a balanced DAT scheme. We also describe the DAT prototype implementation and measure its performances on tree properties, message overheads and load balance. In Chapter 4, we tackle a special infor- mation aggregation problem in the context of global cardinality counting. An adaptive counting 10 ! " " Figure 1.2: Relationship among the eight chapters in this dissertation algorithm is proposed to estimate the global cardinality by digesting large data sets into small cardinality summaries. Based on the indexing and aggregation techniques discussed in Chapters 2, 3 and 4, we in- vestigate three real-world applications on Grid replica location service, distributed RDF reposi- tory, and collaborative worm signature generation in Chapters 5, 6 and 7 respectively. Chapter 5 presents the design, implementation and evaluation of a P2P replica location service. Chapter 6 illustrates the architecture of RDFPeers and its detailed mechanisms on RDF storing, query resolution and subscription. In Chapter 7, we present the WormShield system based on the DAT and adaptive counting schemes. We also evaluate the performance of WormShield in large-scale worm simulations. Finally, in Chapter 8 we conclude the dissertation and suggest future research work. 11 Chapter 2 Multi-Attribute Addressable Network 2.1 Introduction P2P and Grid computing on a large scale requires scalable and efcient resource indexing and discovery. Most existing systems often maintain a centralized server [7, 39] or a set of hierarchi- cally organized servers [41] to index resource information. The centralized server might become both a bottleneck and a single point of failure in a planet-scale environment. Zhang et al [139] show that GIIS in MDS2 [29] and Manager in Hawkeye [93] can only manage up to 100 GRIS or Agent servers. On this scale, both GIIS and GRIS need to enable data caching with large time- to-live (TTL) values, which is not suitable for real-time status such as CPU load. In addition, the partitioning scheme in hierarchical systems is often predened and can not adapt to the dynamic change of Grid environments. For example, if the upper level GIIS fails, the low level GRIS needs to be manually redirected. To overcome the above shortcomings of centralized approaches, Adriana Iamnitch et al. [53] proposed a P2P approach to organize the MDS directories in a at, dynamic P2P network. Every virtual organization in the Grid dedicates a certain amount of its resources as peers that host information services. Those peers constitute a P2P network among organizations. Resource requesters search desired resources through query forwarding that is similar to unstructured P2P 12 systems. However, this approach does not scale well because of the large volume of ooding messages [100, 111]. To avoid ooding the whole network, the number of hops on the forwarding path is typically bounded by the TTL eld. Thus, the search results are not deterministic, and this approach cannot guarantee to nd the desired resource even if it exists. In contrast, recent structured P2P systems use message routing instead of ooding by leverag- ing a structured overlay network among peers. These systems typically support distributed hash table (DHT) functionality and the basic operation they offer is lookup (key), which returns the identity of the node storing the object with the key [96]. However, this kind of hash table func- tionality is not enough for indexing Grid resources because they typically have multiple attributes and thus need to be registered with a list of attribute-value pairs. For example, a resource provider would want to register its multiple attributes like this: register name=pioneer & url=gram://pioneer.isi.edu:8000 & os-type=linux & cpu-speed=1000MHz & memory-size=512M Consequently, resource requesters want to be able to search for resources that meet multiple attribute requirements (as demonstrated by e.g. the Resource Specication Language (RSL) [28] in Globus), using a query like: search os-type=linux & 800MHzcpu-speed1000MHz & memory-size512MB The attributes in the above example have two different types: string and numerical. Attribute name, url and os-type are string based and only have a limited number of values, while attribute cpu-speed and memory-size have continuous numerical values. For numerical types of attributes, being able to query with attribute ranges instead of exact values is a critical re- quirement. However, current DHT systems can neither handle multi-attribute queries nor range queries. 13 In this chapter, we propose a new structured P2P system called Multi-Attribute Addressable Network (MAAN) for distributed resource indexing. In MAAN, resources are registered with a set of attribute-value pairs and can be searched via multi-attribute based range queries. 2.2 Chord Distributed Hash Table In this section, we briey describe the Chord system proposed by Ion Stoica al el. [118]. Like all other DHT systems, Chord supports scalable <key, object> pairs registration and lookup operations. Let ID(x) be the unique identier of node x in a b-bit identier space, where ID(x) 2 [0; 2 b ). In Chord, the identier space is structured as a cycle of 2 b , and the distance between two identiers i 1 and i 2 is DIST(i 1 ; i 2 ) = (i 1 +2 b i 2 ) mod 2 b . Similar to [118], we will use the term node to refer to both the node and its identier. The node identier can be chosen locally by hashing the node's IP address and port number with SHA1. Each object in Chord is also assigned a unique b-bit identier, called object key. Chord assigns objects to nodes using a consistent hashing scheme. For an object stored in Chord, let k be its key in the same identier space as nodes, i.e. k2 [0; 2 b ). Key k is assigned to the rst node whose identier is equal to or follows k in the circular space. This node is called the successor node of key k, denoted by successor(k). Each object is registered on the successor node of its object key. Fig. 2.1 shows an 8-node Chord network with 6-bit identier space. Node N 20 has the identier of 20 and stores the objects with key 10 and key 15. All Chord nodes organize themselves into a ring topology according to their identiers in the circular space. For node x, let PRED(x) denote its immediate predecessor, and SUCC(x) denote its immediate successor. Each Chord node also maintains a set of nger nodes that are spaced exponentially in the identier space. The j-th nger of node x, denoted by FINGER(x, j), is the 14 N4 N8 N20 N24 N40 N48 N56 N60 N4+1, N4+2, N4+4 N4+32 N4+8, N4+16 K10 K10 K15 K15 K32 K32 K52 K52 lookup(K52) lookup(K52) lookup(K52) Finger Table N4+1 => N8 N4+2 => N8 N4+4 => N8 N4+8 => N20 N4+16 => N20 N4+32 => N40 Figure 2.1: A 6-bit Chord network consisting of 8 nodes and 4 object keys rst node that succeeds x by at least 2 j1 in the identier space, where 0 j < b. Therefore, the nger table contains more nearby nodes than faraway nodes at a doubling distance. Thus each node only needs to maintain the state for O(log n) neighbors for a Chord network with n nodes. For example, the ngers of N 4 in Fig. 2.1 are N 8 , N 20 and N 40 correspondingly. Chord uses the nger routing scheme to forward a lookup message. When a node x wants to lookup key k that is far away from itself, it forwards the lookup message to the nger node whose identier most immediately precedes the successor node of k. By repeating this process, the message gets closer and closer to, and will eventually reach the successor node. For example, if N 4 in Fig. 2.1 issues a lookup request for K 52 , it sends the request to its nger N 40 that is the closest one to K 52 in the identier space. N 40 then forwards the request to N 48 that will forward it to N 56 . Since N 56 is the successor node of K 52 , it looks up the object corresponding to K 52 locally and returns the result to N 4 . Because the ngers in the node's nger table are spaced 15 exponentially around the identier space, each hop from node x to the next node covers at least half the identier space (clockwise) between x and k. So the average number of hops for a lookup is O(log n), where n is the number of nodes in the network. Like many other DHT systems, Chord offers efcient and scalable single-key based registra- tion and lookup service for decentralized resources. However, it does not support range queries and multi-attribute based lookup. Our MAAN approach addresses this problem by extending Chord with locality preserving hashing and a multi-dimensional query resolution mechanism. 2.3 Range Queries Resolution Chord assigns each node and key a b-bit identier using a base hashing function such as SHA1, and uses consistent hash to map keys to nodes. This approach balances the keys among nodes because SHA1 hash generates randomly distributed identiers no matter the distribution of actual node addresses and keys. However, SHA1 hashing destroys the locality of keys, which is essential to support range queries for numerical attribute values. MAAN uses SHA1 hashing to assign a b-bit identier to each node and the attribute value with string type. However, for attributes with numerical values MAAN uses locality preserving hashing functions to assign each attribute value an identier in the b-bit space. Denition 1. Hash function H is a locality preserving hashing function if it has the following property: H(v i ) < H(v j ) iff v i < v j , and if an interval [v i ; v j ] is split into [v i ; v k ] and [v k ; v j ], the corresponding interval [H(v i ); H(v j )] must be split into [H(v i );H(v k )] and [H(v k ); H(v j )]. Suppose we have an attribute a with numerical values in the range of [v min ; v max ]. A simple locality preserving hashing function could be H(v) = (v v min ) (2 b 1)=(v max v min ), where v2 [v min ; v max ]. So for each attribute value v, it has the corresponding identier H(v) 16 in the [0;2 b 1] identier space. MAAN also uses the same consistent hashing as Chord and assigns attribute value v to the successor node of its identier, i.e. successor(H(v)). Theorem 1. If we use locality preserving hash function H to map attribute value v to the b-bit circular space [0; 2 b 1], given a range query [l; u] where l and u are the lower bound and upper bound respectively, nodes that contain attribute value v in [l; u] must have an identier equal to or larger than successor(H(l)) and equal to or less than successor(H(u)). Proof. Attribute value v is assigned to successor(H(v)) and successor(H(v)) is the rst node whose identier is equal to or follows the identier of H(v) in the identier circle. Since l v u and from Denition 1, we can see that attribute value v can only be assigned to node x and successor(H(l)) x successor(H(u)). Thus we can use the following algorithm to resolve range queries for numeric attribute values. Suppose node x wants search for resources with attribute value v between l and u for attribute a, i.e. l v u, where l and u are the lower bound and upper bound respectively. Node x composes a search request and uses the Chord routing algorithm to route it to node x l , the successor of H(l). The request is represented as following: SEARCH REQUEST(k;R;S). k is the key used for Chord routing, initially k = H(l). R is the desired attribute value range [l; u] andS is a list of resources discovered in the range. Initially,S is empty. When node x l receives the request, it searches its local resource entries and appends those resources that satisfy the range query toS in the request. Then it checks whether it is the successor of H(u) as well. If true, it sends back the search response to node x with the search result in S. Otherwise, it forwards the request to its immediate successor x i . Node x i also searches its local 17 resource entries, appends matched resources to S, and forwards the request to its immediate successor until the request reaches node x u , the successor of H(u). According to Theorem 1, the resources that have attribute values in the range of [l; u] must be registered on the nodes between x l and x u (clockwise) in the Chord ring. So the above search algorithm is complete. Obviously, routing the search request to node x l using Chord routing algorithm takes O(log n) hops for n nodes. The next sequential forwarding from x l to x u takes O(p) , where p is the number of nodes between x l and x u . There are total O(log n + p) routing hops to resolve a range query for single attribute. Since there are p nodes that might contain the resources matching the range query, we have to visit all of them to guarantee to nd the correct search result. In this sense, O(log n + p) routing hops is optimal for range queries in Chord. 2.4 Uniform Locality Preserving Hashing Though our simple locality preserving hashing function keeps the locality of attribute values, it does not produce uniform distribution of hashing values if the distribution of attribute values is not uniform. Consequently, the load balancing of resource entries can be poor across the nodes. To address this problem, we propose a uniform locality preserving hashing function that always produces uniform distribution of hashing values if the distribution function of input attribute val- ues is continuous and monotonically increasing, and is known in advance. This condition is satised for many common distributions, such as Gaussian, Pareto, and Exponential distribu- tions. Suppose attribute value v of resources conforms to a certain distribution with continuous and monotonically increasing distribution function D(v) and possibility function P(v) = D(v) dv , where v2 [v min ;v max ]. We can design a uniform locality preserving hashing function H(v) as following: H(v) = D(v) (2 b 1). 18 Theorem 2. Hash function H(v) is a locality preserving hashing function. Proof. Since D(v) is monotonically increasing, H(v) is monotonically increasing too. Obvi- ously, H(v) is a locality preserving hashing function according to Denition 1. Theorem 3. Suppose attribute value v2 [v min ; v max ] and v has distribution function D(v). Let hashing value y = H(v), then y conforms to a uniform distribution in range [H(v min );H(v max )]. Proof. The possibility distribution of y, denoted P(y)dy, is determined by the fundamental trans- formation law of probabilities, which isjP(y)dyj = jP(v)dvj or P(y) = P(v) dv dy . Since y = H(v) = D(v) (2 b 1), we have dy dv = d(D(v)) dv (2 b 1) or dy dv = P(v) (2 b 1). Therefore, we have P(y) = 1 (2 b 1) : (2.1) Since attribute value v 2 [v min ; v max ] and its probability function P(v) is normalized by denition, as in R vmax v min P(v)dv = 1 or D(v max ) D(v min ) = 1. Also since v min R 1 P(v)dv = 0, we have D(v min ) = 0 and D(v max ) = 1. Therefore, H(v min ) = D(v min ) (2 b 1) = 0 and H(v max ) = D(v max ) (2 b 1) = 2 b 1, so that Z H(vmax) H(v min ) P(y)dy = Z 2 b 1 0 1 (2 b 1) dy = 1: (2.2) From Eq.( 2.1) and ( 2.2), we can see that hashing value y conforms to a uniform distribution in the range of [H(v min );H(v max )]. Thus, with this uniform locality preserving hashing function, resources will be uniformly distributed on all nodes if the nodes uniformly cover the b-bit identier space. We know that the latter is true when each node hosts O(log n) virtual nodes with independent identiers [118]. 19 2.5 Multi-Attribute Query Resolution In addition to single attribute based queries, MAAN extends the above routing algorithm for range queries to support multi-attribute queries. In this multi-attribute setting, we assume each resource has m attributes a 1 ; a 2 ;: : : ;a m and corresponding attribute value pairs < a i ; v i >, where 1 i m. For each attribute a i , its attribute value v i is in the range of [v imin ; v imax ] and conforms to a certain distribution with distribution function D i (v). Thus, we can apply a uniform locality preserving hashing function H i (v) = D i (v) (2 b 1) for each attribute a i . With these hashing functions we map all attribute values to the same b-bit space in Chord. A resource registers its information at node x i = successor(H(v i )) for each attribute value v i , where 1 i m. The registration request for attribute value v i is routed to its successor node using Chord routing algorithm with key identier H(v i ). Each node categorizes the indices of <attribute-value, resource> pairs by different attributes. When a node receives a registration request from resource r with attribute value a i = v ir , it adds the < v ir ; r > pair to corresponding list for attribute a i . When a node searches for interested resources, it composes a multi-attribute range query that is the combination of sub-queries on each attribute dimension, i.e. v il a i v iu where 1 i m, v il and v iu are the lower bound and upper bound of the query range respectively. MAAN supports two approaches to search resources for multi-attribute range queries, i.e. iterative and single-attribute dominated query resolution. 2.5.1 Iterative Query Resolution The iterative query resolution scheme is very straightforward. If node x wants to search resources by a query of m sub-queries on different attributes, it iteratively searches all candidate resources 20 for each sub-query on one attribute dimension, and intersects these search results at the query originator. We reuse the search algorithm proposed for single attribute based lookup in Sec. 2.3. The only modication is to carry an attribute eld in each search request to indicate which at- tribute we are interested in. The request is formatted as follows: SEARCH REQUEST(k;a;R;X), where a is the name of the attribute we are interested in, and k, R and X are the same as in a single-attribute based query. When a node within a query range receives the query request, it only searches the index that matches the attribute name in the request. Though this approach is simple and easy to implement, it is not very efcient. For m-attribute queries, it takes O( m P i=1 (log n + p i )) routing hops to resolve the queries, where p i is the number of nodes that intersects the query range on attribute a i . We dene selectivity s i as the ratio of query range width in identier space to the size of the whole identier space, i.e. s i = H(v iu )H(v il ) 2 b . Suppose attribute values are uniformly distributed on all n nodes, then we have p i = s i n and routing hops would be O( m P i=1 (log n + n s i )). Thus, the routing hops for searching increase linearly with the number of attributes in the query. 2.5.2 Single-Attribute Dominated Query Resolution Obviously, the search result of a multi-attribute query must satisfy all the sub-queries on each attribute dimension. It is the intersection of all resources that satises each individual sub-query. SupposeS is the set of resources satisfying all sub-queries, andS i is the set of resources satisfying the sub-query on attribute a i , where 1 i m. We have S = T S i and each S i is a superset of S. The iterative query resolution approach computes allS i using m iterations and then calculates their intersection set. However, since we register the resource information for each attribute dimension, resources in the set ofS i also contain the information of other attribute value pairs. 21 The single-attribute dominated approach utilizes this extra information and only needs to compute a set of candidate resources S d that satises the sub-query on a single attribute a d . It also applies the sub-queries for other attributes on these candidate resources and computes the set S that satises all sub-queries. Here, we call attribute a d the dominated attribute. There are two possible approaches to apply these sub-queries. One approach is to apply them at the query originator after it receives all candidate resources inS d . Since the setS d is typically much larger than S, search requests and responses might contain many candidate resources that do not satisfy other sub-queries. Thus this approach will introduce unnecessarily large search messages and increase communication overhead. Another approach is to carry these sub-queries in the search request, and apply them locally at the nodes that contain candidate resources in S d . This approach is more efcient because search requests and responses only carry the resources satisfying all sub-queries. The search request in the latter approach is as following: SEARCH REQUEST(k;a;R;O;X). k, a, R are the same as those in iterative query resolve approach. O is a list of sub-queries for all other attributes except a, and S is a list of discovered resources satisfying all sub-queries. When node x wants to issue a search request with R = [l; u], it rst routes the request to node x l = successor(H(l)). The node x l , searches its local index corresponding to attribute a for the resources with attribute value in the range of [l;u] and with all other attributes satisfying sub- queries in O, and appends them to S. Then it checks whether it is also the successor of H(u). If true, it sends back a search response to node x with the resources inS. Otherwise, it forwards the request to its immediate successor x i , which repeats this process until the request reaches node x u = successor(H(u)). 22 Since this approach only need to do one iteration for the dominated attribute a d , it takes O(log n + n s d ) routing hops to resolve the query. We can further minimize the routing hops by choosing the attribute with minimum selectivity as the dominated attribute. Thus, the routing hops will be O(log n + n s min ), where s min is the minimum selectivity for all attributes. Figure 2.2 shows an example of the single attribute dominated algorithm in an 8-node MAAN network storing 11 resources. This MAAN network has the identier space of [0; 64). Each re- source has two attributes: cpu-speed and memory-size. The attribute ranges and corresponding locality preserving hash functions are shown in the attribute settings table. Each node has one or more resources. For example, node B has two resources: B 1 with 0:8GHz CPU and 128MB memory, and B 2 with 4:8GHz CPU and 256MB memory. Each resource is registered by both cpu-speed and memory-size. For instance, resource B 1 is registered at node C that is the succes- sor node of its cpu-speed, and it is also registered at node B for its memory-size. When node A wants to look for resources with cpu-speed in the range of (4:0GHz; 5:0GHz) and memory-size in the range of [768MB;1024MB], it will rst apply the locality preserving hashing on each sub-query and compute the sub-queries in the Chord identier space. It chooses the attribute with minimum selectivity as the dominated attribute, which is cpu-speed in this example. Then node A composes a search request with the hash value of lower bound as the key and routes it to the corresponding successor node G using Chord's routing algorithm. The initial search request(1) in this example is SEARCH REQUEST(50:4; cpu-speed2(4:0GHz; 5:0GHz); memory-size2 [768MB; 1024MB]; EMPTY). When node G receives search request(1), it will nd the matched resource F 1 for both sub- queries and append it into the set S. Since node G is not the successor node of upper bound, it forwards the request to its immediate successor that is node H. The search request(2) will be 23 Node A (4) Identifier Space: [ 0, 64) Attribute Settings: Memory-Size CPU-Speed Name 63x/1024 MByte [0, 1024] 63x/5 GHz [0, 5] H(x) Unit Range Node B (8) Node C (20) Node D (24) Node E (40) Node F (48) Node G (56) Node H (60) A1(1.0GHz, 512MB) B1(0.8GHz, 128MB) B2(4.8GHz, 256MB) CPU-Speed: {B2} Mem-Size:{C2,E1} CPU-Speed: {H1} Mem-Size:{B1,H1} C1(2.0GHz,512MB) C2(2.4GHz,,1024M) D1(3.0GHz, 512MB) E1(3.6GHz, 1024MB) E2(2.4GHz, 256MB) F1(4.2GHz, 768MB) G1(4.0GHz, 768MB) H1(0.4GHz, 128MB) CPU-Speed:{} Mem-Size:{} CPU-Speed: {A1,B1} Mem-Size:{B2,E2} CPU-Speed: {} Mem-Size:{} CPU-Speed: {C1,C2,D1,E2} Mem-Size: {A1,C1,D1} CPU-Speed: {E1} Mem-Size:{F1,G1} CPU-Speed: {F1,G1} Mem-Size:{} Query in Chord identifier space: 50.4<H(CPU-Speed)<63 47.23<=H(Mem-Size)<=63 Looking for resources: 4.0<CPU-Speed<5.0, 768<=Mem-Size<=1024 H(x) Search_Request(1) Search_Request(2) Search_Request(3) Figure 2.2: An example for the single-attribute-dominated query resolution algorithm SEARCH REQUEST (57; cpu-speed2 (4:0GHz; 5:0GHz); memory-size2 [768MB; 1024MB]; fF 1 g). Since there is no resource registered at node H, it just simply forwards the request to node A and the search request(3) will be the same as search request(2) except that k is set to be 61. Node A has no matched resource for the sub-query of memory-size and it is already the successor node of the upper bound. So it just returns the resource F 1 in set S as the search result to the query originator that happens to be itself. In this approach, the number of routing hops is independent of the number of attributes, and thus scales perfectly in the number of attributes of a query. On the other hand, it incurs the memory cost of registering all attributes for a resource if any of its attributes is registered; and it incurs more updating overhead of attribute values change. However, the good query performance of the single-attribute dominated approach will typically outweigh the greater updating cost in 24 the Grid environment since node registration operations (e.g. os-type, cpu-speed, memory-size, cpu-count, etc.) are typically far less frequent than query operations to nd suitable machines. 2.6 Performance Evaluation We veried our theoretical MAAN results by measuring the performance of a prototype imple- mentation in Java. It can easily be congured to support different attribute schemas, such as an example for grid nodes shown in Table 2.1. Our implementation runs each distributed node in its own Java virtual machine as a separated process. The implementation uses sockets to communi- cate between the peers, and supports the register and search commands described in Sec. 2.1. New nodes can join the network by contacting any existing peer at its IP address and port number. Table 2.1: An Example of Attribute Schema for Grid Resources in MAAN Attribute Name Type Min Max Unit Name String / / / URL String / / / OS-Type String / / / CPU-Speed Numerical 1 10 5 MHz Memory-Size Numerical 1 10 6 MBytes Disk-Size Numerical 1 10 6 GBytes Bandwidth Numerical 10 3 10 4 MBps CPU-Count Numerical 1 10 4 CPU To collect the performance data from the distributed nodes, we implemented a status message that is ooded to all nodes. It is used for experimental measurement purposes only. The message causes every node to dump its neighborhood state to a log le. We also instrumented MAAN messages with additional elds, such as hops taken. Our experiment environment consists of 2 dual Xeon workstations with 1GB memory, 4 P4 desktops with 1GB memory and 8 dual PIII workstations with 512MB memory. The operating systems installed on these machines include Redhat 9.0, FreeBSD 4.9 and Windows XP professional. In order to setup a large MAAN network 25 with 512 nodes, we ran up to 64 nodes each on 2 dual Xeon workstations and up to 32 nodes each on other machines. Since we use routing hops as our performance metric in this experiment, hosting multiple nodes on each machine will not affect the correctness of the results. We know that the number of neighbors per node in Chord increases logarithmically with the network size. MAAN uses the Chord algorithm to maintain the overlay network among nodes, and thus has the same property of neighborhood states as Chord. To validate our Java implementation of MAAN, we measured the number of neighbors per node against network size. In this experiment, we set the successor replication factor to be 4, i.e. each node maintains 4 successors instead of only its immediate successor. These redundant successors will be used to recover the ring topology when nodes fail and also replicate resources. The result shown in Fig. 2.3 conrms that similar to Chord the neighborhood states in MAAN can scale well to a large number of nodes. Note that the x-axis is in log-scale. 0 5 10 15 20 1 4 16 64 256 1024 Number of neigbhors Number of nodes minimum and maximum Figure 2.3: The number of neighbors as a logarithmic function of network size Another important performance metric is the number of routing hops a search request would take to resolve a query. From Sec. 2.5, we know that the number of routing hops is O(log n+n s min ), where n is the total number of nodes in network, and s min is the minimum range selectivity 26 for all attributes. So if we want to search resources with at least one exact matching sub-query, i.e. s min = "%, the number of routing hops is O(log n), which is logarithmic to network size. Fig. 2.4 shows our measurement result for 5-attribute queries with "% range selectivity on a network with up to 512 nodes. The measured average routing hops roughly match with our theoretical analysis as the dotted line (log 2 (n)=2) shows in Fig. 2.4. 0 2 4 6 8 10 12 14 1 4 16 64 256 1024 Routing hops Number of nodes minimum and maximum log2(N)/2 Figure 2.4: The routing hops increases on a log-scale with the network size for 5-attribute range query with "% range selectivity However, for normal range queries whose selectivity s i > "% , the number of routing hops increases linearly with network size. This is because s i of total n nodes have to be visited by the search query if we want to balance the load to all the nodes. Fig. 2.5 shows this linear relationship between the number of routing hops and the number of nodes for 2-attribute range queries with 10% range selectivity in a 64 nodes network. For the same reason, the number of routing hops also increases linearly with the range selectivity of search queries as shown in Fig. 2.6. Theoretically, the average number of routing hops for range queries is log 2 n=2 + n s min . Our measurement result matches quite well with the analysis result, as shown by the dotted line in Fig. 2.5 and Fig. 2.6. However we can see that range queries with large range selectivity are very costly. They will basically ood the whole network. 27 0 10 20 30 40 50 60 70 80 1 4 16 64 256 1024 Routing hops Number of nodes minimum and maximum log2(N)/2+N*0.1 Figure 2.5: The routing hops increases linearly with the network size for 2-attribute range queries with 10% range selectivity 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 Routing hops Range selectivity (%) minimum and maximum log2(64)/2+64*s/100 Figure 2.6: The routing hops as a linear function of query's range selectivity (64 nodes, 1 at- tribute) We also compared the two multi-attribute query resolution algorithms we proposed in Sec. 2.5, i.e. iterative vs. single-attribute dominated. Fig. 2.7 shows the comparison result of these two approaches. The routing hops used in the iterative algorithm increases almost linearly as the number of attributes in the query. However, the single-attributed dominated algorithm uses a constant number of routing hops when the number of attributes increases from 1 to 8. This result is consistent well with our theoretical analysis. 28 0 20 40 60 80 100 1 2 3 4 5 6 7 8 Routing hops Number of attributes in query single attribute dominated iterative search Figure 2.7: The expected number of routing hops as a function of the number of attributes (64 nodes, 10% range selectivity) 2.7 Related Work Besides Chord, there are many other structured Peer-to-Peer networks proposed in recent years, such as Tapestry [141],, Pastry [104], CAN [95], Koorde [45], TerraDir [113], Skip Graphs [5] and SkipNet [52]. The routing algorithms used in Tapestry and Pastry are both inspired by Plaxton [88] . The idea of the Plaxton algorithm is to nd a neighboring node that shares the longest prex with the key in the lookup message and to repeat this operation until a destination node is found that shares the longest possible prex with the key. Each node has neighboring nodes that match each prex of its own identier but differ in the next digit. For a system with n nodes, each node has O(log n) neighbors, and the routing path takes at most O(log n) hops. Tapestry uses a variant of the Plaxton algorithm and focuses on supporting a more dynamic environment, with nodes joining and leaving the system. To adapt to environment changes, Tapestry dynamically selects neighbors based on the latency between the local node and its neighbors. Pastry also employs 29 the locality information in its neighborhood set to achieve topology-aware routing, i.e. to route messages to the nearest node among the numerically closest nodes [104]. CAN [95] maps its keys to a d-dimensional Cartesian coordinate space. The coordinate space is partitioned into n zones for a network of n nodes. Each node in a CAN network has O(d) neighbors, and routing path length is O(dn 1=d ) hops. If d is chosen to be O(log n), it has O(log n) neighbors and O(log n) routing hops like the above algorithms. CAN trades off neighborhood state for routing efciency by adjusting the number of dimensions. The above DHT algorithms are quite scalable because of their logarithmic neighborhood state and routing hops. However, these bounds are close to optimal but not optimal. Kaashoek et al. [45] proved that for any constant neighborhood state k, (log n) routing hops is optimal. But in order to provide a high degree of fault tolerance, a node must maintain O(log n) neighbors. In that case, O(log n= log log n) optimal routing hops can be achieved. Koorde is a neighborhood state optimal DHT based on Chord and de Bruijn graphs. It embeds a de Bruijn graph on the identier circle of Chord for forwarding lookup requests. Each node maintains two neighbors: its successor and the rst node that precedes its rst de Bruijn node. TerraDir [113] is a tree-based structured P2P system. It organizes nodes in a hierarchical fashion according to the underlying data hierarchy. Each query request will be forwarded upwards repeatedly until reaching the node with the longest matching prex of the query. Then the query is forward to the destination downwards the tree. In TerraDir, each node maintains constant number of neighbors and routing hops are bounded in O(h), where h is the height of the tree. Recently, two novel structured P2P systems based on skip lists [91] were proposed: Skip Graphs [5] and SkipNet [52]. These systems are designed for use in searching P2P networks and provide the ability to perform queries based on key ordering, rather than just looking up a key. 30 Thus, Skip Graphs and SkipNet maintain data locality, unlike DHTs. Each node in a Skip Graphs or SkipNet system maintains O(log n) neighbors in its routing table. A neighbor that is 2 h nodes away from a particular node is said to be at level h with respect to that node. This scheme A search for a key in Skip Graphs or SkipNet begins at the top-most level of the node seeking the key. It proceeds along the same level without overshooting the key, continuing at a lower level if required, until it reaches level 0. The number of routing hops required to search for a key is O(log n). The above structured P2P systems provide scalable distributed lookup for unique keys. How- ever they can not support efcient search, such as keyword search and multi-dimensional range queries. Patrick Reynolds and Amin Vahdat [99] proposed an efcient distributed keyword search system, which distributes an inverted index into a distributed hash table, such as Chord or Pastry. To minimize the bandwidth consumed by multi-keyword conjunctive searches, they use bloom lters to compress the document ID sets by about one order of magnitude and use caching to exploit temporal locality in the query workload. pSearch [120] is another peer-to-peer keyword search system that distributes document in- dices into a CAN network based on the document semantics generated by Latent Semantic In- dexing (LSI). In pSearch, the rolling index scheme is used to map the high dimensional semantic space to the low dimensional CAN space. Also it uses content-aware node bootstrapping to force the distribution of nodes in the CAN to follow the distribution of indices. Artur Andrzejak et al [4] extend CAN for handling range queries on single attributes by mapping one dimensional space to CAN's multi-dimensional space using Hibert Space Filling Curve as hash function. For a range query [l; u], they rst route to a node whose zone includes the middle point (l + u)=2. Then the node recursively propagates the request to its neighbors 31 until all the nodes that intersect the query are visited (a ooding strategy). They also proposed and compared three different ooding strategies: brute force, controlled ooding and directed controlled ooding. However, this work did not address multi-attribute range queries. In contrast to Andrzejak's system, Cristina Schmidt and Manish Parashar [110] proposed a dimension reducing indexing scheme that efciently maps the multi-dimensional information space into the one dimensional Chord identier space by using Hibert Space Filling Curve. This system can support complex queries containing partial keywords, wildcards, and range queries. They solve the load balance problem by probing multiple successors at node join and migrating virtual nodes at runtime. Thus this system do not need to know the distribution of different attribute values, but they will introduce some extra joining and migration overhead. 2.8 Summary In this chapter, we have presented a multi-attribute addressable network (MAAN) for index- ing distributed resources in Grids. MAAN routes search queries to the nodes where the target resources are registered, and avoids ooding queries to all other irrelevant nodes. It resolves range queries efciently by mapping attribute values to Chord identier space via uniform lo- cality preserving hashing. We also introduced a single-attribute dominated algorithm to resolve multi-attribute based queries. We have implemented the MAAN prototype system in Java and evaluated its performance with up to 512 nodes. Our experimental results show that the number of routing hops to resolve a query in MAAN is O(log n + n s min ), where s min is the minimum range selectivity on all attributes. Thus it scales well in the number of attributes. Also when s min = ", the number of routing hops is logarithmic to the number of nodes. 32 Chapter 3 Distributed Aggregation Tree 3.1 Introduction In addition to resource discovery, it is important to track some global system properties in large- scale distributed systems. For example, in Grids, administrators need to continuously monitor some performance metrics of the whole system for capacity planning or system diagnostics. However, resource monitoring in P2P systems and Grids are quite challenging due to their in- creasing scales. For example, the current PlanetLab consists of 706 machines at 340 sites [9], and the planet-scale Grid will have 100,000 CPUs in 2008 [124]. P2P Grid such as the SETI @Home [64] achieves massively distributed computing by aggregating CPU cycles from mil- lions of contributing computers. Furthermore, the at structure of P2P systems poses a major challenge to the global information aggregation due to its large-scale and decentralized nature. Distributed information aggregation has a broad application on various distributed systems, such as Grid resource monitoring [29, 7], P2P reputation aggregation [56, 143], and so on. Ta- ble 3.1 summarizes six example applications of distributed information aggregation in different contexts. These applications are compared in terms of four important aspects: aggregated infor- mation, aggregation index, aggregate function, and aggregation mode. For example, the aggre- gated information of Grid resource monitoring systems is the global properties of Grid resources, 33 such as total free cpu cycles. The aggregated information is indexed by aggregation index that is similar to the group by clause in the SQL language. Different aggregate functions are used to calculate the global information with two aggregation modes, i.e. on-demand and contin- uous. The former calculates the global information once upon an aggregation request, while the latter does the aggregation continuously for every time period. Table 3.1: Example Applications of Distributed Information Aggregation Applications Aggregated Aggregation Aggregate Aggregation Example information index functions mode systems Grid resource resource property sum, average continuous Globus MDS [44] monitoring properties name MAAN [20] P2P reputation top-k reputable peer name sum, top-k continuous PowerTrust [143], aggregation peers EigenTrust [56] Distributed RDF RDF object predicate sum, average, on-demand Edutella [83] repository values count, etc. RDFPeers [19] Distributed worm ngerprint ngerprint sum, continuous Autograph [63] signature detection statistics distinct-count WormShield [16] Wide-area network IP ow ow sum, average on-demand Mind [69] monitoring statistics attribute Sensor network sensor readings monitored sum, average continuous TAG [73] monitoring attribute This chapter presents a distributed aggregation scheme that provides a common interface for aggregating global information. In this scheme, a DAT tree is constructed among nodes by leveraging the structured P2P network, i.e. Chord [118]. In DAT, all nodes use a balanced routing scheme to build a balanced DAT tree towards the root node. We have implemented a prototype DAT system running on top of RPC protocol or on a discrete event simulation engine. We also evaluated the performance of the DAT system with up to 8192 nodes. 3.2 DAT System Model In this section, we formulate the distributed aggregation problem in a structured P2P network. We then describe the basic formulation of the DAT approach based on Chord. 34 3.2.1 Distributed Aggregation Problem The distributed aggregation problem can be formulated as follows. Considering a P2P network with n nodes, each node i holds a local value x i (t)inX in time slot t, where 1 i n. For a given aggregate function f : X + ! X, the goal is to compute the aggregated value g(t) of all local values in time, i.e. g(t) = f(x 1 (t); x 2 (t);:::; x n (t)) in a decentralized fashion. We are interested in any generic aggregate function that is known as reduction function. An aggregate function f is said to be reducible, if f(X 1 [ X 2 ) = f(f(X 1 ); f(X 2 )), where X 1 ; X 2 X and X 1 \ X 2 = . As shown in [73], most aggregate functions are reducible, such as count, min, max, sum, average, and distinct count. The reducible aggregation functions have two important properties. First, the aggregates either return a single representative value from the set of all values (e.g. min and max), or calculate some property of all values (e.g. count, and sum). In both cases, the output value has much smaller size than the set of input values. Second, the reduction aggregate functions are applied to a large set of input values recursively, i.e. a subset of input values at each time. 3.2.2 Basic Formulation of the DAT Model To solve the above aggregation problem, we propose a distributed aggregation tree (DAT) ap- proach that builds a tree structure implicitly from the native routing paths of Chord. In DAT, each node applies the given aggregate function f on the values of its child nodes, and sends the aggregated value to its parent node. By recursively aggregating the values through the tree in a bottom-up fashion, the root node will calculate the global aggregated value very efciently since it only needs to collect the values from its direct child nodes. Fig. 3.1 shows an example of ag- gregating the global value through a DAT tree of seven nodes, where each node N i has a local 35 value x i and i = 1; 2; :::;7. After applying the aggregate function three times at nodes N 5 , N 6 , and N 7 , the root node N 7 will calculate the global aggregated value from all local values. Figure 3.1: Example of aggregating the global value through a DAT tree of 7 nodes. In a dynamic P2P network with node joining and leaving, it is quite challenging to build aggregation trees explicitly by maintaining the parent-child membership [67]. First, explicit tree construction has limited scalability on a large number of aggregation trees since the parent-child maintenance overhead increases linearly with the number of trees. Second, the membership over- head will be further exaggerated when nodes dynamically join or leave the network. Since a node is often part of every aggregation tree, all related trees have to be adjusted upon its departure. As shown in Sec. 2.2, structured P2P networks, e.g. Chord, have a unique network topol- ogy and routing mechanism that provide an elegant infrastructure for building aggregation trees implicitly. Instead of maintaining explicit parent-child membership, DAT nodes use the existing neighboring information of Chord to organize themselves into a tree structure in a bottom-up fashion. When a node joins or leaves the network, the Chord protocol will update its neighbors automatically using the nger stabilization algorithm [118]. Therefore, the DAT scheme does not have to repair the parent-child membership and signicantly reduces the tree maintenance over- head. Next, we will describe the detailed DAT construction algorithms in the following section. 36 3.3 DAT Construction Algorithms This section presents the design and analysis of two DAT construction algorithms based on dif- ferent Chord routing schemes. The basic Algorithm 1 builds a DAT tree from the nger routes of all Chord nodes to a given root node. To further balance the aggregation load among nodes, a new balanced routing scheme is proposed in Algorithm 2 to build more balanced DAT trees. 3.3.1 Basic DAT Construction The basic construction scheme builds a DAT tree on Chord in a bottom-up fashion. We assume that all nodes aggregate towards the global information with regard to a given object key called rendezvous key. A rendezvous key is the Chord identier of a given aggregate index similar to the group by clause in the SQL language. The rendezvous key is determined by DAT applications. For example, in Grid resource monitoring systems, the aggregated global resource properties are indexed by different property names, e.g. CPU usage. In this case, the rendezvous key will be the corresponding SHA1 hash value of the attribute name. Let k be the rendezvous key of an aggregation, and r be the root node of the DAT tree for this aggregation. The successor node of k is automatically selected as the root node via the same consistent hashing scheme as Chord, i.e. r = successor(k). Since consistent hashing has the advantage of mapping keys to nodes uniformly, this root selection scheme is capable of building multiple DAT trees in a load-balanced fashion. For example, in WormShield, a DAT tree has to be built for each ngerprint. By using this scheme, each monitor will be responsible for aggregating the information of roughly the same number of ngerprints. Besides the automatic selection of a root node, applications still have the exibility of designating a given Chord node as the root by using its node identier as the rendezvous key. 37 Considering a Chord network of n nodes, and f u;v be the nger routing path (i.e. nger route) from u to v. Suppose f u;v is of the form < w 0 ; w 1 ; :::; w q1 ; w q >, we have: (1) w 0 = u, w q = v, and (2) for any 0 < i < q, w i+1 = FINGER(w i ; j), such that w i+1 2 (w i ; k] and DIST(w i+1 ,k) = minfDIST(FINGER(w i ;j); k); 0 < j bg. LetF is the set of nger routes from all nodes to a given root node r. We haveF =ff v;r j1 v ng, where f v;r is the nger route from node v to r. To build a DAT tree rooted at node r, each node uses the next hop of its nger route towards key k as its parent node. Intuitively, all nger routes destined to k will implicitly build a DAT tree, called basic DAT. The following two lemmas prove this conjecture since each nger route is loop-free and each node except r has a unique parent node. Lemma 1. For any Chord nger route f v;r =< w 0 ; w 1 ; :::; w q > from node v to r, we have w i 6= w j where i6= j and 0 i; j q. Proof. According to the Chord nger routing algorithm discussed in 2.2, the next hop of a node towards key k is its nger node that most immediately precedes successor(k). In our context, r = successor(k). Hence, we have DIST(w i ;r) < DIST(w j ; r), where i > j and 0 i; j q, and 0 < DIST(w i ; r) < DIST(w 0 ; r), where 0 < i < q. Since the last hop of route f v;r is r, i.e. w q =r, node r is guaranteed to be reached within one circle of the ring topology. This implies that a node will appear at most once in any given nger route to node r. Lemma 1 shows that each nger route destined to root node r is loop-free. Next, we will prove that each node has the same next hop in the nger routes that contains this node. Let p(v; i) be the next hop of node v in nger route f i;r from i to r, assuming p(v;i) is empty if v is not in f i;r or v is the last hop of f i;r . 38 Lemma 2. For any node v6= r, the next hop of v in any nger route f i;r is the same, where v2 f i;r and 1 i n. Proof. For a given key k, the next hop of v in f i;r is p(v; i) = FINGER(v; j) such that p(v;i)2 (v; k] and DIST(p(v; i),k) = minfDIST(FINGER(v; j); k); 0 < j bg. Therefore, if v6= r, there is at least one nger of v that is the next hop towards k. Moreover, the next hop is only determined by v and k, which is independent of the previous hops along the route. Hence, for any v6= r, there is one and only one next hop of v in any nger route f i;r , where v2 f i;r and 1 i n. Since each node v has the same next hop p(v; i) towards r regardless of nger route f i;r , we can simply use p(v; i) as the parent node of v to build a basic DAT tree T(r) as specied in Algorithm 1. From Lemma 1 and 2, it is quite obvious that Algorithm 1 will construct a DAT tree rooted at r since each nger route is loop-free and each node except r has a unique parent node. Algorithm 1 Basic DAT Construction Algorithm 1: INPUT: rendezvous key k, nger table FINGER(i;j) of each node i, where i = 1; 2; :::; n, and j = 0;1;:::; b 1. 2: OUTPUT: a basic DAT tree T rooted at node r = successor(k) 3: for i 1 to n do 4: if DIST(k;i) < DIST(PRED(i); i) then 5: ROOT(T) i 6: end if 7: for j b 1 downto 0 do 8: if DIST(i; FINGER(i; j)) DIST(i; k) then 9: PARENT(i) FINGER(i; j) 10: end if 11: end for 12: end for Figure 3.2(a) and 3.2(b) illustrate an example of constructing a basic DAT rooted at node N 0 in a Chord network of 16 nodes with 4-bit identiers. In Fig. 3.2(a), the label on each link, 39 denoted by FINGER(N i , j), represents that the j-th nger node of N i is selected as the next hop of N i towards the root node N 0 . In this example, each nger route towards N 0 from a Chord node N i corresponds to the path from N i to the root in the basic DAT. For example, the nger route from N 1 to N 0 is < N 1 ;N 9 ; N 13 ; N 15 ; N 0 > in Fig. 3.2(a), and the basic DAT has the same path from N 1 to N 0 as shown in Fig. 3.2(b). Since N 0 is the next hop of N 8 , N 12 , N 14 , and N 15 , it has four child nodes correspondingly. This basic DAT construction algorithm can be easily extended to a distributed setting. Ac- tually distributed nodes do not need to build DAT trees explicitly. Instead, all the nodes know its parent directly by using the Chord nger routing; i.e., the next hop in the forwarding route is the parent. Since Chord has a very nice stabilization algorithm to update its ngers during node arrival and departure, the resultant DAT tree will adapt to node dynamics accordingly. Next, we will analysis two important properties of basic DAT: namely the tree height and branching factor. (a) Finger routing paths to N0 in Chord (b) Constructed basic DAT tree rooted at N0 Figure 3.2: Basic DAT construction using Chord nger routes to N 0 in a 16-node network 40 3.3.2 Analysis of Basic DAT Properties The height and branching factor of a DAT tree are important for the scalability and load-balance of distributed aggregation. The tree height determines the maximal number of nodes an aggregation message must traverse before reaching the root. Apparently, the aggregation latency increases as the DAT tree becomes deeper. The branching factor of a node is the number of children of the node. Since each node in basic DAT is responsible for aggregating the information from its children, its branching factor indicates the aggregation load of the node. To avoid hotspots and balance the load among nodes, it is desired for each node to have the same branching factor. We formally analyze two tree properties of basic DAT in the following lemma and theorem. Lemma 3. The tree height of basic DAT is O(log n) for a network of n nodes. Proof. The basic DAT tree height is equal to the length of the longest Chord nger route, which is O(log n) hops in a network of n nodes. From the example in Fig. 3.2(b), we know that the branching factor of a node is related to the distance between the node and the root. Let FINGER + (i;j) denote the j-th outbound nger of node i, we have FINGER + (i; j) = i + 2 j1 mod 2 b , where j = 1; 2; :::; b. Symmetrically, if v = FINGER + (i; j), we dene i as the j-th inbound nger of v, denoted by FINGER (v; j). Therefore, we have FINGER (v; j) = v2 j1 mod 2 b . In the following proof, we assume that all arithmetic operations on Chord node identiers are modulo operations of 2 b . For a given node i, let PARENT(i) be the outbound nger of i that most closely precedes r. The children of i must be a subset of its inbound ngers. Not all inbound ngers of i will choose i as their parents since they may have other outbound ngers that are more close to r. Suppose node r is the root node, and B(i; n) is the branching factor of node i in a basic DAT with n nodes. 41 We consider n = 2 b and with index i = 0; 1; 2;:::;2 b1 . As shown in Fig. 3.3(a), the identier space is divided into four disjoint intervals: (i)(r; i 2 j ], (ii)(i 2 j ;r 2 j ] , (iii)(r 2 j ; i) , and (iv)[i; r], where j =dlog 2 (d + 1)e. Fig. 3.3(b) identies the parents of nodes in interval (i), (ii), and (iii). 2 j i - 2 j r - ) 1 ( log 2 + = d j (a) Four disjoint intervals of the Chord identier space 2 j i - 2 j r - (b) Parent ngers of nodes in (i), (ii), and (iii) Figure 3.3: Illustration of the parent ngers of nodes in different identier spaces Theorem 4. Consider a basic DAT tree in which n nodes are evenly distributed in identier space, the branching factor of node i is computed as follows: B(i; n) = log 2 (n)dlog 2 (d=d 0 + 1)e, where d = DIST(i; r) and d 0 is the distance between any two adjacent nodes. Proof. We prove in two cases: (1) 1 < d < 2 b1 , and (2) 2 b1 d < 2 b . For case (2), it is obvious that B(i; n) = log 2 (n)dlog 2 (d + 1)e = 0 since (a)8v2 (i;r], PARENT(v)2 (i; r], and (b)8v2 (r; i], PARENT(v) = FINGER + (v; b)2 (i; r]. For case (1), the children of i are its inbound ngers in (r;i2 j ], where j =dlog 2 (d+ 1)e. Since d < 2 b1 , we have i2 j > r. We split our proof in four disjoint intervals as shown in Fig. 3.3(a). 42 (i)8v = FINGER (i; k)2 (r;i2 j ], where k j+1, we have PARENT(v) = FINGER + (v; k) = i since FINGER + (v; k+1) = i2 k1 +2 k i+2 j = i+2dlog 2 (d + 1)e i+d+1 = r+1. (ii)8v2 (i 2 j ;r 2 j ], we have PARENT(v) = FINGER + (v; j + 1)2 (i; r]. (iii)8v2 (r2 j ; i), FINGER + (v; j) = v+2 dlog 2 (d+1)e1 < u+d+1 i1+d+1 = r, and FINGER + (v; j) = v+2 j1 r+12 j +2 j1 = r+12 dlog 2 (d+1)e1 > r+1(d+1) = i. Thus, PARENT(v) = FINGER + (v; j)2 (i;r). (iv)8v2 [i;r], we have PARENT(v)2 (i; r]. Figure 3.3(b) divides the parents of nodes in intervals (i), (ii), and (iii). When 1 < d < 2 b1 , i has j inbound ngers in (ii) and (iii), log 2 (n) j inbound ngers in (i), and 0 inbound ngers in (iv). Thus, node i has B(i;n) = log 2 (n) j = log 2 (n)dlog 2 (d + 1)e children in the DAT. When n < 2 b , we shrink the identier space by d 0 = n=2 b . This proves B(i; n) = log 2 (n)dlog 2 (d=d 0 + 1)e. Theorem 4 shows that the branching factor of a basic DAT is not the same for all nodes. For example, the root node has the maximal branching factor of log 2 (n). However, the minimal branching factor of non-leaf nodes is 1 for nodes in the interval of [r nd 0 =4;r nd 0 =2). Thus, the basic DAT is not balanced and some nodes need to aggregate information from many more child nodes than others. This prompts us to build more balance DAT trees. 3.3.3 Balanced DAT Construction The imbalance of the basic DATs is due to the greedy strategy applied in the Chord nger routing algorithm. A Chord node always forwards a message to the closest preceding node in its nger table. For example, the node N 8 in Fig. 3.2(a) forwards its update to the node N 0 directly, using 43 the nger 2 3 away in the identier space from itself. To build a balanced DAT with a constant number of branches, we propose a balanced routing scheme to construct the routing paths from all nodes to a given root node. Instead of selecting a parent nger from the entire nger table, node i only considers a subset of ngers that are at most 2 g(x) away from i, where g(x) is a function of the clockwise distance x between i and the root r in the identier space. We call g(x) the nger limiting function of node i. In Fig. 3.4, the solid arrows represent the ngers that could be used as a parent nger of i in the balanced routing scheme. The dashed arrow represents the parent nger that otherwise would be used by the ordinary nger routing scheme. Figure 3.4: Subset of ngers used in balanced routing Next, we derive a function g(x) such that all balanced routing paths (i.e. balanced routes) to r will build a balanced DAT tree with a constant branching factor, given nodes are evenly distributed in the identier space. Intuitively, any given node i should have at most two contiguous inbound ngers that will use i as their parent ngers to r. For the ease of exposition, we will also assume that n = 2 b and i = 0; 1;2;:::; 2 b1 . Let d = DIST(i; r), and j =dlog(d + 2)e. Suppose node u and v are the j-th and j + 1-th inbound ngers of i respectively. The whole identier space can be divided into four disjoint intervals: (i) (r; i 2 j ), (ii) [i 2 j ; i 2 j1 ], (iii) (i 2 j1 ; i], and (iv) (i; r] as shown in Fig. 3.5. 44 2 j i - 1 2 j i - - ) 2 ( log 2 + = d j Figure 3.5: Parent ngers of nodes in 4 intervals using balanced routing To have a constant branching factor for each node, we will let u and v be the only two child nodes of a given node i. Therefore, the inbound ngers of i in interval (i) and (iii) must not use i as their next hop to r. For node v, we have 8 > > < > > : x = r i 2 j = d + 2 dlog 2 (d+2)e g(x) = j =dlog(d + 2)e (3.1) Solving the above equation, we have g(x) =dlog( x+2 3 )e. For detailed mathematical deduc- tion, readers are referred to Appendix B. In Sec. 3.3.4, we will prove that each node has at most two children, i.e. the j-th and j + 1-th inbound ngers. When n < 2 b , we shrink the identier space by a factor of d 0 = n=2 b since nodes are evenly distributed in the identier space. Thus, g(x) =dlog( x+2d 0 3 e, where d 0 is the distance between any two adjacent nodes. Algorithm 2 species the construction of a balanced DAT. Figure 3.6(a) and 3.6(b) demonstrate this balanced routing scheme and corresponding bal- anced DAT tree. In Fig. 3.6(a), node N 8 only selects the closest preceding nger from the ngers 45 Algorithm 2 Balanced DAT Construction Algorithm 1: INPUT: rendezvous key k, nger table FINGER(i;j) of each node i, where i = 1; 2; :::; n, and j = 0;1;:::; b 1. 2: OUTPUT: a balanced DAT tree T rooted at node r = successor(k) 3: d 0 average distance between two adjacent nodes 4: for i 1 to n do 5: if DIST(k;i) < DIST(PRED(i); i) then 6: ROOT(T) i 7: end if 8: x DIST(i;k) 9: max dlog 2 ((x + 2d 0 )=3)e 10: for j max downto 0 do 11: if DIST(i; FINGER(i; j)) DIST(i; k) then 12: PARENT(i) FINGER(i; j) 13: end if 14: end for 15: end for that are at most 2 2 hops away from itself, since x = 08 mod 2 4 = 8, and g(x) =dlog 8+2 3 e = 2. Therefore, node N 12 now is the next hop of N 8 , while node N 0 was its next hop in Fig. 3.2(a) when the ordinary nger routing algorithm was used. The routing of all other nodes remain un- changed and the tree is balanced with a maximum branching factor of 2 as shown in Fig. 3.6(b). 3.3.4 Analysis of Balanced DAT Properties We now analyze the branching factor and tree height of balanced DAT. When all nodes are evenly distributed in the identier space, the following theorem proves that the resulting DAT from balanced routing is indeed a well balanced tree with maximum branching factor of 2. Theorem 5. Consider a balanced DAT tree with evenly distributed node identiers, its tree height is at most log 2 (n) for n nodes. Proof. As shown in Fig. 3.5, node u is the closest child to i and DIST(u; i) = 2 j1 . We prove DIST(u; i) d in the following two cases: (a) d = 2 k , and (b) d = 2 k 1, where k = 0; 1; :::;2 b1 . When d = 2 k , DIST(u;i) = 2 dlog 2 (d+2)e1 = 2k = d. When d = 2 k 1, 46 (a) Balanced routing paths to N0 (b) Balanced DAT tree rooted at N0 Figure 3.6: Building a balanced DAT trees using the balanced routing scheme. DIST(u; i) = 2 dlog 2 (d+2)e1 = 2k = d + 1 > d. Since the distance between i and its child is at least the same as the distance between i and r, the length of any balanced routing path is at most log 2 (n) in a network of n nodes. Therefore, the tree height of balanced DAT is at most log 2 (n) as well. Theorem 6. Given nodes are evenly distributed in identier space, the DAT tree built from bal- anced routes is a balanced tree with a branching factor of at most 2. Proof. For any given node i, only its j-th and j + 1-th inbound nger in (r; i) are the children of i in a balanced DAT. We split our discussion in four cases: (1)8w2 (r; i 2 j ), we have i6= PARENT(w) since w+2 j < i. (2)8w2 (i2 j1 ; i), we have i6= PARENT(w) since w+2 j1 2 (i; r). (3) w = i2 j1 , we have i = PARENT(w) since g(d+2 j1 ) =dlog( d+2 dlog(d+2)e1 +2 3 )e = dlog(d + 2)e 1 = j 1: (4) w = i 2 j , we have i = PARENT(w) since g(d + 2 j ) = dlog( d+2 dlog(d+2)e +2 3 )e =dlog(d + 2)e = j: 47 In addition, a DAT must be a balanced tree if its tree height is log 2 (n) and the branching factor is at most 2. For any given node i, its left sub-tree should have at most one more node than its right sub-tree, and vice versa. Otherwise, the overall tree height will be more than log 2 (n) for a tree of n nodes since the branching factor is at most 2. Theorem 6 proves that if the ranges between two immediately adjacent nodes are the same, the balanced routing scheme will lead to a balanced tree. However, if the interval of a randomly selected node is split as that in Chord, the ranges will not be uniformly distributed [2]. The ratio of the maximal and minimal ranges is O(log n), where n is the network size. To ensure the ranges among nodes distributed uniformly, Adler et al [2] proposed an identier probing approach in which each joining node probes O(log n) neighbors of a randomly selected node and splits the one with the maximal interval. The ratio of the maximal and minimal ranges in this approach is bounded by a constant factor. Our simulation results in Sec. 3.5.2 show that with node identier probing, the maximal branching factor in a balanced DAT will be a constant as well. 3.4 DAT Prototype Implementation Based on the above DAT construction algorithms, we implemented a prototype system of DAT, called libdat, in C language on both Linux and FreeBSD. Next, we will describe the architec- ture of our DAT implementation, and detailed mechanisms on identier probing and aggregation synchronization. 3.4.1 Implementation Architecture Figure 3.7 shows the implementation architecture of our DAT prototype. In this implementation, each DAT node consists of three layers, i.e. RPC, Chord and DAT layers. The RPC layer im- 48 plements the low-level mechanisms of remote procedure call for the communication among dis- tributed nodes. It provides four routines to its upper layers, i.e. rpc call, rpc dispatch, set timer, and activate timer. Both Chord and DAT messages are transmitted via UDP protocol in external data representation (XDR) format. The XDR encoding and decoding routines are automatically generated via the rpcgen compiler. A RPC manager module is implemented at the socket-level to send and receive UDP packets. By using select() call that is waiting on the UDP socket, the timer management module is able to activate the next timer without being blocked. ! " #$ % " % & ' ( ( ( ) *+ ( ( " Figure 3.7: DAT Implementation Architecture To simplify the testing and evaluation of our DAT prototype, we also implemented a dis- crete event simulation engine that provides the same interface to the Chord and DAT layers. A heap-based event queue is used to insert and re those events in a chronological order. Without 49 modifying the upper layers, the simulator can be used to evaluate the performance of libdat with large number of nodes as we show in Sec. 3.5. The Chord layer extends the original Chord protocols with extensions on identier probing and maintaining extra information about ngers. It consists of three components, i.e. Chord procedures, nger table and nger stabilization. Each node keeps not only the information of its direct ngers, but also the information of its ngers of nger (FOF). When a node joins the network, it rst sends a join request with a random identier to a well-known node. Then the request is forwarded to the successor of the random identier. The successor splits the maximal interval of its ngers and returns the designated node identier to the joining node. Finally the node uses the same node join operation as in Chord [118] to join the network. The DAT layer implements both on-demand and continuous aggregate modes for different aggregation functions. It leverages the three underlying Chord routines, i.e. route, broadcast and upcall. To support multiple DAT trees simultaneously, each DAT node also maintains an aggregation table that keeps track of the current active DAT trees as shown in Fig. 3.7. When a node initializes an aggregate for a given rendezvous key, it adds a new entry in the aggregation table for this aggregate, and computes its child nodes based on the information in the Chord nger table. Next, we will describe the detailed mechanism of on-demand and continuous aggregation. 3.4.2 Aggregation Synchronization During the course of aggregation, each intermediate node in the DAT tree must synchronize with its child nodes to compute the aggregated value of its sub-tree. The DAT system uses two different schemes for on-demand and continuous aggregations. For on-demand aggregation, each node only keeps the aggregate entry until it receives all values from its child nodes. With the 50 information of ngers of nger, each DAT node can easily identify which inbound nger is its child for a given key k. During an aggregation, the root node rst added an entry in the aggregate table, and sends a request to each child. After receiving the reported values from all children, it removes the aggregate entry, aggregates all values and returns the aggregated value to the client. This process is done recursively in top-down fashion at all nodes in the DAT tree. In contrast, for continuous aggregation, the DAT system uses a bottom-up approach to ag- gregate the global value periodically. The root node rst broadcasts an aggregate request to all nodes in the network. Each node then computes its children information and added an entry in is aggregate table. In every aggregation period, each leaf node sends its local value to its parent node. Once receiving all updates from its child nodes or the time period expires, the parent node calculates its own aggregated value and send it to the parent node. The same process will be done re-cursively in a bottom-up fashion. Thus, the root node can aggregate the global value continuously for each time period. 3.5 Performance Evaluation In this section, we measure the performance and scalability of our DAT prototype system with three metrics, including tree properties, message overhead, and effects of load balancing. 3.5.1 Experiment Setup To faithfully evaluate the DAT system at different scales, we have implemented a UDP-based RPC module as well as a discrete event simulator. We deployed the DAT system in an 8-node cluster at the USC Internet and Grid Computing Lab. The cluster nodes are dual Xeon 3.0 GHz processors with 2 gigabytes of memory running Linux kernel 2.6.9 and connected via a 1-Gigabit Ethernet switch. We ran up to 64 DAT instances on each machine to create a network of 512 51 nodes. For larger networks up to 8192 nodes, we ran the DAT prototype in the event-driven simulator. Note that both RPC-based and simulator-based setups use the same Chord and DAT layers. They indeed have the consistent results for the metrics we measured in this section. 3.5.2 Measured DAT Tree Properties We examine the DAT tree properties with various network sizes from 16 to 8192. We studied three different properties of DAT trees, i.e. maximum branching factor, average branching factor and tree height. Fig. 3.8(a) shows the cumulative distribution of DAT node branching factors in a network of 4096 nodes. When no probing scheme is used, both nger routing and balanced routing construct DAT trees with skewed distributions of branching factors. When both balanced routing and identier probing are used, there are 5:5%, 86%, 5:6% nodes having branching factors of 1, 2, 3, respectively. And only about 3% remaining nodes have the maximal branching factor of 4. This implies that balanced routing paths with identier probing construct much more balanced DAT trees than using other schemes. Figure 3.8(b) plots the maximal branching factor as a function of network size for both basic and balanced DATs. The maximal branching factor of the basic DAT increases on a log scale with the number of nodes. Note that the network size is in log scale. When probing is used to balance node identiers, the maximal branching factor decreases signicantly, e.g. 16 vs. 43 for 8192 nodes. However, it still increases on a log scale with network size. In contrast, the maximal branching factor of balanced DAT is almost a constant of 4 when node identiers are uniformly distributed by probing O(log n) neighbors. However, without identier probing, balanced DAT trees still have the maximal branching factor that increases on a log scale. This is due to the ratio 52 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 20 25 30 35 40 45 CDF Branching factor finger routing w/o probing finger routing w/ probing balanced routing w/o probing balanced routing w/ probing (a) Cumulative distribution of branching factors in a network of 4096 nodes. 0 10 20 30 40 50 60 16 32 64 128 256 512 1024 2048 4096 8192 Maximal branching factor Network size finger routing, w/o id probing finger routing, w/ id probing balanced routing, w/o id probing balanced routing, w/ id probing (b) Maximum branching factor vs. network size. 2 2.5 3 3.5 4 4.5 16 32 64 128 256 512 1024 2048 4096 8192 Average branching factor Network size finger routing, w/o id probing finger routing, w/ id probing balanced routing, w/o id probing balanced routing, w/ id probing (c) Average branching factor vs. network size 2 4 6 8 10 12 14 16 18 16 32 64 128 256 512 1024 2048 4096 8192 Tree height Network size finger routing, w/o id probing finger routing, w/ id probing balanced routing, w/o id probing balanced routing, w/ id probing (d) Tree height vs. network size Figure 3.8: Comparison of tree properties for different DAT schemes. of the maximal and minimal ranges between adjacent nodes is O(log n) when node identies are randomly chosen. Figure 3.8(c) shows that the average branching factors of balanced DAT are constant as the network size increases. When identier probing is used, two DAT trees have almost the same constant average branching factor of 2. However, they increase to 3 and 3.2 respectively if there is no identier probing, although they remain constant as network size increases. As shown in 53 Fig. 3.8(d), the tree height is always bounded by O(log n) since the routing hops are at most O(log n) in a network of n nodes. 3.5.3 Measured Message Overheads For distributed information aggregation systems, message overhead is an important metric to measure their scalability. There are two types of messages introduced by the DAT system, i.e. aggregation and stabilization messages. The former are transmitted among DAT nodes to aggre- gate the global information. Since the aggregation messages often vary by different workload, we measure their overhead by the number of messages used per node in each aggregation round. On the other hand, the stabilization messages are used by the Chord layer to keep the nger information up to date. Their overhead is measured by the stabilization trafc rate per node. Figure 3.9(a) compares the aggregation message overhead under various network sizes for three aggregation schemes, i.e. centralized without DAT, basic DAT and balanced DAT. In the centralized scheme, each node updates its local value to the root node directly using the Chord nger routing. Therefore, the number of aggregation messages per node increases on a log-scale with network size since each update message will routed in O(log n) hops. For example, in a network of 1024 nodes, the centralized scheme uses 4.8 aggregation messages on average. In contrast, both basic and balanced DATs use one aggregation message per node since the local values of a sub-tree are aggregated and sent to its parent node only once. Next, we show the amount of stabilization overhead required to maintain the Chord overlay network. In the DAT system, each node periodically sends ping messages to its ngers to re- trieve the information of its ngers of nger. It also sends Chord stabilization messages to their immediate successors; these messages ask nodes to identify their predecessor nodes. Finally, ad- 54 0 1 2 3 4 5 6 7 16 32 64 128 256 512 1024 2048 4096 8192 Number of messages per node Network size Centralized w/o DAT Basic DAT Balanced DAT (a) Number of aggregation messages for count func- tion vs. network size 0 500 1000 1500 2000 2500 3000 3500 16 32 64 128 256 512 1024 2048 4096 8192 Stabilization traffic per node (Bps) Network size 5 sec stabilizing interval 10 sec stabilizing interval 20 sec stabilizing interval (b) Stabilization trafc for membership maintenance vs. network size Figure 3.9: Aggregation overhead of centralized, basic and balanced DAT schemes with net-work size varying from 16 to 8192. ditional messages are sent periodically to maintain an updated nger table, in which each DAT node maintains pointers to nodes that are logarithmically distributed around the Chord identier space. We refer collectively to these three types of messages for Chord membership maintenance as stabilization trafc. Fig. 3.9(b) shows the measured overhead per node in bytes per second for stabilization trafc as the network size increases. The three lines show different periods (5s, 10s and 20s) at which the stabilization messages are sent. For all three update intervals, the stabi- lization trafc is quite low (less than 3 KB/s for 8192 nodes). The stabilization trafc per node increases at a rate of O(log 2 n) for a network of size n. This is because each node has O(log n) ngers and each ping message contains the information of O(log n) ngers of nger. 3.5.4 Effects of Load Balancing Besides the average message overhead per node, the distribution of aggregation messages among nodes is another important metric to evaluate the performance of the DAT system. Apparently, the evener the messages are distributed among nodes, the better the aggregation process is load 55 balanced. Fig. 3.10(a) plots the distributions of aggregation message in a network of 512 nodes for three different schemes. In this gure, the DAT nodes are sorted in the descending order of the number of aggregation messages. We dene node rank as the position of a node in this sorted node list. As shown in Fig. 3.10(a), the message distribution of the centralized scheme without DAT is quite skewed. Note that the y-axis is in a log-scale. For example, the root node is the most loaded one with 511 aggregation messages, which is almost the same as the total number of nodes in the network. This is because each node in the network except the root node itself must send their local values to the root node directly. In addition, the closer a node precedes the root node in the Chord identier space, the more aggregation messages it has to forward for other nodes due to the nature of the Chord nger routing algorithm. 1 2 4 8 16 32 64 128 256 512 0 50 100 150 200 250 Number of aggregate messages Node rank Centralized w/o DAT Basic DAT Balanced DAT (a) Distribution of aggregation messages among nodes 200 400 600 800 1000 0 20 40 60 80 100 Imbalance factor Network size Centralized w/o DAT Basic DAT Balanced DAT (b) Imbalance of aggregation messages vs. net- work size Figure 3.10: Comparison of load balance for centralized, basic and balanced DAT schemes In contrast, distributed aggregation in the network with DAT trees signicantly reduces the imbalanced load at the root monitor. Each intermediate node in the DAT tree only processes the aggregation messages from its direct children instead of every node in the sub-tree. For example, 56 the most loaded nodes in basic and balanced DATs have only 24 and 4 messages respectively. Since basic DAT is not a balanced aggregation tree, the root has more children than other nodes. Therefore, the distribution of message overhead in basic DAT is still more skewed than that in balanced DAT. We dene the imbalance factor of message overhead as the ratio between the maximum and average number of aggregation messages on each node. The aggregation is well balanced if the imbalance factor is close to 1. Fig. 3.10(b) shows the imbalance factor as a function of the network size varying from 100 to 1000 for three difference aggregation schemes. The imbalance factor of the centralized scheme increases almost linearly with the network size since the root node has to process O(n) aggregation messages. The imbalance factor of the basic DAT only increases on a log-scale with the network size. For example, the imbalance factors are 4.2 and 8.5 for the net- works of 100 and 1000 nodes respectively. The balanced DAT has an almost constant imbalance factor under different network sizes, e.g. 1.9 and 2.0 for 100 and 1000 nodes respectively. This further validates our theoretical analysis of the DAT tree properties in Sec. 3.3.2 and Sec. 3.3.4. 3.6 Related Work The DAT system is related to several previous works on aggregating the global information in distributed systems [49, 98]. Astrolabe [98] provides a DNS-like distributed management service by grouping nodes into non-overlapping zones and specifying a tree structure of zones. In Astro- labe, a representative node is elected for each zone to propagate information across zones using a gossip protocol. Bawa et al [10] compared the tree-based and propagation-based schemes for estimating aggregates in P2P networks. They showed that aggregation trees are often prone to node failures in unstructured P2P network. 57 However, most structured P2P networks are able to adapt their topologies in case of node arrival and departure. Several tree-based aggregation schemes have been proposed to leverage the topology information of structured P2P networks [140, 3, 126, 134, 46, 67]. SOMO [140] is an information gathering and disseminating infrastructure on top of an arbitrary DHTs. The SOMO tree is built by recursively dividing the DHT identier space into disjoint regions and assigning each region to a DHT node. DASIS [3] and Willow [126] use a similar scheme to build a single aggregation tree on hypercube-based DHTs, such as Pastry [104], Tapestry [141], and Kademlia [74]. By aggregating the minimal depth information, DASIS improves the node join algorithm to balance the load of P2P systems. Li et al [67] build an aggregation tree by mapping nodes to their parents in the tree with a parent function. By adjusting parameters in a parent function, their approach can build multiple interior-node-disjoint trees to tolerate single points of failure. SDIMS [134] is the most closely related project to our work. In SDIMS, each attribute is hashed on to a key and corresponding aggregation tree is built from Plaxton routes to the key. The default Plaxton routing algorithm is also modied to provide administrative isolation. As we show in Sec. 3.3, the aggregation trees in SDIMS is similar to the DATs built from Chord nger routes. Our work focus more on the construction algorithms of more balanced aggregation trees. Instead of building aggregation trees, gossip-based protocols [54, 105, 62] estimate the aggre- gates by exchanging information among nodes in an epidemic manner. Kempe et al [62] show that these protocols converge exponentially fast to the true aggregates. In addition to P2P networks, in-network information aggregation has been investigated in the context of sensor networks with more restrict energy and security constraints [73, 142, 90, 127, 76]. 58 3.7 Summary This chapter has presented the algorithms, prototype implementation, and performance evaluation of our DAT system built on top of the Chord overlay network. We rst formulated the distributed aggregation problem and our DAT system model. We then introduced the basic DAT construction algorithm and analyzed its properties in terms of tree height and branching factor. A balanced Chord routing algorithm were developed to enable the construction of balanced DATs, when nodes are evenly distributed in the identier space. After describing the details of our DAT prototype implementation, we evaluated the performance of the DAT system with three metrics, including tree properties, message overheads, and effects of load balancing. Our experimental results demonstrates that the balanced DAT scheme scales well to a large number of nodes and corresponding aggregation trees. 59 Chapter 4 Distributed Cardinality Counting 4.1 Introduction The DAT scheme works well for most aggregation functions that yield a single output value from a set of input values. However, aggregation through a DAT tree is not enough for counting the global cardinality of large sets that are distributed among nodes. The cardinality of a set is the number of its distinct elements, while its size counts all elements. To count the cardinality of multiple sets, an aggregation function has to output the set of distinct elements, instead of only the number of distinct elements. Therefore, each DAT node must send all distinct elements of its subtree to its parent node. For large cardinalities, the communication cost of transferring the set of distinct elements between DAT nodes could be very high. For example, during a worm outbreak, the number of distinct IP addresses of a worm signature is exceptionally larger than those of normal trafc patterns. The worm signature can be automatically generated by tracking the global cardinality of IP address sets observed at multiple edge networks. However, the counting of global address dispersion must scale up to large sets of IP addresses with moderate communication cost. In this chapter, we propose a distributed cardinality counting scheme with an adaptive count- ing algorithm. Instead of collecting the full sets of distinct elements, our approach digest these 60 sets into small cardinality summaries that are merged through a DAT tree to estimate the global cardinalities. Existing probabilistic counting algorithms are typically designed for a predeter- mined range of set size and cardinality. They cannot scale either up to large cardinalities [132], or down to small ones [34]. However, the number of distinct IP addresses can change dramatically in the order of magnitude during a worm outbreak. Our adaptive counting algorithm, on the other hand, can intelligently adapt itself to the cardinality of the set it counts with different algorithm variants and a common data structure, which is well-suited for anomaly detection. 4.2 LogLog Cardinality Counting Counting large cardinalities is a very important problem in many areas such as database query op- timization and network trafc analysis. Several probabilistic algorithms have been proposed since middle 1980s, e.g. probabilistic counting [42], linear counting [132], multi-resolution bitmap [38] and LogLog counting [34]. These algorithms estimate the cardinality of a considerably large set by digesting the set into a small memory referred to as cardinality summary. The LogLog algorithm designed by Durand and Flajolet counts sets of large cardinality very efciently in terms of space complexity and accuracy [34]. Similar to other approaches, the LogLog algorithm rst applies a uniform hash function to all elements in setS to eliminate dupli- cates and to generate hash values that closely resemble a random uniform distribution. Suppose a hash value x is in binary string format, i.e. x2f0; 1g L , where L is the hash length. Let (x) denote the position of its rst most signicant bit 1, e.g. (1 ) = 1 and (001 ) = 3. Du- rand and Flajolet showed the largest for a set of randomly distributed hash values can provide a reasonable indication on log 2 n, where n is the set cardinality. To obtain a better estimate, the LogLog algorithm separates hash values into m groups, or buckets, by using the least signicant 61 k bits of x as bucket index, where m = 2 k . Suppose counter C j records the largest for hash val- ues in bucket j, then the arithmetic mean 1 m P m j=1 C j can be expected to approximate log 2 (n=m) with an additive bias. Therefore, the cardinality n can be estimated by ^ n = m m2 1 m P m j=1 C j ; (4.1) where m is a correction factor approximated by 1 = e p 2=2 0:39701 ( is Euler's con- stant) when m 64. The estimator ^ n is asymptotically unbiased as n!1. The bias (B) of ^ n=n is derived from its expectation (E), or B(^ n=n) = E(^ n=n) 1 = 1;n + o(1); (4.2) wherej 1;n j < 10 6 as n!1 [34]. The standard error (SE) of ^ n=n is derived from the variance (Var), or SE(^ n=n) = p Var(^ n)=n 1:30= p m: (4.3) 4.3 Adaptive Cardinality Counting The LogLog algorithm has very low space-complexity and it only requires m counters in memory with log 2 (log 2 (n=m)+3) bits for each counter. We refer to these m counters as cardinality sum- mary in our context. For example, with 1 million 5-bit counters (640 KB cardinality summary), we can estimate cardinalities up to 10 14 with standard error around 0:13%. Although the LogLog algorithm can scale up to very large cardinalities, it is asymptotically unbiased only when the cardinality n is much greater than the number of buckets m. Fig. 4.1 plotted the empirically measured bias, in log scale, of the ratio ^ n=n by varying the load factor 62 1e-05 1e-04 0.001 0.01 0.1 1 1 1.5 2 2.5 3 3.5 4 4.5 5 Bias Load factor (n/m) m = 64K measured m = 256K measured m = 1024K measured Figure 4.1: The bias of LogLog algorithm for small cardinalities with load factor from 1 to 5. t = n=m. The number of buckets m is congured as 64K, 256K and 1024K, respectively. The experiment shows that the bias increases exponentially when the load factor t is less than 3. From (4.3), we know m should be large enough to ensure accuracy: e.g., we need at least 1 million buckets for an estimator of 0:13% error, and cardinalities less than 3 millions cannot be accurately estimated due to large bias. The large bias of the LogLog algorithm for comparatively small cardinalities is mainly due to sampling error, since many buckets are empty when t is relatively small. The probability that an element is hashed into a given bucket is p = 1=m. For a set of n distinct elements, the probability that a bucket is empty is p e = (1 p) n (1=e) n=m = e t ; i.e., the number of empty buckets will increase exponentially when t decreases. For example, when t = 1, there are 36% empty buckets. Although the LogLog algorithm is not suitable for estimating cardinalities when load factor is small, the expected number of empty buckets can give us a much better estimation on n. Suppose 63 0 0.002 0.004 0.006 0.008 0.01 0.012 1 1.5 2 2.5 3 3.5 4 4.5 5 Standard Error Load factor (n/m) loglog measured (m=64K) loglog predicated (m=64K) linear predicated (m=64K) loglog measured (m=256K) loglog predicated (m=256K) linear predicated (m=256K) loglog measured (m=1024K) loglog predicated (m=1024K) linear predicated (m=1024K) Figure 4.2: The standard error of LogLog algorithm for small cardinalities with load factor less than 5. the number of empty buckets is b e , the expected number of empty buckets after digesting n distinct elements is E[b e ] = mp e e n=m , and n can be estimated as ^ n =m ln(b e =m): (4.4) Whang et al. also show that this is the maximum likelihood estimator of n [132]. In their linear counting algorithm, they count b e using a bitmap of m bits instead of m buckets. The bias and standard error of ^ n=n of their scheme are B(^ n=n) = (e t t 1)=(2n) (4.5) SE(^ n=n) = p e t t 1=(t p m): (4.6) In order to achieve a specied accuracy, load factor t in linear counting cannot be too large. Fig. 4.2 shows the predicted standard errors of linear counting and LogLog counting as well as 64 the measured standard error of LogLog counting as t varying from 1 to 5. The standard error of LogLog counting is almost constant as predicated by (4.3), while that of linear counting increases signicantly when t changes from 1 to 5. However, there is a certain load factor for different numbers of buckets at which the standard error of linear counting becomes greater than that of LogLog counting. We refer to this particular load factor as switching load factor, denoted by t s . From (4.3) and (4.6), we have SE(^ n=n) = p e ts t s 1=(t s p m) = 1:30= p m: (4.7) Solving (4.7), we have t s 2:89, which is independent of the number of buckets m. As shown in Fig. 4.1, the maximum bias of LogLog counting is below 0:17% when t > t s , which is sufcient for most cardinality estimation applications. Therefore, to better estimate both small and large cardinalities, we propose an adaptive count- ing algorithm that estimates small cardinalities using linear counting when t t s and estimates large ones using LogLog counting when t > t s . Our adaptive counting algorithm uses the same data structure as LogLog counting, i.e. an array of m counters C j to record the largest for hash values in each bucket. Since (x) is always greater than zero for all x ((x) = L + 1 for x = 0, where L is the bit length of x), a bucket j is empty iff C j = 0. Suppose the ratio of empty buckets is , we have s = b e =m = e ts = 0:051 from (4.4), where s is the ratio of empty buckets when t = t s . Therefore, the adaptive counting algorithm estimates n as follows ^ n = 8 > > < > > : m m2 1 m P m j=1 C j if 0 < 0:051 m ln() if 0:051 1 : (4.8) 65 Our adaptive counting algorithm uses only one data structure for two estimation algorithms, i.e. supporting both linear counting and LogLog counting without introducing any extra pro- cessing and storage overhead, when compared with the ordinary LogLog algorithm. Also, as supported by our results in Sec. 4.5, the algorithm can scale down to small cardinalities and scale up to large ones simultaneously, which cannot be accomplished by either LogLog counting or linear counting. The dual-scalability of our adaptive counting algorithm makes it well-suited for some highly dynamic applications, such as worm signature generation systems. 4.4 Distributed Cardinality Counting The distributed cardinality counting problem can be formulated as follows. Consider a network of n nodes and S i is the set of elements at node i, where i = 1; 2;:::;n. In different context, S i may contain different types of elements. For example, in a worm signature generation system,S i could be the set of IP addresses observed by monitor i. Let S g be the global set that is the union of sets at all nodes, i.e. S g = S n i=1 S i . Therefore, the distributed cardinality counting problem is to compute the cardinality ofS g from distributed setsS i , where i = 1; 2;:::;n. Apparently, collecting all distinct elements over a wide-area network at a root node in DAT will not scale up to large sets. Instead of performing the union operation on all sets directly, we indeed only need to estimate the cardinality of the union of all sets. Since the sets may vary from empty to very large, we need to use a very small size memory to estimate cardinalities varying in several orders of magnitude. LetS i be the cardinality summary ofS i at node i. To estimate the global cardinalityjS g j, we only merge the cardinality summaries S i instead of the actual sets S i from all nodes. ThenjS g j can be estimated from the merged cardinality summaries using the adaptive counting algorithm. 66 Recall that cardinality summary S consists of an array of m counters, i.e. S[j] = C j , where 1 j m, and C j is the j-th counter. We dene the max-merge operator (denoted by) as follows: S = S 1 S 2 iff 8j2 [1;m];S[j] = maxfS 1 [j];S 2 [j]g; (4.9) where S is the max-merged cardinality summary of S 1 and S 2 . This operator can be repeatedly applied to n cardinality summaries, denoted by S 1 S 2 :::S n . As we discussed in Sec. 4.3, the duplicate elements of a set are eliminated after applying the uniform hash function. Also, the cardinality summary of a set is independent of the processing sequence on its elements. Suppose S g = S n i=1 S i , we have S g = S 1 S 2 :::S n : (4.10) Therefore,jS g j can be estimated from the max-merged cardinality summary S g . Algorithm 3 species the detailed process of estimating the global cardinality. Algorithm 3 Distributed Cardinality Counting Algorithm 1: INPUT: a list of n distributed sets S i, where n is the number of nodes and i = 1;2; :::; n. 2: OUTPUT: the global cardinality of all n sets 3: for all node i in 1;2; :::; n do 4: initialize an array of m counters C j(i) = 0, where j = 1; :::m. 5: for all element s in S i do 6: hash s into a binary string B 7: j B mod m; C j(i) max(C j(i); (B)) 8: end for 9: send the array of m counters C j(i) to the root node r 10: end for fThe following operations are performed at the root nodeg 11: initialize an array of m counters G j = 0, where j = 1; :::m. 12: for i = 1 to n do 13: G max-merge(G, C(i)) 14: end for 15: the ratio of empty counters in G 16: if < 0:051 then 17: ^ N mm2 1 m P m j=1 G j 18: else 19: ^ N mln() 20: end if 21: output ^ N as the global cardinality 67 In a distributed environment, the max-merge operation is performed gradually through a DAT tree. Instead of collecting all cardinality summaries at a root node, each node max-merges the summaries from its children and only sends the merged one to its parent in the DAT tree. Fig. 4.3 illustrates the process of estimating global cardinality in a DAT tree of 8 nodes. Since cardinality summaries have O(log log N) space-complexity for N distinct elements, this scheme signi- cantly reduces the message overhead for estimating the global cardinality of distributed sets. 1 S 1 5 S S 3 S 1 3 5 7 S S S S 2 S 4 S 2 4 6 S S S g 1 2 3 4 5 6 7 S S S S S S S S = Figure 4.3: Distributed estimation of global cardinality through a DAT tree with 8 nodes. 4.5 Performance Evaluation In this section, we compared our adaptive counting algorithm with linear counting, LogLog counting and multi-resolution bitmaps by scaling cardinality in six orders of magnitude. We also benchmarked the performance of digesting a large set in our scheme. 4.5.1 Scalability of Adaptive Counting The efcacy of our distributed cardinality counting scheme largely depends on the space com- plexity and accuracy of the adaptive counting algorithm. Fig. 4.4 compares adaptive counting 68 algorithm with linear counting [132], LogLog counting [34] and multi-resolution bitmap [38] by scaling cardinalities in about six orders of magnitude. The accuracy is measured by Root Mean Squared Error (RMSE) of the ratio ^ n=n. where n is the cardinality and ^ n is its estimator. The RMSE reects both the bias and the standard error in a single metrics since RMSE(^ n=n) = p (B(^ n=n)) 2 + (SE(^ n=n)) 2 . We randomly generated a set of n distinct elements, and varied n from 0:1K to 128M to test the scalability of three counting algorithms. RMSE(^ n=n) are calculated from 100 runs. All four algorithms use the same 20 Kbit memory and the multi-resolution bitmap is congured with 2% relative error for up to 10M. Fig. 4.4 shows that linear counting estimates small cardinalities (less than 100K) more accurately than other three algorithms, but cannot estimate cardinalities larger than 200K. Its RMSE increases dramatically from 0:022 to 0:24 when n only increases from 100K to 200K. After n is greater than 200K, the bitmap becomes full and cannot estimate n correctly. In contrast, the LogLog algorithm perform very well when n is larger than 10K, which corre- sponds to the switching load factor, 2:98, of our adaptive counting algorithm. LogLog counting scales to very large cardinalities such as 128M with almost constant RMSE of 0:02. However, its RMSE increases signicantly from 0:02 to 0:44 when n scales from 10K down to 2K, and further increases to 12:25 when n is 0:1K. The multi-resolution bitmap and adaptive counting have almost the same RMSE when n is less than 16M. But when n is greater than 16M, the multi-resolution bitmap becomes full very quickly and cannot estimate n correctly. The adaptive counting algorithm has the same scale up property as LogLog counting while it can scale down to small cardinalities as well. Its RMSE slightly decreased from 0:017 to 0:012 when n scales from 10K down to 0:1K, which is much better than the LogLog counting algorithm. 69 0 0.1 0.2 0.3 0.4 0.5 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 linear counting 0 0.1 0.2 0.3 0.4 0.5 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 loglog counting 1.3/sqrt(m) 0 0.1 0.2 0.3 0.4 0.5 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 Root Mean Squared Error (RMSE) multi-resolution bitmap 0 0.1 0.2 0.3 0.4 0.5 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 Cardinality(n) adaptive counting 1.3/sqrt(m) Figure 4.4: Comparison of linear counting, loglog counting, multi-resolution bitmap, and adap- tive counting algorithms with n scaling from 0:1K to 128M, m=4K. Therefore, it is capable of scaling both up to very large cardinalities (more than 128M) and down to small ones (less than 0:1K). 4.5.2 Performance of Cardinality Digesting The performance of digesting a set into cardinality summary is very critical for our counting scheme to support large sets. We did a benchmark test on our prototype implementation in C on a DELL server with 3 GHz Xeon CPU and 1 GB memory. We use the hash function in UMAC [13] as our uniform hash function. Figure 4.5 shows the cumulative distribution function of the CPU cycles used to digest each element into cardinality summary with 256K or 1024K buckets. Each element uses 1040 cycles on average for 256K buckets and 1041 cycles on average for 1024K buckets. Therefore, the software implementation on a server with 3 GHz CPU can process more than 2:8 million elements 70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 950 1000 1050 1100 1150 1200 1250 1300 1350 CDF CPU cycles per packet m=256K m=1024K Figure 4.5: Performance of digesting large sets per second. Moreover, current state-of-the-art hardware implementation of UMAC hash can yield a throughput of 79 Gbps [135]. Since our scheme digests each element with one memory read and one write, it should be able to process 50 million elements with 10 ns SRAM and hardware acceleration. 4.6 Summary In this chapter, we have presented a distributed cardinality counting scheme that digests large sets into cardinality summaries and estimates the global cardinality in a fast and accurate manner. To overcome the shortcomings of existing cardinality counting algorithms, we also introduced an adaptive counting algorithm that not only scales up to very large cardinalities, but also scales down to very small ones. As shown in Chapter 7, our approach is well-suited for worm signature generation systems to track the global IP address dispersion of worm signatures. 71 Chapter 5 P2P Replica Location Service 5.1 Introduction In Grid environments, replication of remote data is important for data intensive applications. Replication is used for fault tolerance as well as to provide load balancing by allowing access to multiple replicas of data. One component of a scalable and reliable replication management system is a Replica Location Service (RLS) that allows the registration and discovery of replicas. Given a logical identier of a data object, the Replica Location Service must provide the physical locations of the replicas for the object. Chervenak et al. [22] proposed a parameterized RLS framework that allows users to deploy a range of replica location services that make tradeoffs with respect to consistency, space overhead, reliability, update costs, and query costs by varying six system design parameters. A Replica Location Service implementation based on this framework is available as part of the Globus Toolkit Version 3 [23]. The Replica Location Service design consists of two components. Local Replica Catalogs (LRCs) maintain consistent information about logical-to-physical mappings on a site or storage system, and Replica Location Indices (RLIs) aggregate information about mappings contained in one or more LRCs. The RLS achieves reliability and load balancing by deploying multiple 72 LRC LRC LRC LRC RLI RLI RLI Replica Location Index Nodes Local Replica Catalogs RLI RLI Figure 5.1: Example of a hierarchical RLI Index conguration supported by the RLS implemen- tation available in the Globus Toolkit Version 3. and possibly redundant RLIs in a hierarchical, distributed index. An example RLS deployment is shown in Fig. 5.1. The RLS framework also envisions a membership management service that keeps track of LRCs and RLIs as they enter and leave the system and adapts the distributed RLI index according to the current server membership. However, the current RLS implementation does not contain a membership service; instead, it uses a static conguration of LRCs and RLIs that must be known to servers and clients. The current RLS implementation is being used successfully in production mode for several scientic projects, including as the Earth System Grid (www.earthsystemgrid.org) and the Laser Interferometer Gravitational Wave Observatory(www.ligo.caltech.edu). However, there are sev- eral features of the existing RLS that could be improved. First, because a membership service has not been implemented for the RLS, each deployment is statically congured, and the system does not automatically react to changes in membership (i.e., servers joining or leaving the sys- tem). Conguration les at each RLS server specify parameters including authorization policies; whether a particular server will act as an LRC, an RLI or both; how state updates are propagated 73 from LRCs to RLIs; etc. When new servers are added to or removed from the distributed RLS system, affected conguration les are typically updated via command-line administration tools to reect these changes. While this static conguration scheme has proved adequate for the scale of current deploy- ments, which typically contain fewer than ten RLS servers, more automated and exible member- ship management is desirable for larger deployments. Second, although the current RLS provides some fault tolerance by allowing LRCs to send state updates to more than one RLI index node, the overall RLI deployment is statically congured and does not automatically recover from RLI failures. We have no ability to specify, for example, that we want the system to maintain at least 3 copies of every mapping in the RLI index space or that after an RLI server failure, the distributed RLS should automatically recongure its remaining servers to maintain the required level of redundancy. In this chapter, we present a Peer-to-Peer Replica Location Service (P-RLS) that provides a distributed RLI index with properties of self-organization, greater fault-tolerance and improved scalability. It uses the overlay network of the Chord system [118] to self-organize P-RLS servers. A P-RLS server consists of an unchanged Local Replica Catalog (LRC) and a peer-to-peer Replica Location Index node called a P-RLI. The network of P-RLIs uses the Chord routing algorithm to store mappings from logical names to LRC sites. A P-RLI responds to queries re- garding the mappings it contains and routes queries for other mappings to the P-RLI nodes that contain those mappings. The P-RLS system also exploits the structured overlay network among P-RLI nodes to replicate mappings adaptively among the nodes; this replication of mappings provides a high level of reliability and availability in the P-RLS network. 74 We implemented a prototype of the P-RLS system by extending the RLS implementation in Globus Toolkit Version 3.0 with Chord protocols. We evaluated the performance and scalability of a P-RLS network with up to 16 nodes containing 100,000 or 1 million total mappings. We also simulated the distribution of mappings and queries in P-RLS systems ranging in size from 10 to 10,000 nodes that contain a total of 500,000 replica mappings. In this chapter, we describe the P-RLS design, its implementation and our performance results. 5.2 Globus Replica Location Service The RLS included in the Globus Toolkit Version 3 provides a distributed registry that maintains information about physical locations of copies and allows discovery of replicas. The RLS frame- work [22] consists of ve components: the LRC, the RLI, a soft state maintenance mechanism, optional compression of state updates, and a membership service. The LRCs maintain mappings between logical names of data items and target physical names. The RLIs aggregate state in- formation contained in one or more LRCs and build a hierarchical, distributed index to support discovery of replicas at multiple sites, as shown in Fig. 5.1. LRCs send summaries of their state to RLIs using soft state update protocols. Information in RLIs times out and must be periodically refreshed. To reduce the network trafc of soft state updates and RLI storage overheads, the RLS also implements an optional Bloom lter compression scheme [14]. In this scheme, each LRC only sends a bit map that summarizes its mappings to the RLIs. The bit map is constructed by per- forming a series of hash functions on logical names that are registered in an LRC and setting the corresponding bits in the bit map. Bloom lter compression greatly reduces the overhead of soft state updates. However, the bloom lter is a lossy compression scheme. Using bloom lters, 75 the RLIs lose information about specic logical names registered in the LRCs. There is also a small probability that the Bloom lter will provide a false positive, an incorrect indication that a mapping exists in the corresponding LRC when it does not. A membership service is intended to keep track of participating LRCs and RLIs as well as which servers send and receive soft state updates from one another. The current implementation does not include a membership service but rather maintains a static conguration for the RLS. The RLS is implemented in C and uses the globus io socket layer from the Globus Toolkit. The server consists of a multi-threaded front end server and a back-end relational database, such as MySQL or PostgreSQL. The front end server can be congured to act as an LRC server and/or an RLI server. Clients access the server via a simple string-based RPC protocol. The client APIs support C, Java and Python. The implementation supports two types of soft state updates from LRCs to RLIs: (1) a complete list of logical names registered in the LRC and (2) Bloom lter summaries of the contents of an LRC. The implementation also supports partitioning of the soft state updates based on pattern matching of logical names. The distributed RLI index can provide redundancy and/or partitioning of the index space among RLI index nodes. LRCs can be congured to send soft state updates summarizing their contents to one or more RLIs. When these updates are sent to multiple RLIs, we avoid having performance bottlenecks or single points of failure in the index space. In the framework design as well as the Globus Toolkit 3 implementation, RLS also supports the capability of limiting the size of soft state updates based on a partitioning of the logical namespace. With partitioning, we perform pattern matching of logical names and send only matching updates to a specied RLI index. The concept of partitioning was considered important to reduce the network and 76 memory requirements for sending soft state updates. In practice, however, the use of Bloom lter compression is so efcient at reducing the size of updates that partitioning is rarely used. While the current implementation of the RLS is being used effectively in several Grid pro- duction deployments and systems [31, 32], we are interested in applying peer-to-peer ideas to the distributed RLI index to produce an index that is self-congurable, highly fault tolerant and scalable. 5.3 Design of P2P Replica Location Service Next, we describe the design of our peer-to-peer Replica Location Service (P-RLS). This design replaces the hierarchical RLI index from the Globus Toolkit Version 3 RLS implementation with a self-organizing, peer-to-peer network of P-RLS nodes. In the P-RLS system, the Local Replica Catalogs (LRCs) are unchanged. Each LRC has a lo- cal P-RLI server associated with it, and each P-RLI node is assigned a unique m-bit Chord identi- er. The P-RLI nodes self-organize into a ring topology based on the Chord overlay construction algorithm discussed in Sec. 2.2. The P-RLI nodes maintain connections to a small number of other P-RLI nodes that are their successor nodes and nger nodes. When P-RLI nodes join or leave, the network topology is repaired by running the Chord stabilization algorithm. Thus, the Chord overlay network provides membership maintenance for the P-RLS system. Updates to the Replica Location Service begin at the Local Replica Catalog (LRC), where a user registers or unregisters replica mappings from logical names to physical locations. LRCs periodically send soft state updates summarizing their state into the peer-to-peer P-RLS network. The soft state update implementation in P-RLS is based on the uncompressed soft state updates of the original RLS implementation. Just as in that implementation, our updates containflogical 77 LRC/PRLI - A (4) LRC/PRLI - B (8) N4+1, N4+2, N4+4 N4+32 N4+8, N4+16 <lfn1000, lrc1000> lookup(K52 lookup(K52 lookup(K52 Finger Table N4+1 => N8 N4+2 => N8 N4+4 => N8 N4+8 => N20 N4+16 => N20 N4+32 => N40 <lfn1001, lrc1001> <lfn1002, lrc1002> SHA1(lfn1000)=18 SHA1(lfn1001)=52 SHA1(lfn1002)=31 LRC/PRLI - C (20) LRC/PRLI - D (24) LRC/PRLI - E (40) LRC/PRLI - F (48) LRC/PRLI - G (56) LRC/PRLI - H (60) Figure 5.2: Example of the mapping placement of 3 mappings in the P-RLS network with 8 nodes. name, LRCg mappings. To perform a soft state update in P-RLS, the system rst generates the Chord key identier for each logical name in the soft state update by applying an SHA1 hash function to the logical names. Then the system identies the P-RLI successor node of the Chord key of each logical name and stores the correspondingflogical name, LRCg mapping on that node. We call this successor node the root node of the mapping. Fig. 5.2 shows how three mapping are placed in a P-RLS network with 8 nodes. To locate an object in the P-RLS system, clients can submit queries to any P-RLS node. When a P-RLS node receives a query for a particular logical name, it generates the Chord key for that name and checks whether it is the successor node for that key. If so, then this node contains the desiredflogical name, LRCg; the node searches its local RLI database and returns the query result to the client. Otherwise, the node will determine the successor node for the object using 78 the Chord successor routing algorithm and will forward the client's query to the successor node, which returns zero or more logical name, LRC mappings to the client. Once the client receives these P-RLS query results, the client makes a separate query to one or more LRCs to retrieve mappings from the logical name to one or more physical locations of replicas. Finally, the client can access the physical replica. Next, we describe several aspects of our P-RLS design, including adaptive replication of P-RLI mappings and load balancing. 5.3.1 Adaptive Replication The P-RLI nodes in the P-RLS network can join and leave at any time, and also the network connection between any two nodes can be broken. In order to resolve queries forflogical name, LRCg mappings continuously despite node failures, we need to replicate the mappings on dif- ferent P-RLI nodes. In the P-RLS network, the Chord membership maintenance protocol can maintain the ring topology among P-RLI nodes even when a number of nodes join and leave concurrently. Thus, it is quite intuitive to replicate our mappings in the P-RLS network based on the membership information provided by the Chord protocol. Based on the above P-RLS design, we know that each mapping will be stored on the root node of the mapping. The root node maintains the connections to its k successor nodes in the Chord ring for successor routing reliability, where k is the replication factor and is typically O(log n) for a P-RLS network with n nodes. Thus, the total number of copies of each mapping is k + 1. A simple replication approach is to replicate the mappings stored on the root node to its k successors. This scheme, called successor replication, is adaptive when nodes join or leave the system. When a node joins the P-RLS network, it will take over some of the mappings and 79 replicas from its successor node. When a node leaves the system, no explicit handover procedure is required, and the node does not need to notify its neighbors; the Chord protocol running on the node's predecessor will detect its departure, make another node the new successor, and replicate mappings on the new successor node adaptively. If, because of membership changes in the P-RLS network, a particular node is no longer a successor of a root node, then the mappings from that root node need to be removed from the former successor node. We solve this problem by leveraging the soft state replication and the periodic probing mes- sages of the Chord protocol. Each mapping has an expiration time, and whenever a node receives a probe message from its predecessor, it will extend the expiration time of the mappings belonging to that predecessor, because the node knows that it is still the successor node of that predecessor. Expired mappings are timed out to avoid unnecessary replication of mappings. When a mapping on a root node is updated by an LRC, the root node updates its successors immediately to main- tain the consistency of replicated mappings. Since the successor replication scheme adapts to nodes joining and leaving the system, the mappings stored in the P-RLS network will not be lost unless all k successors of a particular root node fail simultaneously. 5.3.2 Load Balancing Load balancing is another important problem for a distributed replication index system, such as P- RLS. Here, we consider two aspects of the load balancing problem: evenly distributing mappings among nodes and query load balancing for extremely popular mappings. 5.3.2.1 Storage Load Balancing The Chord algorithm we discussed in Sec. 2.2 uses consistent hashing and virtual nodes to balance the number of keys stored on each node. However, the virtual nodes approach introduces some 80 extra costs, such as maintaining more neighbors per node and increasing the number of hops per lookup. We adaptively replicate mappings on multiple P-RLS nodes for fault tolerance purpose. At the same time, mapping replication can improve the distribution of mappings among nodes without using virtual nodes. In P-RLS, the number offlogical name, LRCg mappings stored on each P-RLI node is determined by the distance of the node to its immediate predecessor in the circular space, i.e. the owned region of the P-RLI node. In Chord [118], the distribution of the owned region of each node is tightly approximated by an exponential distribution with mean , where m is the number of bits of the Chord identier space and N is the number of nodes in the network. With adaptive replication using replication factor k, each P-RLI node not only stores the mappings belonging to its owned region, but also replicates the mappings belonging to its k predecessors. Therefore, the number of mappings stored on each P-RLI node is determined by the sum of k + 1 continuous owned regions before the node. Since the node identiers are generated randomly, there is no dependency among those continuous owned regions. Intuitively, when the replication factor k increases, the sum of k + 1 continuous owned regions will be more normally distributed. Therefore, we can achieve a better balance of mappings per node when we replicate more copies of each mapping. This hypothesis is veried by the simulation results in Sec. 5.5.3. Moreover, we can still use virtual nodes to distribute mappings among heterogeneous nodes with different capacities. 5.3.2.2 Query Load Balancing Although successor replication can achieve better distribution of the mappings stored on P-RLI nodes, it does not solve the hotspot problem for extremely popular mappings. Consider a mapping 81 i N x N 1 + i N 2 + i N 1 - i N 2 - i N y N rli_get_lrc(“popular-object”) rli_get_lrc(“popular-object”) {“popular-object”, rlsn://pioneer.isi.edu:8000} Figure 5.3: P-RLI queries for logical name popular-object traverse the predecessors of the root node N i fpopular-object, rlsn://pioneer.isi.edu:8000g that is queried 10,000 times from different P-RLI nodes. All the queries will be routed to the root node of the mapping, say node N i , and it will be a query hotspot in the P-RLS network. The successor replication scheme does not solve the problem because all replicas of the mapping are placed on successor nodes that are after the root node (clockwise) in the circular space. The virtual nodes scheme does not solve this problem either because the physical node that hosts the virtual root node will be a hotspot. However, recall that in the Chord successor routing algorithm, each hop from one node to the next node covers at least half of the identier space (clockwise) between that node and the destination successor node, i.e. the root node of the mapping. When the query is closer to the root node, there are fewer nodes in the circular space being skipped for each hop. Therefore, before the query is routed to its root node, it will traverse one of the predecessors of the root node with very high probability, as shown in Fig. 5.3. Therefore, we can improve our adaptive replication scheme and balance the query load for popular mappings by replicating mappings in the predecessor nodes of the root node. When a 82 predecessor node of the root node receives a query to that root node, it will resolve it locally by looking up the replicated mappings and then return the query results directly without forwarding the query to the root node. We call this approach predecessor replication. The predecessor replication scheme does not introduce extra overhead for Chord membership maintenance because each P-RLI node has information about its predecessors, since it receives probe messages from its predecessors. Also, this scheme has the same effect of evenly distributing mappings as the successor replication scheme because now each node stores its own mappings and those of its k successors. 5.4 P-RLS Implementation We implemented a prototype of the P-RLS system by extending the RLS implementation in Globus Toolkit 3.0 with Chord protocols. Fig. 5.4 shows the architecture of our P-RLS imple- mentation. In this implementation, each P-RLS node consists of a LRC server and a P-RLI server. The LRC server implements the same LRC protocol in original RLS, but uses Chord protocol to update logical name, LRC mappings. The P-RLI server implements both original RLI protocol and the Chord protocol. Messages in the Chord protocol include SUCCESSOR, JOIN, UPDATE, QUERY, PROBING, and STABILIZATION messages. The SUCCESSOR message is routed to the successor node of the key in the message, and the node identier and address of the successor node are returned to the message originator. When a P-RLI node joins the P-RLS network, it rst nds its immediate successor node by sending a SUCCESSOR message, and then it sends the JOIN message directly to the successor node to join the network. The UPDATE message is used to add or delete a mapping, and the QUERY message is used to lookup matched mappings for a 83 SUCCESSOR, JOIN, UPDATE, QUERY, PROBING STABILIZATION Chord Network RRPC Layer LRC Protocol Chord Protocol P-RLS RLI Protocol LRC Server P-RLI Server RLS Client API RRPC Layer LRC Protocol Chord Protocol P-RLS RLI Protocol RLS Client API P-RLI Server LRC Server Figure 5.4: The implementation architecture of P2P replica location service logical name. The P-RLI nodes also periodically send PROBING and STABILIZATION messages to detect node failures and repair the network topology. We implemented the Chord successor lookup algorithm using the recursive mode rather than the iterative mode. In iterative mode, when a node receives a successor request for an object key, it will send information about the next hop to the request originator if it is not the successor node of the key. The originator then sends the request to the next node directly. By contrast, in recursive mode, after a node nds the next hop, it will forward the request to that node on behalf of the request originator. There are two approaches for the successor node to send the reply to the request originator. The rst approach is simply to send the reply to the request originator. This approach might introduce a large number of TCP connections on the request originator from many different repliers. The second approach is to send the reply to its upstream node (the node where this node receives the successor request) and let the upstream node route the reply back to the request originator. In our P-RLS implementation, we implemented the second approach to avoid too many open TCP connections. All LRC, RLI and Chord protocols are implemented on top of the RLS RPC layer called RRPC. 84 5.5 Performance Evaluation In this section, we present performance measurements for a P-RLS system deployed in a 16-node cluster as well as analytical and simulation results for a P-RLS system ranging in size from 10 to 10,000 nodes with 500,000flogical name, LRCg mappings. 5.5.1 Scalability Measurements First, we present performance measurements for update operations (add or delete) and query op- erations in a P-RLS network running on our 16-node cluster. The cluster nodes are dual Pentium III 547MHz processors with 1.5 gigabytes of memory running Redhat Linux 9 and connected via a 1-Gigabit Ethernet switch. Fig. 5.5 shows that update latency increases O(log n) with respect to the network size n. This result is expected, since in the Chord overlay network, each update message will be routed through at most O(log n) nodes. The error bar in the graph shows the standard deviation of the update latency. These results are measured for a P-RLS network that contains no mappings at the beginning of the test. Our test performs 1000 updates on each node, and the mean update latency and standard deviation are calculated. The maximum number of mappings in the P-RLS network during this test is 1000, with subsequent updates overwriting earlier ones for the same logical names. Figure 5.6 shows that the query latency also increases on a log scale with the number of nodes in the system. These results are measured for two P-RLS networks that preload 100,000 and 1 million mappings respectively at the beginning of the test. Our test performs 1000 queries on each node, and the mean query latency and standard deviation are calculated. The results show that there is only slight latency increase when we increase the number of mappings in the P-RLS 85 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of nodes Update latency (ms) Figure 5.5: Update latency in milliseconds for performing an update operation in the P-RLS network. 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of nodes Query latency (ms) 100,000 preloaded mappings 1,000,000 preloaded mappings Figure 5.6: Query latency in milliseconds for performing a query operation in the P-RLI network network from 100,000 to 1 million. This is because all the mappings on each node are stored in a hash table and the local lookup cost is nearly constant with respect to the number of mappings. Next, Fig. 5.7 shows the number of RPC calls that are required to perform a xed number of updates as the size of the network increases. This test uses 15 clients with 10 requesting threads per client, where each thread performs 1000 update operations. For each conguration, the clients are distributed among the available P-RLI nodes. For a P-RLS network consisting of a single node, the number of RPC calls is zero. The number of RPC calls increases on a log scale with the number of nodes in the system. Since the queries are routed to the root node, the 86 0 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of nodes Number of RRPC requests Figure 5.7: Number of RPC Calls performed for a xed number of update operations as the size of the P-RLS network increases number of RPC calls for each query should increase logarithmically with the number of nodes in the system. In the Chord overlay network, each P-RLI node must maintain pointers to its successors and to its nger nodes. The number of successors maintained by each node is determined by the replication factor k that is part of the P-RLI conguration. Fig. 5.8 shows the rate at which the number of pointers to neighbors maintained by a P-RLI node increases. In this experiment, we set the replication factor to be two, i.e. each P-RLI node maintains the pointers to two successors. The number of neighbor pointers maintained by a node increases logarithmically with the size of the network. The error bars shows the minimum and maximum number of neighbor pointers maintained by each P-RLI node. Next, we show the amount of overhead required to maintain the Chord overlay network. To maintain the Chord ring topology, P-RLS nodes periodically send probe messages to one another to determine that nodes are still active in the network. P-RLI nodes also send Chord stabilization messages to their immediate successors; these messages ask nodes to identify their predecessor nodes. If the node's predecessor has changed because of the addition of new P-RLI nodes, this 87 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Network size Number of neighbors Figure 5.8: Rate of increase in pointers to neighbor nodes maintained by each P-RLI node as the network size increases, where the replication factor k equals to two 0 200 400 600 800 1000 1200 1400 1600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of nodes Membership maintenance traffic (Bps) 5 seconds stabilizing interval 10 seconds stabilizing interval Figure 5.9: Stabilization Trafc for a P-RLS network of up to 16 nodes with stabilization intervals of 5 and 10 seconds allows the ring network to adjust to those membership changes. Finally, additional messages are sent periodically to maintain an updated nger table, in which each P-RLI node maintains pointers to nodes that are logarithmically distributed around the Chord identier space. We refer collectively to these three types of messages for P-RLI membership maintenance as stabilization trafc. Figure 5.9 shows the measured overhead in bytes per second for stabilization trafc as the number of nodes in the P-RLS network increases. The two lines show different periods (5 seconds 88 and 10 seconds) at which the stabilization messages are sent. For both update intervals, the stabilization trafc is quite low (less than 1.5 Kbytes/second for 16 nodes). The stabilization trafc increases at a rate of O(n log n) for a network of size n. The graph shows the tradeoff between frequent updates of the Chord ring topology and stabilization trafc. If the stabilization operations occur more frequently, then the Chord overlay network will react more quickly to node additions or failures. In turn, this will result in better performance, since the nger tables for routing will be more accurate. The disadvantage of more frequent stabilization operations is the increase in network trafc for these messages. 5.5.2 Analytical Model for Stabilization Trafc Next, we developed an analytical model for stabilization trafc in a P-RLS network to estimate the trafc for larger networks than we could measure directly. Suppose we have a P-RLS network of n nodes with stabilization interval I and replication factor k. The average sizes of messages sent by a node to probe its neighbors, stabilize its im- mediate successor, and update its ngers are S p , S s , and S f respectively. In our implementation, over the course of three stabilization intervals, each P-RLS node sends messages of these three types. Thus, the total membership maintenance trafc T for a stable network is: T = (log(n) + k) S p + log(n) S f + S s 3I n Table 5.1 shows the average message sizes measured in our P-RLS implementation. Table 5.1: Measured Message Sizes for the P-RLS Implementation Message Type Message Size (bytes) S p 96.00 S f 164.73 S s 255.78 89 Table 5.2: Stabilization Trafc in Bytes per Second Predicted by Our Analytical Model Network Size Stabilization Interval 5 seconds 10 seconds 10 876 Bps 438 Bps 100 14.53 KBps 7.27 KBps 1000 203.07 KBps 101.53 KBps 10,000 2608.19 KBps 1304.09 KBps Based on this analytical model, we computed the membership trafc for networks ranging from 10 to 10000 nodes, where the replication factor is 2. These values are shown in Table 5.2. To validate our analytical model, we compared the calculated stabilization trafc with the trafc we measured in our 16-node cluster (shown in Fig. 5.9 of the previous section). Fig. 5.10 shows that the analytical model does a good job in predicting the stabilization trafc for a network of up to 16 P-RLI nodes. 0 200 400 600 800 1000 1200 1400 1600 1800 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Network size Membership maintenance traffic(Bps) Measured traffic (I=5 sec) Measured traffic (I=10 sec) Analytical model (I=5 sec) Analytical model (I=10 sec) Figure 5.10: Comparison of measured and predicted values for stabilization trafc 5.5.3 Simulations for Adaptive Replication In this section, we present simulation results for a larger network of P-RLI nodes. We simulate P-RLS networks ranging in size from 10 to 10,000 nodes with 500,000 mappings in the system. 90 We picked 500,000 unique mappings as a representative number for a medium size RLS system. RLS deployments to date have ranged from a few thousand to tens of millions of mappings. We used different random seeds and ran the simulation 20 times. In these simulations, we are interested in evaluating the distribution of mappings in a P-RLS network. A fairly even distribu- tion of mappings in the P-RLS network should result in better load-balancing of queries to the system if those queries are uniformly distributed on the mappings. We also evaluate the distribu- tion of queries for extremely popular mappings when we replicate mappings on the predecessors. The simulator used in this section is written in Java. It is not a complete simulation of the P-RLS system, but rather, it focuses on how keys are mapped to the P-RLI nodes and how queries for mappings are resolved in the network. First, we simulate the effect of increasing the number of replicas for each mapping in the P-RLS network, where k is the replication factor and there are a total of k + 1 replicas of each mapping. As we increase the replication factor, we must obviously store a proportionally increas- ing number of mappings in the P-RLS network. Table 5.3 shows the mean number of mappings per node for P-RLS networks ranging in size from 10 to 10,000 nodes when the replication factor k ranges from 0 to 12. We simulate a P-RLS network with a total of 500,000 unique mappings. The table shows that as the P-RLS network size increases, the average number of mappings per node decreases proportionally. As the replication factor increases, the average number of map- pings per node increases proportionally. The mean numbers of mappings shown in Table 5.3 would increase proportionally with the total number of unique mappings in the P-RLS system. While Table 5.3 shows that the mean number of mappings per node is proportional to the replication factor and inversely proportional to the network size, the actual distribution of map- pings among the nodes is not uniform. However, we observe in Fig. 5.11 that as the replication 91 Table 5.3: Mean Number of Mappings Per Node for a Given Network Size and Replication Factor Network Size Replication Factor (Total Replicas) 0(1) 1(2) 4(5) 12(13) 10 50,000 100,000 250,000 N/A 100 5,000 10,000 25,000 65,000 1000 500 1000 2,500 6,500 10,000 50 100 250 650 factor increases, the mappings tends to become more evenly distributed. Fig. 5.11 shows the distribution of mappings over a network of 100 P-RLI nodes. On the horizontal axis, we show the number of nodes ordered from the largest to the smallest number of mappings per node. On the vertical axis, we show the cumulative percentage of the total mappings that are stored on some percentage of the P-RLI nodes. For a replication factor of zero (i.e., a single copy of each mapping), the 20% of the nodes with the most mappings contain approximately 50% of the total mappings. By contrast, with a replication factor of 12 (or 13 total replicas), the 20% of nodes with the most mappings contain only about 30% of the total mappings. Similarly, in the case of a single replica, the 50% of nodes with the most mappings contain approximately 85% of the total mappings, while for 13 total replicas, 50% of the nodes contain only about 60% of the total mappings. Figure 5.12 also provides evidence that as we increase the number of replicas for each map- ping, the mappings are more evenly distributed among the P-RLI nodes. The vertical axis shows the cumulative density functions for the number of mappings stored per P-RLI node versus the number of mappings per node. The replication factor for P-RLI mappings ranges from 0 to 12, and the P-RLS network size is 100 nodes. The left-most line shows the case where P-RLI map- pings are not replicated at all. This line shows a skewed distribution, in which most nodes store few mappings but a small percentage of nodes store thousands of mappings. By contrast, the line 92 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Percentage of nodes in decending order of mappings (%) Percentage of total mappings (%) no replication 1 replica per mapping 4 replicas per mapping 12 replicas per mapping Figure 5.11: The distribution of mappings among nodes in a P-RLS network of 100 nodes with 500,000 unique mappings. The replication factor varies from 0 to 12. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20000 40000 60000 80000 100000 120000 140000 160000 Number of mappings per node CDF no replication 1 replica per mapping 4 replicas per mapping 12 replicas per mapping Figure 5.12: Cumulative distribution of mappings per node as the replication factor increases in a P-RLS network of 100 nodes with 500,000 unique mappings representing a replication factor of 12 is less skewed and resembles a Normal distribution. The ratio between the nodes with the least and greatest number mappings is approximately 3, with most nodes containing 40,000 to 100,000 mappings. Next, we show in Fig. 5.13 that the ratio between the greatest number of mappings per node and the average number per node converges as we increase the replication factor. The top line in Fig. 5.13 shows this ratio, which decreases from approximately 10 to 2 as we increase the total number of replicas per mapping from 1 to 13. The graph shows the average ratio of 20 simulation 93 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Replication factor Ratio to average number of mappings per node Minimum Ratio Maximum Ratio Figure 5.13: Ratios of the P-RLI nodes with the greatest and smallest number of mappings com- pared to the average number of mappings per node runs. The bottom line in the graph shows that the ratio between the P-RLI node with the smallest number of mappings and the average number of mappings increases very slightly from a value of 0 as the replication factor increases. Finally, we present simulation results for our predecessor replication scheme, which is de- signed to reduce query hotspots for popular mappings. Fig. 5.14 shows simulation results for a P-RLS network with 10,000 nodes. We randomly choose 100 popular mappings. For each of these popular mappings, we issue 10,000 queries from a randomly selected P-RLI node, and the average number of queries that are resolved on the root node and its predecessors is simulated. In Fig. 5.14, node N represents the root node and node N i is the i-th predecessor of node N in the Chord overlay network. Thus, node N 1 is the immediate predecessor of N, and node N 12 is 12th predecessor of node N. The results show that if there is no predecessor replication, all 10,000 queries will be resolved by the root node N. However, as we increase the number of replicas of popular mappings on node N's predecessors, the query load is more evenly distributed among node N and its predecessors. 94 N-12 N-11 N-10 N-9 N-8 N-7 N-6 N-5 N-4 N-3 N-2 N-1 N 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of queries Root node N and its predecessors 12 replicas 4 replicas 1 replicas no replication Figure 5.14: The number of queries resolved on the root node N and its predecessors becomes more evenly distributed when the number of replicas per mapping increases 5.6 Related Work There are several other replica management systems in Grids. The Storage Resource Broker [8] and GridFarm [121] projects register and discover replicas using a metadata service. The Euro- pean DataGrid Project [50] has implemented a Web Service-based Replica Location Service that is based on the RLS Framework [22]. Ripeanu et. al [101] constructed a P2P overlay network of replica location services; their work focused on the use of Bloom lter compression for soft state updates for efcient distribution of location information. In contrast to P-RLS, which inserts mappings and routes queries in a structured P2P network, Ripeanu's approach distributes compressed location information to each node participating in an unstructured P2P network. Each node can answer queries locally without forwarding requests, and thus query latencies are reduced compared to P-RLS. However, since updates have to be propagated to the whole network, the cost of creating and deleting a replica mapping in Ripeanu's unstructured scheme is higher than for P-RLS when the network scales to large sizes. 95 Data replication has also been studied extensively in the literature of distributed le sys- tems and distributed databases [15, 48, 87, 123, 133, 112]. A primary focus of much of that work is the tradeoff between the consistency and availability of replicated data when the network is partitioned. In distributed le systems, the specic problem of replica location is known as the replicated volume location problem, i.e. locating a replica of a volume in the name hierar- chy [85]. NFS [106] solves this problem using informal coordination and out-of-band communi- cation among system administrators, who manually set mount points referring to remote volume locations. Locus [89] identies volume locations by replicating a global mounting table among all sites. ASF [137] and Coda [108] employ a V olume Location Data Base (VLDB) for each local site and replicate it on the backbone servers of all sites. Ficus [85] places the location information in the mounted-on leaf, called a graft point. The graph points need location information since they must locate a volume replica in the distributed le system. The graft points may be replicated at any site where the referring volume is also replicated. 5.7 Summary We have described the design, implementation, and evaluation of a P2P Replica Location Service. The goal of our design is to provide a distributed RLI index with properties of self-organization, fault-tolerance and improved scalability. Our performance measurements on a 16-node cluster demonstrated that update and query latencies increase at a logarithmic rate with the size of the P-RLS network. We showed that the overhead of maintaining the P-RLS network is reasonable, with the amount of trafc to stabilize the network topology increasing at a rate of O(nlog(n)). 96 Furthermore, we presented simulation results for adaptive replication of P-RLS mappings for network sizes ranging from 10 to 10,000 P-RLS nodes. We demonstrated that as the replication factor of these mappings increases, the mappings are more evenly distributed among the P-RLI nodes when using the adaptive replication scheme. Also, we showed that the predecessor repli- cation scheme can more evenly distribute the queries for extremely popular mappings, thereby reducing the hotspot effect on a root node. 97 Chapter 6 Distributed RDF Repository 6.1 Introduction Metadata is the foundation for the Semantic Web, and is also critical for Grid [33, 114] and Peer-to-Peer systems [83]. RDF (Resource Description Framework) is the W3C specication for modeling metadata on the Semantic Web. It makes exible statements about resources that are uniquely identied by URIs. RDF statements are machine-processable, and statements about the same resource can be distributed on the Web and made by different users. RDF schemata are extensible and evolvable over time by using a new base URI every time the schema is revised. The possibility to distribute RDF statements provides great exibility for annotating resources. However, distributed RDF documents on the Web are hard to discover. Putting an RDF document on a Web site does not mean that others can nd it, much less issue structured queries against it. One approach is to crawl all possible Web pages and index all RDF documents in centralized search engines, RDF Google if you wish, but this approach makes it difcult to keep the indexed metadata up to date. For example, it currently takes Google many days to index a newly created Web page. Further, this approach has a large infrastructure footprint for the organization providing the querying service, and is a centralized approach on top of technologies (RDF, the Internet itself) that were intentionally designed for decentralized operations. 98 Also centralized RDF repositories are not well-suited for some semantic web applications in which data is not owned by any participant and each participant is responsible for support- ing the community. When the community scales up, the data and query load will be so large that no participant will be able to afford the hosting cost. One example is Shared-HiKE, a col- laborative hierarchical knowledge editor that lets users create, organize and share RDF data. In Shared-HiKE, each participant has her local hierarchical knowledge and also shares the external knowledge from other participants. Moreover, participants in the community want to quickly be notied of specic new content, that is, they have persistent queries expressing interest in certain people, products, or topics that are constantly serviced. In centralized RDF repositories without subscription support, this could only be accomplished by constantly issuing the queries of interest every few minutes, which will generate much unnecessary query load on the server. Also, it is difcult for centralized subscrip- tion schemes to scale up to a large number of subscribers. Thus, we argue that a distributed RDF infrastructure that can scale to Internet size and support RDF metadata subscription is useful or even necessary for many interesting semantic web applications, such as Shared-HiKE. One choice for non-centralized RDF repositories is Edutella [83] that provides an RDF-based metadata infrastructure for P2P applications. It uses a Gnutella-like [100] unstructured P2P net- work that has no centralized index or predictable location for RDF triples. Instead, RDF queries are ooded to the whole network and each node processes every query. Measurement studies [107] [111] show that Gnutella-like unstructured P2P networks do not scale well to a large num- ber of nodes. This is because their ooding mechanism generates a large amount of unnecessary trafc and processing overhead on each node, unless a hop-count limit is set for queries but then the queries cannot guarantee to nd results, even if these results exist in the network. An 99 Edutella successor [82] provides better scalability by introducing super-peers and schema-based routing; however, it requires up-front denition of schemas and designation of super peers. This chapter presents a scalable P2P RDF repository named RDFPeers that allows each node to store, query and subscribe to RDF statements. The nodes in RDFPeers self-organize into a cooperative structured P2P network based on randomly chosen node identiers. When an RDF triple is inserted into the network, it is stored at three places by applying a globally-known hash function to its subject, predicate, and object values. Both exact-match and range queries are efciently routed to those nodes where the matching triples are stored. The subscriptions for RDF statements are also routed to and stored on those nodes. Therefore, the subscribers will be notied when matching triples are inserted into the network. We implemented a prototype of RDFPeers in Java and evaluated its performance in a 16-nodes cluster. We also measured the load balancing of real-world RDF data from the Open Directory Project by inserting it into a simulated RDFPeers network, and found that the load balances to less than an order of magnitude between the nodes when a successor probing scheme is used. 6.2 RDFPeers Architecture Our distributed RDF repository consists of many individual nodes called RDFPeers that self- organize into a multi-attribute addressable network (MAAN) as we discussed in Chapter 2. MAAN extends Chord [118] to efciently answer multi-attribute and range queries. However, MAAN only supported predetermined attribute schemata with a xed number of attributes. RDF- Peers exploits MAAN as the underlying network layer and extends it with RDF-specic storage, retrieval, subscription and load balancing techniques. Fig. 6.1 shows the architecture of RDF- Peers. Each node in RDFPeers consists of six components: MAAN network layer, RDF triple 100 Subscription Handler Application RDF Documents JOIN, STORE, REMOVE,QUERY, SUBSCRIBE, NOTIFY, KEEPALIVE RDQL Queries RDFPeers Network MAAN Network Layer RDF Subscriber API Native Query Resolver RDFPeer RDQL to Native Query Translator Local RDF Triple / Subscription Storage RDF Triple Loader Notify Subscribe Subscription Handler Application MAAN Network Layer RDF Subscriber API Native Query Resolver RDFPeer RDQL to Native Query Translator Local RDF Triple / Subscription Storage RDF Triple Loader Notify Subscribe Figure 6.1: The Architecture of RDFPeers loader, RDF Subscriber API, local RDF triple and subscription storage, native query resolver and RDQL-to-native-query translator. The underlying MAAN protocol contains four classes of messages for (a) topology mainte- nance, (b) storage, (c) query, and (d) subscription. The topology maintenance messages are used for keeping the correct neighbor pointers and routing tables. It includes JOIN, KEEPALIVE and other Chord stabilizing messages. The STORE message inserts triples into the network and the REMOVE message deletes the triples from the network. The QUERY message visits the nodes where the triples in question are known to be stored, and returns the matched triples to the re- questing node. The RDF triple loader reads an RDF document, parses it into RDF triples, and uses MAAN's STORE message to store the triples into the RDFPeers network. In RDFPeers, when a node receives a STORE message, it stores the triples into its local storage component such as a le or a relational database. The native query resolver parses native RDFPeers queries and uses MAAN's QUERY message to resolve them. There can be a multitude of higher-level query modules on top of the native query resolver that map higher-level user queries into RDFPeers' native queries, such as an RDQL to native query translator. Applications 101 built on top of RDFPeers can also subscribe to RDF triples by calling the RDF subscriber API with a subscription handler. The RDFPeers node then sends a SUBSCRIBE message to the nodes that are responsible for storing matching triples. When the RDF triples that match the subscription are inserted into the network, the subscribing node will receive a NOTIFY message and notify the application to handle the triples with the subscription handler. 6.3 Storing RDF Triples RDF documents are composed of a set of RDF triples. Each triple is in the form of <subject, predicate, object>. The subject is the resource about which the statement was made. The pred- icate is a resource representing the specic property in the statement. The object is the property value of the predicate in the statement. The object is either a resource or a literal; a resource is identied by a URI; literals are either plain or typed and have the lexical form of a unicode string. Plain literals have a lexical form and optionally a language tag, while typed literals have a lexical form and a teletype URI. The following triples show three different types of objects, resource, plain literal, and typed literal, respectively. @prefix info: <http://www.isi.edu/2003/11/info#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <info:rdfpeers> <dc:creator> <info:mincai> . <info:mincai> <foaf:name> "Min Cai" . <info:mincai> <foaf:age> "28"<xsd:integer> . In order to support efcient queries on distributed RDF triples, we exploit the overlay struc- ture of MAAN to build a distributed index for these triples. In the STORE message one of the three attribute values is designated as the destination of the routing, and we store each triple three times, once each based on its subject, predicate, and object. Each triple will be stored at the suc- 102 cessor node of the hash key of the value of the routing key attribute-value pair. Since the value of attribute subject and predicate must be a URI that is a string, we apply the SHA1 hash function to mapping the subject value and predicate value to the m-bit identier space in MAAN. However, the values of attribute object can be URIs, plain literals or typed literals. Both URIs and plain literals are strings and we apply SHA1 hashing on them. The typed literal can be either string types or numeric types, such as an enumeration type or a positive integer respectively. As discussed above, we apply SHA1 hashing on string-typed literals and locality preserving hashing on numeric literals. For example, to store the rst triple above by subject, RDFPeers would send the following message in which the rst attribute-value pair (subject, info:rdfpeers) is the routing key pair, and key is the SHA1 hash value of the subject value. STORE {key, {("subject", <info:rdfpeers>), ("predicate", <dc:creator>), ("object", <info:mincai>)}} where key=SHA1Hash("<info:rdfpeers>") This triple will be stored at the node that is the successor node of key. Fig. 6.2 shows how the three triples above are stored into an example RDFPeers network. It also shows the nger tables of example nodes N 6 and N 14 for illustration. Most semantic web applications prefer high availability to strong consistency in the face of network partitions. RDFPeers provides relaxed consistency by leveraging soft state updates [25] [18]. Each triple has an expiration time, and the node that inserts the triple needs to renew the triple before it expires. If the nodes that store the triple do not receive any renewals, the triple will be removed from their local storage. With soft state updates, RDFPeers provides best effort consistency for triples indexed at three places. To apply locality preserving hashing on numeric literals of RDF triples, we need to know their minimal and maximal values. We can leverage the datatype information of the predicates 103 N1 N2 N5 N6 N10 N12 N14 N15 13 <info:rdfpeers> 2 “28” 7 “Min Cai” 10 <foaf:age> 4 <foaf:name> 5 <dc:creator> 1 <info:mincai> Hash Value in [0, 15] URI / Literal <info:mincai> <foaf:age> “28” By subject: <info:rdfpeers> <dc:creator> <info:mincai> By object: <info:mincai> <foaf:name> “Min Cai” By predicate: <info:rdfpeers> <dc:creator> <info:mincai> <info:mincai> <foaf:age> “28” By predicate: <info:mincai> <foaf:name> “Min Cai” By object: <info:mincai> <foaf:age> “28” By object: <info:rdfpeers> <dc:creator> <info:mincai> By subject: <info:mincai> <foaf:name> “Min Cai” N14+1 N14+2 N14+4 N6 +1,+2, +4 N6+8 N14+8 Figure 6.2: An example of storing 3 RDF triples into an RDFPeers network of 8 nodes in 4-bit identier space. provided by the RDF Schema denition. For example, the following schema denes that the object values of triples whose predicate is <foaf:age > are instances of < info:AgeType >. <foaf:age> <rdfs:range> <info:AgeType> . <xsd:simpleType name= "info:AgeType"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="0"/> <xsd:maxInclusive value="250"/> </xsd:restriction> </xsd:simpleType> Thus the numeric literals have the minimal value 0 and maximal value 250 as dened by the XML schema of < info:AgeType >. However, this approach assumes that the datatypes of predicates are xed and will not change in the future. We solve this problem by only allowing a schema to evolve to a new version with a different namespace rather than change the old version; 104 and applications are responsible for updating the RDF statements with new schemas, which is always necessary because of our soft state scheme. Since nodes might fail and network connections might break, triples are replicated on the neighbors of their successors by setting the parameter replica factor in MAAN. Whenever a node receives a triple storing request, it not only stores the triple locally but also replicates it to as many of its immediate successors as the above parameter indicates. If any node fails or its connec- tion breaks, its immediate successor and predecessor will detect it by checking the KEEPALIVE messages. If the node does not respond after a time-out period, other nodes will repair the ring structure using the Chord stabilization algorithm. After stabilization, the immediate successor node of the failed node will restore its replicas to its new predecessor if necessary. 6.4 Native Queries in RDFPeers Based on the above triple-storing scheme, we dene a set of native queries that can be efciently resolved via MAAN's multi-attribute range queries. These native queries include atomic triple queries, disjunctive and range queries, and conjunctive multi-predicate queries. 6.4.1 Atomic Triple Queries An atomic query is a triple pattern in which the subject, predicate, or object can each either be a variable or an exact value. The eight resulting possible queries are shown in Table 6.1. Q1 is the most general and most expensive query that matches all triples. Since there is no restriction whatsoever on this triple pattern, we have to propagate this query to all nodes, which takes O(n) routing hops for a network with n nodes. We can use MAAN's routing algorithm to resolve queries Q2 through Q8 since we store each triple three times based on its subject, predicate, and object hash values. In these seven query 105 Table 6.1: The Eight Possible Atomic Triple Queries for Exact Matches No. Query Pattern Routing Hops Query Semantics Q1 (?s;?p;?o) O(n) nd all possible triples Q2 (?s;?p;o i ) log n given object o i of any predicate, nd the subjects and predicates of matching triples Q3 (?s;p i ; ?o) log n given predicate p i , nd the subjects and objects of the triples having this predicate Q4 (?s;p i ; o i ) log n given object o i of predicate p i , nd the subjects of matching triples Q5 (s i ; ?p; ?o) log n given subject s i , nd all predicates and objects of the resource identied by s i Q6 (s i ; ?p; o i ) log n given subject s i , nd its predicate that has object o i Q7 (s i ; p i ; ?o) log n given subject s i , nd its object of predicate p i Q8 (s i ; p i ; o i ) log n return this triple if it exists otherwise return nothing patterns, there is always at least one value that is a constant, and we resolve the query by routing it to the node responsible for storing that constant, that node then matches these triples against the pattern locally and returns them to the requesting node. For example, in Fig. 6.2 if node N 6 asks the native query (<info:mincai>, <foaf:name>, ?name), we hash on info:mincai and get the hash value 1. Then N 6 routes it to the corresponding node N 1 (via N 14 ). N 1 lters triples locally using this pattern, and sends back the matched triple <info:mincai>, <foaf:name>, Min Cai to N 6 (via N 5 ). Note that we assume that the value is not overly popular, in which case we would have to use O(n) messages as discussed in Sec. 6.7.4. 6.4.2 Disjunctive and Range Queries RDFPeers' native queries support constraints on variables in the triple patterns. Q9 extends the above atomic triple queries with a constraint list that limits the domain of variables. Q9 ::= TriplePattern 'AND' ConstraintList TriplePattern ::= Q1|Q2|Q3|Q4|Q5|Q6|Q7 ConstraintList ::= OrExpression ('&&' OrExpression)* OrExpression ::= Expression ('||' Expression)* Expression ::= Variable (NumericExpression | StringExpression)+ NumericExpression ::=('>'|'<'|'='|'!='|'<='|'>=') NumericLiteral 106 StringExpression ::= ('='|'!=')Literal Literal ::= PlainLiteral|URI|NumericLiteral Variables can be either string-valued or numeric. Constraints can limit the domain of string values by enumerating a set of either allowed or forbidden constants. Numeric variables can additionally be limited to a set of disjunctive ranges. (a) (?s, dc:creator, ?c) AND ?c=''Tom'' || ?c=''John'' (b) (?s, foaf:age, ?age) AND ?age > 10 && ?age < 20 As discussed in Chapter 2, MAAN can efciently resolve range queries by using locality preserving hashing. Note that this is the one case where RDFPeers would benet from up-front RDF Schema information: if say an Integer-valued object of some triples in reality only ever has values 1 through 10, RDFPeers can use a hash function that yields better load balancing for these triples. In addition to specifying a single range, Q9 can also specify a set of disjunctive ranges for attribute values. For example, a user can submit a range query for variable ?x and ?x2 S d i=1 [l i ; u i ]. Obviously, this kind of disjunctive range query could simply be resolved by issuing one query for each contiguous range and by then computing the union of the results. For a query with d disjunctive ranges, this takes d O(log n + n s), where s is the aggregate selectivity of the d ranges. So the number of hops in the worst case increases linearly with d and is not bounded by n. We optimize this by using a range ordering algorithm that sorts these disjunctive query ranges in ascending order. Given a list of disjunctive ranges in ascending order, [l i ; u i ]; 1 i d where l i l j and u i u j iff i j, the query request will be rst routed to node n l1 , the successor node of H(l 1 ) that is the key corresponding to the lower bound of the rst range. Node n l1 then sequentially forwards the query to the successor node of the upper bound H(u 1 ) if it itself is not the successor node of H(u 1 ). Then node n u1 uses successor routing to forward the query to node 107 n l2 , the successor node corresponding to the lower bound of the next range [l 2 ; u 2 ], which in turn forwards the query to the successor node of H(u 2 ). This process is repeated until the query reaches the successor node of H(u d ). This optimized algorithm exploits the locality of numeric MAAN data on the Chord ring and the ascending order of the ranges, reduces the number of routing hops, especially for cases where d is large, and bounds the routing hops to n. Disjunctive exact-match queries such as ?c2fTom; Johng present a special case of the above disjunctive range queries where both the lower bound and upper bound of the range are equal to the exact-match value, and we use the same algorithm to resolve them. 6.4.3 Conjunctive Multi-Predicate Queries In addition to atomic triple queries and disjunctive range queries, RDFPeers handles conjunctive multi-predicate queries that describe a non-leaf node in the RDF graph by specifying a list of edges for this node. They are expressed as a conjunction of atomic triple queries or disjunctive range queries for the same subject variable. Q10 consists of a conjunction of sub-queries where all subject variables must be the same. Q10 := TriplePatterns 'AND' ConstraintList TriplePatterns := (Q3|Q4|Q9)+ In Q10, we restrict the sub-query Q9 to be the Q3-style triple pattern with constraints on the object variable. Thus Q10 describes a subject variable with a list of restricting (predicate, object) or (predicate, object-range) pairs. (?x, <rdf:type>, <foaf:Person>) (?x, <foaf:name>, "John") (?x, <foaf:age>, ?age) AND ?age > 35 108 To efciently resolve these conjunctive multi-predicate queries, we use a recursive query res- olution algorithm that searches candidate subjects on each predicate recursively and intersects the candidate subjects inside the network, before returning the query results to the query originator. The query request takes the parameters q, R, C, and I, where q is the currently active sub-query, R is a list of remaining sub-queries, C is a set of candidate subjects matching current active sub- query, and I is a set of intersected subjects matching all resolved sub-queries. Initially, q is the rst sub-query in this multi-predicate query, R contains all sub-queries except q, C is empty and I is the whole set. Suppose the sub-query q for predicate p i is v li o i v ui , where v li and v ui are the lower bound and upper bound of the query range for the object variable o i , respectively. When a node issues a query, it rst routes the request to node n li = successor(H(v li )). Node n li receives the request, searches its local triples corresponding to predicate p i , appends the subjects matching sub-query q to C, and forwards this request to its immediate successor n si unless it is already the successor(H(v ui )). Node n si repeats this process until the query request reaches node n ui = successor(H(v ui )). When node n ui receives the request, it also searches locally for the subjects matching sub-query q and appends them to C. It then intersects set I with set C, and pops the rst sub-query in R to q. If R or I is empty, it sends the query response back with the subjects in I as the result; otherwise, it resolves sub-query q. This process will be repeated until no sub-queries remain or I is empty. This recursive algorithm takes O( k P i=1 (log n + n s i )) routing hops in the worst case, where k is the number of sub-queries and s i is the selectivity of the sub-query on predicate p i . However, it intersects the query results on different predicates in the network and will terminate the query process before resolving the query on all predicates if there are no matches left, i.e. I is empty. Thus, we can further reduce the average number of expected routing hops by sorting the sub- 109 queries in ascending order of selectivity presuming the selectivity can be estimated in advance. For example, in the above three-predicate query, the sub-query on rdf:type might match many subjects, while foaf:age matches far fewer and foaf:name matches only a handful. After sorting the sub-queries, we resolve foaf:name rst, then rdf:age, and nally rdf:type. 6.5 Resolving RDQL Queries RDQL [77] is a query language for RDF proposed by the developers of the popular Jena Java RDF toolkit [75]. RDQL operates at the RDF triple level, without taking RDF Schema information into account (like RQL [60] does) and without providing inferencing capabilities. As such, it is the type of low-level RDF query language that we want RDFPeers to support well. It is our intuition that it is possible to translate all RDQL queries into combinations of the native RDFPeers queries above; however, we have not yet written such a translator and it may be inefcient for some queries, especially for joins. This section informally describes how the example RDQL queries from the Jena tutorial (http://www.hpl.hp.com/semweb/doc/tutorial/RDQL) would be resolved. (1) SELECT ?x WHERE (?x, <vcard:FN>, "John Smith") (2) SELECT ?x, ?fname WHERE (?x, <vcard:FN>, ?fname) (3) SELECT ?givenName WHERE (?y, <vcard:Family>, "Smith"), (?y, <vcard:Given>, ?givenName) (4) SELECT ?resource WHERE (?resource, <inf:age>, ?age) AND ?age>=24 (5) SELECT ?resource, ?givenName WHERE (?resource, <vcard:N>, ?z), (?z, <vcard:Given>, ?givenName) (6) SELECT ?resource, ?familyName WHERE (?resource, <inf:age>, ?age), (?resource, <vcard:N>, ?y), (?y, <vcard:Family>, ?familyName) AND ?age>=24 110 Query (1) translates directly into Q4, so that it can be resolved in log n routing hops in a network of n nodes. Similarly, query (2) translates directly into Q3, taking log n hops. To resolve query (3) , we rst issue a Q4-style query and then use its query result as constraint to issue a Q9-style disjunctive query with Q3-style triple patterns. Since all predicate values in the two triple patterns are known, these two native queries can be resolved in 2 log n hops. Query (4) is a typical Q9-style range query with the constraint on the object value. Since its predicate value is known, we can route the query to the node that stores the triples with predicate inf:age in log n hops. Our native queries do not include join operations, so that we decompose join queries into multiple native queries. Query (5) can be resolved via two Q3-style queries, and by then joining the rst triple set's object with the second triple's subject, 2log n routing hops. (However, note that these two Q3-style queries might generate large-size messages if the predicates vcard:N or vcard:Given are popular.). Query (6) can be resolved by rst issuing the same query as for the previous RDQL example for the rst triple pattern. Then we use the query result as a constraint for variable ?resource and resolve the second triple pattern as a Q9-style disjunctive range query. Finally, we use the second query result as a constraint for variable ?y and again resolve the third triple as a Q9-style query, which in the aggregate takes 3 log n hops. 6.6 RDF Subscription and Notication 6.6.1 Subscribing to Atomic Queries RDFPeers also already implements subscriptions for atomic queries in which at least one of the triple's values is restricted to a constant. Our basic scheme for subscriptions to these queries is that the subscription request is routed to the same node that is responsible for storing the triple 111 with that value in that position. Thus, the subscription request for (?person, ?predicate, Min Cai) would be routed to node N 10 in the example of Fig. 6.2. If there are multiple constants in the triple pattern, absent a-priori knowledge of the frequency distribution, we heuristically bias to rst use the subject, then the object, then the predicate to route subscriptions, based on our experience for which positions are most likely to have overly popular values (see Sec. 6.7.4). Thus, the subscription request for (?person, <foaf:age >, 28) would be routed to node N 2 in our running example. Each node keeps a local list of subscriptions, which consist of (1) a triple pattern, (2) a re- quested notication frequency, (3) the requested expiration date of the subscription, and (4) the node identier of the subscriber. Each node internally maintains hash-based access into this sub- scription list where the key is a position-constant pair. When a node stores or removes a triple, it will also locally evaluate the matching subscription queries and (immediately or after collect- ing several such matches) notify the subscribing node of the matched triples. How often such notication messages are sent is dictated by the larger duration of (a) the requested notication frequency, and (b) a minimum interval between updates that the subscription-hosting node may impose. Given that we want both data and subscriptions to survive the sudden death of any node, we replicate the subscription list to the next replication factor nodes in the identier space, just as we do for the triples themselves. The repair protocol for subscription data is identical to the repair protocol for triple data described in Sec. 6.3. Conversely, it is possible that subscriptions persist after the subscribing node has quit the network. We deal with this issue via a maximum subscription duration parameter. Each node will periodically purge its subscriptions list from 112 older entries, unless a subscribing node re-issued the subscription request more recently, in which case it will reset the age of the subscription to the latest request date. 6.6.2 Subscribing to Disjunctive and Range Queries In disjunctive and range queries, the object is restricted to be within one or more disjunctive enumeration domains or numeric ranges. The basic subscription scheme for disjunctive and range queries is similar to the one for constant queries, but the subscription request is stored by all nodes that fall within the hashed identiers of the minimum and maximum range value. The routing of the subscription request is identical to that for range queries described in Sec. 6.4.2, taking O(log n + n s) routing hops where s is the selectivity of the range query. For performance reasons, large-selectivity range query subscriptions are undesirable because a large number of matches would be sent. In practice, a query for common integer values independent of a target predicate such as (a) likely makes little sense. However, subscriptions for a narrow date range as in (b) or historical date ranges (the 12th century) independent of predicate seems to be of practical value. (a) ( ?, ?, ?age) AND ?age > 0 (b) (?, ?, 2004-04-13T00:00:00Z<=?<2004-04-16T00:00:00Z) Conceivably, the network could reject range subscription requests that span more than max- imum range subscription selectivity nodes (say, 20). For example, this could be done by having the node to which the minimum value hashes to - and which thus is the rst node in the series of nodes that would add it to its subscription list - compute the estimated number of nodes that would be involved (selectivity in percentage of the identier space times estimated number of participating nodes, the latter estimated from the size of the node's nger table), and reject it if it exceeds that threshold. This technique would not prevent a node from inserting a range request 113 that did not exceed the threshold at insertion time but does exceed it later because of network growth, but it would prevent the subscription from being re-inserted as-is into the larger network once the original subscription expires. At present, we have not implemented such a rejection mechanism for overly broad range subscriptions. 6.6.3 Subscribing to Conjunctive Multi-Predicate Queries These conjunctive multi-predicate queries look for subjects that match multiple constant (or con- stant range) predicate-object pairs. We have not implemented these subscriptions at time of writ- ing. A possible scheme could initially route the subscription to the node corresponding to the rst clause, which would remove and store just this rst clause, and would also store the hash value of the next clause (not the identier of the node that it currently maps to, given that the clause will move as nodes appear and disappear). Another node will then store the next clause, and route the remaining clauses towards the hash value of the next closest clause, and so on. The node storing the last clause will store the identier of the node that issued the original subscription request. Then, whenever the rst clause matches a new triple, the matching triples will be forwarded to the second node in the chain. The second and subsequent nodes only fur- ther forward those triples matching their local ltering criterion. There is a complication to this scheme if range queries are involved. In these cases, in the subscription registration phase one would always propagate the subscription request to the next nearest node involved. 6.6.4 Unsupported Subscription Types We do not support subscriptions to the following types of queries. 1. (?, ?, ?) 2. () AND (() OR ()) 3a. (?x, <a>, ?) AND (?x, <b>, ?) 3b. (?, ?, ?x) AND (?x, ?, ?) 114 The rst type is inherently not scalable to large networks, and we do not intend to ever support it. The second type consists of a combination of disjunctive and conjunctive sub-queries. It is our intuition that this type of subscription could be supported by chaining the techniques explained above within the network, with the leaves of the query parse tree as the starting points. The third type consists of joins. It is our intuition that those joins could be supported for which (a) one of the conjunctive clauses has a constant value and for which (b) this clause by itself matches only a moderate number of triples (example 3a). With the design of RDFPeers as described, it may not be possible to efciently resolve joins for which each clause by itself leads to an overwhelming number of triples while the join between them leads to few (example 3b). However, the following slight variation may allow RDFPeers to handle those as well. Instead of prexing URIs and values with subject:, predicate:, and object: before applying SHA-1 hashing, as we do now for a slight load balancing gain, we could instead not add that prex before hashing. In that case, the same URI hashes to the same node regardless of position. At a high cost of O(n) routing hops for rst broadcasting the subscription to all nodes, each node can then locally search for a match when it stores a new triple and notify the subscriber. 6.6.5 Extension To Support Highly Skewed Subscription Patterns In a real-world P2P application, e.g. of scientists subscribing to each others' Weblog-like com- munications, it is likely that the vast majority of scientists will have few if any subscribers while a handful will attract nearly everybody. In the latter case, analogous to IP multicasting for In- ternet video streaming, we want to avoid having to originate a number of messages proportional to the number of subscribers from the subscription-handling node, but rather want to construct 115 something more akin to a real-life phone tree. We propose the following possible extension: If a node ends up with multiple identical subscription queries by different subscribers, such as (<mailto:famous@ivy-league.edu>, ?, ?), it will internally combine them into a single entry in its subscription list, with multiple addresses to be notied. The node will then designate the node half-way across from it in identier space as a replicat- ing node for the subscriptions if the number of subscribers exceeds a certain threshold, then one a quarter-way across if an-other threshold is exceeded, and so on, adding up to log n repeater nodes analogous in structure to its nger table. In this case, it will send the repeater nodes those subscriber identiers that fall into their responsibility. The repeater nodes can then themselves set up repeater nodes, and so on. In the aggregate, this will take O(n) notication messages if all n nodes subscribe (the natural lower bound), and additionally leads to the busiest nodes having to send no more than O(log n) notication messages rather than the O(n) that need to be sent from the subscription-handling node in the naive approach. 6.7 Implementation and Evaluation We implemented a prototype of RDFPeers in Java that extends our previous MAAN implementa- tion. RDFPeers is implemented as a Java library and exposes the following API to applications: interface RDFPeersJavaApi { public void store(Triple t); public void remove(Triple t); public Iterator query(TriplePattern p); public void subscribe( TriplePattern p, long durationOfSubscriptionInMs, long minimumTimeBetweenNotificationsInMs, SubscriptionHandler h); } interface SubscriptionHandler { public void notify(Iterator added, Iterator deleted); } 116 The rst two methods store and remove RDF triples from the P2P network. The query method lets you retrieve triples from the network by specifying a triple pattern that can restrict values to be constants or numeric ranges. The subscribe method lets you watch for RDF content changes by passing a triple pattern to watch for, how long you would like this subscription to last (in milliseconds), how frequently you want to be notied, and a call-back object; you will then be called back periodically with lists of added and deleted triples that matched. All of these four methods throw a variety of exceptions not further described here, such as ones for broken connections and response time-outs. 6.7.1 Routing Hops to Resolve Native Queries The number of routing hops taken to resolve a query is the dominant performance metric for P2P systems. Fig. 6.3 shows our simulation result for atomic triple patterns from 1 node to 8192 nodes on a logarithmic scale, which matches our theoretical analysis. 0 2 4 6 8 10 12 14 16 1 10 100 1000 10000 Routing hops Number of nodes minimum, average and maximum Figure 6.3: The number of routing hops to resolve atomic triple patterns Q2 through Q8. We also compared two disjunctive range query resolution algorithms: the simple algorithm vs. range ordering algorithm. Fig. 6.4 shows the simulation result for up to 1000 disjunctive exact-match values (s i = "%) in a network with 1000 nodes. Figure 6.5 shows the result for 117 up to 1000 disjunctive ranges with 0.1% selectivity each in the same network. From these two experiments, we can see that the range ordering algorithm takes less routing hops to resolve a range query than the simple algorithm, and that its routing hops are indeed bounded by n. 0 1000 2000 3000 4000 5000 6000 7000 8000 0 100 200 300 400 500 600 700 800 900 1000 Routing hops Number of exact-match values per query w/o range ordering w/ range ordering Figure 6.4: The number of routing hops to resolve disjunctive exact-match queries in a network with 1000 nodes. 0 1000 2000 3000 4000 5000 6000 7000 8000 0 100 200 300 400 500 600 700 800 900 1000 Routing hops Number of disjunctive ranges per query w/o range ordering w/ range ordering Figure 6.5: The number of routing hops to resolve disjunctive range queries (0.1% selectivity) in a network with 1000 nodes. 6.7.2 Throughput of Triple Storing and Querying In this section, we present throughput measurements for triple storing and query operations in a RDFPeers network deployed on a 16-node cluster. The nodes in the cluster are all dual Pentium III 547 MHz workstations with 1.5 Gigabytes memory, and connected with a 1-Gigabit switch. 118 We rst measured the aggregated throughput of triple storing in a RDFPeers network of 12 nodes. We increase the number of clients that concurrently store triples into the network from 1 to 25. Fig. 6.6 shows that the number of triples stored per second of all clients increases sub-linearly with respect to the number of clients. When there is only one client, it stores 10.74 triples per second. However, 25 clients can concurrently store 94.00 triples per second. When the number of clients is more than 25, the throughput does not increase signicantly because of the computation and network limitation of our xed number of nodes. 0 20 40 60 80 100 120 1 10 Throughput (triples/sec) Number of clients Figure 6.6: Aggregated throughput of triple storing increases with the number of concurrent clients in a 12-node network We then measured the aggregated query throughput for the same RDFPeers network that preload 10,000 and 100,000 triples respectively at the beginning of the test. Each client performed 200 queries on one RDFPeers node simultaneously and the total rate of queries is calculated. Fig. 6.7 shows that the query rates of two congurations both increase sublunary with respect to the number of clients. When there is only one client, the query rates are 11.08 and 11.47 queries per second respectively for 10,000 and 100,000 preloaded triples. While for 150 clients, the query rates are 214.38 and 233.20 queries per second respectively for 10,000 and 100,000 triples. Similarly to storing operations, the query throughput also stop increasing signicantly 119 when there are more than 120 clients. These results also shows that the query rate only drops slightly when the preloaded triples increase from 10,000 to 100,000. 0 50 100 150 200 250 1 10 100 Throughput (queries/sec) Number of clients 100,000 triples 10,000 triples Figure 6.7: Aggregated query throughput increases with the number of concurrent clients in a 12-node network with 10,000 and 100,000 preloaded respectively These throughput results of storing and query operations are still preliminary and not enough for applications that care about high throughput for storing and query triples. We will further do some performance tuning for our RDFPeers implementation, such as using asynchronous socket, customized message marshaling and unmarshaling, and batched triple insertion. 6.7.3 Message Trafc of Subscription and Notication We performed two experiments, running RDFPeers on the same cluster, with up to ten nodes per machine. In the rst experiment, we set up a network of n nodes, then inserted 1,024 subscription requests into the network (1; 024=n subscriptions per node), followed by inserting 16,348 triples into the network (each node inserts 16; 348=n triples). Each triple matches 8 subscriptions. Fig. 6.8 shows that the subscription messages needed grow logarithmically with the size of the net- work, 1024 log(n), while the number of notication messages needed approaches a constant, 16,348 triples 8 subscriptions each. 120 The message trafc is less than that constant for small networks because some subscriptions can be resolved within a single node. The latter are bounded by that constant assuming that the subscription-handling node can store the network address of the subscriber and open a direct connection to notify it, as our implementation does, otherwise, if Chord successor routing is used, the latter number would grow logarithmically with network size as well. Finally, as expected, the cost of inserting triples grows logarithmically with network size, 16,348 triples 3 times each triple is indexedlog n. 0 50000 100000 150000 200000 250000 300000 1 10 100 Number of messages Number of nodes Subscription messages Notification messages Triple insertion messages Figure 6.8: For a constant number of triple subscriptions and insertions, the cost of our subscrip- tion scheme in messages grows no more than logarithmically with network size, 128 topics, 1024 subscriptions and 16384 triples In the second experiment, we kept the number of nodes in the network constant, but varied the percentage of topics that each node subscribes to. As expected, Fig. 6.9 shows that the number of messages needed grows linearly with the subscription rate, both for the subscription trafc and for the notication trafc. 6.7.4 Dealing with Overly Popular URIs and Literals Even today's cheapest PCs have a surprising storage capacity, each can store well over ten mil- lion RDF triples by dedicating 10 Gigabytes of its typical 80-120 GB disk. Nevertheless, some 121 0 100000 200000 300000 400000 500000 600000 700000 800000 0 20 40 60 80 100 Number of messages Percentage of topics subscribed (%) Subscription messages Notification messages Triple insertion messages Figure 6.9: For a constant network size and load, registration and notication trafc grows lin- early with the subscription rate, 128 topics, 64 nodes, and 8192 triples triples in RDF such as those with the predicate rdf:type may occur so frequently that it becomes impossible for any single node in the network to store all of them. That is, in practice, triples may not hash around the Chord identier circle uniformly due to the non-uniform frequency count distribution of URIs and literals. Fig. 6.10 shows the frequency count distribution of the URIs and Literals in the RDF dump of the Kids and Teens catalog of the Open Directory Project (http://rdf.dmoz.org). There are two RDF les for this catalog: kt-structure.rdf.u8.gz and kt-content.rdf.u8.gz. The former describes the tree structure of this catalog and contains 19,550 triples. The latter describes all the sites in this catalog and contains 123,222 triples. Fig. 6.10 shows that only 10 to 20 URIs and literals (less than 0.1%) occur more than a thousand times. Table 6.2 lists the URIs and literals that occur more than 1000 times in kt-structure.rdf.u8.gz. For example, since each URI as a predicate value will be stored at only one node, this node has the global knowledge about the frequency count of this predicate value. We deal with the overly popular predicate values by simply no longer indexing triples on them. Each node denes a Popular Threshold parameter based on its local capacity and will- ingness (subject to some minimum community expectation). Each node keeps counting the fre- 122 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Frequence count Rank of URIs and Literals kt-structure.rdf.u8.gz kt-content.rdf.u8.gz Figure 6.10: The frequency count distribution of URIs and literals in the ODP Kids and Teens catalog. Table 6.2: The URIs and Literals Occurring More Than 1,000 Times in the RDF File (kt- structure.rdf.u8.gz) Frequency URI or literal Type 3158 rdf:type predicate 3158 dc:Title object 2612 http://dmoz.org/rdf/Topic object 2612 http://dmoz.org/rdf/catid predicate 2574 http://dmoz.org/rdf/lastUpdate predicate 2540 http://dmoz.org/rdf/narrow predicate 1782 http://dmoz.org/rdf/altlang predicate 1717 dc:Description object quency of each predicate value. If a predicate value occurs more than Popular Threshold times, the node will refuse to store it and internally makes a note of that. If the node receives a search request with the overly popular value for the predicate, it sends a refusal message back to the requesting node that must then nd an alternative way of resolving the query by navigating to the target triples though either the subject or object values. This approach will add O(log n) to that node's total query cost in hops. We limit subject and object values in the same way. We are aware that this still makes the node with popular URIs a hotspot for query messages that can be addressed by querying nodes caching which queries were refused in the past. In 123 essence, this means that you cannot ask e.g. which instances in the world are the subclass of some class. However, these queries are so general and would return so many triples that we suspect they would rarely be of use in practice anyway (in analogy to the English language, where the words a and the occur frequently but provide little value as search terms). For the above query, you could alternatively gather the class URIs for which you want to look for instances for, then traverse to the instances via that set of URIs by issuing a Q4-style query. Figure 6.11 shows the minimum, average, and maximum number of triples per node with Popular Threshold from 500 to 32,000. In this experiment, we store both ktstructure.rdf.u8.gz and ktcontent.rdf.u8.gz (total 142,772 triples) into a network of 100 physical nodes (and the stan- dard Chord log(100)=6 virtual nodes per physical node for trading off load balancing against routing hops). When Popular Threshold=32,000, there are no overly popular URIs or literals being removed and there is an average of 4303 triples per node. 0 5000 10000 15000 20000 25000 30000 35000 40000 1000 10000 Number of triples per node Threshold of popular URIs and Literals minimum, average and maximum Figure 6.11: The number of triples per node as a function of the threshold of popular triples (100 physical nodes with 6 virtual nodes per physical node). However, the load is unevenly balanced the minimum number of triples per node is 700 while the maximum number of triples per node is 36,871. When Popular Threshold is set to 500, there are 20 overly popular URIs and literals being removed from indexing and there are 124 an average of 2352 triples per node. The minimum number of triples per node is 688 while the maximum number of triples per node is reduced to 4900 which we believe at less than an order of magnitude difference is acceptable load balancing. 6.7.5 Load Balancing via Successor Probing Although limiting overly popular URIs and literals greatly reduces the difference between the maximum and minimum number of triples per node, the triples are still not uniformly distributed around all nodes. This is because the frequency count distribution of non-popular URIs and lit- erals remains non-uniform even after removing overly popular values. We propose a preliminary successor probing scheme inspired by the probe-based node insertion techniques of [47] to fur- ther achieve a more balanced triple storage load on each node. In Chord, the distribution of node identiers is uniform and independent of the data distribution. In this successor probing scheme, we use a sampling technique to generate a node identier distribution adaptive to the data distribution. When a node joins the network, it will use SHA1 hashing to generate Probing Factor candidate identiers. Then it uses Chord's successor routing algorithm to nd the successors corresponding to these identiers. All the successors will return the number of triples that would be migrated to the new node if it joined there, and the new node will choose the identier that gives it the heaviest load. The cost of this technique is that it increases the insertion time of a triple from log n to Probing Factor log n. It is our intuition that log n is a good setting for the probing factor. Fig. 6.12 shows the minimum, average and maximum number of triples per node with Prob- ing Factor from 1 to 9 in a network with 100 physical nodes. The Popular Threshold is set to 1000 in this experiment. If there is no successor probing, the most loaded node has 7.2 times 125 more triples than the least loaded node. If each node probes 9 nodes when it joins, the node with the heaviest load only has 2.6 times more triples than the node with the lightest load which fur- ther reduces load imbalances to much less than an order of magnitude. We can further improve load balancing with a background virtual node migration scheme proposed in [94], subject to the limitation that it cannot distribute the load for a single overly popular value. 1000 2000 3000 4000 5000 6000 7000 8000 0 2 4 6 8 10 Number of triples per node Number of successors probed minimum, average and maximum Figure 6.12: The number of triples per node as a function of the number of successor nodes probed (100 physical nodes, Popular Threshold=1000). 6.8 Example Application: Shared-HiKE HiKE (Hierarchical Knowledge Editor) lets users create and organize RDF data. Unlike other meta-data creation tools, HiKE lets users enter instance data rst, and add an ontology for the data later, explicitly or by inference when users create more data, rene their data, or align their data. Various external hierarchical structures can be imported whole into HiKE, such as a le hierarchy or a tree hierarchy of XML instance data. The whole HiKE hierarchy is based on RDF. Each node in HiKE is a RDF resource with the shown text being its label. Each node can have some attributes that correspond to RDF triples with this node as the subject. The node hierarchy is also represented with a list of RDF triples connecting various resources. 126 HiKE has been constantly used by us as a semantic desktop. Being RDF-driven, HiKE seems a natural application of RDFPeers. We have interfaced HiKE to RDFPeers network and termed that combined application Shared-HiKE, forming what we believe is the rst instance of a new class of Internet application driven by RDF transmitted over a structured P2P network. Shared- HiKE is written in Java and sits on top of the API provided by RDFPeers. Fig. 6.13 shows a screenshot of Shared-HiKE as currently implemented. The left pane (My HiKE) contains all of the RDF data, both this user's and others'. The My Shared and the Active Users items are specially recognized, and cannot be moved, deleted, or re-named. Content under My Shared is inserted into the P2P network. Figure 6.13: Shared-HiKE, a P2P knowledge editor built on top of Subscribable RDFPeers Conversely, opening nodes under Active Users will fetch and display content shared by others via queries to the underlying P2P network. Anybody can add to others' shared content. The 127 tabs called My Shared, Bob, MinCai, Baoshi, Sameer, and Martin do not represent new content but are rather convenience shortcuts into the My HiKE hierarchy. The middle pane shows attribute data of the current selection (that is, out-going arcs of that RDF node). At time of writing, we are working on (a) color-coding content based on who contributed it, on (b) using RDFPeers' new subscription mechanism to monitor the network for users going on- and off-line (rather than for the user having to explicitly refresh the Active Users node), and on (c) letting the user monitor specied URIs and keywords via RDFPeers subscriptions. 6.9 Related Work Many centralized RDF repositories have been implemented to support storing, indexing and querying RDF documents, such as Inkling [77], RDFStore [97] and Jena [75]. These centralized RDF repositories typically use in-memory or database-supported processing, and les or a rela- tional database as the back-end RDF triple store. RDFDB supports a SQL-like query language, while Inkling, RDFStore and Jena all support SquishQL-style RDF query languages. Centralized RDF repositories are very fast and can scale up to many millions of triples. However, they have the same limitations as other centralized approaches, such as a single processing bottleneck and a single point of failure. To support integrated querying of distributed RDF repositories, Stuckenschmidt et al [119] extends the Sesame system to a distributed architecture that introduces a RDF API implemen- tation (Mediator SAIL) on top of the distributed repositories. Their work focuses on the index structure as well as query optimization in the mediator SAIL implementation. This mediator approach can support arbitrary complex queries and works well for small size of data sources. However, it is difcult for this approach to scale up to Internet size of data sources. 128 Edutella [83] and its successor super-peer based RDF P2P network [82] were discussed in Sec. 6.1. Super-peers are often desirable in order to place the load unevenly among heteroge- neous nodes, but our scheme can achieve the same effect more exibly by nodes hosting more or fewer Chord virtual nodes according to their capacity. REMINDIN [122] developed a lazy learning approach for the SWAP platform [35] to efciently route semantic queries based on so- cial metaphors. However, it only learns how to forward simple queries and still lacks efcient algorithm for complex queries. Much work in the Semantic Web and information integration literature has been emphasized on solving the semantic interoperability problem among data sources with heterogeneous on- tologies. ChattyWeb [1] enables the participating data sources to incrementally develop global agreement in an evolutionary and completely decentralized bottom-up process by learning the graph of local mappings among schemas through gossiping. Piazza [51] also eliminates the need for a global mediated schema by describing the mappings between sets of XML and RDF source nodes and evaluating those schema mappings transitively to answer queries. These two systems forward queries to the peers based on schema similarities, which is complementary to RDFPeers that indexes instances of RDF statements. It might be interesting to develop some hybrid systems that leverage schema mapping on the top of RDFPeers. Besides RDFPeers, there are several other distributed RDF metadata management system that provides publish and subscribe mechanisms. MDV [61] is a distributed RDF metadata manage- ment system based on a 3-tier architecture and supports caching and replication in the middle- tier. It implemented a lter algorithm based on relational database technology that efciently computes all subscribers for created, updated and deleted RDF data. Chirita et al [24] proposed a peer-to-peer RDF publish/subscribe system that was based on a super-peer based RDF peer-to- 129 peer network. In contrast to RDFPeers, subscriptions in this approach are selectively broadcast to other super-peers based on their advertisements, while subscriptions in RDFPeers are routed to and store on a particular node that is also responsible for storing matching RDF statements. 6.10 Summary In summary, RDFPeers provides efcient distributed RDF metadata storage, query and subscrip- tion in a structured P2P network. It avoids ooding queries to the whole network and guarantees that query results will be found if they exist. RDFPeers also balances the triple-storing load be- tween the most and least loaded nodes by using the successor probing scheme. Its state cost in neighborhood connections is logarithmic to the number of nodes in the network, and so is its processing cost in routing hops for all insertion, most query and subscription operations. RDFPeers offers subscriptions that, assuming a xed number of subscriptions per node, scale to networks of many nodes. RDFPeers also preserves subscriptions as well as the original data by replicating content to a xed number of nearby nodes so that the network can repair itself without data loss when a node suddenly dies. RDFPeers thus enables fault-tolerant distributed RDF repositories of truly large numbers of participants. 130 Chapter 7 Distributed Worm Signature Generation 7.1 Introduction Large-scale worm outbreaks are considered as a major security threat to today's Internet [6, 79, 80]. Network worms exploit vulnerabilities in widely deployed homogeneous software to self- propagate quickly on the Internet [130]. Recent advances in port-scan detection demonstrate that victims infected by scanning worms could be detected and quarantined quickly at individual edge networks [131]. On the other hand, to contain worms over the entire Internet, simulation studies [81] show that signature-based ltering is about 10 times faster than address blacklisting. Automatic signature generation is essential for signature-based ltering to contain zero-day worms [63, 115]. It often takes hours or even days for security experts to extract worm signatures manually [115]. However, the reaction time of efcient worm containment could be less than a few hours or even minutes [81]. For example, Cord-RedII worms infected more than 359; 000 computers on the Internet in less than 14 hours [80]. Slammer worms probed all 4 billion IPv4 Internet addresses for potential victims in less than 10 minutes [79]. During the spreading of a mono-morphic worm, the invariant content substrings shared by worm exploits are often repeated frequently, and their associated source or destination IP ad- dresses are also dispersed widely. Recent work by Autograph [63] and Earlybird [115] shows 131 that the procedure of signature generation can be automated by analyzing the repetition of content substrings (i.e. ngerprints) and their address dispersion. These systems generally distinguish a worm signature from legitimate trafc patterns by some detection thresholds. However, infection attempts of a worm are often scattered around the entire Internet with a low density at its early outbreak stage. Consequently, individual edge networks may not be able to accumulate enough worm samples for fast and accurate signature generation. Apparently, the overall infection attempts observed by multiple edge networks are more dis- tinguishable from legitimate trafc than those observed by a single edge network. The more worm trafc we observe and aggregate, the better chance that we can generate accurate worm signatures sooner. Autograph [63] uses this heuristic to speed up signature generation by sharing IP addresses of port scanners among distributed monitors. However, Autograph does not share the repetition count of a substring among the monitors. The maximum repetition of a worm signa- ture observed by an Autograph monitor is determined by the number of worm infection attempts targeting at the monitored edge network. Therefore, each Autograph monitor only observes the local rather than the global signature repetition. Aggregating toward the global repetition and address dispersion of content substrings is chal- lenging due to the requirement of scalability, fault tolerance and load balance. The scalability is in three folds: First, with increasing link speed, the number of payload substrings processed by each monitor will be tremendous even in a short time period, while the communication cost of global aggregation has to be moderate. Second, the aggregation has to scale up to a large number of monitors at multiple edge networks, e.g. several thousands if 10% of all edge networks are monitored. Third, the total number of distinct addresses of a suspicious substring observed by dis- 132 tributed monitors will be signicantly large during a worm outbreak. Thus, the counting of global address dispersion must scale up to large sets of IP addresses with moderate communication cost. Moreover, the aggregation has to be done in a fault-tolerant and load-balanced manner. Ob- viously, collecting all information at a central site will introduce a single point of failure as well as communication and processing bottleneck. Besides these technical challenges, global ag- gregation must preserve the privacy of individual organizations since both content and address information are privacy sensitive. Preserving privacy is also necessary for encouraging different organizations to collaborate in distributed worm signature generation. To meet these challenges, we propose a distributed worm signature generation system, called WormShield. In WormShield, distributed monitors collaboratively generate the signatures of mono-morphic worms through distributed ngerprint ltering and aggregation at multiple edge networks. This chapter presents the system model, ngerprint sampling and ltering schemes, and simulation results of the WormShield system. 7.2 The WormShield System Model In this section, we rst analyze three important worm properties during its spreading. Then we briey describe the basic scheme of WormShield and its building blocks. 7.2.1 Modeling Worm Spreading Properties Worm outbreaks often exhibit some unique properties that are deviated signicantly from legiti- mate trafc. We assume that all infection sessions of a worm share at least an invariant substring, i.e. the worm signature. This assumption is valid for most existing mono-morphic worms in the wild. We will discuss the limitation for polymorphic worms in Sec 7.6.4. In the following, we analytically model three global worm spreading properties during a worm outbreak, i.e. nger- 133 print repetition, and the dispersions of source and destination addresses. For the simplicity of our analysis, we assume a random scanning worm whose attack vector is always consistent. Also, we only consider the occurrence of ngerprints caused by a given worm instead of other benign trafc patterns. A ngerprint is the Rabin-Karp hash value of a content substring. In this chapter, we use ngerprint and substring exchangeably since the hash collision is ignorable when 64-bit n- gerprint values are used. A ngerprint repetition is the repetition count of a given ngerprint. Let I(t) be the number of infected hosts at time t, be the worm probing rate, M be the vulnerable population, and I 0 be the initially infected population, where 1 I 0 < M. The susceptible and infected (SI) model [81, 144] has I(t) = M e (tT) 1+e (tT) , where = M 2 32 for 32-bit IPv4 ad- dress space, and T is an integration constant determined by I 0 , i.e. T = ln( M I 0 1)=. Since I(T) = M 2 , T represents the time to half of the vulnerable population being infected. Let r(t) be the global ngerprint repetition of a worm at time t. Since the worm ngerprint appears at least once in each worm exploit, we have r(t) = Z t 0 I(x)d(x) = Z t 0 M e (xT) 1 + e (xT) d(x) = 2 32 (ln(e (tT) + 1) ln(e T + 1)) (7.1) The source address dispersion of a ngerprint is dened by the number of distinct source addresses of the IP packets that contain the ngerprint. Let s(t) be the global source address dispersion of a worm ngerprint at time t. Since each source address of the worm ngerprint represents an infected host, we have s(t) = I(t) = M e (tT) 1 + e (tT) (7.2) 134 Similarly, the destination address dispersion is dened by the number of distinct destination addresses of the IP packets that contain the ngerprint. Let d(t) be the expected number of global destination address dispersion of a worm ngerprint at time t, and f(r) be the expected number of distinct destination IP addresses when a ngerprint repeats r times. Suppose a worm probes the 32-bit IPv4 space uniformly, and we have f(r) distinct destination addresses after r probes. For the (r + 1)-th probe, we have f(r) + 1 distinct addresses with probability (2 32 f(r))=2 32 , or f(r) distinct addresses with probability f(r)=2 32 . Therefore, we have f(r + 1) = (f(r) + 1)(2 32 f(r))=2 32 + f(r)f(r)=2 32 . Since f(1) = 1, this equation can be solved and f(r) = 2 32 (1 (1 1=2 32 ) r ). Hence, d(t) = f(r(t)) = 2 32 (1 (1 1=2 32 ) r(t) ) (7.3) According to Eq.(7.1), (7.2), (7.3), both the ngerprint repetition and address dispersion of a worm will increase exponentially at the early stage of its outbreak. When r(t) << 2 32 , d(t) increases almost linearly as a function of r(t). These analytical models are further conrmed by the worm simulation results in Sec. 7.5.2. These three worm spreading properties could be used to distinguish a worm signature from legitimate trafc patterns [63, 115]. 7.2.2 Distributed Worm Signature Generation Individual edge networks can only observe a small proportion of worm ngerprints. To have a better global view of worm activities, we propose a distributed worm signature generation system called WormShield. It consists of a set of geographically distributed monitors deployed at multi- ple edge networks or sites. All monitors organize themselves as a Chord overlay network [118]. Each monitor sniffs both inbound and outbound trafc on its access link to the backbone. 135 For monitor i in an network of n monitors, it rst uses the Rabin-Karp algorithm [92] to com- pute the ngerprint of each sliding window in sniffed packets. The ngerprints are then sampled using a window sampling algorithm [109]. Within a given window of ngerprints, the window sampling algorithm selects the one with a minimal value. For each sampled ngerprint j, each monitor i updates its local ngerprint repetition r i (j), as well as the source and destination ad- dress setsS i (j) andD i (j), where i = 1; 2;:::;n. Let s i (j) and d i (j) be the source and destination address dispersion of j respectively. We compute s i (j) =jS i (j)j and d i (j) =jD i (j)j. Once r i (j), s i (j) and d i (j) all exceed their local thresholds, denoted by L r , L s and L d , ngerprint j becomes a local suspicious one and then is subject to global aggregation. Similar to Earlybird, all ngerprints are actually clustered by destination port and protocol since worms typically target a particular service [115]. The repetition and address dispersion are calculated for each ngerprint j with unique destination port and protocol. This will not reduce the ability to track worm trafc, but can effectively exclude large amount of repetitive substrings in non-worm trafc. In the remainder of this chapter, we will always assume a ngerprint is associated with a unique pair of destination port and protocol. To aggregate the information globally, WormShield automatically selects a root monitor for each ngerprint j, denoted by root(j), by using Chord consistent hashing as discussed in Chap- ter 3. The root monitor of j is responsible for calculating the global repetition, source and desti- nation address dispersions of j denoted by r g (j), s g (j) and d g (j), respectively. By aggregating the updates from all monitors, we obtain the following expressions for a global setting: r g (j) = N X i=1;r i (j)Lr r i (j); s g (j) = j N [ i=1;s i (j)Ls S i (j)j; and d g (j) = j N [ i=1;d i (j)L d D i (j)j: (7.4) 136 Since the root monitor aggregates the information of the same ngerprint j from all monitors, it has a global view of the ngerprint repetition and address dispersion of j in all edge networks. If r g (j), s g (j) and d g (j) all exceeds their global thresholds, denoted by G r , G s and G d , the cor- responding substring of j will be identied as a potential worm signature. Algorithm 4 sketches the basic process of distributed signature generation in WormShield. Algorithm 4 Distributed Signature Generation Algorithm 1: INPUT: local thresholds(L r , L s, and L d ), and global thresholds (G r , G s, and G d ) 2: OUTPUT: generated worm signatures 3: for all monitor i in 1;2; :::; n do 4: for all local ngerprint j in packets do 5: r i(j) r i(j) + 1, S i(j) S i(j)[fSrc IP(j)g, D i(j) D i(j)[fDest IP(j)g 6: end for 7: if ri(j) Lr ANDjSi(j)j Ls ANDjDi(j)j L d then 8: mark j as a global ngerprint 9: r g(j) r g(j) + r i(j), S g(j) S g(j)[S i(j), D g(j) D g(j)[D i(j) 10: end if 11: end for 12: for all global ngerprint j at root monitor do 13: if rg(j) Gr ANDjSg(j)j Gs ANDjDg(j)j G d then 14: output the substring of j as a worm signature 15: end if 16: end for When monitors are uniformly distributed in the Chord identier space, the number of n- gerprints mapped to each monitor is almost uniformly distributed. However, if every monitor updates its local information to the root monitor directly using the Chord routing algorithm, the root monitor of a potential worm signature will be overwhelmed by a large number of updates during a worm outbreak. Instead, WormShield constructs a DAT tree for each root monitor to aggregate information gradually among all monitors rather than only at the root monitor. Each monitor in the DAT tree will receive updates from its child monitors and send the single aggre- gated value (i.e. the sum) to its parent monitor. Fig. 7.1 illustrates the process of aggregating ngerprint repetition in WormShield. 137 Figure 7.1: An example of aggregating ngerprint repetition in a WormShield network of 8 nodes. Once a potential worm signature is identied, its root monitor then constructs a multicast tree on top of the Chord overlay network and disseminates the signature to all other monitors participating in the WormShield system. Other monitors could automatically deploy the received worm signatures in their local signature-based intrusion detection systems such as Snort [102] or Bro [86]. Monitors could also dene their own policies on importing the signatures generated by others, such as notifying local network security administrators before activating signature ltering or rate limiting. The main building blocks of a WormShield monitor are shown in Fig. 7.2. Each monitor implements a Chord protocol stack on top of which the DAT and broadcasting modules are built. The DATs aggregate the global information of suspicious substrings that pass the ngerprint l- tering at each monitor. The global address dispersions are then estimated in a distributed manner using an adaptive counting algorithm. The signature database stores the signatures generated by a root monitor or disseminated by others, which could be further imported into a local IDS/IPS. 138 Figure 7.2: The functional design of a WormShield monitor 7.3 Fingerprint Sampling Many network-based signature detection systems, e.g. Earlybird and Autograph, calculate a hash value, i.e. ngerprint, for each k-gram in packet payload using the Rabin ngerprint al- gorithm [92]. A k-gram is a contiguous substring of k bytes, which is slid by every byte over the whole packet payload. For a packet of n bytes payload, there are (n k + 1) k-grams, which are almost as many as the total bytes in the payload. To reduce the overhead of processing all ngerprints, two different approaches are used by Autograph and Earlybird. The COntent Pay- load Partition (COPP) approach in Autograph computes small k-grams and uses a predened breakmark to partition payload into non-overlapping chunks. In contrast, Earlybird uses a value sampling approach that selects only those ngerprints matching a predened sampling value. For worm signature detection, it is critical to prevent a worm from evading detection by only changing some portions of its payload. Suppose x is the length of the worm signature, i.e. the maximal common substring in worm payload, we denote P(x) as the possibility of selecting at least one ngerprint for the signature of length x. Obviously, P(x) should be as large as possible 139 even when x is small. However, when x is relatively small, e.g. less than 200 bytes, neither COPP nor value sampling can guarantee to select at least one ngerprint. WormShield uses a window sampling method [109] to select a small fraction of all k-gram ngerprints. In the following, we rst describe the basic idea of window sampling, and then compare it with COPP and value sampling. 7.3.1 Window Sampling For a given sequence of ngerprints f 1 :::f n and a window size w, each position 1 i nw+1 in this sequence denes a window of ngerprints h i :::h i+w1 . The window sampling approach simply selects the minimal ngerprint value in each window. If there is more than one minimal value and if the value has been selected in the previous window, no ngerprint will be selected for the current window. Otherwise, the rightmost occurrence of the minimal ngerprint is selected for the current window. The sampling factor of a sampling algorithm is the expected fraction of ngerprints selected from all ngerprints with random input. In window sampling, the sampling factor f is determined by the window size, i.e. f = 2=(w + 1) as proved by Schleimer et al [109]. Therefore, for a given sampling factor f, we should set the window size to be w = 2=f 1. Apparently, window sampling will select at least one ngerprint from each window. Thus it will guarantee to generate a ngerprint for signatures longer than the window size, i.e. P(x) = 1 if x w. 7.3.2 Comparison with COPP and Value Sampling We compare window sampling with COPP and value sampling by analyzing the possibility of se- lecting the ngerprint of a signature under the same sampling factor. We denote P c (x), P v (x) and P w (x) as the probability of selecting a x-byte signature by COPP, value sampling, and window 140 sampling under sampling factor f, respectively. We assume that the input is randomly distributed, and we will discuss the effects of repetitive, low-entropy strings later in this section. COPP uses an idea similar to value sampling to match the ngerprint of a short k-gram with that of the breakmark. To achieve sampling factor f, the average chunk size a in COPP need to be equal to 1=f. Since the input is random, the probability of a k-gram matches the breakmark is f = 1=a. For a signature of x bytes, there are n = x k + 1 ngerprints of k-grams. Since the maximal chunk size is often very large, e.g. 1024 bytes in Autograph, P c (x) is the possibility that there are at least 2 out of n ngerprints matching the breakmark. Therefore, we have P c (x) = 1 (1 f) n nf(1 f) n1 = 1 e f(xk) ((x k)f + 1) (7.5) when x k; otherwise, P c (x) = 0. In value sampling, there are also n = x k + 1 k-grams, while they are often much longer than those in COPP, e.g. 40 bytes in Earlybird and 4 bytes in Autograph. Therefore, P v (x) is the possibility that there is at least 1 of n ngerprints that match a given sampling value. Since the possibility that each ngerprint matches the sampling value is f, we have P v (x) = 1 (1 f) n = 1 e f(xk+1) (7.6) when x k; otherwise, P v (x) = 0. For window sampling, the minimal value is uniformly distributed in the window given uni- formly distributed ngerprints. When there are not enough ngerprints for one window, i.e. n = x k + 1 < w, P w is n=w, where w = 2=f 1. Therefore, we have 141 P w = 8 > > > > > > < > > > > > > : 0 if 0 x < k f(x k + 1)=(2 f) if k x < 2=f + k 2 1 if x 2=f + k 2 (7.7) Figure 7.3 compares three different sampling methods when signature length x varies from 0 to 500 bytes with f = 1=64. In this plot, COPP uses 4 bytes k-gram, while both value sampling and window sampling use 40 bytes k-grams. 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 Probability of being sampled Signature length (bytes) window sampling value sampling COPP Figure 7.3: Comparison of three different sampling methods. As shown in Fig. 7.3, when a worm signature is longer than 166 bytes, window sampling guarantees to select a least one ngerprint for it. In contrast, COPP and value sampling have a probability of 71:9% and 86:2% to detect the signature, respectively. For very shorter signatures (less than 140 bytes), value sampling has higher probability than window sampling to detect them. However, these very short signatures normally are not specic enough to lter out worm trafc. 7.4 Distributed Fingerprint Filtering As we discussed in Sec. 7.2, WormShield uses ngerprint ltering at distributed monitors to lter out most content substrings in legitimate background trafc. Only those suspicious ones above 142 local thresholds are subject to global aggregation. The problem of ngerprint ltering is similar to that of generating worm signatures at a single monitor in Earlybird [115]. However, the local thresholds should be low enough for fast global signature generation. The ngerprint ltering scheme has two phases, i.e. repetition ltering and address dispersion ltering. In the rst phase, a multi-stage lter [37] is used to select frequent ngerprints that have repeated at least L r times. In the second phase, the address dispersion of each frequent ngerprint is tracked using an adaptive counting algorithm presented in Sec. 4.4. A ngerprint is dispersed if it has at least L s distinct source addresses and L d distinct destination addresses. The ngerprint lter only selects suspicious ngerprints that are both frequent and dispersed as specied in Algorithm 5. Algorithm 5 The Fingerprint Filtering Algorithm 1: INPUT: a set of ngerprints F, local thresholds L r , L s, and L d 2: OUTPUT: locally-suspicious ngerprints that are both frequent and dispersed 3: initialized a multistage lter mf and an address dispersion table ad 4: for all ngerprint f in F do 5: if ad[f] != NULL then 6: add f:src ip and f:dst ip into sets ad[f]:src ip and ad[f]:dst ip respectively 7: ifjad[f]:src ipj L s ANDjad[f]:dst ipj L d then 8: output f as a locally-suspicious ngerprint 9: end if 10: else 11: frequent true 12: for all stage s in mf do 13: s[f]:count s[f]:count + 1 14: if s[f]:count < L r then 15: frequent false 16: end if 17: end for 18: if frequent then 19: add an entry ad[f] for f in table ad 20: end if 21: end if 22: end for To aggregate as much information as possible, local thresholds need to be low enough. On the other hand, the higher local thresholds are, the less aggregation trafc will be introduced by 143 each monitor. Apparently, the effectiveness of ngerprint ltering is determined by the ngerprint characteristics in background trafc, which is essential to estimate the aggregation overhead for a given link speed. We examined two kinds of ngerprint distributions, i.e. the distribution of ngerprint repetition X(r) and the distribution of address dispersion Y (s; d). The former is the number of ngerprints that repeat r times, and the latter is the number of ngerprints that have s distinct source IP addresses and d distinct destination IP addresses. We analyzed eight traces from two OC-24 access links of a class B network in August 2005 and June 2006. The traces include both inbound and outbound trafc in 30 seconds, 1 minute and 10 minutes time intervals. Table 7.1 summarizes the characteristics of these eight traces. Table 7.1: Summary of Eight Internet Packet Traces Used in Our Experiments Date Duration Direction Packets Bytes Fingerprints 2005-IN-30s 08/18/2005 30 sec inbound 0.3M 184.8M 169.3M 2005-OUT-30s 08/18/2005 30 sec outbound 0.8M 713.0M 669.1M 2005-IN-1m 08/18/2005 1 min inbound 0.7M 547.2M 509.4M 2005-OUT-1m 08/18/2005 1 min outbound 1.5M 1191.6M 1110.1M 2005-IN-10m 08/18/2005 10 min inbound 7.6M 5969.8M 5445.9M 2005-OUT-10m 08/18/2005 10 min outbound 15.3M 12859.1M 12022.1M 2006-IN-10m 06/22/2006 10 min inbound 9.3M 4731.1M 4250.9M 2006-OUT-10m 06/22/2006 10 min outbound 11.5M 6737.0M 6138.4M 7.4.1 Distribution of Fingerprint Repetition It is challenging to estimate the exact repetition distribution of ngerprints due to their large quan- tity, e.g. the 10-minute inbound and outbound traces collected in 2005 have more than 76M and 157M ngerprints after 1=64 sampling, respectively. Indeed, estimating repetition distribution is similar to estimating ow size distribution [66]. Kumar et al [66] proposed a probabilistic al- gorithm that uses Expectation Maximization (EM) to estimate the ow size distribution. In their scheme, each ngerprint is rst hashed into an index for an array of counters and the counter at 144 this index is incremented by 1. Hash collisions might cause two or more ngerprints to increment the same index. An iterative EM algorithm is then used to estimate the actual distribution from the value of counters. The EM algorithm can estimate large-scale distributions accurately with only moderate-sized memory, e.g. 512 MB memory for 256M ngerprints. To verify the accuracy of EM estimation for our purpose, we implemented the EM algorithm as well as a hash table approach. The latter calculates the exact repetition distribution by indexing a counter for each ngerprint. However, it is only able to analyze small-size traces due to memory limitation. We dene the rank of a ngerprint as its position in descending order of repetition counts. Fig. 7.4(a) compares the EM algorithm with the hash table approach by plotting the ngerprint repetition against its rank in trace 2005-IN-30s. The ngerprints are computed on 40 bytes k-grams, and 1=64 window sampling is used to ease the memory requirement of the hash table implementation. The rank distributions of ngerprint repetition for both approaches are plotted in log-log scale. The close tting of these two curves demonstrates that the EM algorithm is very accurate in estimating the repetition distribution. Indeed, the weighted mean relative difference (WMRD) [66] between the EM estimated distribution and the actual distribution is less than 0:01%. Figure 7.4(a) shows that window sampling retains the unbiased distribution for both low- and high-repetitive ngerprints. The most frequent ngerprint in trace 2005-IN-30s repeats 3; 199; 055 times when no sampling is used. With 1=64 window sampling, it is estimated to repeat 50;403 times, which is almost 1=64 of the actual repetition. The log-log scale plots in Fig. 7.4(a) reect a linear relationship, which suggests that ngerprint repetition exhibits a Zipf- like distribution in small-size traces. For large-size traces collected in August 2005 and June 2006, Fig. 7.4(b) and 7.4(c) shows a similar linear relationship between ngerprint repetition 145 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 10000 100000 1e+06 1e+07 Fingerprint repetition Rank of fingerprint 1/64 sampling (hash table) 1/64 sampling (EM) no sampling (EM) (a) Comparison of the hash table and EM algorithm: inbound 30-second trace (2005-IN-30s). 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 10000 100000 1e+06 1e+07 Fingerprint repetition Rank of fingerprint IN-10m (1/64 sampling) OUT-10m (1/64 sampling) (b) Inbound and outbound 10-minute traces on Au- gust 18, 2005 (2005-IN-10m and 2005-OUT-10m) 1 10 100 1000 10000 100000 1e+06 1 10 100 1000 10000 100000 1e+06 1e+07 Repetition Rank 2006-IN-10m (1/64 sampling) 2006-OUT-10m (1/64 sampling) (c) Inbound and outbound 10-minute traces on June 22, 2006 (2006-IN-10m and 2006-OUT-10m) 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 10000 100000 1e+06 1e+07 Repetition Rank ftp-data (42.8%) nntp (23.3%) http (16.6%) (d) Top three protocols with highest trafc volume in trace 2005-IN-10m Figure 7.4: The Zipf-like distribution of ngerprint repetition in various Internet traces and its rank, which conrm that ngerprint repetition follows a Zipf-like distribution in both short and long time periods. Zipf-like distributions are commonly observed in many kinds of phenomena although its exact cause is still unclear [68, 30]. The word frequency in randomly generated texts has a Zipf- like distribution as observed in many nature languages, such as English [68]. Also, ngerprints (i.e. k-gram) in a broad class of unbiased binary texts exhibit Zipf-like distributions due to the presence of long-range correlation [30]. For example, ngerprints in Web documents follow 146 a Zipf distribution [109]. Since a large portion of network trafc consists of Web documents, emails, or other contents, we believe that ngerprints in network trafc should exhibit a Zipf-like distribution as well. In addition to overall inbound and outbound trafc, we also analyzed the individual ngerprint distributions of top three protocols with highest trafc volumes. Fig. 7.4(d) plots the distributions of all top three protocols (ftp-data, nntp, and http) whose volumes are 82:7% of the total trafc in 2005-IN-10m. We observe that ngerprints in each individual protocol also exhibit a Zip-like distribution. Therefore, Zipf-like distribution should be reliably seen in the overall trafc mixed with different protocols. The same results were also observed on other traces (2006-IN-10m and 2006-OUT-10m) collected almost one year later at a different access link. This conrms that the Zipf-like ngerprint distribution should be nature in the Internet trafc. We have made our trafc analysis tools public available at http://gridsec.usc.edu/wormshield. Readers are welcome to verify the ngerprint distribution in their own edge networks. 7.4.2 Distribution of Address Dispersion Besides the local repetition threshold, each WormShield monitor also applies local thresholds for the address dispersion of ngerprints. We study the distribution of address dispersion Y (s; d) in the background trafc, where s and d are the source and destination address dispersions re- spectively. We rst apply a multistage lter to lter out most ngerprints that repeat less than 10 times in our experiments. Then we estimate the address dispersion of ngerprints using an adap- tive counting algorithm discussed in Sec. 4.4. Fig. 7.5 shows the 3D histograms of the address dispersion in trace 2005-IN-10m and 2005-OUT-10m. Since the number of ngerprints is in log scale, most ngerprints only have a few distinct source and destination addresses. 147 50 100 150 200 250 300 350 1 10 100 1000 10000 100000 350 300 250 200 150 100 50 1 Distinct destination addresses Number of fingerprints Distinct source addresses (a) Inbound 10 minutes (2005-IN-10m) 500 1000 1500 2000 1 10 100 1000 10000 2000 1500 1000 500 1 Distinct destination addresses Number of fingerprints Distinct source addresses (b) Outbound 10 minutes (2005-OUT-10m) Figure 7.5: The distribution of address dispersion in 3D histogram. 7.4.3 Efciency of Fingerprint Filtering The Zipf nature of ngerprint repetition is critical to reduce the global aggregation overhead with ngerprint ltering in WormShield. We dene ltering ratio F(t) as the fraction of ngerprints that exceed a given threshold t. Obviously, F(t) needs to be as small as possible even when t is quite low. Fig. 7.6(a) plots the ltering ratio as a function of the repetition threshold in trace 2005-IN-10m and 2005-OUT-10m. It shows that the ltering ratio decreases dramatically when the threshold increases from 1 to 10. For example, the ltering ratio of 2005-IN-10m decreases from 0:01 to 0:0009 when the threshold increases from 1 to 5, and further decreases to 0:0003 when t is 10. Similarly, that of 2005-OUT-10m decreases from 0:015 to 0:0007 when the threshold increases from 1 to 5. Therefore, only about 0:1% ngerprints are subject global aggregation when the local repetition threshold is only 5, which will almost not affect the efcacy of global ngerprint repetition. 148 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 0.001 0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 Filtering ratio Repetition threshold IN-10m (1/64 sampling) OUT-10m (1/64 sampling) (a) Effects of ngerprint repetition threshold 1e-05 1e-04 0.001 0.01 0.1 1 10 100 1000 10000 Filtering ratio Address dispersion threshold IN-10m OUT-10m (b) Effects of address dispersion threshold Figure 7.6: The ltering ratio of packet traces decreases with increasing local thresholds. Similar to the case of ngerprint repetition, we also examine the ltering ratio F(t) of address dispersion, where t is a threshold for both source and destination addresses. Fig. 7.6(b) shows that the ltering ratio also decreases dramatically when t increases from 1 to 10. For example, when t is applied to both source and destination addresses in trace 2005-IN-10m, F(t) decreases from 0:04 to 0:001 when t increases from 1 to 10. Indeed, the distribution of address dispersion also follows a Zipf-like distribution, since the ltering ratio is actually a complementary cumulative function of the address dispersion distribution. Therefore, even when considerably low thresholds are used, the ngerprint ltering is able to reduce the aggregation trafc from a single monitor by ve to six orders of magnitude. Note that the overall ltering ratio is the multiplication of two ltering ratios in repetition and address dispersion ltering phases. For each ngerprint, we use 20 bytes for its SHA-1 hash value, 4 bytes for its repetition value, and 90 bytes each for summarizing its source and destination addresses. When all local thresholds are 10, we experience roughly 0:6 KB/s aggregation trafc from each monitor, which is about 0:003% of the 18 MB/s link trafc. 149 7.4.4 Aggregation Trafc of A Single Monitor Reducing global aggregation trafc among monitors is one of the most important issues in dis- tributed worm signature generation systems. Our measurement results in previous subsections show that we can leverage the Zipf distribution of substring repetition and address dispersion to effectively reduce the number of ngerprints that are subject to global aggregation. Moreover, for each ngerprint, we only need a few hundreds of bytes to summarize its repetition and ad- dress information. In our WormShield prototype, we use 20 bytes for the SHA-1 hash value of a ngerprint, 4 bytes of the repetition value, and 160 bytes each for source and destination address summaries. We plot the data rate of the aggregation trafc from a single monitor and the original link trafc in Fig. 7.4.4. We use the 10 minutes inbound trace 2005-IN-10m and omit the rst 200 seconds, since there are no continuous traces before this interval. We calculate the aggregation trafc and link trafc in an interval of every 10 seconds. We used the repetition threshold L r of 10, and local address dispersion thresholds L s and L d of 5 and 10, respectively. Although the peak link trafc is about 18MB/s, the aggregation trafc from a monitor is under 10KB/s and 18KB/s when address dispersion thresholds are 5 and 10, respectively. The above measured aggregation trafc only consists of the ngerprints originated from a single monitor. During distributed aggregation, each monitor will also need to forward nger- prints from other monitors. If each monitor updates the information of its ngerprints with the root monitor directly, there will be O(n log n) total messages for an overlay network of n mon- itors. The following section presents an optimization strategy to minimize the communication cost among monitors. 150 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 200 250 300 350 400 450 500 550 600 link traffic 6 8 10 12 14 16 18 20 200 250 300 350 400 450 500 550 600 Traffic rate (KB/s) aggregation traffic (L_r=10, L_s=L_d=10) 6 8 10 12 14 16 18 20 200 250 300 350 400 450 500 550 600 Time (seconds) aggregation traffic (L_r=10, L_s=L_d=5) Figure 7.7: Aggregation trafc data rates with different local thresholds. Since a monitor has O(log n) neighbors in a Chord network of n nodes, it will have at most O(log n) child monitors as well, although it will be part of n DAT trees rooted at different mon- itors. Therefore, for a network of thousands of monitors, the aggregation trafc originated and forwarded by each monitor will be only a few hundreds of kilobytes per second. 7.5 Performance Evaluation In this section, we simulated a spectrum of scanning worms including CodeRed and Slammer on realistic Internet congurations. Although collaborative monitors are expected to perform better than isolated ones, we are interested in the quantitative comparison of WormShield and equal number of isolated monitors by various performance metrics, including signature genera- tion speed, false positives and deployment scalability. 7.5.1 Simulation Setup To faithfully simulate worm propagation and signature generation events, we implemented a dis- crete event simulator in packet-level. It simulates three types of network objects, i.e. vulnerable 151 host, edge network that advertises a network address prex, and Autonomous System (AS). To realistically partition the IP address space among edge networks, the simulator uses imported BGP snapshots at RouteViews (www.routeviews.org) to assign the IP address blocks of simulated edge networks and ASes [81, 63]. For each vulnerable edge network that contains at least one vulnerable host, a WormShield monitor is simulated to aggregate the ngerprint repetition and address dispersion of sniffed packets. The simulator was written with about 7000 lines of C code and is available for download at http://gridsec.usc.edu/wormshield. We simulated two variants of CodeRed and Slammer worms with both uniform and subnet preferred scanning schemes. The CodeRedI-v2 and Slammer-I worms uniformly probe the entire IP address space except for 244.0.0.0/4 (multicast) and 127.0.0.0/8 (loopback). The CodeRedII and Slammer-II worms probe a completely random IP address 1=8 of the time, the address in the same class A network (=8 network) half of the time, and the address in the same class B network 3=8 of the time. All worms have 25 hosts infected at the beginning of the simulation. Table 7.2 summarizes the parameters of these four simulated worms. Table 7.2: Parameters of Four Simulated Worms Simulated Date of BGP # of total Vulnerable # of vulnerable Probing rate Probing worms snapshot edge networks population edge networks (probe/sec) method CodeRedI-v2 July 19, 2001 97,461 359,000 59,004 10 uniform CodeRedII Aug. 4, 2001 97,876 359,000 59,343 10 subnet-preferred Slammer-I Jan. 25, 2003 112,065 75,000 40,136 4000 uniform Slammer-II Jan. 25, 2003 112,065 75,000 40,136 4000 subnet-preferred The CodeRed and Slammer worms are used as two representative samples in a spectrum of TCP and UDP scanning worms. Note that these worms are simulated as an unknown worm in our experiments. Neither WormShield nor isolated monitors take any advantage of their exploit code or associated vulnerabilities. Our simulation results are independent on the actual worm types 152 used as long as they have similar propagation parameters, e.g. vulnerable population, probing rate and so on. We did not simulate the bandwidth limitation of access links, which may cause the Slammer worms to spread somewhat faster in our simulation than in real network environment. 7.5.2 Spreading Patterns of Simulated Worms To valid our simulator and the analytical model in Sec. 7.2.1, we simulated the spread of two Slammer worms and plotted their global ngerprint repetition and address dispersion over time in Fig. 7.8. The ngerprint repetition of Slammer-I with uniform scanning is well predicted by Eq.(7.1) as shown in Fig. 7.8(a). Since Slammer-II uses the subnet-preferred scanning, its ngerprint repetition increases much faster than that of Slammer-I. We observe in Fig. 7.8(b) that the source address dispersions of both Slammer-I and Slammer-II are very close to the number of infected hosts, which is consistent with our analysis in Eq.(7.2). For uniform scanning worms (e.g. Slammer-I), the destination address dispersion predicted in Eq.(7.3) also matches well with the simulation results as shown in Fig. 7.8(c). Figure 7.8(a), 7.8(b) and 7.8(c) show that both ngerprint repetition and address dispersion increase exponentially when the worm just starts spreading, which matches well with our theoret- ical analysis in Sec. 7.2.1. Our simulations on two CodeRed worms have similar results as well. Fig. 7.8(d) shows that when the destination address dispersion is far less than the IPv4 address space, it almost increases linearly with the ngerprint repetition as illustrated by Eq.(7.3). Once the destination address dispersion of a worm signature exceeds a given threshold, its ngerprint repetition must be greater than the threshold as well. Therefore, signature generation systems only need to consider the global source and destination address dispersion as criteria to identify a worm signature. 153 0 1e+10 2e+10 3e+10 4e+10 5e+10 6e+10 0 50 100 150 200 250 300 Fingerprint repetition Time (seconds) slammer-I slammer-II analytical model (eq. (1)) (a) Fingerprint repetition over time 0 10000 20000 30000 40000 50000 60000 70000 80000 0 50 100 150 200 250 300 Source address dispersion Number of hosts Time (seconds) infected hosts (slammer-I) source addresses (slammer-I) infected hosts (slammer-II) source addresses (slammer-II) (b) Source address dispersion over time 0 5e+08 1e+09 1.5e+09 2e+09 2.5e+09 3e+09 3.5e+09 4e+09 4.5e+09 0 50 100 150 200 250 300 Destination address dispersion Time (seconds) slammer-I slammer-II analytical model (eq. (3)) (c) Destination address dispersion over time 0 2e+08 4e+08 6e+08 8e+08 1e+09 0 2e+08 4e+08 6e+08 8e+08 1e+09 Destination address dispersion Fingerprint repetition slammer-I slammer-II analytical model (eq. (3)) (d) Dest. address dispersion vs. ngerprint repetition Figure 7.8: The spreading patterns of Slammer worms with uniform and subnet-preferred scan- ning 7.5.3 Speed of Signature Generation Next, we compare collaborative WormShield monitors with isolated ones in terms of how timely they generate worm signatures. With different probing rates of worms, the absolute time duration may not be a good metric to measure the timeliness of signature generation. Instead, we char- acterize the speed of signature generation reversely by the number of hosts infected when a new 154 worm signature is generated. We are particularly interested in evaluating the speed of signature generation under various thresholds. We rst simulate the signature generation of CodeRedI-v2 and CodeRedII with default 10 probes per second. For the sake of fair comparison, we deploy equal number of isolated and WormShield monitors (i.e. 256 monitors) in randomly selected edge networks. The isolated monitors do not share ngerprint statistics among themselves and generate a worm signature only based on their local information. Fig. 7.9 shows the number of infected hosts at the signature gen- eration time as a function of different global thresholds for CodeRedI-v2 and CodeRedII. When only the source address dispersion is considered for CodeRedI-v2 in Fig. 7.9(a), the average case of isolated monitors succeeds in generating the signature before 343:4K out of 359K vulnerable hosts are infected as the global threshold is 100. The 99-th percentile case almost generates the signature as slow as the average case, i.e. 332:6K infected hosts. Since WormShield aggregates the global source address dispersion from all monitors, it generates the signature before 7; 200 hosts are infected, which is roughly 1=48 of using isolated monitors. When only the destination address dispersion is considered in Fig. 7.9(b), the 99-th percentile generates the signature much faster than the average case, e.g. 29:5K and 307:4K infected hosts respectively as the global threshold is 3; 000. This is because some isolated monitors are able to observe more outbound infection attempts if their edge networks have already had some infected hosts. WormShield monitors further reduce the number of infected hosts down to 2; 880 by aggregating the global destination addresses. Note that the 99th percentile case cannot always guarantee to achieve since there might not be an isolated monitor deployed in that edge network. Figure 7.9(c) and 7.9(d) show that WormShield monitors also generate the signature of subnet-preferred worms (i.e. CodeRed-II) much faster than isolated monitors in both the av- 155 0 50000 100000 150000 200000 250000 300000 350000 50 100 150 200 250 300 350 400 450 500 Number of infected hosts Global address dispersion threshold (source) isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (a) Source address dispersion (CodRedI-v2) 0 50000 100000 150000 200000 250000 300000 350000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of infected hosts Global address dispersion threshold (destination) isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (b) Destination address dispersion (CodeRedI-v2) 0 50000 100000 150000 200000 250000 300000 350000 50 100 150 200 250 300 350 400 450 500 Number of infected hosts Global address dispersion threshold (source) isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (c) Source address dispersion (CodeRedII) 0 50000 100000 150000 200000 250000 300000 350000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of infected hosts Global address dispersion threshold (destination) isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (d) Destination address dispersion (CodeRedII) Figure 7.9: Comparison of WormShield and isolated monitors in the speed of signature generation for CodeRed worms with varying global thresholds erage and 99-th percentile cases. In addition, neither isolated nor WormShield monitors are able to generate its signature as fast as that of CodeRedI-v2. This is because CodeRedII scans the IP addresses in its own subnet more frequently while other addresses less frequently than CodeRedI- v2. However, both isolated and WormShield monitors only sniff inbound and outbound trafc on the access links of edge networks. They are not aware of the infection attempts inside the same edge network. 156 In Fig. 7.9, we observe that the global threshold of destination address could be set much higher than that of source address. Since the probing rate of a worm can be easily adjusted, we also study the effect of probing rate on the speed of signature generation. Fig. 7.10 plots the number of infected hosts for CodeRedI-v2 by varying the probing rate from 10 to 1000. The global thresholds of source and destination addresses are xed at 30 and 3000 respectively. Fig. 7.10 shows that the number of infected hosts decreases considerably when the probing rate increases from 10 to 200 for all three cases of using isolated and WormShield monitors. As the probing rate increases further from 200 to 1000, there is no signicant decrease in the number of infected hosts. 100000 120000 140000 160000 180000 0 100 200 300 400 500 600 700 800 900 1000 isolated monitors (average) 4000 6000 8000 10000 12000 0 100 200 300 400 500 600 700 800 900 1000 Number of infected hosts isolated monitors (99th percentile) 500 600 700 800 900 1000 1100 0 100 200 300 400 500 600 700 800 900 1000 Probing rate (probes/sec) wormshield monitors Figure 7.10: Number of hosts infected by CodeRedI-v2 with different probing rates un- der three monitor congurations 0 20 40 60 80 100 0 1000 2000 3000 4000 5000 Number of false signatures Global address dispersion threshold source address destination address Figure 7.11: Number of false signatures drops sharply with increasing global thresholds 7.5.4 Effects of False Signatures Signature accuracy is often quantied by two metrics, i.e. false negatives and false positives. The evaluation of false negatives requires time-synchronized worm attack traces at different edge net- works. However, there is no such trace data public available in the research community. Instead, we simulated a worm attack synthesized with background trafc. In this case, WormShield and 157 isolated monitors either succeed or fail to generate the signature of the simulated worm. Thus, false negatives can not be meaningfully evaluated in our simulations. We evaluate false positives with the number of false signatures that are not real worm sig- natures. To understand the trade-off between signature generation speed and false signatures, we must conduct the worm simulation with realistic background trafc. We split the 10 minutes inbound and outbound traces (2005-IN-10m and 2005-OUT-10m) of a class-B network into 256 traces of class-C networks. The partitioned traces capture most of the trafc at their access-links, and only miss the trafc among those class-C networks. Since the traces only last 10 minutes, we have to adjust the probing rate to t the worm propagation in this time scale. The probing rate of two CodeRed worms is 200 probes per second in this trace-driven simulation. Fig. 7.11 shows that the number of false signatures drops sharply with increasing thresholds, e.g. less than 10 false signatures when the global threshold is greater than 1000. These false signatures are out of 5:9GB inbound and 12:8GB outbound background trafc. The false positive rate will be very low if we count the total number of ngerprints as the denominator. We note that there might be some worm noise in the background trafc although there is no known worm outbreak during the trace collection time. Therefore, the number of actual false signatures might be even lower than that in Fig. 7.11. Figure 7.12 plots the number of infected hosts at the signature generation time as a function of the number of false signatures by varying the global thresholds, which is essentially the re- ceiver operating characteristic (ROC) curves of signature generation speed against false alarms. We observe that for both uniform and subnet-preferred scanning worms, WormShield monitors generate the signature faster and more accurate than isolated monitors. Note that the number of infected hosts in Fig. 7.12 is in log-scale. When there is no false signature, the average and 158 99-th percentile of isolated monitors generate the signature of CodeRedI-v2 before 231:1K and 11:5K hosts are infected, respectively. In contrast, there are only 1;710 hosts infected before WormShield monitors generate the signature, which is about 135 times faster than the average case of using isolated monitors. 100 1000 10000 100000 1e+06 0 20 40 60 80 100 Number of infected hosts Number of false signatures isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (a) CodRedI-v2, uniform scanning 100 1000 10000 100000 1e+06 0 20 40 60 80 100 Number of infected hosts Number of false signatures isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (b) CodeRedII, subnet-preferred scanning) Figure 7.12: ROC curves showing the number of infected hosts against the number of false signatures As more false signatures are tolerated, all three cases generate the signature with less hosts being infected. Fig. 7.12(a) shows that when there are 50 false signatures, the number of infected hosts is reduced to 27:3K, 3:6K, and 280 for the average of isolated monitors, 99-th percentile of isolated monitors, and WormShield monitors, respectively. Fig. 7.12(b) shows a similar result for the CodeRedII worm with subnet-preferred scanning. It is important to note that the background trafc is varying from one edge network to another. Therefore we are not to emphasize on the particular number of false signatures, but rather to show that WormShield monitors are capable of achieving better trade-off between signature generation speed and false positives than isolated ones. 159 7.5.5 Deployment Scalability In this section, we investigate the speed gain of signature generation as more monitors are de- ployed. We use 200 monitors as our baseline conguration and scale the system size from 200 to 3000 monitors. We dene the improvement factor (n) of signature generation by using n mon- itors over the baseline conguration as (n) = I b I n ; where I b and I n are the numbers of infected hosts at the signature generation time for the baseline and n-monitor congurations, respectively. The global thresholds are set as 30 and 3;000 respectively for source and destination addresses, which eliminates any false signature in background trafc. Figure 7.13 plots the improvement factor as a function of the number of monitors for the CodeRed worms. Note that we always compare the improvement factors of using equal number of isolated and WormShield monitors. For the cases of isolated monitors, the signature generation speed does not improve with increasing number of monitors as demonstrated by the at lines in Fig. 7.13. For example, when isolated monitors increase from 200 to 300 for CodeRedI-v2, the number of infected hosts only decreases from 308:8K to 308:1K for the average case and from 42:4K to 22:3K for the 99-th percentile case. In contrast, the improvement factor of WormShield increases almost linearly with the deployment scale. The number of infected hosts decreases from 3;610 to 150 when the number of monitored increases from 200 to 3; 000, which is roughly a 24 times improvement in signature generation speed as shown in Fig. 7.13(a). The 3; 000 WormShield monitors also gain an improvement factor of 19 over the 256-monitor conguration discussed in Sec. 7.5.3. Since the worm ngerprints are not evenly distributed over the edge networks, the improve- ment factor of WormShield monitors for uniform-scanning worms could be greater than the in- 160 0 5 10 15 20 25 200 600 1000 1400 1800 2200 2600 3000 Improvement factor g(n) Number of monitors (n) isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (a) CodeRedI-v2, uniform scanning 0 5 10 15 20 25 200 600 1000 1400 1800 2200 2600 3000 Improvement factor g(n) Number of monitors (n) isolated monitors (average) isolated monitors (99th percentile) wormshield monitors (b) CodeRedII, subnet-preferred scanning Figure 7.13: Improvement factors of three signature generation schemes using increasing number of monitors from 200 to 3;000. crease in the number of monitors. This super-linear speedup situation is seen in Fig. 7.13(a). We observe an improvement factor of 24 in signature generation speed when the number of WormShield monitors is 15 times of the baseline conguration. We also observe in Fig. 7.13 that the improvement factor of WormShield for uniform-scanning worms is better than that for subnet- preferred ones. This is because uniform-scanning worms generate more inbound and outbound infection attempts that are observable by WormShield monitors. Thus far, we have compared the performance of WormShield with that of isolated monitors for two variants of CodeRed worms. Our simulation on Slammer worms also shows similar results. The Earlybird system [115] contributes several clever algorithmic designs for the scalable wire- speed implementation of a single worm monitor. However, without sharing ngerprint statistics among monitors, we anticipate that the Earlybird monitors should have a similar performance as isolated monitors in our simulation. The Autograph system [63] has a better support for distributed deployment through application- level multicast. Autograph monitors share the IP addresses of port scanners to speed-up the 161 classication of suspicious worm ows. However, as in Earlybird, they do not share ngerprint information either, and each Autograph monitor will only accumulate as much worm payload as an isolated one in our simulation. Therefore, with the same global thresholds, we believe that Au- tograph should have a similar signature generation speed as the best case of isolated monitors. On the other hand, the Autograph monitors use ow-level heuristics to lter out most legitimate traf- c, and should have better accuracy than isolated monitors or even better than the collaborative WormShield monitors. 7.6 Extensions and Limitations Although our simulation results show that collaborative worm signature generation is quite promis- ing, there are several other practical issues that need to be addressed in real-world deployment. 7.6.1 Privacy-Preserving Signature Generation It is critical to preserve the privacy of organizations that participate in collaborative worm sig- nature generation. WormShield addresses this issue by anonymizing payload substrings with SHA-1 hash values and by anonymizing IP addresses with cardinality summaries. Each monitor only discloses the SHA-1 hash value of a substring above all local thresholds to others, instead of the substring itself. Only the content of a worm signature is revealed by the root monitor to others. In WormShield, each substring is 40 bytes, while its SHA-1 hash value is only 20 bytes. Thus, it will be very difcult if not impossible for an attacker to recover the substring from a hash value due to the cryptographic property of SHA-1 hash function. A malicious member may build a reverse mapping from hash values to substrings, which is very expensive (2 160 entries). 162 However, this mapping does not make the recovery of a 40-byte substring much easier since each hash value will correspond to 2 160 possible substrings. In addition, a 40-byte substring has very limited information that may not be useful for attack- ers. Since only substrings that exceed local thresholds are subject to global aggregation, it should be more difcult to recover the full packet payload from the disclosed SHA-1 hash values. For IP addresses, only cardinality summaries are disclosed to other monitors. It is impossible to recover actual IP addresses from a cardinality summary that is indeed an array of counters. Therefore, WormShield monitors generate worm signatures collaboratively without requiring any organiza- tion to disclose its private information. 7.6.2 Security in WormShield Deployment WormShield relies on a group of distributed monitors working collaboratively to generate worm signatures, and every monitor plays an equally important role in the system. This may attract attackers to target the monitors or even attempt to compromise the entire system. To protect the condentiality, integrity and authenticity of the communication among the monitors, information exchange can be encrypted and authenticated by existing means such as using IPSec or SSL/TLS. In addition to leveraging other research works [116, 21, 40] to make Chord more secure and robust, we can introduce a family of consistent hashing functions in WormShield. Each substring will have multiple independent root monitors. Even if the trust relationship cannot be fully established when WormShield is initially deployed, certain measures such as majority voting can be used to resolve possible disputes and ght against collusion among some of the compromised monitors. 163 The topological information of a WormShield network may also be exploited by P2P worms to propagate very fast. This is a fundamental threat to any DHT-based system. We would an- ticipate that WormShield monitors are not only built as standalone network appliances, such as rewalls, but also administrated by security experts. Thus, we hope to minimize the risk of a WormShield monitor being compromised. In our future work, we will investigate the possibility of improving DHT routing schemes to prevent the fast spreading of P2P worms. For example, as suggested by Nicholas Weaver, we could predene some secret constraints to limit the connec- tivity among nodes. The infected nodes could be detected and isolated since P2P worms do not know these constraints. 7.6.3 Polymorphic Worms Similar to other network-based worm signature generation systems [63, 115, 129], WormShield could be challenged by polymorphic worms [84, 27] with very short or no invariant substrings. Recent studies [84, 27] show that it is very possible for worm authors to implement a polymorphic worm, although it has not yet appeared in the wild [65, 70]. For semi-polymorphic worms that have a relative-short invariant substring, window sampling [109] has a higher probability in gen- erating short signatures than using value sampling [115] and Content-based Payload Partitioning (COPP) [63]. The problem of polymorphic worms is orthogonal to the problem of collaborative worm signature generation studied in this chapter. Many recently proposed systems for polymorphic worms are either host-based [27, 26] or single-monitor based [84, 65, 70]. Similar to mono- morphic worms, collaborative monitors can speed-up the detection of polymorphic worms by observing their global activities. However, instances of a polymorphic worm may not share a 164 common invariant substring of sufcient length for ordinary WormShield monitors. Instead of ngerprinting a worm with its content substrings, Kruegel et al [65] proposed to ngerprint a worm by the control ow graph (CFG) of the worm executable. The CFG ngerprint character- izes the structural similarities between variations of a polymorphic worm. Therefore, the collab- oration techniques in WormShield can be extended to detect polymorphic worms by aggregating the global repetition and address dispersion of CFG ngerprints from distributed monitors. 7.6.4 Other Practical Limitations NAT boxes are often deployed in networks that have limited IPv4 addresses (e.g. Europe and Asia). Several internal hosts behind a NAT box share only one global IPv4 address. From a global point of view, all infected internal hosts can be modeled as a single infect host with higher probing rate. Therefore, the deployment of NAT boxes will not affect the global ngerprint repetition and destination address dispersion, but will reduce the global source address dispersion. In addition, our study in this chapter is based on the IPv4 address scheme. The propagation of scanning worms under IPv6 will be extremely slow since the address space is too sparse assuming that the total number of hosts on the Internet does not increase suddenly [11]. Instead, IPv6 worms have to leverage other information sources for target discovery. It is relatively straightforward to extend WormShield to IPv6 worms as long as they have exceptionally high ngerprint repetition and address dispersion. The efciency of ngerprint ltering in WormShield relies on the Zipf-like ngerprint distri- bution. Our discussion in Sec. 7.4.1 demonstrates that this distribution is nature in the Internet trafc. However, the ltering scheme will become less efcient if the distribution becomes less skewed. In our future work, we will evaluate the efciency of ngerprint ltering by deploying 165 WormShield in a real network environment. Moreover, the Zipf-like distribution is the property of background trafc instead of worm attack trafc. Without injecting signicantly large volume of trafc, it would be difcult for a worm attacker to atten the skewed Zipf-like distribution so as to evade WormShield. On the other hand, injecting too much trafc will saturate the access link, which reduces the speed of worm propagation and is easy to be detected. 7.7 Related Work Recently, there have been many research efforts on network worms [130], such as worm model- ing and simulation [144, 36, 81, 71], measurement [80, 79, 6] and defense [131, 115, 63, 129]. Most scanning worms nd a vulnerable target by randomly looking through the IP address space. There are several proposals to detect worm attacks by analyzing their scanning activities, e.g. threshold random walks [131, 55] at individual edge networks and trend detection [144] on the global Internet. Another worm detection approach is to passively monitor scanning trafc sent to unused address space, e.g. network telescopes [80, 79], honeypots [117] and active-sinks in DOMINO [136]. Network-based worm containment techniques can be classied into two major categories, i.e. address blacklisting and signature-based ltering. The former quarantines infected hosts that exhibit abnormal port-scan activities [131], which is efcient for protecting individual edge net- works. However, to contain worms over the entire Internet, Moore et al [81] show that signature- based ltering [102, 86] is more efcient than address blacklisting. Besides network-based tech- niques, Vigilante [26] employs the collaboration among end-hosts to contain worms using self- certied alerts. Shield [128] installs host-based network lters that are vulnerability-specic and exploit-generic once a vulnerability is discovered and before a patch is applied. 166 Our work was inspired by previous efforts on automatic signature generation for unknown worms, e.g. Autograph [63] and Earlybird [115]. These two systems both employ the heuristic that the invariant byte string in worm payload will repeat very frequently during a worm outbreak. Autograph uses the ow level heuristic of a worm, e.g. port scanning, to classify suspicious ows before generating signatures. Distributed monitors in Autograph share the address information of port scanners to speed-up the signature generation process. In contrast, Earlybird does not rely on any classied trafc. It uses the address dispersion of a byte-string as another heuristic to distinguish a worm signature from legitimate trafc patterns. Several scalable algorithms, e.g. multistage lter and scaled bitmaps, are used in Earlybird for the wire-speed implementation of a single monitor. Although the basic concept of sharing information among monitors is similar in Autograph and WormShield, the shared information and underlying communication mechanisms are quite different in these two systems. Autograph uses application-level multicast over DHTs to share source IP addresses of port-scanners among monitors, while WormShield uses distributed ag- gregation trees to compute the global ngerprint repetition and address dispersion. Fingerprint aggregation in WormShield has considerable higher scalability requirement than sharing of the IP addresses of port-scanners in Autograph. Application-level multicast will impose too much overhead for ngerprint aggregation, e.g. O(N 2 ) messages per aggregation for N monitors. In contrast, DAT only uses O(N) messages per aggregation. Furthermore, WormShield leverages distributed ngerprint ltering to reduce aggregation trafc signicantly due to the Zipf-like n- gerprint distribution. In summary, our work on WormShield is complimentary to Autograph and Earlybird, since distributed ngerprint ltering and aggregation can be used to improve them as well. 167 PAYL [129] uses the Z-string of packet payload to generate worm signatures automati- cally. The payload alerts from different sites are correlated to increase accuracy and reduce false alarms. The privacy of individual sites is preserved by only exchanging unrecoverable Z-string. Instead of generating worm signatures using single substrings that could be evaded by polymor- phic worms, Polygraph [84] generates the signatures of polymorphic worms with multiple disjoint string tokens that are shorter than single substrings used in Autograph and Earlybird. Karamcheti et al [58] uses inverse distribution of packet contents to detect worm trafc. How- ever, they observed a binomial distribution of repetitions in background trafc. Since they only use 16 bits for each ngerprint, different content substrings will likely have the same hash value. Instead, we use 64-bit ngerprints that is sufcient to avoid signicant collisions. Collaborative intrusion detection in general and worm containment in particular have been studied in previous work. DOMINO [136] builds an overlay network among active-sink nodes to distribute alert information by hashing the source IP addresses. Worminator [72] summarizes port-scan alerts in Bloom-lters and disseminate them among collaborating peers. Kannan et al [57] analyze the efcacy of cooperation among rewalls in containing worms. In Vigilante [26], worms are collaboratively contained by distributing self-certied alerts among end-hosts who do not necessarily trust each other. Our work is more focused on collaboratively generating worm signatures by using distributed ngerprint ltering and aggregation in multiple edge networks. Table 7.3 compares the six worm signature generation systems according to seven criteria. WormShield is different from other systems in both the method of information sharing and the heuristics of signature generation. In WormShield, distributed monitors share ngerprint statistics with distributed aggregation trees, and worm signatures are generated by monitoring the global ngerprint repetition and address dispersion. Besides the conceptual comparison in Table 7.3, 168 Sec. 7.5 also measures several performance metrics (e.g. signature generation speed, false posi- tives and deployment scalability) that can only be quantitatively evaluated via simulations. Table 7.3: Comparison of Six Worm Signature Generation Systems Systems Autograph Earlybird Polygraph PAYL WormShield Vigilante Deployment location Network Network Network Network Network Host Information shared among monitors Port-scan alerts None None Z-Strings Fingerprint statistics Self-certified alerts Information sharing method Multicast None None Centralized servers Aggregation trees Broadcast Signature generation heuristics Local prevalence of substrings Local fingerprint repetition & address dispersion Local prevalence of short string tokens Mahalanobis distance of 1-gram Global fingerprint repetition & address dispersion Non-executable pages & dynamic dataflow analysis Signature structure Single substring Single substring Disjoint string tokens Multiple substrings Single substring Self-certified alerts Flow classification Required No Required Required No No Polymorphic worms No No Yes No No Yes 7.8 Summary This chapter has presented the WormShield system that collaboratively generates worm signa- tures by ltering and aggregating ngerprints at distributed monitors in multiple edge networks. First, we modeled in Eqs. (7.1)-(7.3) the exponential growth of global ngerprint statistics at the early stage of a worm outbreak. This heuristic can be applied not only in WormShield but also in other worm detection system to distinguish worm signatures from legitimate trafc patterns. Second, WormShield uses a window-based ngerprint sampling algorithm that has higher possi- bility to sample a worm signature than the value sampling approach in Earlybird and the COPP approach in Autograph. Third, we observed the Zipf-like distributions of ngerprint repetition and address dispersion in real-life Internet trafc. This propriety enables the efcient ngerprint ltering that reduces the global aggregation trafc by several orders of magnitude. Finally, we performed large-scale worm simulations and demonstrated the effectiveness and scalability of WormShield in fast worm signature generation. 169 Chapter 8 Conclusions Distributed resource indexing and information aggregation are two fundamental issues in large- scale distributed systems, such as P2P systems and Grids. In this dissertation, we have presented three techniques, i.e. a multi-attribute addressable network, a distributed aggregation tree, and a distributed cardinality counting, to tackle these two problems in a fully decentralized fashion. We also demonstrated the performance and scalability of these techniques via three real-world applications on Grid replica location service, distributed metadata management and collaborative worm signature generation. This nal chapter concludes the dissertation with a summary of our contributions and suggests future research directions. 8.1 Summary of Contributions The work in this dissertation has extended several previous research efforts on distributed re- source indexing and information aggregation. Summarized below are the major contributions and research ndings. First, we proposed a multi-attribute addressable network (MAAN) that indexes resources with a set of (attribute,value) pairs and searches the resources of interest using multi-attribute based range queries. MAAN supports efcient range queries by mapping attribute values to Chord identier space via uniform locality preserving hashing. It not only preserves the locality of 170 resources but also distributes resources among all nodes uniformly. In a MAAN network of n nodes, the single-attribute dominated algorithm only uses O(log n + n s min ) routing hops to resolve a query, where s min is the minimum range selectivity on all attributes. Thus, it scales well in the number of attributes. Also when s min = ", the number of routing hops is logarithmic to the number of nodes. Second, we introduced an efcient scheme for building DAT trees on a Chord overlay net- work. This scheme constructs an aggregation tree implicitly from Chord routing paths without maintaining any parent-child membership. To further balance DAT trees, a balanced routing algorithm was proposed for Chord to select the parent of a node from its nger nodes dynami- cally according to its distance to the root. The theoretical analysis proved that this algorithm will construct a balanced tree when nodes are evenly distributed in the Chord identier space. The experimental results showed that the DAT scheme scales to a large number of nodes and corre- sponding aggregation trees. Without maintaining explicit parent-child membership, it has little overhead during node arrival and departure. Third, we proposed an adaptive counting algorithm that integrates the advantages of both LogLog and linear counting algorithms. As shown in the experiments, this algorithm scales well from very small cardinalities to large ones in six orders of magnitude. By max-merging cardinality summaries from distributed nodes, this algorithm only needs O(log log n) aggregation trafc for estimating the global cardinality of large sets with n distinct elements. Fourth, we designed a P2P replica location service (P-RLS) that improves the Globus RLS implementation via a distributed replica indexing network with properties of self-organization, greater fault-tolerance and scalability. P-RLS introduces an adaptive replication scheme to evenly distribute the mappings among nodes. This system also uses a predecessor replication scheme to 171 reduce the query hotspots of popular mappings. The performance measurements demonstrated that the update and query latencies increase at a logarithmic rate with the size of the P-RLS network, while the overhead of maintaining its network topology is reasonable. Fifth, we developed a scalable P2P RDF repository (RDFPeers) that stores each triple at three places in a MAAN network by applying globally known hash functions to its subject, predicate, and object. In RDFPeers, queries are efciently routed to the nodes storing the matched triples, and users can selectively subscribe to and be notied of new RDF triples. This system has no single point of failure and does not require the prior denition of RDF schemas. Both the number of neighbors per node and the number of routing hops for inserting RDF triples and for resolving most queries are logarithmic to the number of nodes in the network. The experiments on real- world RDF data showed that the triple-storing load in RDFPeers differs by less than an order of magnitude between the most and the least loaded nodes. Finally, we modeled the exponential growth of global ngerprint statistics at the early stage of a worm outbreak. Based on this heuristic, we introduced a collaborative worm signature gen- eration system (WormShield) that applies the DAT and cardinality-based counting schemes to aggregate global ngerprint statistics from multiple edge networks. This is the rst time that the Zipf-like ngerprint distributions have been observed in real-life Internet trafc. Thus, dis- tributed ngerprint ltering reduces the global aggregation trafc by several orders of magnitude. The adaptive counting algorithm enables the effective estimation of all IPv4 addresses in a worm outbreak with only a few hundreds bytes. The large-scale worm simulations demonstrated that collaborative WormShield monitors offer distinct advantages in speed, accuracy and overhead for fast worm signature generation. 172 8.2 Future Directions While MAAN supports multi-attribute range queries quite well, it does have some limitations. First, the attribute schema of resources has to be xed and known in advance. We believe that supporting attribute schemas that evolve during the course of using the MAAN network is an important future research direction. Second, when the range selectivity of queries is very large, ooding the query to the whole network can actually be more efcient than routing it to nodes one by one as MAAN does. It would be interesting to analyze the threshold of range selectivity at which ooding becomes more efcient than routing, and to have MAAN use different query resolution algorithms for different kind of queries. The current MAAN and DAT implementations use an application-specic and non-standard protocol on top of TCP to communicate between nodes. However, recently the Grid community has moved to the Web Services based infrastructure, such as OGSA [125]. To be used in the real Grid environment, it is important to design and implement the MAAN and DAT systems based on standard Grid services. One approach is to implement the whole P2P network as a distributed Grid service, which exposes a generic resource indexing and aggregation interface to other Grid services. In the network, each node would still use a specic protocol to communicate to another, although the communication could be based on the SOAP protocol. This is similar to the OpenHash system [59] that provides a service-oriented DHT network instead of libraries to other applications. In this dissertation, the experiments on MAAN and DAT were performed on a cluster of nodes connected via a local area network. It would be of interest to conduct similar studies in the wide area. One important issue regarding the wide area is potentially long latencies for 173 sending messages among nodes in the P2P network. One concern regarding the use of Chord in a wide area deployment is that each hop in the Chord overlay might correspond to multiple hops in the underlying IP network. Zhang et al [138] proposed a lookup-parasitic random sampling (LPRS) algorithm that reduces the IP layer lookup latency of Chord. The LPRS-Chord is proved to result in lookup latencies proportional to the average unicast latency of the network, provided the underlying physical topology has power-law latency expansion. It would be an interesting future work to incorporate this algorithm in MAAN and DAT to reduce the network latencies in a wide area environment. It is quite challenging to enforce security and trust in open P2P systems and Grids [116]. Castro et al. [21] combined secure node identier assignment, secure routing table maintenance, and secure message forwarding to tolerate up to 25% malicious nodes in a P2P network. In Grids, security policies tend to be heterogeneous in different institutions and virtual organizations. Mechanisms for node authentication and access control on resource indexing are still missing. In addition, the proposed resource indexing scheme assumes that resource providers from different organizations will register accurate and trustable resource information. This assumption is reasonable if all providers are from the same virtual organization and they are cooperative in- stead of competitive. However, we would envision that resource providers from different virtual organizations will participate into a unied resource sharing environment. The commercial de- ployment of Grid could purely rely on monetary transactions. In such an environment, it can not be assumed that all resource providers will report accurate resource information. Thus, it would be meaningful to investigate techniques that enable trusted resource indexing and discovery. 174 Bibliography [1] K. Aberer, P. CudreMauroux, and M. Hauswirth, The chatty web: Emergent semantics through gossiping, in Proc. of the 13th World Wide Web Conference (WWW2003), May 2003. [2] M. Adler, E. Halperin, R. M. Karp, and V . V . Vazirani, A stochastic process on the hy- percube with applications to peer-to-peer networks, in Proc. of the 35th Annual ACM Symposium on Theory of Computing (STOC'03), June 2003. [3] K. Albrecht, R. Arnold, M. Gahwiler, and R. Wattenhofer, Aggregating information in peer-to-peer systems for improved join and leave, in Proc. of the 4th International Con- ference on Peer-to-Peer Computing (P2P'04), 2004, pp. 227234. [4] A. Andrzejak and Z. Xu, Scalable, efcient range queries for Grid information ser- vices, in Proc. of the Second IEEE International Conference on Peer-to-Peer Computing (P2P2002), September 2002. [5] J. Aspnes and G. Shah, Skip graphs, in Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, USA, 1214 Jan. 2003. [6] M. Bailey, E. Cooke, F. Jahanian, and D. Watson, The Blaster worm then and now, in IEEE Security and Privacy Magazine, vol. 3, no. 4, July/August 2005, pp. 2631. [7] M. Baker and G. Smith, GridRM: An extensible resource monitoring system, in Proc. of the IEEE International Cluster Computing Conference, December 2003, pp. 207214. [8] C. Baru, R. Moore, A. Rajasekar, and M. Wan, The SDSC storage resource broker, in Proc. of the CASCON'98 Conference, Toronto, Canada, November 1998. [9] A. Bavier, M. Bowman, B. Chun, D. Culler, S. Karlin, S. Muir, L. Peterson, T. Roscoe, T. Spalink, and M. Wawrzoniak, Operating system support for planetary-scale network services, in Proc. of the 1st Symposium on Networked Systems Design and Implementation (NSDI), 2004. [10] M. Bawa, H. Garcia-Molina, A. Gionis, and R. Motwani, Estimating aggregates on a peer-to-peer network, Computer Science Department, Stanford University, Tech. Rep., 2003. [11] S. M. Bellovin, A. Keromytis, , and B. Cheswick, Worm propagation strategies in an IPv6 Internet, ;login:, pp. 7076, February 2006. [12] F. Berman, G. Fox, and A. J. Hey, Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, April 2003. 175 [13] J. Black, S. Halevi, H. Krawczyk, T. Krovetz, and P. Rogaway, UMAC: Fast and secure message authentication, Proc. of CRYPTO '99, 1999. [14] B. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422426, 1970. [15] Y . Breitbart and H. Korth, Replication and consistency: Being lazy helps sometimes, in Proc. of the 16th ACM SIGACT/SIGMOD Symposium on the Principles of Database Systems, Tucson, AZ, 1997. [16] M. Cai, K. Hwang, Y .-K. Kwok, S. Song, and Y . Chen, Collaborative Internet worm containment, in IEEE Security and Privacy Magazine, vol. 3, no. 3, May/June 2005, pp. 2533. [17] M. Cai, K. Hwang, J. Pan, and C. Papadopoulos, WormShield: Fast worm signature generation with distributed ngerprint aggregation, submitted to IEEE Transaction on Dependable and Secure Computing (TDSC), 2006. [18] M. Cai, A. Chervenak, and M. Frank, A peer-to-peer replica location service based on a distributed hash table, in Proc. of the 2004 ACM/IEEE conference on Supercomputing, 2004, p. 56. [19] M. Cai and M. Frank, RDFPeers: a scalable distributed RDF repository based on a struc- tured peer-to-peer network, in Proc. of the 13th International Conference on World Wide Web, 2004, pp. 650657. [20] M. Cai, M. Frank, J. Chen, and P. Szekely, MAAN: A mulit-attribute addressable network for Grid information services, Journal of Grid Computing, no. 1, pp. 314, 2004. [21] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S. Wallach, Secure routing for structured peer-to-peer overlay networks, in Proc. of the 5th Symposium on Operating Systems Design and Implementation (OSDI '02), 2002, pp. 299314. [22] A. Chervenak, E. Deelman, I. Foster, L. Guy, W. Hoschek, A. Iamnitchi, C. Kesselman, P. Kunst, M. Ripeanu, S. B, S. H, K. Stockinger, and B. Tierney, Giggle: A framework for constructing sclable replica location services, in Proc. of the 2004 ACM/IEEE conference on Supercomputing, Baltimore, MD, November 2002. [23] A. L. Chervenak, N. Palavalli, S. Bharathi, C. Kesselman, and R. Schwartzkopf, Perfor- mance and scalability of a replica location service, in Proc. of High Performance Dis- tributed Computing Conference (HPDC-13), Honolulu, HI, June 2004. [24] P. A. Chirita, S. Idreos, M. Koubarakis, and W. Nejdl, Publish/subscribe for RDF-based p2p networks, in the 1st European Semantic Web Symposium (ESWS 2004), May 2004. [25] D. D. Clark, The design philosophy of the DARPA Internet protocols, in Proc. of SIG- COMM. Stanford, CA: ACM, Aug. 1988, pp. 106114. [26] M. Costa, J. Crowcroft, M. Castro, A. Rowstron, L. Zhou, L. Zhang, and P. Barham, Vig- ilante: End-to-end containment of Internet worms, in Proc. of the 20th ACM Symposium on Operating Systems Principles (SOSP), October 2005. 176 [27] J. R. Crandall, Z. Su, S. F. Wu, and F. T. Chong, On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits, in Prof. of the 12th ACM Conference on Computer and Communications Security, November 2005. [28] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke, A resource management architecture for metacomputing systems, Lecture Notes in Com- puter Science, vol. 1459, 1998. [29] S. Czajkowski, K. Fitzgerald, I. Foster, and C. Kesselman, Grid information services for distributed resource sharing, in Proc. of High Performance Distributed Computing Conference, 2001. [30] A. Czirok, H. E. Stanley, , and T. Vicsek, Possible origin of power-law behavior in n-tuple Zipf analysis, Physical Review E, vol. 53, no. 6, June 1996. [31] E. Deelman, J. Blythe, Y . Gil, and C. Kesselman, Pegasus: Planning for execution in Grids, GriPhyN Project Technical Report, Tech. Rep., 2002. [32] E. Deelman, J. Blythe, Y . Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Laz- zarini, A. Arbree, R. Cavanaugh, and S. Koranda, Mapping abstract complex workows onto Grid environments, Journal of Grid Computing, vol. 1, no. 1, pp. 2539, 2003. [33] E. Deelman, G. Singh, M. P. Atkinson, A. Chervenak, N. P. C. Hong, C. Kesselman, S. Patil, L. Pearlman, and M. shi Su, Grid-based metadata services, in 16th International Conference on Scientic and Statistical Database Management (SSDBM04), June 2002. [34] M. Durand and P. Flajolet, Loglog counting of large cardinalities, in 11th Annual Euro- pean Symposium on Algorithms, Sept. 2003. [35] M. Ehrig, P. Haase, S. Staab, and C. Tempich, Swap: A semantics-based peer-to-peer system, in JXTA Workshop, November 2003. [36] D. Ellis, Worm anatomy and model, in WORM '03: Proc. of the 2003 ACM workshop on Rapid Malcode, Washington, DC, USA, 2003, pp. 4250. [37] C. Estan and G. Varghese, New directions in trafc measurement and accounting: Focus- ing on the elephants, ignoring the mice, ACM Transactions on Computer Systems, 2003. [38] C. Estan, G. Varghese, and M. Fisk, Bitmap algorithms for counting active ows on high speed links, in Proc. of the 2003 Internet Measurement Conference, Oct. 2003. [39] A. W. C. et al, The relational Grid monitoring architecture: Mediating information about the Grid, Journal of Grid Computing, vol. 2, no. 4, December 2004. [40] A. Fiat, J. Saia, and M. Young, Making Chord robust to Byzantine attacks, in Proc. of European Symposium on Algorithms(ESA), 2005. [41] S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke, A directory service for conguring high-performance distributed computations, in Proc. of the 6th IEEE Symposium on High-Performance Distributed Computing, 5-8 Aug. 1997, pp. 365375. 177 [42] P. Flajolet and G. N. Martin, Probabilistic counting algorithms for data base applications, J. of Comp. and Syst. Sci., vol. 31, no. 2, pp. 182209, Oct. 1985. [43] I. Foster and C. Kesselman, The Grid: Blueprint of a New Computing. Morgan Kaufmann, 1999. [44] I. Foster and C. Kesselman, Globus: A metacomputing infrastructure toolkit, The In- ternational Journal of Supercomputer Applications and High Performance Computing, vol. 11, no. 2, pp. 115128, 1997. [45] D. R. K. Frans Kaashoek, Koorde: A simple degree-optimal hash table, in 2nd Interna- tional Workshop on Peer-to-Peer Systems (IPTPS '03), February 2003. [46] L. Galanis and D. J. DeWitt., Scalable distributed aggregate computations through collab- oration, in Proc. of the 16th International Conference on Database and Expert Systems Applications (DEXA), 2005. [47] S. Ghandeharizadeh, A. Daskos, and X. An, PePeR: A distributed range addressing space for P2P systems, in Int'l Workshop on Databases, Information Systems, and Peer-to-Peer Computing (at VLDB), 2003. [48] J. Gray, P. Helland, P. O'Neil, and D. Shasha, The dangers of replication and a solution, in Proc. of the ACM SIGMOD Conference, 1996. [49] I. Gupta, R. van Renesse, and K. Birman, Scalable fault-tolerant aggregation in large process groups, in Proc. Conf. on Dependable Systems and Networks, 2001. [50] L. Guy, P. Kunszt, E. Laure, H. Stockinger, and K. Stockinger, Replica management in data Grids, GGF5 Working Draft, Tech. Rep., July 2002. [51] A. Halevy, Z. Ives, I. Tatarinov, and P. Mork, Piazza: Data management infrastructure for semantic-web applications, in Proc. of the 13th World Wide Web Conference (WWW2003), May 2003. [52] N. J. A. Harvey, M. B. Jone, S. Saroiu, M. Theimer, and A. Wolman, Skipnet: A scal- able overlay network with practical locality properties, in Proc. of the Fourth USENIX Symposium on Internet Technologies and Systems (USITS '03), Seattle, WA, USA, March 2003. [53] A. Iamnitchi, I. Foster, and D. Nurmi, A peer-to-peer approach to resource discovery in Grid environments, in Proc. of the 11th Symposium on High Performance Distributed Computing, Edinburgh, UK, August 2002. [54] M. Jelasity, A. Montresor, and O. Babaoglu, Gossip-based aggregation in large dynamic networks, ACM Transaction on Computer Systems, vol. 23, no. 3, pp. 219252, 2005. [55] J. Jung, V . Paxson, A. Berger, and H. Balakrishnan, Fast portscan detection using sequen- tial hypothesis testing, in Proc. of the IEEE Symposium on Security and Privacy, May 2004. 178 [56] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina, The EigenTrust algorithm for reputation management in P2P networks, in Proc. of the 12th International Conference on World Wide Web, 2003, pp. 640651. [57] J. Kannan, L. Subramanian, I. Stoica, and R. H. Katz, Analyzing cooperative containment of fast scanning worms, in Proc. of USENIX SRUTI 2005 Workshop, July 2005. [58] V . Karamcheti, D. Geiger, Z. Kedem, and S. Muthukrishnan, Detecting malicious net- work trafc using inverse distributions of packet contents, in Proc. of the SIGCOMM MineNet'05 Workshop, August 2005. [59] B. Karp, S. Ratnasamy, S. Rhea, and S. Shenker, Spurring the adoption of DHTs with OpenHash, in Proc. of the 3rd International Workshop on Peer-to-Peer Systems (IPTPS), 2004. [60] G. Karvounarakis, S. Alexaki, V . Christophides, D. Plexousakis, and M. Scholl, RQL: A declarative query language for RDF, in 11th World Wide Web Conference, 2002. [61] M. Keidl, A. Kreutz, and A. Kemper, A publish and subscribe architecture for dis- tributed metadata management, in 18th International Conference on Data Engineering (ICDE'02), February 2002. [62] D. Kempe, A. Dobra, and J. Gehrke, Gossip-based computation of aggregate informa- tion, in FOCS '03: Proc. of the 44th Annual IEEE Symposium on Foundations of Com- puter Science, 2003, p. 482. [63] H.-A. Kim and B. Karp, Autograph: Toward automated, distributed worm signature de- tection, in USENIX Security, 2004. [64] E. Korpela, D. Werthimer, D. Anderson, J. Cobb, and M. Lebofsky, SETI@Home - mas- sively distributed computing for SETI, Computing in Science & Engineering, January 2001. [65] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, Polymorphic worm detection using structural information of executables, in Proc. of the 8th Symposium on Recent Advances in Intrusion Detection (RAID), September 2005. [66] A. Kumar, M. Sung, J. Xu, and J. Wang, Data streaming algorithms for efcient and accu- rate estimation of ow size distribution. in ACM SIGMETRICS Performance Evaluation Review, 2004, pp. 177188. [67] J. Li, K. Sollins, and D.-Y . Lim, Implementing aggregation and broadcast over distributed hash tables, SIGCOMM Computer and Communication Review, vol. 35, no. 1, pp. 8192, 2005. [68] W. Li, Random texts exhibit Zipf's-law-like word frequency distribution, IEEE Transac- tion on Information Theory, vol. 38, no. 6, November 1992. [69] X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, W. Hong, and G. Iannaccone, MIND: A distributed multi-dimensional indexing system for network diagnosis, in Proc. of INFO- COM, 2006. 179 [70] Z. Li, M. Sanghi, B. Chavez, Y . Chen, and M.-Y . Kao, Hamsa: Fast signature genera- tion for zero-day polymorphic worms with provable attack resilience, in Proc. of IEEE Symposium on Security and Privacy, May 2006. [71] M. Liljenstam, D. M. Nicol, V . H. Berk, and R. S. Gray, Simulating realistic network worm trafc for worm warning system design and testing, in Proc. of the 2003 ACM workshop on Rapid Malcode, Washington, DC, USA, 2003, pp. 2433. [72] M. E. Locasto, J. Parekh, A. D. Keromytis, and S. Stolfo, Towards collaborative security and P2P intrusion detection, in Proc. of the 6th Annual IEEE SMC Information Assurance Workshop (IAW), June 2005, pp. 333 339. [73] S. Madden, M. J. Franklin, J. Hellerstein, and W. Hong, TAG: a tiny aggregation service for ad-hoc sensor networks, in ACM SIGOPS Operating Systems Review, vol. 36, 2002. [74] P. Maymounkov and D. Mazieres, Kademlia: A peer-to-peer information system based on the XOR metric, in Proc. of the International Workshop on Peer-to-Peer Systems (IPTPS '02), 2002. [75] B. McBride, Jena: Implementing the RDF model and syntax specication, in 2nd Int'l Semantic Web Workshop, 2001. [76] X. Meng, T. Nandagopal, L. Li, and S. Lu, Contour maps: Monitoring and diagnosis in sensor networks, Computer Networks Journal, 2006. [77] L. Miller, A. Seaborne, and A. Reggiori, Three implementations of SquishQL, a simple RDF query language, in First Int'l Semantic Web Conference, 2002. [78] D. S. Milojicic, V . Kalogeraki, R. Lukose, K. Nagaraja1, J. Pruyne, B. Richard, S. Rollins, and Z. Xu, Peer-to-Peer computing, HP Laboratories, Palo Alto, Tech. Rep., July 2003. [79] D. Moore, V . Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, Inside the Slammer worm, IEEE Magazine of Security and Privacy, August 2003. [80] D. Moore, C. Shannon, and k claffy, Code-Red: a case study on the spread and victims of an Internet worm, in Proc. of the 2nd ACM SIGCOMM Workshop on Internet measurment, 2002, pp. 273284. [81] D. Moore, C. Shannon, G. M. V oelker, and S. Savage, Internet quarantine: Requirements for containing self-propagating code, in Proc. of the 22th Conference on Computer Com- munications, 2003. [82] W. Nejdl, M. Wolpers, W. Siberski, C. Schmitz, M. Schlosser, I. Brunkhorst, and A. Lser, Super-peer-based routing and clustering strategies for RDF-based peer-to-peer networks, in 12th World Wide Web Conference, May 2003. [83] W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmer, and T. Risch, Edutella: A P2P networking infrastructure based on RDF, in Proc. of the World Wide Web Conference (WWW2002), Hawaii, May 2002, pp. 711. 180 [84] J. Newsome, B. Karp, and D. Song, Polygraph: Automatic signature generation for poly- morphic worms, in Proc. of IEEE Security and Privacy Symposium, May 2005. [85] J. T. W. Page, R. G. Guy, G. J. Popek, J. S. Heidemann, W. Mak, , and D. Rothmeier, Management of replicated volume location data in the cus replicated le system, in Proc. of the USENIX Conference, 1991. [86] V . Paxson, Bro: a system for detecting network intruders in real-time. Computer Net- works, vol. 31, no. 23-24, pp. 24352463, 1999. [87] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers, Flexible update propagation for weakly consistent replication, in Proc. of the 16th ACM symposium on Operating systems principles, 1997, pp. 288301. [88] C. G. Plaxton, R. Rajaraman, and A. W. Richa, Accessing nearby copies of replicated objects in a distributed environment, in ACM Symposium on Parallel Algorithms and Ar- chitectures, 1997, pp. 311320. [89] G. Popek, The Locus Distributed System Architecture. The MIT Press, 1986. [90] B. Przydatek, D. Song, and A. Perrig, SIA: secure information aggregation in sensor net- works, in SenSys '03: Proc. of the 1st International Conference on Embedded Networked Sensor Systems, 2003, pp. 255265. [91] W. Pugh, Skip lists: A probabilistic alternative to balanced trees, in Proc. of the Work- shop on Algorithms and Data Structures, 1989. [92] M. O. Rabin, Fingerprinting by random polynomials, Center for Research in Computing Technology, Harvard University, Tech. Rep., 1981. [93] R. Raman, M. Livny, and M. Solomon, Matchmaking: Distributed resource management for high throughput, in Proc. of the Seventh IEEE International Symposium on High Per- formance Distributed Computing (HPDC7), July 1998. [94] A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica, Load balancing in structured P2P systems, in 2nd Int'l Workshop on P2P Systems, 2003. [95] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, A scalable content ad- dressable network, in Proc. of the 2001 Conference on Applications, Technologies, Archi- tectures, and Protocols for Computer Communications (SIGCOMM), 2001. [96] S. Ratnasamy, S. Shenker, and I. Stoica, Routing algorithms for DHTs: Some open ques- tions, in 2nd International Workshop on Peer-to-Peer Systems (IPTPS '03), February 2003. [97] A. Reggiori, D.-W. van Gulik, and Z. Bjelogrlic, Indexing and retrieving semantic web resources: the RDFStore model, in Proc. of the SWAD-Europe Workshop on Semantic Web Storage and Retrieval, 2003. 181 [98] R. V . Renesse, K. P. Birman, and W. V ogels, Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining, ACM Trans. Comput. Syst., vol. 21, no. 2, pp. 164206, 2003. [99] P. Reynolds and A. Vahdat, Efcient peer-to-peer keyword searching, in Proc. of the ACM/IFIP/USENIX International Middleware Conference(Middleware 2003), 2003. [100] M. Ripeanu, I. Foster, and A. Iamnitchi, Mapping the Gnutella network: Properties of large-scale peer-to-peer systems and implications for system design, IEEE Internet Com- puting Journal, vol. 6, no. 1, 2002. [101] M. Ripeanu and I. Foster, A decentralized, adaptive, replica location mechanism, in Proc. of the 11th IEEE Internationa Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, June 2002. [102] M. Roesch, Snort - lightweight intrusion detection for networks, in Proc. of USENIX 13th Systems Administration Conf. (LISA '99), Berkeley, CA, 1999, pp. 229 238. [103] D. D. Roure, N. R. Jennings, and N. R. Shadbolt, The Semantic Grid: Past, present, and future, Proc. of the IEEE, vol. 93, no. 3, pp. 669681, March 2005. [104] A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems, Lecture Notes in Computer Science, vol. 2218, 2001. [105] A. G. S. Boyd, B. Prabhakar, and D. Shah, Randomized gossip algorithms, the joint special issue of IEEE Transactions on Information Theory and ACM/IEEE Transactions on Networking, June 2006. [106] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, Design and implementa- tion of the Sun network le system, in Proc. of the USENIX Conference, 1985. [107] S. Saroiu, P. K. Gummadi, and S. D. Gribble, A measurement study of peer-to-peer le sharing systems, in Proc. of Multimedia Computing and Networking 2002 (MMCN '02), San Jose, CA, USA, January 2002. [108] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C. Steere, Coda: A highly available system for a distributed workstation environment, IEEE Trans- actions on Computers, vol. 39, no. 4, pp. 447459, April 1990. [109] S. Schleimer, D. S. Wilkerson, and A. Aiken, Winnowing: local algorithms for document ngerprinting, in Proc. of the 2003 ACM SIGMOD International Conference on Manage- ment of Data (SIGMOD), San Diego, California, 2003, pp. 7685. [110] C. Schmidt and M. Parashar, Flexible information discovery in decentralized distributed systems, in 12th IEEE International Symposium on High Performance Distributed Com- puting (HPDC'03), 2003. [111] S. Sen and J. Wong, Analyzing peer-to-peer trafc across large networks, in Proc. of ACM SIGCOMM Workshop on Internet Measurment Workshop, San Jose, CA, USA, November 2002. 182 [112] J. Sidell, P. Aoki, A. Sah, C. Staelin, M. Stonebraker, and A. Yu, Data replication in Mari- posa, in Proc. of the 12th International Conference on Data Engineering, New Orleans, LA, 1996. [113] B. Silaghi, B. Bhattacharjee, and P. Keleher, Query routing in the TerraDir distributed directory, in SPIE ITCOM'02, August 2002. [114] G. Singh, S. Bharathi, A. Chervenak, E. Deelman, C. Kesselman, M. Manohar, S. Pail, and L. Pearlman, A metadata catalog service for data intensive applications, in Proc. of the Supercomputing Conference (SC2003), November 2003. [115] S. Singh, C. Estan, G. Varghese, and S. Savage, Automated worm ngerprinting, in Proc. of the ACM/USENIX Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004. [116] E. Sit and R. Morris, Security considerations for peer-to-peer distributed hash tables, in Proc. of the 1st International Workshop on Peer-to-Peer Systems. Springer-Verlag, 2002, pp. 261269. [117] L. Spitzner, Honeypots: Tracking Hackers. Addison-Wesley, 2002. [118] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan, Chord: A scalable peer-to-peer lookup service for Internet applications, in Proc. of ACM SIGCOMM, 2001. [119] H. Stuckenschmidt, R. Vdovjak, G.-J. Houben, and J. Broekstra, Index structures and algorithms for querying distributed RDF repositories, in the 14th World Wide Web Con- ference (WWW2004), May 2004. [120] C. Tang, Z. Xu, and S. Dwarkadas, Peer-to-peer information retrieval using self- organizing semantic overlay networks, in Proc. of ACM SIGCOMM, 2003. [121] O. Tatebe, S. Sekiguchi, Y . Morita, S. Matsuoka, and N. Soda, Worldwide fast le repli- cation on Grid datafarm, in Proc. of the 2003 Computing in High Energy an d Nuclear Physics (CHEP03), March 2003. [122] C. Tempich, S. Staab, and A. Wranik, Remindin: Semantic query routing in peer- to-peer networks based on social metaphors, in the 14th World Wide Web Conference (WWW2004), May 2004. [123] D. B. Terry, K. Petersen, M. J. Spreitzer, , and M. M. Theimer, The case for non- transparent replication: Examples from bayou, in Proc. of the 14th International Con- ference on Data Engineering, 1998. [124] P. Thibodeau, Planet-Scale Grid, ComputerWorld, October 10 2005. [125] S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham, C. Kesselman, T. Maguire, T. Sandholm, P. Vanderbilt, and D. Snelling, The physiology of the Grid: An open Grid services architecture for distributed systems integration, Global Grid Forum Draft Rec- ommendation, 2003. 183 [126] R. van Renesse and A. Bozdog, Willow: DHT, aggregation, and publish/subscribe in one protocol, in Proc. of the International Workshop on Peer-to-Peer Systems (IPTPS '04), February 2004. [127] D. Wagner, Resilient aggregation in sensor networks, in Proc. of the 2nd ACM Workshop on Security of Ad-hoc and Sensor Networks, 2004, pp. 7887. [128] H. J. Wang, C. Guo, D. R. Simon, and A. Zugenmaier, Shield: vulnerability-driven net- work lters for preventing known vulnerability exploits, in Proc. of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communica- tions (SIGCOMM), Portland, Oregon, USA, 2004, pp. 193204. [129] K. Wang and S. S. G. Cretu, Anomalous payload-based worm detection and signature generation, in Proc. of the 8th International Symposium on Recent Advances in Intrusion Detection, September 2005. [130] N. Weaver, V . Paxson, S. Staniford, and R. Cunningham, A taxonomy of computer worms, in Proc. of the 2003 ACM workshop on Rapid Malcode, 2003, pp. 1118. [131] N. Weaver, S. Staniford, and V . Paxson, Very fast containment of scanning worms. in Proc. of the USENIX Security Symposium, 2004, pp. 2944. [132] K.-Y . Whang, B. T. Vander-Zanden, and H. M. Taylor, A linear-time probabilistic count- ing algorithm for database applications, ACM Transaction on Database Systems, vol. 15, no. 2, pp. 208229, 1990. [133] M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, and G. Alonso, Database replication techniques: A three paramater classication, in Proc. of the 19th IEEE Symposium on Reliable Distributed Systems, Nuernberg, Germany, 2000. [134] P. Yalagandula and M. Dahlin, A scalable distributed information management system, in Proc. of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, New York, NY , USA, 2004, pp. 379390. [135] B. Yang, R. Karri, and D. A. Mcgrew, An 80Gbps FPGA implementation of a universal hash function based message authentication code, Third Place Winner, 2004 DAC/ISSCC Student Design Contest, June 2004. [136] V . Yegneswaran, P. Barford, and S. Jha, Global intrusion detection in the DOMINO over- lay system, in Proc. of Network and Distributed System Security Symposium (NDSS), 2004. [137] E. R. Zayas and C. F. Everhart, Design and specication of the cellular andrew environ- ment, Carnegie-Mellon Universi ty, Tech. Rep. CMU-ITC-070, August 1988. [138] H. Zhang, A. Goel, and R. Govindan, Incrementally improving lookup latency in dis- tributed hash table systems, in Proc. of the 2003 ACM SIGMETRICS International Con- ference on Measurement and Modeling of Computer Systems, San Diego, CA, USA, 2003, pp. 114125. 184 [139] X. Zhang, J. Freschl, and J. M. Schopf, A performance study of monitoring and informa- tion services for distributed systems, in Proc. of HPDC, 2003. [140] Z. Zhang, S.-M. Shi, and J. Zhu, SOMO: Self-organized metadata overlay for resource management in P2P DHT, in Proc. of the International Workshop on Peer-to-Peer Systems (IPTPS '03), February 2003. [141] B. Zhao, J. Kubiatowicz, and A. Joseph, Tapestry: a fault-tolerant wide-area application infrastructure, in ACM Computer Communication Review, vol. 32, no. 1, 2002, p. 81. [142] J. Zhao, R. Govindan, and D. Estrin, Computing aggregates for monitoring wireless sen- sor networks, in The First IEEE International Workshop on Sensor Network Protocols and Applications (SNPA'03), Anchorage, AK, USA, May 2003. [143] R. Zhou and K. Hwang, PowerTrust: Robust and scalable reputation aggregation for trusted P2P computing, IEEE Transaction on Parallel and Distributed Systems (TPDS), accepted to appear, 2006. [144] C. C. Zou, L. Gao, W. Gong, and D. F. Towsley, Monitoring and early warning for Internet worms. in Proc. of the ACM Conference on Computer and Communications Security, 2003, pp. 190199. 185 Appendix A Related Publications [1] M. Cai, K. Hwang, Distributed Aggregation Schemes for Scalable Peer-to-Peer and Grid Computing, submitted to IEEE Trans. on Parallel and Distributed Systems, Sept, 2006. [2] M. Cai, K. Hwang, J. Pan, and C. Papadopoulos, WormShield: Fast Worm Signature Gen- eration with Distributed Fingerprint Aggregation, revision submitted to IEEE Trans. on Dependable and Secure Computing, June 2006. [3] K. Hwang, M. Cai, Y .-K. Kwok, S. Song, Yu Chen, and Ying Chen, DHT-based Security Infrastructure for Trusted Internet and Grid Computing, Int'l Journal of Critical Infrastruc- tures, 2(4), 2006. [4] A. L. Chervenak and M. Cai, Applying Peer-to-Peer Techniques to Grid Replica Location Services, Journal of Grid Computing, 4(1), 2006, pp. 49-69. [5] M. Cai, K. Hwang, Y .-K. Kwok, S. Song, and Y . Chen, Collaborative Internet Worm Con- tainment, IEEE Security and Privacy, May/June issue, 2005. [6] M. Cai, M. Frank, B. Yan, and R. MacGregor, A Subscribable Peer-to-Peer RDF Repository for Distributed Metadata Management, Journal of Web Semantics, 2(2), 2005. [7] M. Cai, M. Frank, J. Chen, and P. Szekely, MAAN: A Multi-Attribute Addressable Network for Grid Information Services, Journal of Grid Computing, 2(1), 2004, pp. 3-14. [8] M. Cai, A. Chervenak, and M. Frank. A Peer-to-Peer Replica Location Service Based on A Distributed Hash Table, in Proc. of the 2004 ACM/IEEE Conference on Supercomputing (SC2004), Pittsburgh, Nov 2004. [9] M. Cai, and M. Frank, RDFPeers: A Scalable Distributed RDF Repository based on A Structured Peer-to-Peer Network, in Proc. of the 13th Int'l World Wide Web Conference (WWW2004), New York, May 2004. 186 Appendix B Solution to Equation 3.1 We solve this equation in two cases: (a) d = 2 k 2, and (b) d = 2 k 2+, where k = 1; 2; :::;b1 and 1 < 2 k . For case (a), we have x = 2 k 2 + 2 k . Therefore, g(x) = k = log 2 ( x+2 2 ). For case (b), since x = 2 k 2 + + 2 k+1 = 3 2 k 2 + , then g(x) =dlog(d + 2)e = dlog(2 k + )e =dlog( x+2 3 + 2 3 )e. Also, since 1 2 k , we have dlog( x+2 3 )e dlog( x+2 3 + 2 3 )e =dlog( 32 k + 3 + 2 3 )e =dlog(2 k + )e < dlog(2 k + 2 k )e = k + 1 < dlog(2 k + 3 )e + 1 =dlog( x+2 3 )e + 1: Thus, the solution for case (b) is g(x) =dlog( x+2 3 )e. To normalize g(x) for these two cases, we prove that log( (x+2) 2 ) =dlog( x+2 3 )e when x = 2 k+1 2 as follows: dlog 2 ( x + 2 3 )e =dlog 2 2 k+1 log 2 3e =dk + 1 log 2 3e = k = log 2 x + 2 2 : Therefore, we can derive g(x) =dlog( x+2 3 )e from Eq.( 3.1) for both cases. 187
Abstract (if available)
Abstract
Peer-to-Peer (P2P) systems and Grids are emerging as two novel paradigms of distributed computing for wide-area resource sharing on the Internet. In these two paradigms, it is essential to discover resources by their attributes and to acquire global information in a fully decentralized fashion. This dissertation proposes a multi-attribute addressable network (MAAN) for resource indexing, a distributed aggregation tree (DAT) for information aggregation, and a distributed counting scheme for estimating global cardinalities.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Techniques for peer-to-peer content distribution over mobile ad hoc networks
PDF
Scalable reputation systems for peer-to-peer networks
PDF
Peer-to-peer content networking with copyright protection and jitter-free streaming
PDF
QoS-based distributed design of streaming and data distribution systems
PDF
Performance and incentive schemes for peer-to-peer systems
PDF
Scalable peer-to-peer streaming for interactive applications
PDF
Approximate query answering in unstructured peer-to-peer databases
PDF
A system for trust evaluation and management leveraging trusted computing technology
PDF
Hybrid mesh/image-based rendering techniques for computer graphics applications
PDF
Collaborative detection and filtering of DDoS attacks in ISP core networks
PDF
Advanced liquid simulation techniques for computer graphics applications
PDF
Aggregation and modeling using computational intelligence techniques
PDF
Distributed resource management for QoS-aware service provision
PDF
Software connectors for highly distributed and voluminous data-intensive systems
PDF
Balancing local resources and global goals in multiply-constrained distributed constraint optimization
PDF
Learning about the Internet through efficient sampling and aggregation
PDF
Prediction of energy consumption behavior in component-based distributed systems
PDF
Location-based spatial queries in mobile environments
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
Asset Metadata
Creator
Cai, Min
(author)
Core Title
Distributed indexing and aggregation techniques for peer-to-peer and grid computing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/19/2006
Defense Date
10/25/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
distributed aggregation,distributed indexing,grid computing,OAI-PMH Harvest,peer-to-peer computing,worm signature generation
Language
English
Advisor
Hwang, Kai (
committee chair
), Kuo, C.-C. Jay (
committee member
), Neuman, Clifford (
committee member
)
Creator Email
mincai@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m172
Unique identifier
UC1142859
Identifier
etd-Cai-20061119 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-34426 (legacy record id),usctheses-m172 (legacy record id)
Legacy Identifier
etd-Cai-20061119.pdf
Dmrecord
34426
Document Type
Dissertation
Rights
Cai, Min
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
distributed aggregation
distributed indexing
grid computing
peer-to-peer computing
worm signature generation