Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 828 (2004)
(USC DC Other)
USC Computer Science Technical Reports, no. 828 (2004)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Epidemic Sampling for Search in Unstructured
Peer-to-Peer Networks
Farnoush Banaei-Kashani, Cyrus Shahabi, Muhammad Sahimi
University of Southern California, Los Angeles, CA 90089, USA
[banaeika,shahabi,moe]@usc.edu
Abstract—Unstructured peer-to-peer (P2P) networks are
self-organizing and dynamic. Therefore, they are unindex-
able and without indexing, efficient search is only possi-
ble by efficient and intelligent dissemination of the query
to scan the network nodes/objects. Existing search mech-
anisms are rare, inefficient, and naive without theoretical
foundation. In this paper, first we define a query model
that formalizes and generalizes the typical P2P exact-match
search queries to partial selection queries, i.e., selection
queries that can be satisfied by a partial result-set (rather
than the entire result-set). As compared to exact-match
queries, partial selection queries are not only more general,
but also more cost-efficient and practical. Even though the
existing search mechanisms can be applied to answer such
queries, none of them are designed to answer these queries
intelligently and efficiently. Subsequently, we introduce
our simple but elegant epidemic-based search mechanism,
termed the SIR sampling mechanism, which is specially de-
signed to answer partial selection queries efficiently. Based
on a rigorous percolation analysis, we can tune our SIR
sampling mechanism on-the-fly and per-query to take a just
sufficiently large sample of the network nodes/objects to sat-
isfy the partial selection query. Our empirical study shows
that SIR sampling can strike a balance between communi-
cation cost and response time of the query. For the com-
mon case of the P2P partial selection queries, SIR sampling
outperforms flooding by up to two orders of magnitude in
communication cost while maintaining a tolerable response-
time. Also, SIR sampling outperforms a 32-random-walker
in both response time and communication cost.
Index Terms—System design, graph theory
I. INTRODUCTION
An unstructured peer-to-peer (P2P) network is a feder-
ation of peer nodes that assemble to share their resources.
Search to access the resources/objects is the most funda-
mental operation in such networks. Traditionally, index
structures are used for efficient search in large-scale dis-
tributed object repositories such as distributed databases.
With indexing, the repository is organized/structured into
a distributed data structure that allows real-time search
with minimum cost and short response-time. However,
by definition unstructured P2P networks are both self-
organizing and dynamic. The combination of these two
characteristics render unstructured P2P networks unin-
dexable: constructing the distributed index structure by
structuring the network (e.g., imposing particular network
topology, object placement, etc.) is in conflict with au-
tonomy of the nodes in self-organizing the P2P network.
Even if the structure is imposed on the network, the over-
head of maintaining the data structure with dynamic node-
set and object-set exceeds its benefit. Without indexing,
efficient search is only possible by efficient scanning of
the network nodes.
Distributed index structures such as DHTs [1], [2], [3],
[4] are search mechanisms that only apply to structured
P2P networks such as P2P file storage systems [5]. Struc-
tured P2P networks are inherently indexable by applica-
tion. For search in unstructured P2P networks, there are
two main proposals: flooding [6] and random walk [7]
1
.
With both of these search mechanisms, query is dissem-
inated throughout the network by recursive forwarding
from node to node in order to scan the network nodes
and locate the query objects. With flooding each node
that receives the query forwards it to all of its neighbors,
whereas with random walk query is forwarded to only one
randomly selected neighbor. Both of these approaches are
inefficient, because they fail to strike a balance between
the two metrics of efficiency for query dissemination, i.e.,
the communication cost and the response time. Flood-
ing is most efficient in query time but incurs too high of
a communication cost to be practical, whereas a random
walker is potentially more efficient in communication cost
but is intolerably slow in scanning the network. In [7], us-
ing k random walkers in parallel is proposed as a way to
balance the communication cost and the response time of
the query. However, this proposal does not provide any
theoretical basis for selecting the value of k for optimal
performance. In general, all of the mechanisms proposed
for search in unstructured P2P networks are studied only
empirically and are not supported by any firm theoretical
basis.
1
Here, we focus on the unstructured P2P networks that are com-
pletely unindexable. There are a number of search mechanisms pro-
posed for unstructured P2P networks that assume imposing some kind
of organization on the network [8], [9], [10], [11]. These approaches
do not apply to unindexable P2P networks.
2
We propose an epidemic-based (or gossip-based) query
dissemination mechanism for search in unstructured P2P
networks. The process of disease dissemination in a social
network have been previously used as a model to design
information dissemination techniques for various applica-
tions (e.g., for maintenance of replicated databases [12]).
Here, we discuss benefits of adopting this process as a
model for query dissemination in P2P networks. With
an epidemic query dissemination mechanism, query for-
warding is probabilistic, i.e., a node forwards a query to
each neighbor with forwarding probabilityp (0·p·1).
Therefore, a node may forward the query to zero or more
neighbors at each time. Such a query forwarding algo-
rithm is obviously more flexible and subsumes both flood-
ing and random walk. Potentially, an epidemic query dis-
semination mechanism can be tuned to the best operation
point to strike a balance between the communication cost
and the response time of the query. The tuning can be
done rigorously using mathematical models that support
epidemic-based algorithms
2
. Moreover, epidemic query
dissemination mechanisms are simple to implement, and
since they are randomized (and not deterministic), they
are more reliable to use in dynamic systems such as un-
structured P2P networks.
In this paper, first we define a query model that general-
izes the typical P2P exact-match search queries to partial
selection queries, i.e., selection queries that can be satis-
fied by a partial result-set (rather than the entire result-
set). As compared to exact-match queries, partial selec-
tion queries are not only more general, but also more cost-
efficient. With this query model, in order to avoid any
extra/useless communication cost and query processing
time, users define the fractional size ² (0 · ² · 1) of
the result-set that is sufficient to satisfy their query. An
exact-match query is a specific case of partial selection
query with² = 1, and a selection predicate that can only
be a conjunction of equality conditions. It is important
to note that all of the existing search mechanisms can be
used to answer partial selection queries but none of them
are designed to answer these queries intelligently and ef-
ficiently.
Subsequently, we introduce our epidemic-based search
mechanism, termed the SIR sampling mechanism, which
is modelled on the common SIR epidemic model [13] and
is specially designed to answer partial selection queries
efficiently. For each partial selection query with specific²,
our SIR sampling mechanism is tuned on-the-fly and per-
query to disseminate the query with the minimum query
forwarding probabilityp (and hence, minimum communi-
cation cost) such that only a subset (or sample) of the net-
work nodes/objects is covered which is just sufficiently
2
These models were originally developed to explain disease epi-
demic in human populations [13].
large to satisfy the query. Consequently, SIR sampling
can adjust its communication cost and response time ac-
cording to the expected² value per partial selection query
and avoid redundant communication. We show that due
to a phase transition phenomenon associated with the SIR
epidemic model, for the common case of the P2P partial
selection queries avoiding the redundant communication
can result in up to two orders of magnitude improvement
in the efficiency of the sampling mechanism as compared
to flooding (which is a specific case of the SIR query dis-
semination with fixedp=1).
The main challenge with SIR sampling is how to tune
p to its best operation point per query. To address this
challenge, we use a percolation model to derive the rela-
tion between p and the sample-size analytically. Instead
of the traditional mathematical models developed to ana-
lyze the SIR epidemic model, for this analysis we use the
percolation theory to be able to consider the effect of the
particular P2P network topology (in this case, power-law
topology) in the analysis. The traditional SIR mathemati-
cal models assume a fully connected topology for the net-
work to simplify the analysis. However, it is shown that
considering the structure of the network in the analysis
extensively affects the results [14].
Finally, we perform an empirical study via simulation
to both verify our theoretical results, and compare the
efficiency of our SIR sampling mechanism versus other
search mechanisms. The regular flooding (mentioned
above) is not designed to answer partial selection queries
and regardless of ², always covers the entire network.
Instead, some P2P networks use scope-limited flooding
(i.e., flooding with limited TTL) to limit the coverage of
the query. However, with scope-limited flooding one can
never ensure whether a sufficiently large fraction of the
network is covered to satisfy a particular partial selection,
unless the size of the network is known globally. The
same limitation applies to query dissemination by random
walk. Therefore, these search mechanisms are also inca-
pable of answering partial selection queries efficiently. On
the other hand, with the SIR query dissemination the frac-
tional coverage is naturally enabled by probabilistic flood-
ing and no global information is required. In our exper-
iments, we compared the efficiency of our SIR sampling
mechanism with that of the scope-limited flooding andk-
random-walkers (with variousk) by artificially informing
them about the coverage required to satisfy each partial se-
lection query. Our results show that even under such arti-
ficial conditions, SIR sampling outperforms scope-limited
flooding in communication cost while maintaining a rea-
sonable response time. Also, to our surprise, SIR sam-
pling not only has a much better response time as com-
pared to that of the random-walk but also outperforms a
32-random-walker (the optimal case as suggested in [7])
3
even in communication cost.
The remainder of this paper is organized as follows. In
Section II, we formally define the search problem by char-
acterizing the partial selection query model, explaining
our efficiency metrics, and discussing our assumed P2P
network model. Section III elaborately describes our SIR
sampling mechanism. In Section IV, we provide our the-
oretical analysis that enables the tuning of the SIR sam-
pling. Section V presents the results of our empirical
study. Finally, Section VI concludes the paper and dis-
cusses the future directions of this research. We cover the
related work, each at its own context throughout the paper.
II. FORMAL DEFINITION OF THE PROBLEM
A. Data and Query Model
Consider a collective set of objects (e.g., files) shared
by the nodes of a P2P network; we abstract out the se-
mantic of each objectu by a tuple withd attributes: u =
ha
1
;a
2
;:::;a
d
i. Hence, in the relational data model terms,
the object content of the P2P network comprises a relation
r (optionally, containing replicated objects
3
) with d at-
tributes which is horizontally fragmented and distributed
among the nodes of the network. For instance, these at-
tributes can be the characteristic attributes of music files
(e.g., artist-name, album-name, album-year) shared in a
P2P audio-file-sharing network such as Piolet [15], or
they can be the amount of computing resources (e.g.,
CPU, memory, storage, bandwidth) collectively shared as
a computing unit/object in a P2P distributed computing
system such as SETI@home [16], or they can be inter-
face attributes (e.g., service-input, service-output, service-
address) of a web/grid service shared in a web/grid ser-
vices infrastructure such as NEESgrid [17]. We assume
that the objects may be replicated, and nodes can share
any number of objects, with possibly non-uniform distri-
bution of the object population across the nodes.
With this data model, the generalized form of the typ-
ical P2P exact-match search queries is a basic selection
query:
SELECT a
1
;a
2
;:::;a
n
FROM r
WHERE P(u)
whereP is the selection predicate
4
. For example, in a dis-
tributed computing system to schedule a 512KB task on a
3
Note that here we relax the object-uniqueness integrity constraint
for relations. Thus, a relation is a multiset that may contain several
replicas of the same objectu.
4
We emphasize that the result-set of a selection query is also a re-
lation/multiset and may include several replicas of the same tuple. It
is important that the replicated tuples are preserved in the result-set
of the query. For instance, they may refer to different replicas of the
same object satisfying the selection predicate. If the object replicas
pertain to different QoS properties, or in those cases where the querier
needs several replicas of the object for collective service (e.g., for the
computing unit with fast CPU, one can use the following
search query to locate possible choices:
SELECT unit-address
FROM r
WHERE (CPU = 2.8MHz) AND (memory> 512KB)
However, since P2P networks and, therefore, the rela-
tion r tend to grow large in scale, inspecting the entire
network/relation to process a query is often too expensive
(in communication cost and response time). On the other
hand, depending on the application some subsetX
0
of the
complete result-set X of the selection query, i.e., only a
partial result-set, may be sufficient for the purpose of the
application (e.g., see the above example query). Thus,
we further generalize the search query to partial selec-
tion query, i.e., a selection query that can be satisfied by
a partial result-set. A partial selection query is formal-
ized as a selection query with an additional user-defined
parameter ² to specify the required completeness of the
result-set. We term such a query an²-query for short. To
measure the relative completeness of the result-set, we de-
fine the Recall Ratio (RR) for the partial result-setX
0
as
RR(X
0
)=
jX
0
j
jXj
. An²-query is satisfied by a result-setX
0
(X
0
µ X) if and only ifRR(X
0
) > ². A selection query
is a specific case of²-query with²=1
5
.
We can show that the recall ratio of the partial result-
set of a query is proportional to the number of the ob-
jects visited/inspected during processing of the query and,
therefore, one can deduce the extent of the object inspec-
tion needed to satisfy an ²-query for a given ². As we
discussed in Section I, with unstructured P2P networks
the relation r is unindexable and analogous to the brute
force scanning of the unindexable relations in databases,
processing a P2P query is by indirectedly disseminating
the query throughout the network to take a sample from
the object-setr. During sampling, the query visits a sub-
set of the network nodes and inspects their object content
against the selection predicate. To illustrate, one can con-
sider the P2P network with its object content as a bag of
total n (n = jrj) marbles where m (m = jXj) of them
are gray (the objects that satisfy the selection predicate of
the query) and the rest are white (see Figure 1-a). It is
easy to see that the expected recall ratio E[RR(X
0
)] of
the query is equal to the relative size of the object sam-
ple S, i.e., E[RR(X
0
)] = E[
l
m
] =
s
n
where l and s
resilient P2P swarm-streaming [18]), the replicated tuples should not
be eliminated from the result-set.
5
In an open computing environment such as a P2P network, there
should be some kind of incentive for the users not to use² values larger
than what they need, because queries with larger ² values probably
cost the system more to resolve uselessly. One can think of an auditing
mechanism where users collect credit by collaborating with other peers
in resolving search queries, and use the credit to issue search queries,
with higher charge for queries with larger² values.
4
are the sizes of the partial result-set and the sample, re-
spectively
6
. Thus, to satisfy an ²-query one can take a
sample of any size s ¸ ²n from the P2P bag and ex-
pect that the sample is large enough to satisfy the²-query.
More strongly, one can show that in fact in the limit of
large network sizes, the recall ratio of individual²-queries
(i.e., RR(X
0
)) seldomly differs much with the expected
recall ratio E[RR(X
0
)] and, therefore, with any sample
sizes¸ ²n every²-query is satisfied with high probabil-
ity (whp)
7
. The probability p
sat
of satisfying an ²-query
with² =
l
m
by taking a sample of sizes from a network
withn objects is computed as follows:
p
sat
=
P
m
i=l
¡
m
i
¢¡
n¡m
s¡i
¢
¡
n
s
¢
in which the numerator is the number of desired cases
where the sample includes more than the fraction² =
l
m
of the gray marbles (the desired objects) and the denom-
inator is the total number of possible cases. Figure 1-b
depicts p
sat
as a function of the tolerable recall ratio of
the query (²) and the relative sample size (s=n), for a
small network with n = 1000 objects and a query with
selectivity (m=n = 1%). It is evident that for an²-query
with a fixed² =²
0
, roughly with any relative sample size
s=n < ²
0
the query satisfaction probability takes near-
zero values while the probability makes a sudden transi-
tion to near-one values with any sample size s=n > ²
0
.
In the limit of large network sizes, this transition is sharp
and with any sample of size s=n ¸ ²
0
every ²
0
-query is
satisfied whp.
Sampling from the object-set of a P2P network (to an-
swer an²-query as discussed above) translates to sampling
from the node-set of the network via some query dissemi-
nation mechanism such as flooding. We call such a mech-
anism a (node/object) sampling mechanism. Considering
the distribution of the object population among the net-
work nodes (in Section II-D we consider uniform distri-
bution), the required size of the node-sample to answer
an ²-query can be determined based on the required size
of the object-sample, which in turn is proportional to² as
discussed above (see Section III-A.2 for more details on
the relationship between these three parameters). In this
paper, we focus on developing efficient sampling mech-
anisms to answer ²-queries. Next, we define the notion
of communication graph to model and visualize the dy-
namic process of query dissemination as implemented by
a generic sampling mechanism, and subsequently discuss
6
Here, we assume the sample S is a uniform random sample from
the object-set. With random assignment of the objects to the nodes of
the network, this assumption is valid even with non-uniform sampling
from the node-set.
7
By the term “with high probability (whp)”, we mean with a proba-
bility greater thanp=1¡
1
n
O(1)
.
r
[n]
X
[m]
X
[l]
S
[s]
a. Sampling from the P2P bag
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Relative Sample Size (s/n)
Expected Recall Ratio (ε)
Query Satisfaction Probability (p
sat
)
b. Query satisfaction probability
Fig. 1. Query recall ratio versus sample size
the two measures of efficiency for sampling mechanisms.
B. Generic Model for Query Dissemination
We model a P2P overlay network with the node-setN
and the link/edge-setE as an undirected graphG(N;E).
For a query initiated at time t = t
0
, the communication
graph of the query at timet¸t
0
is a subgraphG
t
(N
t
;E
t
)
ofG, whereE
t
µE is the set of links traversed by at least
one query replica during the time interval[t
0
;t], andN
t
µ
N is the set of nodes visited by at least one query replica
during the same time interval. Associated with any link
e2E
t
is a weightw
e
that is the number of times the linke
is traversed during the time interval. We assume a discrete
time model. Thus, the dynamic process of disseminating
a query is completely represented by the set of commu-
nication graphsfG
t
0
;G
t
0
+1
;G
t
0
+2
;:::;G
t
0
+T
g, where at
time t = t
0
+T query dissemination terminates (hence,
for all t ¸ (t
0
+ T), G
t
0
+t
= G
t
0
+T
). The commu-
nication graph is a generic model to visualize the query
dissemination process employed by any sampling mech-
anism. For example, Figure 2 depicts the 6 first commu-
nication graphs of a query that is initiated at node A and
disseminated based on the random walk sampling mecha-
nism.
C. Efficiency Measures for Sampling
Based on the notion of the communication graph, we
define two main metrics to measure the efficiency of sam-
pling as follows:
1) Communication cost (C): Assuming uniform link
cost and uniform query size, we model the commu-
nication cost of taking a sample as the total num-
ber of query replicas communicated between the
nodes during the query dissemination process. In
communication-graph terminology:
5
A
A
A
A
1
1
1
1
2
1
A
1
1
1
A
2
1
1
1
t=t
0
t=t
0
+1
t=t
0
+2
t=t
0
+3
t=t
0
+5
t=t
0
+4
Fig. 2. Communication graph
C=
X
e2E
t
0
+T
w
e
2) Sampling time (T): Assuming uniform link latency,
the sampling/query time is the total time it takes to
disseminated the query. In communication-graph
terminology:
T=T
D. Peer-to-Peer Network Model
Up to this point, our discussion was generic to unstruc-
tured P2P networks. Here, we specify our model for P2P
networks. We model the characteristics of the network on
those of the real P2P networks (e.g., Gnutella [19] and
Kazaa [20]), as captured by numerous empirical studies
such as [21], [22], [23], [24]:
² Network topology: The topology of the P2P network
is a power-law (or scale-free) random graph, i.e., a
random graph
8
with power-law probability distribu-
tion for node degrees. The degree of a node is the
number of edges connected to the node. Intuitively,
in a power-law random graph most of the nodes are
of low degree while there are still a few nodes with
very high connectivity. We define the power-law
probability density function (pdf) for the node degree
k as follows:
p
k
=Ck
¡°
e
¡k=º
where °, º, and C are constants. ° is the skew
factor of the power-law distribution, often in range
8
A random graph is a graph in which the interconnection between
nodes is defined probabilistically rather than deterministically. See
[25] for background information about random graphs.
2 < ° < 3:75 for real networks. For example, a
case study reports ° = 2:3 for Gnutella [21]. The
less the skew factor, the heavier the tail of the power-
law distribution, which translates to larger number
of highly connected nodes. A pure power-law dis-
tribution does not include the exponential cutoff fac-
tor (e
¡k=º
), allowing for nodes with infinite degree,
which is unrealistic for real P2P networks. The cut-
off factor with indexº shortens the heavy tail of the
power-law distribution such that the maximum node
degree for the nodes of the graph is in the same or-
der of magnitude as º. Finally, C is the normaliza-
tion factor that is computed asC = [Li
°
(e
¡1=º
)]
¡1
,
whereLi
°
(x) =
P
1
k=1
x
k
k
°
is the°-th polylogarithm
function ofx.
² Object popularity: In [24], it is shown that the popu-
larity of objects in P2P networks follows a bi-modal
Zipf distribution. Here, to simplify the model we as-
sume a regular Zipf distribution. With the Zipf distri-
bution, if the objects are sequentially ranked accord-
ing to the descending order of access frequency (i.e.,
the number of times each object is requested in unit
of time), then the access frequency for the i-th ob-
ject u is given by f
u
= Ci
¡µ
, where µ is the skew
factor of the distribution and C is the normalization
constant. For P2P networks, we assume a Zipf distri-
bution with skew factorµ =1 (see [24]).
² Object replication: We assume in the object-setr of
the P2P network, objectu is replicated forc
u
times.
With a common object caching policy where nodes
store the objects that they request, replication fac-
tor c
u
of the object is proportional to its access fre-
quencyf
u
. Therefore, assuming the total population
of the objects in r (which includes all object repli-
cas) isjrj = n, the replication factor of the objectu
is defined such that:
½ P
u2r
c
u
=n
c
u
/f
u
² Object distribution: We assume objects are dis-
tributed uniformly among the network nodes.
As we explain in Section IV, among the network char-
acteristics mentioned above the performance of our pro-
posed sampling mechanism is only dependent on our as-
sumptions about the network topology and the object dis-
tribution. Other assumptions, i.e., object popularity and
object replication, are only relevant to our experimental
methodology for empirical analysis (see Section V).
III. EPIDEMIC SAMPLING
Our proposed sampling mechanism is inspired by epi-
demic spreading/dissemination of contagious diseases in a
6
social network (i.e., a network that models a society, with
nodes as people and links as social links between people
who are in regular contact). A contagious disease first ap-
pears at one node (the originator of the disease), and then
through the social links disseminates throughout the net-
work in a flood-like fashion, from the infected person to
its direct neighbors, from the infected neighbors to their
neighbors, and so on. However, the success in transmis-
sion of the disease through a social link is probabilistic,
i.e., the disease is transmitted from an infected node to
its susceptible neighbor with probability p (0 · p · 1)
and is ceased with probability 1¡ p. The value of p is
determined based on the infectiousness of the particular
disease as well as some other environmental parameters,
but the value is generic to all links of the network. When
the spreading terminates, the disease has covered/reached
a sample S of the total node population M (S µ M),
where the size of S increases with increasing value of p
(although not necessarily linearly).
By analogy, we take the dissemination of a query in a
P2P network as the spreading of a disease in a social net-
work. We model our sampling mechanism on a particu-
lar (and popular) model for disease spreading, namely the
SIR (Susceptible-Infected-Removed) model, and use the
parameter p as a control parameter to determine the size
of the node-sample such that it satisfies an²-query with a
given². Below, first we define this sampling mechanism,
the SIR sampling mechanism, as the basic technique to
answer ²-queries. Thereafter, we introduce a number of
variants for the basic SIR sampling mechanism in order
to model additional properties of some specific P2P net-
works.
A. Basic Epidemic Sampling Mechanism: SIR Sampling
We begin by describing the SIR disease spreading
model. With this model, we also discuss the relation be-
tween the size of the sample node-set covered by the dis-
ease and the infection probability p. Subsequently, we
introduce our SIR sampling mechanism that is motivated
by the SIR epidemic model.
1) SIR Epidemic Model and Phase Transition Phe-
nomenon: With the SIR epidemic model, at any partic-
ular time during dissemination of a disease each node of
the social network is in one of the three states susceptible
(S), infected (I), or removed (R). A “susceptible” node is
capable of being infected but is not infected yet; an “in-
fected” node is currently infected; and a “removed” node
has recovered from the infection and is both immune to
further infection and impotent to infect other nodes. Now,
consider the dynamic process of disease dissemination in
a social network according to the SIR model. Initially, at
timet
0
all nodes of the network are susceptible except the
originator of the disease, which is infected. As the disease
propagates throughout the network, if at timet¸t
0
a sus-
ceptible noden has an infected neighborm, at timet+1
with probability p node n conceives the disease from m
and becomes infected (see Figure 3-a for the state transi-
tion diagram of a node). An infected node remains in the
infectious state for a period of time ¿ (during which it is
able to infect its neighbors), and then finally becomes re-
moved. We assume¿ = 1 without loss of generality. The
disease dissemination terminates at timet
0
+T (T ¸ 1)
when the infection ceases. At this time, all nodes of the
network are either removed (i.e., affected by the disease),
or still susceptible (and survived the disease without ef-
fect).
As we mentioned before, with our SIR-based sampling
mechanism we use the probabilityp as a parameter to con-
trol the size of the samples. Therefore, it is important to
characterize the relation between the infection probabil-
ity p and the size of the sample node-set covered by the
disease (i.e., the disease spread) according to the SIR dis-
ease spreading model. Specifically, we are interested in
the range of the values for p that whp result in disease
epidemic (or disease prevalence), as defined below. With
small values for p, the disease covers an infinitesimally
small number of nodes as compared to the total number
of nodesM in large networks; i.e., we have
R(t
0
+T)
M
=0 as M !1 (1)
where R(t) is the number of removed nodes at time t.
The disease dissemination is called epidemic only when
the size of the covered node-set is comparable to that of
the network; i.e., when
R(t0 +T)
M
=s as M !1; (2)
wheres2(0;1].
Percolation Model for the Disease Spread: We for-
mulate the problem posed above as a percolation prob-
lem, namely a bond percolation problem [26]. Consider
a network G(N;E) with jNj ! 1, and assume each
link/bonde2E of the network is closed (i.e., connected)
with probabilityp, or open (i.e, disconnected) with prob-
ability1¡p. Atp=0, the network is fully disconnected.
As p increases, connected components of finite size ap-
pear but they are isolated from each other. LetG
0
(N
0
;E
0
)
be the subgraph of G that represents the largest of these
components. SinceN
0
is finite, we have
jN
0
j
jNj
=0 as N !1: (3)
At some critical value p = p
c
, whp most of the growing
finite components interconnect and comprise one infinite
7
S
R
I
τ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
|N'| / |N|
p =p
c
a. SIR state diagram b. Phase Transition Phenomenon
Fig. 3. SIR epidemic model
component, termed the giant component
9
. At this point
the largest componentG
0
is the giant component (because
the connected components that are still disconnected from
the giant component are of finite size, and hence, they are
smaller). Unlike finite components, the size of the giant
component is comparable to the size of the entire network;
i.e.,
jN
0
j
jNj
=s as N !1 (4)
wheres 2 (0;1). This sudden transition from a sparsely
connected network to an almost globally connected net-
work at the critical pointp = p
c
is termed the phase tran-
sition phenomenon (see Figure 3-b). At the critical point,
the giant component G
0
is a weakly connected tree-like
graph that hardly contains any loops. As p increases be-
yond the critical value p
c
towards p = 1, not only the
giant component becomes more strongly connected (with
more loops), but also with the merger of other finite com-
ponents its relative sizes increases. Finally atp = 1, G
0
becomes identical toG.
Considering the percolation model described above,
one can answer the problem we posed as follows. As-
sume a disease originates at a node n and disseminates
throughout the social network with the infection probabil-
ity p. Also, on the same social network consider a bond
percolation model with the same probability p. One can
observe that the disease spread maps to the percolation
component that contains the originator of the disease n.
Moreover, one can deduce that the disease is epidemic if
and only if n belongs to the giant component (compare
the definitions for the epidemic and the giant component
in Equations 2 and 4, respectively). Thus, the necessary
and sufficient conditions for epidemic can be listed as fol-
lows:
9
We should point out that here our entire discussion is a probabilis-
tic discussion (and not deterministic). First, the process of marking
the bonds as open or closed is a random process. Moreover, ifG is a
random graph, the linkage in the original graph itself is randomized.
Thus, statements such as appearance of the giant component atp=pc
claim that some event occurs with high probability over all possible
cases of bond marking, and assuming random graphs, over an ensem-
ble of random graphs with particular degree distribution.
² The infection probability p must be greater than the
critical valuep
c
for the giant component to exist whp.
² The originator of the disease n must belong to the
giant component.
2) SIR Sampling Mechanism: We propose the SIR
sampling mechanism to answer²-queries. With SIR sam-
pling, we adopt the SIR disease dissemination model
as a model for query dissemination (hereafter, we also
adopt the corresponding epidemic terminology). Given
an²-query, the SIR sampling mechanism disseminates the
query with the minimum query forwarding probability p
(i.e., the same as the infection probability) such that the
size of the covered node-set (and hence, the size of the
sample object-set stored among the covered nodes) is suf-
ficient to satisfy the ²-query whp. The main challenge
is how to select the appropriate value for p to satisfy an
²-query with a given ². Our solution is based on the per-
colation model described in the previous section. In the
rest of this section, first we summarize our solution to ad-
dress this challenge. We provide theoretical analysis and
empirical verification for this solution in Sections IV and
V, respectively. Subsequently, we describe how our SIR
sampling mechanism is implemented in P2P networks and
intuitively discuss why it is an efficient sampling mecha-
nism.
Solution Summary: In Section II-A, we showed that
in order to satisfy an ²-query whp, we have to take a
sample of (relative) size s
obj
from the P2P bag such that
s
obj
¸ ². In turn, to take an object-sample of size s
obj
,
the query must visit/cover a node-sample with (relative)
sizes
node
where nodes collectively store at least a fraction
s
obj
of the network object-content. One can estimate the
requireds
node
as a function ofs
obj
considering the distri-
bution of the objects among the network nodes. In Sec-
tion II-D, we assumed uniform object distribution among
nodes; therefore, in this case s
node
= s
obj
. Thus, given
the ² for an ²-query one can determine the node-sample
sizes
node
required to satisfy the query whp. On the other
hand, according to Figure 3-b with SIR query dissemi-
nation the size of the covered node-set s
node
=
jN
0
j
jNj
is
determined by the query forwarding probability p. We
characterize this relation rigorously for a power-law net-
work in Section IV. Therefore, in summary we can usep
as a control parameter to cover a node-sample of sufficient
sizes
node
storing a large enough object-sample to satisfy
a particular²-query whp.
Implementation: Since with any non-trivial²-query
we have²>0, the SIR sampling mechanism must be im-
plemented such that the query dissemination is epidemic
(see the definition for an epidemic disease in the previous
section), otherwise with large networks the relative size
of the covered node-sample (i.e., s
node
) remains close to
8
zero and fail to satisfy the²-query.
To satisfy the conditions for epidemic query dissem-
ination, firstly the query forwarding probability p must
be greater than the critical value p
c
(i.e., the critical for-
warding probability for the P2P networks with power-law
topology). Secondly, the query must be originated at a
node that whp belongs to the giant component. Accord-
ing to the bond percolation model that we defined in the
previous section, the probability that a node belongs to the
giant component is proportional to the connectivity degree
of the node. On the other hand, as one of the main char-
acteristics of the power-law networks, in such networks
there are a number of very highly connected nodes, which
for the same reason are also quickly reachable with a few
hops from other nodes of the network [8]. Such highly
connected nodes belong to the giant component whp.
Therefore, we define our simple but elegant SIR sam-
pling algorithm to answer an²-query as follows:
² Step 1: Seeding the Query
First, the actual querier initiates a selective walker to
locate a highly connected node of the network. With
the selective walk, at each hop the query is forwarded
to the neighbor with the maximum connectivity de-
gree until the walker reaches a node with a connec-
tivity degree higher than the degree of all its neigh-
bors. In [8], it is shown that in a power-law network
whp such a selective walker finds a highly connected
node belonging to the giant component inO(logjNj)
hops, wherejNj is the size of the network. Our ex-
periments also verify this result.
² Step 2: Disseminating the Query
Next, the SIR query dissemination is initiated at
the highly connected node located by the selective
walker. The query is forwarded with a forwarding
probability p > p
c
, which is selected such that the
size of the covered node-set satisfies the given ²-
query, as discussed above
10
.
Efficiency: The SIR sampling mechanism is an ef-
ficient sampling technique. The query dissemination at
Step 2 dominates the cost of the query with the SIR sam-
pling. Thus, here we focus on the reasons for the effi-
ciency of the query dissemination. Recall that the commu-
nication costC of a query is equal to the total weight of its
final communication graph G
t
0
+T
(N
t
0
+T
;E
t
0
+T
). With
SIR sampling, when Step 2 terminates, the communica-
tion graph G
t
0
+T
models the giant component through
which the query disseminates. In this communication
graph, weight w
e
of any edge e 2 E
t
0
+T
is at most 2,
because with SIR each node of the network forwards the
10
We assume the result-set of the query is directly communicated
back to the querier. This traffic depends on the selectivity of the query.
Our metric for the communication cost only captures the cost of the
query dissemination.
query to its neighbors at most once (i.e., when it is in-
fected). Therefore, the total communication cost C of
the query is always less than 2 times the number of edges
jE
t
0
+T
j. On the other hand, with SIR the number of edges
jE
t
0
+T
j is proportional to the query forwarding probabil-
ity p. The SIR sampling mechanism is efficient mainly
because it allows to analytically select the minimum value
forp (and according to the above discussion, to undertake
a minimum cost C) required to satisfy an ²-query with
given ². By selecting an appropriate value of p for each
particular query, it is as if the SIR sampling dynamically
prunes the actual network topology to a less strongly con-
nected communication subgraph with minimum
11
number
of edgesjE
t
0
+T
j (hence minimum communication cost)
that still covers sufficient number of nodesjN
t
0
+T
j to sat-
isfy the ²-query. In this way, unlike flooding (which is
a specific case of the SIR query dissemination with fixed
p = 1), SIR sampling is tunable for each particular ²-
query and can satisfy the query efficiently by avoiding re-
dundant communication. Next, we show that due to the
phase transition phenomenon, for most of the ²-queries
avoiding the redundant communication can result in up to
two orders of magnitude improvement in the efficiency of
the sampling mechanism as compared to flooding.
Tuning the query forwarding probability p to the best
operation point specially improves the querying efficiency
for the²-queries other than those that require almost com-
plete result-set with ² ¼ 1. These queries are in fact
the common case with the typical P2P applications. This
property can be explained by observing the skewed form
of the phase transition diagram. According to Figure 3-
b, large fractions
jN
0
j
jNj
of the network can be covered with
forwarding probabilities as low as p
c
(that is often much
less than p = 1); hence, ²-queries with quite large ² val-
ues can be satisfied with low communication cost. Par-
ticularly, with power-law networks since the value of p
c
tends towards zero (see Section IV), tuning the forward-
ing probability to near-p
c
values results in up to two or-
ders of magnitude improvement in cost of such²-queries.
Only for ²-queries with ² ¼ 1 the forwarding probabil-
ity approaches 1. For such queries, a marginal improve-
ment in ² costs a large amount of extra communication
(notice the flat section of the phase transition diagram),
which is only partly spent to expand the coverage of the
query and mostly wasted because of the duplicate queries
generated in the overly connected communication graph.
The SIR sampling mechanism avoids this extra communi-
cation overhead while it is not required.
Finally, here we point out that with p ¿ 1, since the
shortest paths between nodes might be pruned out of the
11
Here, by “minimum” we are not claiming minimality among all
possible communication graphs, but among the SIR communication
graphs with different forwarding probabilities.
9
communication graph, the sampling timeT of the query
exceeds the minimum possible sampling time achieved
with p = 1 (i.e., with flooding). However, our experi-
ments show that the sampling time of SIR only increases
by a factor of 4 over the minimum possible sampling time
(see Section V) and remains incomparably superior to that
of the random walk sampling mechanism.
B. Variants of SIR Sampling
Since in real-world P2P networks some nodes may re-
frain from participating in query dissemination, here we
introduce a family of variants for the SIR sampling mech-
anism to model this behavior. With this family, unlike
the original SIR model where initially all nodes are in the
“susceptible” state, some nodes begin in the “removed”
state. In the context of the disease epidemic, these nodes
represent the people who are vaccinated. We term such
nodes the blocked nodes.
We consider three different variants with blocking for
the original SIR sampling mechanism, each representing
a particular real-world scenario:
1) SIR sampling with uniform blocking: In this
case, nodes are blocked with uniform probability.
This case models the networks where nodes au-
tonomously decide whether or not to participate in
the query dissemination.
2) SIR sampling with negative degree-correlated
blocking: In this case, the nodes with lower connec-
tivity degrees are blocked with higher probability.
For example, this case models the P2P file-sharing
networks where low degree nodes are usually also
low-bandwidth and therefore, to avoid congestion
and possible isolation, may refrain from participat-
ing in query dissemination with higher probability.
3) SIR sampling with positive degree-correlated block-
ing: This case is opposite to the previous case,
where nodes with higher connectivity degrees are
more probably blocked. This case models the sce-
nario where, for example, high degree nodes of a
power-law network are attacked and disabled.
IV. THEORETICAL ANALYSIS
Figure 3-b depicts the relation between the forwarding
probabilityp of the SIR dissemination model and the rel-
ative size of the covered node-sets
node
=
jN
0
j
jNj
assuming
a hypothetical network. Here, we assume a power-law
network (according to our model for P2P networks) and
use the percolation theory [27], [28], [29] to analytically
derive the functions
node
(p).
First, we provide some definitions. The generating-
function formalism [30] can be used to characterize a
probability distribution function. Particularly, the gener-
ating functionG
0
(x) for the distribution of node-degreek
in a random graph is defined as:
G
0
(x)=
1
X
k=0
p
k
x
k
(5)
where p
k
is the probability that a randomly chosen node
of the graph has degree k. From Equation (5), one can
deduce then-th moment of the degree distribution as fol-
lows:
hk
n
i=
·
(x
d
dx
)
n
G
0
(x)
¸
x=1
(6)
For a random graph with the generating function
G
0
(x), one can also defineG
1
(x), which is the generating
function for the distribution of the degree of the nodes we
arrive by following a randomly chosen link (excluding the
chosen link itself). G
1
(x) can be derived from G
0
(x) as
follows:
G1(x)=
1
hki
G
0
0
(x) (7)
Finally, consider the bond percolation model described
in Section III-A.1 for an arbitrary random graph. Assum-
ing each link is closed with probability p, one can easily
derive the generating functionsG
0
(x;p) andG
1
(x;p) for
a connected component (itself as a random graph) as fol-
lows:
G
0
(x;p)=G
0
(1+(x¡1)p) (8)
G
1
(x;p)=G
1
(1+(x¡1)p) (9)
In the rest of this section, first we derive the distribution
of the size s (i.e., the number of nodes) of the connected
components at the bond percolation model described in
Section III-A.1. Since at the critical probabilityp
c
the gi-
ant component appears, the average size of the connected
components goes to infinity atp = p
c
. We use this crite-
rion to find p
c
. Finally, as a result of this discussion we
derive the relation between the relative size s
node
of the
giant component and the probability p (which as we dis-
cussed in Section III-A.1, maps to the query forwarding
probability p). Particularly, we characterize the function
s
node
(p) for a power-law network, which is a particular
kind of random graph.
AssumeH
0
(x;p) is the generating function for the dis-
tribution of the size of the connected components. The
following equations follow by observing the structure of a
giant component:
H1(x;p)=xG1(H1(x;p);p) (10)
H0(x;p)=xG0(H1(x;p);p) (11)
Therefore, the average sizehsi of the connected compo-
nents can be computed as:
hsi=H
0
0
(1;p)=1+G
0
0
(1;p)H
0
1
(1;p) (12)
10
However, according to Equation 10 we have:
H
0
1
(1;p)=1+G
0
1
(1;p)H
0
1
(1;p)=
1
1¡G
0
1
(1;p)
(13)
Thus:
hsi=1+
G
0
0
(1;p))
1¡G
0
1
(1;p)
=1+
pG
0
0
(1)
1¡pG
0
1
(1)
(14)
Finally, as we mentioned above the phase transition oc-
curs whenhsi!1. Therefore, the critical probabilityp
c
is:
pc =
1
G
0
1
(1)
(15)
With p ¸ p
c
, H
0
(x;p) remains the distribution of the
finite size components, i.e., all components except the gi-
ant component. Thus, we have:
H
0
(1;p)=1¡s
node
(p) (16)
Finally, using Equation 11 we finds
node
(p) as follows:
s
node
(p)=1¡G
0
(y;p) (17)
where according to Equation 10y =H
1
(1;p) is the solu-
tion of:
y =G1(y;p) (18)
It is difficult to solve these equations for s
node
(p) in
closed form. However, we can solve the equations by nu-
merical iteration. In Figure 4-a, we show the solution for
a power-law random graph with° =2:3.
V. EMPIRICAL ANALYSIS
We conducted two sets of experiments via simulation
to 1) verify our theoretical results, and 2) compare the
efficiency of our SIR sampling mechanism versus other
search mechanisms. For this purpose, we implemented a
discrete-time event-driven simulator in C++. We used an
Enterprise E220 SUN server to perform the experiments.
A. Experimental Methodology
With the first set of experiments, we studied the rela-
tion between the forwarding probabilityp and the size of
the sample node-set covered by our epidemic sampling
mechanisms. Therefore, with these experiments object
content of the nodes is irrelevant. With the second set of
experiments, we evaluated the efficiency of various search
mechanisms in resolving²-queries, and therefore, consid-
ered the object content characteristics. Our Monte Carlo
simulation was organized as a set of “runs”. For the first
set of experiments each run consists of 1) selecting a net-
work topology, 2) selecting a querier, and finally 3) ini-
tiating 50 query disseminations per forwarding probabil-
ity p (for each one of the sampling mechanisms) while
varying p from 0 to 1, and recording the average size of
the covered node-set as well as C and T. For the sec-
ond set of experiments a run comprises 1) selecting a net-
work topology, 2) selecting an object-set (a multiset of
tuples), 3) distributing the object-set among the network
nodes, 4) selecting an ²-query, 5) selecting a querier and
finally 6) initiating the query for 50 times per² (for each
of the search mechanisms) while varying ² from 0 to 1,
and recording the average value of their efficiency num-
bersC andT. Each result data-point reported in Section
V-B is the average result of 50 runs. The coefficient of
variance across the runs was below2:5% and hence show
the stability of the result.
We generated a set of 100 undirected power-law graphs
G(N;E) each with jNj = 50000 and jEj ¼ 200000.
The skew factors of the graphs are all about ° = 2:3 as
measured in [21]. The minimum number of edges per
node is 4, and the cutoff index of the graphs isº = 100.
The graphs are generated using the preferential attach-
ment model proposed by Barabasi et al. [31].
We considered a 5-dimensional content space and gen-
erated 100 object-sets. For each object-set, we first gen-
eratedjrj = 100000 objects u = ha
1
;a
2
;:::;a
5
i, where
a
i
is an integer value uniformly selected from the interval
[1;10]. Thereafter, we replicated the objects according to
the object replication scheme defined in Section II-D with
the total number of objects n = 500000. An object-set
ofn objects is distributed uniformly among the set of net-
work nodesN.
We also generated a workload of 10000 queries by con-
structing 10000 selection predicates based on the tem-
plate (a
1
X
1
)¯(a
2
X
2
)¯:::¯(a
5
X
5
), where
X
i
is a constant integer uniformly selected from the in-
terval [1;10] for each predicate, 2 f<;=;>g, and
¯ 2 fAND;ORg. Finally, the querier was randomly
selected fromN.
B. Experimental Results
Figure 4 illustrates the results of our first set of experi-
ments. Figure 4-a depicts the relation between the query
forwarding probability p and the relative size of the cov-
ered node-set, i.e., the sample size s
node
=
jN
0
j
jNj
. First,
we notice how close the results of our theoretical analy-
sis conforms with the performance of the basic SIR sam-
pling mechanism in practice, specially for our p values
of interest close to the critical forwarding probability p
c
.
Also, we observe that while performance of the SIR sam-
pling with negative degree-correlated blocking is almost
identical to that of the basic SIR sampling (even by block-
ing more than half of the network nodes), with positive
degree-correlated blocking, the coverage for the same for-
warding probability decreases significantly. This shows 1)
the importance of the highly connected nodes in the per-
formance of the SIR sampling, and 2) the independence
11
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability (p)
|N'| / |N|
Basic SIR
Uniform Blocking (10%)
Uniform Blocking (20%)
Uniform Blocking (30%)
+ve Degree-Cor Blocking
-ve Degree-Cor Blocking (33%)
-ve Degree-Cor Blocking (52%)
Theoretical Analysis
0
50000
100000
150000
200000
250000
300000
350000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability (p)
Number of Forwards (C)
0
5
10
15
20
25
30
35
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability (p)
Time (T)
a. Nodes sample size vs. forwarding probability b. Communication cost vs. forwarding probability c. Sample time vs. forwarding probability
Fig. 4. Verification of the analytical results
of its performance from the nodes with lower connectiv-
ity degrees, which are often low-bandwidth and volatile.
Figure 4-b confirms our previous conjecture that the com-
munication cost of SIR sampling is linearly proportional
to the query forwarding probability. Also, notice that with
p = 0:3 almost 80% of the network nodes can be cov-
ered with only about 25% of the communication cost of
the regular flooding (with p = 1). Besides, the size of
the covered node-set becomes sublinearly proportional to
the network size starting atp
c
¼0:05, where the commu-
nication cost is almost two orders of magnitude less than
that of the flooding. Finally, Figure 4-c illustrates how
the sampling time reduces as the forwarding probability
goes fromp
c
towards 1, because the giant component be-
comes more strongly connected. Also, we observe that in
the worst case, the sampling time with the SIR sampling
only increases by a factor of 4 over the minimum possible
sampling time with flooding.
Figures 5 and 6 illustrate the results of our second set
of experiments. With these experiments, we compared
the efficiency of the SIR sampling in answering²-queries
with that of the random walk and scope-limited-flooding
search mechanisms. As we mentioned in Section I, unlike
SIR sampling, these two search mechanisms are unable to
determine whether they have covered a sufficiently large
fraction of the network to satisfy a particular partial se-
lection query. Nevertheless, to be able to compare our
SIR sampling mechanism with these search mechanisms,
for each particular²-query with given² we calculated the
absolute (not relative) number of nodes that must be cov-
ered to satisfy the query, and terminated the random walk
0
100000
200000
300000
400000
500000
600000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
İ Number of Forwards (C)
Basic SIR
1-RW
1-RW Self Avoiding
16-RW
16-RW Self Avoiding
32-RW
32-RW Self Avoiding
0
50000
100000
150000
200000
250000
300000
350000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
İ Time (T)
Basic SIR
1-RW
1-RW Self Avoiding
16-RW
16-RW Self Avoiding
32-RW
32-RW Self Avoiding
a. Communication cost b. Sample time
Fig. 5. SIR sampling vs. random walk
and flooding as soon as their coverage exceed the required
number of nodes. The SIR sampling can decide on the re-
quired coverage on its own.
Figure 5-b shows that, as one can expect, the sampling
time of the SIR sampling is always incomparably shorter
than that of the random walk, even with 32 parallel walk-
ers. However, to our surprise, SIR sampling also out-
performs random walk in communication cost (see Fig-
ure 5-a). Thus, to cover the same number of nodes the
SIR-based dissemination uses a “lighter” communication
graph as compared to that of the random walk. As il-
lustrated in Figure 5-a, the random walk algorithms with
different number of walkers incur the same communica-
tion cost. This should not be surprising; more walkers
enhance the sample time of the random walk by scanning
the network in parallel, but a single random walker walks
as much as 32 random walkers walk in aggregate to cover
the same number of nodes. Also, notice that among ran-
dom walk algorithms, self-avoiding random walk (which
avoids repeated paths) always outperforms regular ran-
dom walk in sample time. Figure 6-a shows that the SIR
sampling always outperforms scope-limited flooding in
communication cost. Notice the step-like diagram for the
scope-limited flooding. Since at each hop, flooding scans
an exponentially larger number of new nodes, it cannot be
tuned properly to cover a certain fraction of the network
nodes with high resolution. Therefore, it cannot be effi-
cient in answering ²-queries. Finally, Figure 6-b shows
that although the sampling time of the SIR sampling al-
ways exceeds that of the flooding (which is optimal), even
in the worst case it remains tolerable.
0
50000
100000
150000
200000
250000
300000
350000
400000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
İ Time (T)
Scope-Limited Flooding
Basic SIR
0
5
10
15
20
25
30
35
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
İ Time (T)
Scope-Limited Flooding
Basic SIR
a. Communication cost b. Sample time
Fig. 6. SIR sampling vs. scope-limited flooding
12
VI. CONCLUSION AND FUTURE WORK
In this paper, first we defined a query model that gener-
alizes the typical P2P exact-match search queries to partial
selection queries. Subsequently, we introduced our SIR
sampling mechanism, which is specially designed to an-
swer partial selection queries efficiently. SIR sampling is
easy to implement because each node of the P2P network
can simply toss a biased coin with probabilityp to decide
whether the query should be disseminated through a link
or not. We showed through percolation analysis that the
value ofp can be elegantly computed for each partial se-
lection query from the query’s² parameter defined by the
user. We verified our analysis empirically via simulation.
Also, with a comparative empirical study we showed that
for the common case of the P2P partial selection queries,
SIR sampling outperforms flooding by up to two orders of
magnitude in communication cost while maintaining a tol-
erable response-time. Also, as compared to a 32-random-
walker, the SIR sampling has not only faster response time
but also less communication cost.
We intend to extend this study in two directions. First,
since the time-scale of the dynamic changes in the P2P
network topology might in some cases be comparable to
that of the query dissemination, we intend to use dynamic
percolation models to study the effects of the network dy-
namics in query dissemination. Second, we plan to ex-
tend our proposed family of epidemic-based search mech-
anisms to include search mechanisms to answer continu-
ous partial selection queries. For this purpose, we will
adopt the SIS (Susceptible-Infected-Susceptible) disease
dissemination model. Unlike SIR that models epidemic
diseases that disseminate once throughout the social net-
work and quickly disappear, SIS models endemic diseases
that become resident in the social network and continu-
ously disseminate.
REFERENCES
[1] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker,
“A scalable content-addressable network,” in Proceedings of
ACM SIGCOMM ’01, August 2001.
[2] I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, and H. Balakrish-
nan, “Chord: A scalable peer-to-peer lookup service for internet
applications,” in Proceedings of ACM SIGCOMM ’01, August
2001.
[3] B.Y . Zhao, J. Kubiatowicz, and A. Joseph, “Tapestry: An infras-
tructure for fault-tolerant wide-area location and routing,” Tech.
Rep. UCB/CSD-01-1141, UCB, April 2001.
[4] A. Rowstron and P. Druschel, “Pastry: Scalable, distributed ob-
ject location and routing for large-scale peer-to-peer systems,”
in Proceedings of ACM International Conference on Distributed
Systems Platforms (Middleware), November 2001.
[5] F. Dabek, M.F. Kaashoek, D. Karger, R. Morris, and I. Stoica,
“Wide-area cooperative storage with CFS,” in Symposium on
Operating Systems Principles, 2001.
[6] Gnutella, “Gnutella RFC,” 2004, http://rfc-
gnutella.sourceforge.net/.
[7] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and repli-
cation in unstructured peer-to-peer networks,” in Proceedings of
the 16th International Conference on supercomputing (ICS ’02),
June 2002.
[8] L. Adamic, R.M. Lukose, A.R. Puniyani, and B.A. Huberman,
“Search in power-law networks,” Physics Review Letters, vol.
64, no. 46135, 2001.
[9] Q. Lv, S. Ratnasamy, and S. Shenker, “Can heterogeneity make
gnutella scalable?,” in Proceedings of the 1st International Work-
shop on Peer-to-Peer Systems (IPTPS ’02), 2002.
[10] A. Crespo and H. Garcia-Molina, “Routing indices for peer-to-
peer systems,” in Proceedings of the 22nd International Confer-
ence on Distributed Computing Systems(ICDCS), July 2002.
[11] B. Yang and H. Garcia-Molina, “Designing a super-peer net-
work,” in Proceedings of the 19th International Conference on
Data Engineering (ICDE’03), March 2003.
[12] A. Demers, D. Greene, C. Hauser, W. Irish, and J. Larson, “Epi-
demic algorithms for replicated database maintenance,” in Pro-
ceedings of the sixth annual ACM Symposium on Principles of
Distributed Computing, 1987.
[13] H. Hethcote, “The mathematics of infectious diseases,” SIAM
Review, vol. 42, no. 4, pp. 599–653, Otober 2000.
[14] D.J. Watts and S.H. Strogatz, “Collective dynamics of small
world networks,” Nature, , no. 393, pp. 440–442, 1998.
[15] Piolet Networks, “Piolet,” 2004, http://www.piolet.com/.
[16] SETI@home, “Search for extraterrestrial intelligence at home,”
2004, http://setiathome.ssl.berkeley.edu/.
[17] NEES, “NEESgrid,” 2004, http://www.neesgrid.org/.
[18] V .N. Padmanabhan, H.J. Wang, and P.A. Chou, “Resilient peer-
to-peer streaming,” in Proceedings of the 11th IEEE Interna-
tional Conference on Network Protocols (ICNP’03), November
2003.
[19] Limewire.com, “Gnutella,” 2004, http://www.limewire.com/.
[20] Sharman Networks, “Kazaa,” 2004, http://www.kazaa.com/.
[21] S. Saroiu, P.K. Gummadi, and S.D. Gribble, “A measurement
study of peer-to-peer file sharing systems,” in Proceedings of
Multimedia Computing and Networking (MMCN’02), January
2002.
[22] M. Ripeanu, “Peer-to-peer architecture case study: Gnutella net-
work,” in Proceedings of the First International Conference on
Peer-to-Peer Computing (P2P’01), August 2001.
[23] M. Jovanovic, “Modeling large-scale peer-to-peer networks and
a case study of gnutella,” M.S. thesis, University of Cincinnati,
2001.
[24] K.P. Gummadi, R.J. Dunn, S. Saroiu, S.D. Gribble, H.M. Levy,
and J. Zahorjan, “Measurement, modeling, and analysis of a
peer-to-peer file-sharing workload,” in Proceedings of the 19th
ACM Symposium on Operating Systems Principles (SOSP’03),
October 2003.
[25] B. Bollobas, Random Graphs, Academic Press, New York, 1985.
[26] D. Stauffer and A. Aharony, Introduction to Percolation Theory,
Taylor and Francis, second edition, 1992.
[27] M. Molloy and B. Reed, “A critical point for random grraphs
with a given degree sequence,” Random Structures and Algo-
rithms, vol. 6, pp. 161–180, 1995.
[28] R. Cohen, K. Erez, D. Avraham, and S. Havlin, “Resilience of
the internet to random breakdowns,” Physical Review Letters,
vol. 85, no. 21, pp. 4626–4628, November 2000.
[29] M.E.J. Newman, S.H. Strogatz, and D.J. Watts, “Random graphs
with arbitrary degree distribution and their applications,” Physi-
cal Review E, vol. 64, no. 026118, 2001.
[30] H.S. Wilf, GeneratingFunctionology, Academic Press, second
edition, 1994.
[31] A.L. Barabasi and R. Albert, “Emergence of scaling in random
networks,” Science, vol. 286, pp. 509–512, 1999.
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 813 (2004)
PDF
USC Computer Science Technical Reports, no. 736 (2000)
PDF
USC Computer Science Technical Reports, no. 896 (2008)
PDF
USC Computer Science Technical Reports, no. 869 (2005)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 839 (2004)
PDF
USC Computer Science Technical Reports, no. 622 (1995)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 962 (2015)
PDF
USC Computer Science Technical Reports, no. 966 (2016)
PDF
USC Computer Science Technical Reports, no. 968 (2016)
PDF
USC Computer Science Technical Reports, no. 835 (2004)
PDF
USC Computer Science Technical Reports, no. 754 (2002)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 855 (2005)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 587 (1994)
PDF
USC Computer Science Technical Reports, no. 943 (2014)
Description
Farnoush Banaei-Kashani, Cyrus Shahabi, Muhammad Sahimi. "Epidemic sampling for search in unstructured peer-to-peer networks." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 828 (2004).
Asset Metadata
Creator
Banaei-Kashani, Farnoush
(author),
Sahimi, Muhammad
(author),
Shahabi, Cyrus
(author)
Core Title
USC Computer Science Technical Reports, no. 828 (2004)
Alternative Title
Epidemic sampling for search in unstructured peer-to-peer networks (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
12 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269115
Identifier
04-828 Epidemic Sampling for Search in Unstructured Peer-to-Peer Networks (filename)
Legacy Identifier
usc-cstr-04-828
Format
12 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/