Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A flexible framework for replication in distributed systems
(USC Thesis Other)
A flexible framework for replication in distributed systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UM I films
the text directly from the original or copy submitted. Thus, some thesis and
dissertation copies are in typewriter face, while others may be from any type of
computer printer.
The quality of this reproduction is dependent upon the quality of the
copy subm itted. Broken or indistinct print, colored or poor quality illustrations
and photographs, print bleedthrough, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send UM I a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand comer and continuing
from left to right in equal sections with small overlaps.
ProQuest Information and Learning
300 North Zeeb Road. Ann Arbor, Ml 48106-1346 USA
800-521-0600
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A FLEXIBLE FRAMEWORK FOR REPLICATION IN DISTRIBUTED
SYSTEMS
by
Eul Gyu Im
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2002
Copyright 2002 Eul Gyu Im
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 3073796
__ ___ __®
UMI
UM I Microform 3073796
Copyright 2003 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, M l 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UNIVERSITY OF SOUTHERN CALIFORNIA
The Graduate School
University Park
LOS ANGELES, CALIFORNIA 90089-1695
This dissertation, written by
Eul Gyu Im
Under the direction o f his Dissertation
Committee, and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillment o f
requirements fo r the degree o f
DOCTOR OF PHILOSOPHY
- * 1 __
"Graduate Studies
D a te May 10.-200 2
DISSERTATION COMMITTEE
Chairperson
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Dedication
This dissertation is dedicated to my parents and my wife.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgements
Many people helped me while I was working on this thesis. I think I was very
fortunate to meet so many good people at USC.
Many former and current GOST group members attended dry-runs, pointed out
many issues and gave me a lot of advice. I would like to thank every GOST member
for their help and support.
I would also like to thank my many friends at USC: JungHyun Han and June
Sup Lee at USC campus, Dongho Kim, Sungdo Moon, Inyoung Ko, Yongdae Kim,
Hyuckchul Jung, and Soonwook Hwang at ISI, my ex-officemate SungWook Ryu,
and also my ex-apartmentmate Taeyong Kim.
Dr. John Heidemann and Dr. Cyrus Shahabi helped me with my thesis proposals,
and Dr. Timothy Pinkston and Dr. Ellis Horowitz provided me with invaluable
advice for both my thesis proposal and defense. I appreciate their help from the
bottom of my heart.
Finally I’d like to thank my advisor, Dr. Clifford Neuman for his support and
understanding during my stay at ISI.
m
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Contents
D ed ication ii
Acknow ledgem ents iii
List O f Tables vii
List O f Figures viii
A bstract ix
1 Introduction 1
1.1 D ata R e p lic a tio n .................. 2
1.2 Consistency Control ................................................................................... 3
1.3 A Flexible Framework for Replication....................................................... 5
2 R elated W ork 11
2.1 Multiple Replication M echanism s.............................................................. 11
2.2 Support for M o b ility .................................................................................... 15
2.3 Conclusion....................................................................................................... 17
3 D esign 18
3.1 Flexible Replicated O b je c ts ........................................................................ 19
3.2 Generic Phases for Replication Mechanisms ........................................... 21
3.2.1 Pessimistic Mechanisms ................................................................. 23
3.2.2 Optimistic Mechanisms..................................................................... 24
3.2.3 Generic Phases and a Common Fram ework................................. 27
3.3 Design Issues of Replication M echanism s................................................ 27
3.3.1 Multiple Replication M echanism s.................................................. 27
3.3.2 Multiple Levels of R eplication........................................................ 30
3.3.2.1 Light-Weight R e p lic a s..................................................... 31
3.3.2.2 Caches................................................................................. 34
3.3.2.3 Configuration E x am p les.................................................. 34
3.3.3 G ra n u la rity ........................................................................................ 35
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.4 Conclusion...................................................................................................... 38
4 Im plem entation 3 9
4.1 Overall Structure of the Fram ew ork.......................................................... 41
4.1.1 Replication Management M o d u le ............................................... 43
4.1.2 Replication M echanism s............................................................... 43
4.2 Attributes of an O b ject................................................................................ 46
4.3 System Object ............................................................................................ 47
4.4 L ibraries......................................................................................................... 48
4.5 User Com m ands............................................................................................ 49
4.5.1 m k re p lic a ......................................................................................... 49
4.5.2 vrm .................................................................................................. 51
4.6 Protocols ...................................................................................................... 51
4.7 Update L ogs................................................................................................... 53
4.8 Example M echanisms................................................................................... 54
4.8.1 Optimistic Mechanisms................................................................... 54
4.8.1.1 PRSMETADATA.............................................................. 55
4.8.1.2 PRSOPERATIONS ........................................................ 55
4.8.1.3 Libraries.............................................................................. 56
4.8.2 Pessimistic M echanism ................................................................... 57
4.8.2.1 PRSMETADATA.............................................................. 57
4.8.2.2 PRSOPERATIONS ........................................................ 58
4.8.2.3 Update P ro p a g a tio n ........................................................ 59
4.8.3 Two-tier Replication M echanism s............................................... 59
4.8.3.1 PRSMETADATA.......................... 60
4.8.3.2 PRSOPERATIONS ........................................................ 61
4.8.3.3 Update P ro p a g a tio n ........................................................ 62
4.8.4 C a c h in g ............................................................................................ 62
4.8.4.1 PRSMETADATA.............................................................. 63
4.8.4.2 PRSOPERATIONS ........................................................ 63
4.8.4.3 System O b j e c t .................................................................. 64
4.8.4.4 Cache Replacement A lg o rith m ..................................... 64
4.9 Examples of write operations ................................................................... 65
5 C om parison 67
5.1 Comparison of Replication S y ste m s.......................................................... 67
5.1.1 Comparison of Replication M echanism s...................................... 68
5.1.2 Comparison of Replication S y stem s............................................ 70
5.2 Example Applications of Previous Work ................................................ 71
5.2.1 Replicated C o u n te r......................................................................... 72
5.2.2 Replicated D ocum ent...................................................................... 72
5.2.3 Bibliographic D atabase................................................................... 72
5.2.4 Rover A pplications..................................... 73
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2.5 Reference Data of Online Transaction Processing system . . . 74
5.2.6 Overall Comparison.......................................................................... 75
5.3 Quantitative Analysis................................................................................... 76
5.3.1 Number of m e ssa g e s....................................................................... 76
5.3.2 Example Configuration.................................................................... 78
5.3.3 C om parison....................................................................................... 79
5.4 Sim ulation....................................................................................................... 82
5.5 Overhead ...................................................................................................... 87
6 Future W ork 93
6.1 Selection of Replication Mechanisms ...........................................................93
6.2 Dynamic Change of Replication P a ra m e te rs .......................................... 94
6.3 Load Balancing and Replica P lacem ent....................................................... 96
6.4 Quality Control of New Replication M echanism s................................... 97
6.5 Conclusion............................... 98
7 C onclusions 99
R eference List 102
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List O f Tables
4.1 Example of a function table for a replication m e c h a n ism .......................44
5.1 Overall comparison of replication system s............................................... 76
5.2 Comparison of the availability and the number of m essages............... 80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List O f Figures
3.1 Flexible Replicated. O bjects........................................................................... 19
3.2 Flexible Replicated Objects in a distributed sy ste m ................................... 21
3.3 Functional Model for Server-side Pessimistic M echanism s.................... 23
3.4 Functional Model for Client-side Pessimistic M echanism s........................ 25
3.5 Functional Model for Optimistic Mechanisms........................................... 26
3.6 Overall Design of Prospero Replication Service............................................28
3.7 Caches and R e p lic a s .................................................................................... 35
3.8 Volume replication and file s h a r in g ................... ; ....................................... 37
4.1 The Prospero R epository.............................................................................. 40
4.2 The Prospero Replication Service(PRS) module in P r o s p e r o .................. 42
4.3 Life cycles of objects in the P R S ..................................................................... 45
5.1 Comparison of Replication M echanisms..................................................... 68
5.2 Comparison of Replication S y stem s........................................................... 71
5.3 OLTP Reference Data Replication [6 ]........................................................ 74
5.4 Example configuration of replicated d a t a ................................................. 79
5.5 Probability of Consistency of Transaction Data .......................................... 83
5.6 Average Number of Transmissions.............................................................. 84
5.7 Access Patterns of Replicas ............................................................................ 86
5.8 Number of transmissions for 10 replicas..................................................... 87
5.9 Average number of transmissions for two to 12 re p lic a s ............................. 88
5.10 Response time of the SE T operation ........................................................ 89
5.11 Breakdown of elapsed time of SE T operation for an optimistic mechÂ
anism ................................................................................................................. 90
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
Distributed systems are widely used today and gaining popularity with the adÂ
vance of technology. It is important that distributed systems support replication in
order to improve performance, availability and reliability. As the size of a distributed
system increases, so does the number of users and applications in the system. One
of the most important design issues related to replication is the selection of suitÂ
able mechanisms for various users and applications. The appropriate replication
mechanisms depend on access patterns, frequency of updates, tolerance of data inÂ
consistency, and other characteristics. A single replication mechanism cannot meet
the needs of all applications and users in a distributed system because some applicaÂ
tions, such as an online transaction processing, need near-real-time data consistency
while other applications, such as email can tolerate inconsistency. Therefore disÂ
tributed systems must be able to support a variety of replication mechanisms at the
same time.
The goal of this dissertation was to build a flexible framework for different repliÂ
cation mechanisms in distributed systems that would be suitable for a wide variety of
users and applications. A framework was developed that supports multiple replicaÂ
tion mechanisms, allows different objects to be maintained with different replication
mechanisms, and enables application programmers to provide their own replication
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
mechanisms. The Prospero Replication System (PRS) was implemented on top of
Prospero. Simulation results show that the PRS has better data availability and sysÂ
tem performance than other systems. It will be argued that this framework allows
the needs of diverse users and applications in large distributed systems to be met.
This dissertation has produced two significant contributions. First, after an examiÂ
nation of replication mechanisms used in distributed systems, a flexible framework
that supports multiple replication mechanisms together was designed and developed.
Second, following an examination of different levels of replication between replicas
and caches, a unified framework for them was produced. This multi-level replicaÂ
tion allows the framework to work well with mobile computing by placing different
replicas on mobile sites.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
Distributed systems today are widely distributed, support diverse applications, and
increasingly make use of mobile computers. Since resources are distributed over
networks and shared by users in distributed systems, some resources can become
bottlenecks that diminish the system performance. Object replication plays an imÂ
portant role in allowing distributed systems to cope with large distribution, diversity,
and mobility and still provide good performance. When implemented properly, obÂ
ject replication can enhance performance, availability and reliability of distributed
systems.
One of the most important design issues related to replication is the selection
of suitable mechanisms for various users and applications. The appropriate repliÂ
cation mechanisms depend on many factors, such as access patterns, frequency of
updates, tolerance of data inconsistency, and other characteristics. A single replicaÂ
tion mechanism cannot meet the needs of all applications and users in a distributed
system because some applications, such as an online transaction processing, need
near-real-time data consistency while other applications, such as email, can tolerate
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
inconsistency. Therefore distributed systems must be able to support a variety of
replication mechanisms at the same time. The goal of this dissertation was to build
a framework for this purpose.
Let’s take a look at background information first before discussing the issues of
replication.
1.1 Data Replication
Replication of data at multiple servers is the primary mechanism for providing high
availability. Replication has been used to improve data availability and response
time in distributed systems. There are several kinds of replications: data or object
replication, server replication, and so on. Object replication is used to increase data
availability, whereas server replication is used to balance loads and to improve server
availability.
A replica is a replicated copy of an object and is stored in permanent storage,
such as on hard disks. When objects are replicated, some replicas act as primary
copies and the other replicas act as backup copies. This model is called the master-
slave model or primary-backup model [2, 29]. In this model, all the activities must
be processed at the primary copies, and the backup copies remain inactive until the
primary copies fail.
In the master-slave model, primary copies can be a bottleneck. To overcome this
problem, all replicas can be given equal privileges. In the peer-to-peer model, all
replicas have the same privileges and operations can be processed on any replica.
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Replication systems, such as Ficus [17] use the peer-to-peer model for object replicaÂ
tion.
1.2 Consistency Control
From a client’s viewpoint, transactions on a replicated object should appear the
same as those on non-replicated objects. In transactions on non-replicated objects,
transactions appear to be performed one at a time in some order. A replicated system
achieves this by ensuring a serially equivalent interleaving of clients’ transactions.
A replicated system ensures that the effect of transactions performed on replicated
objects by various clients are the same as if they had been performed one at a time
on single data items. This property is called one-copy serializability [9].
Accessing distributed or replicated data introduces the data consistency problem.
If there are read or write operations on several copies of data concurrently, the
operations may be rolled back to guarantee one-copy serializability. An important
objective of distributed and replicated systems is providing high availability of data
while preserving data consistency [3, 4]. Although numerous replicated systems have
been built, most are suited for small domains or limited kinds of applications.
When objects are replicated, they must be kept consistent with each other. ConÂ
sistency controls can be categorized into the following:
S tro n g C o n sisten cy C o n tro l: Traditionally, one-copy serializability has been
used as a consistency criterion for replicated data. One-copy serializability uses a
strong consistency replica control algorithm to map logical data into physical replicas.
Replication mechanisms that use a strong consistency control are called as pessimistic
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
mechanisms. Each replica is kept consistent and results of operations on replicas
must be the same as results of operations done on a single object. Since more than
one replica is involved for an operation, there can be performance degradation with
strong consistency control.
W eak C onsistency Control: W ith weak consistency control schemes, replicas
can be inconsistent because of concurrent updates. In this scheme, only a local
copy is processed by a server and this scheme assumes that concurrent accesses of
an object are rare. So this scheme performs better than strong consistency control
schemes when conflicts between accesses are infrequent.
Many consistency control mechanisms have been proposed to date: read-one
write-all mechanisms, quorum mechanisms, optimistic mechanisms, two-tier replicaÂ
tion mechanisms, and others.
In the read-one/write-all mechanism, only one replica is accessed for read operaÂ
tions and all the replicas must be accessed for write operations. This is a pessimistic
mechanism, and replicas are kept consistent most of the time. If one replica is not
available, write operations cannot be performed. In the quorum mechanism, there
are read quorum and write quorum. Since quorums can be less than the total number
of replicas, read or write operations can be successfully performed on replicas even
though one or two replicas are unavailable. Optimistic mechanisms and two-tier
replication mechanisms use weak consistency controls. In optimistic mechanisms,
operations are performed on any of the replicas and conflict resolution algorithms
are used if there are any conflicts among updates.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.3 A Flexible Framework for Replication
There are several issues related to designing replication systems in distributed sysÂ
tems. If an object is replicated, replicated objects must be kept consistent with the
original object. One of the most important design issues is the selection of replication
mechanisms that will be supported by the system. Selecting appropriate replication
mechanisms depend on many factors including access patterns, document sharing
patterns, frequency of updates, and tolerance of data inconsistency. For example,
an online transaction processing (OLTP) application has three kinds of data, each
of which must be maintained at different levels of consistency: transaction data that
must be maintained with very strong consistency control, reference data that reÂ
quires near-real-time consistency, and analysis data that can tolerate inconsistency
and can be maintained with weak consistency control.
To receive widespread use, a replication system in distributed systems must meet
the diverse needs of numerous users and must be scalable to large distributed sysÂ
tems. In a large distributed system with diverse users and applications it is not
enough to support all users with a single replication mechanism. Moreover, there
are often a variety of ways to implement a given set of semantics. Having a fixed set
of replication mechanisms prevents applications from exploiting new and improved
mechanisms that may better meet users’ needs. Therefore replication systems must
be flexible enough to support more than one replication mechanisms and to allow
application programmers to add new mechanisms.
Because of the increased popularity of mobile computers, the support for moÂ
bility is also an important design issue for replication systems. Due to the high
5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
possibility of disconnection of mobile computers, replication mechanisms used for
stable computers may not directly apply to mobile computers.
This dissertation describes a flexible framework that supports multiple replication
mechanisms and allows the addition of new replication mechanisms. It enables difÂ
ferent objects to be replicated by different mechanisms. In addition, the framework
provides different levels of replication. For mobile computers, the high possibility of
disconnection increases overheads of regular replicas, and systems can have better
performance by placing second- or third-level replicas on mobile nodes. Details on
multi-level replication are discussed in Section 3.3.2.
The contributions of this dissertation are:
• The examination of replication mechanisms used in distributed systems and
the proposal of a flexible framework which can support various replication
mechanisms currently available.
• The examination of different levels of replication between replicas and caches
and the design of a unified infrastructure for replication and caching that gives
more flexibility to the system.
• The implementation of a new framework and the comparison of it to other
replication systems.
In summary, the goal of this new framework is to provide the following features:
• Flexibility:
— Supporting multiple replication mechanisms
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
— Allowing the addition of new replication mechanisms
— Providing different levels of replicas
• Scalability:
— The performance of the overall system is not seriously degraded though
the number of replication mechanisms increases.
— Even though the number of replicated objects grows, overhead of the
framework is negligible in comparison with that of replication mechaÂ
nisms.
— The system allows the addition of new servers without interfering with
other servers.
To support multiple replication mechanisms together, several issues must be reÂ
solved. The system must be cost-effective while satisfying various users’ needs. There
are many aspects of costs. Usage of CPUs and network bandwidth are performance
aspects of costs. There are also costs of storages, or the development cost for deÂ
ploying new mechanisms. In this dissertation, the term ’ cost-effective’ is used while
the above costs are considered. If several replication mechanisms are supported in
a single system, the system may become inefficient to support all these replication
mechanisms because different replication mechanisms use different operations and
attributes, and as a result different operations and attributes must be called whenÂ
ever different replication mechanisms are used for objects. So the system must be
optimized to have enough performance to support multiple replication mechanisms
7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
together and to minimize system overhead, and the system must be optimized to
reduce overall costs of supporting multiple replication mechanisms.
Scalability is also an important issue. As explained above, there are several
aspects of scalability of replication systems: the number of replicated objects, the
number of servers, and the number of replication mechanisms. Since the system
must be able to support largely distributed systems, the system must be scalable
as the number of replicated objects or the number of servers grows. If the system
supports more than one replication mechanisms, the system must also be scalable
with the number of supported replication mechanisms.
In addition, the following issues should be considered to design the system:
• Transparency to users
Some users may not know about replication and users regard replicas the same
as other objects. By defining common interfaces for objects, the system hides
details about each replication mechanism and provides users a transparent
view for replicated objects. Transparency of replicas allows seamless support
of clients’ applications.
• Support for mechanism-specific characteristics
Different mechanisms use different operations and attributes, and these difÂ
ferent operations and attributes must be supported together in a common
framework.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• Specification for a mechanism
To provide flexibility to the framework, the system must provide an efficient
way of adding new mechanisms and selecting an appropriate mechanism for
an object.
To solve the above issues, a common framework for multiple replication mechÂ
anisms was designed. This common framework was drawn from generic phases of
replication mechanisms which are explained more in section 3.2. The generic phases
shown in this dissertation can represent algorithms of both optimistic mechanisms
and pessimistic mechanisms for both client-side and server-side mechanisms. This
dissertation will show that replication mechanisms can be generalized and can fit into
a single set of phases. Based on this generalized single set of phases, a set of common
libraries and attributes are provided to support multiple replication mechanisms.
In addition, to support various replication mechanisms in a framework, a set
of common operations for replicas are defined. This set of common operations can
represent the object cycles that include object creation, modification, object deletion,
and so on. This set can be applied to both replicated and non-replicated objects and
can provide clients transparent views for both replicas and non-replicated objects.
See section 4.1.2 for more details. Also, method-specific attributes are supported
together in the common framework.
The framework has been implemented on top of the Prospero system [30]. Two
simulations were done to prove the superiority of this new framework. Read/write
transactions on an OLTP application were simulated and the results showed that
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
this framework had better availability of objects and system performance than preÂ
vious replication systems. Another simulation was done with several access patterns
of replicas. Simulations showed that performance can be improved by changing
replication mechanisms dynamically based on the access patterns of replicas.
The system was evaluated and the performance of the system was compared with
that of the View Consistency system of Ficus. View Consistency is a client-enforced
replication mechanism that is implemented on top of the Ficus file system. While
it only supports client-side optimistic mechanisms, it adds about 1 to 8 percent
overhead to the base system. The system proposed in this dissertation adds about
3.8 percent overhead to the base system. Even though the overhead values cannot
be compared directly, they indicate that the system proposed in this dissertation
provides good performance while supporting multiple replication mechanisms.
The rest of this dissertation is organized as follows: after the introduction in
Chapter 1, Chapter 2 describes previous work done on supporting multiple replicaÂ
tion mechanisms and multi-level replication. The design and implementation of the
framework are addressed in Chapter 3 and Chapter 4. Chapter 5 details simulations
and evaluations. Chapter 6 describes future work. Chapter 7 has conclusions.
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Related Work
The main intent of this dissertation is to develop a system for supporting multiple
replication mechanisms together in a framework and to provide for multi-level repliÂ
cation. The rest of this chapter describes previous research that has been done on
these topics.
2.1 M ultiple Replication Mechanisms
The designers of Deceit [36] correctly pointed out that the level of consistency reÂ
quired between replicas differs depending on the situation and Deceit allows users
to specify the level of consistency of individual files. In Deceit, the desired level
of consistency can only be specified by adjusting the number of replicas and other
parameters of replicas. Problems with Deceit are that only a limited number of repliÂ
cation mechanisms are supported. In the environments for which Deceit is intended,
the mechanisms provided are sufficient, but in a much larger system with a wide
spectrum of uses, support for additional, application-specific, replication techniques
are necessary.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Munin [7] is a distributed shared memory system which supports multiple conÂ
sistency protocols for shared variables. The declaration of each shared variable is
annotated by its expected access pattern, and the system chooses a consistency proÂ
tocol based on the annotations. Even though this scheme works well with distributed
shared memory systems, it cannot be directly applied to distributed systems because
it concentrates on shared variables rather than objects or replicas.
Both Deceit and Munin support only a fixed set of consistency protocols. Even
though they work well with small research domains, in much larger distributed
systems it is not enough to provide only a fixed set of replication mechanisms because
of the diversity of users and applications. There is similar work [10, 38] supporting
a fixed set of replication mechanisms.
DCOM [41] provides multiple replication schemes and multiple concurrency conÂ
trols for their objects. COM separates objects with different concurrency semantics
into distinct execution contexts, called apartments. An object comes into life, lives,
and dies within its apartment, and to use this object in a different apartment, the
object must be marshalled across. In DCOM, there are four different variations of
concurrency, also called threading models, for in-process servers, and these threading
models can be categorized into two schemes: a single-thread scheme and a multiÂ
thread scheme. If an object is created in a Single-Threaded Apartment (STA), all
method invocations to this object are serialized by COM using a message queue.
If an object is created in a Multi-Threaded Apartment (MTA), any thread within
the MTA can concurrently use this object, so all global, static, and instance states
must be protected against concurrent access. In DCOM, objects are not replicated.
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Instead, proxies for this object are provided to remote servers and clients, and conÂ
currency controls are imposed by the process that has the instance of this object.
The replication scheme in Clouds is quite similar to that of DCOM. In Clouds [10],
two varieties of consistency preserving (cp) threads are used: gcp-threads for global
consistency and lcp-threads for local consistency. Both gcp-threads and lcp-threads
support only pessimistic mechanisms.
In Mariposa [35], depending on write frequency, one of three different update
propagation mechanisms is selected: trig g ers, side files and ta b le scans. The
fundamental idea of using different mechanisms for different access patterns is similar
to the research described in this dissertation, but it is limited to only one aspect of
variations, i.e. write frequency.
Subcontract [19], which tries to provide a flexible base for distributed programÂ
ming, attem pts to give users more flexibility. In Subcontract, different objects can
select different remote procedure call (RPC) modules in order to use different consisÂ
tency controls, and application programmers can even add their own RPC modules.
Even though replication mechanisms can be defined and built as one of the comÂ
munication models, a lot of programming effort is required to add new replication
mechanisms because the system focuses on communication with other servers rather
than replication of objects. Another problem is that Subcontract only supports a
fixed set of meta-data for objects, which limits the ability of programmers to build
new replication mechanisms since different mechanisms need different meta-data.
For example, optimistic methods require version vectors, and voting methods needs
quorums, but these meta-data are not supported in Subcontract.
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Georges Brun-Cottan et al. [5] proposed an architecture for adaptable replicated
objects. In their architecture, the type-specific logic of a replicated object, such as
concurrency control, is separated from the generic logic, such as consistency manageÂ
ment. As a result, programmers can adapt a replicated object to the specific demands
of an application by replacing components that implement a different consistency
manager. But in this scheme, when potential conflicts arise between operations for
replicas, all consistency managers of replicas have to participate in decisions to grant
write locks, making it difficult to build optimistic replication mechanisms and even
harder to support poorly connected computers.
Ashvin Goel generalized session guarantees [39] and proposed a View Consistency
model [14] which is built on top of an optimistic model. View Consistency has a suite
of consistency schemes that differ in terms of consistency criteria and allows clients
to select different consistency schemes based on their consistency criteria. His work
focused on providing users with replicas that meet the specified criterion. In his
scheme, different users and different processes can maintain an object with different
consistency schemes but only within optimistic mechanisms. Although his model
works well within the optimistic replication mechanisms, pessimistic mechanisms
cannot be supported together with optimistic mechanisms within the same system.
Wiesmann et al. provided an abstract and neutral framework to compare replicaÂ
tion techniques used in both distributed systems and databases [42]. The five phases
of the framework - client contact, server coordination, execution, agreement coorÂ
dination and client response - provide a common model for different mechanisms.
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Though the framework can support several mechanisms together, it is a concepÂ
tual model rather than a system, which is focused mainly on comparison, making it
difficult to implement.
There are also other works on supporting a flexible framework [8, 27, 22], but
they also have shortcomings in providing a general framework for a wide variety of
users and applications.
To overcome the problems of previous work, the system proposed in this disserÂ
tation provides a framework that supports multiple replication mechanisms. Since
each object has its own replication information, different objects can be maintained
by different replication mechanisms. It also allows application programmers to add
new replication mechanisms.
2.2 Support for M obility
To support mobility, Gray et al. [16] proposed a two-tier replication scheme which
supports two kinds of replicas: base replicas and mobile replicas. A mobile replica
accumulates tentative transactions which are reprocessed at a corresponding base
replica when the mobile replica reconnects to the base. This scheme uses coarse
granularity for replication and requires every replica, including mobile replicas, to
have the same set of replicas. Since mobile nodes have only limited resources, this
requirement may prevent the system from using replication more widely. Another
problem is that if a mobile replica is changed but the corresponding base replica is
not available, changes to the mobile replica cannot be propagated to other replicas
even though other base replicas are available.
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Coda [33, 34, 24, 23] supplies caches as second-level replicas to support mobility,
but items in caches are swapped out by cache replacement algorithms even though
users may want to keep them in the cache. Since the cache miss penalty in mobile
computers is much higher than that in stable computers, using caches in mobile
nodes can cause system degradation. Another problem is that this system places
base replicas only at a set of secure computers, which reduces the flexibility of
replication. Lastly, as in Gray’s scheme, updates to caches cannot be propagated if
the corresponding base replica is not available.
The Bayou project [40] provides data sharing among mobile users and pair wise
communication between any replicas. In this system any two mobile computers
can communicate and can propagate their changes to one another. Unfortunately,
changes to replicas remain tentative until the changes are committed to a server desÂ
ignated as the primary. In addition, when changes are propagated to other replicas,
the pair wise communications cause more network traffic than other algorithms.
Rover [21] uses a client/server architecture. Clients are Rover applications that
typically run on mobile computers, and servers typically run on stationary hosts.
Rover applications employ a check-in/check-out model of data sharing. The problem
with this approach is that applications must check in any objects that are replicated
in other hosts after making changes even though the applications run on stationary
hosts. In other words, replicas on stable hosts are treated the same as those on
mobile hosts.
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.3 Conclusion
There has been extensive work on replication, but previous replication systems lack
the flexibility to support diverse applications in distributed systems. As distributed
systems become larger and larger, replication systems need to provide more flexible
frameworks for supporting diversity of objects and users. I designed a framework
for replication systems, focusing on flexibility. This framework provides users multiÂ
ple replication mechanisms and multiple levels of replications. To support mobility,
different levels of replicas can be placed on mobile nodes. Example replication mechÂ
anisms shown in Section 4.8 include the two-tier replication mechanism. The two-tier
replication mechanism presented in this dissertation is based on Gray’s mechanism
but has been modified for better performance. Unified control of replication and
caching is also provided in the framework proposed in this dissertation.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
D esign
This dissertation proposes a flexible framework for replication which supports mulÂ
tiple replication mechanisms and enables application programmers to add their own
replication mechanisms to the framework. The framework is implemented on top of
Prospero, and it is called the Prospero Replication System(PRS). In this chapter,
design details of the PRS are presented, such as the underlying object model, generÂ
alized phases of replication operations, levels of replication, and so on. In the PRS,
users and applications can select replication mechanisms suited to their needs and in
addition, replication mechanisms, including application-specific mechanisms, can be
added to the framework. The PRS also supports multi-level replication mechanisms,
such as two-tier replication mechanisms, which can work well with loosely connected
computers.
The overall architecture and implementation of the framework are explained in
Section 4. This architecture is designed based on the generic phases for replication
mechanisms and the Flexible Replicated Objects model which are addressed in the
following sections.
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.1 Flexible R eplicated O bjects
Internal
interfaces
Objects
External
interfaces
Clients
Clients
Clients
Obj ects
attributes
attributes
attributes
Replication
Manager
Replication
Manager
Replication
Manager
Figure 3.1: Flexible Replicated Objects
The Flexible Replicated Objects model which is proposed in this dissertation exÂ
tends the object-oriented model to a distributed environment to support replication
in distributed systems. In this model, each object has attributes, internal interÂ
faces, and external interfaces as shown in Figure 3.1. To access objects and their
attributes, the internal or external interfaces are used: Replication Managers use the
former and clients, such as application programs and users, use the latter.
Each host has one replication manager and objects in the same host share a repliÂ
cation manager. A replication manager contains multiple replication mechanisms
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
and communicates with other replication managers in other hosts to keep replicated
objects consistent. Methods provided by the replication manager to clients include
create, delete, read, write and propagate. Through these methods, remote or local
objects can be accessed by clients or other replication managers. Each object has a
special attribute which specifies a replication mechanism to be used for that object.
When a system gets read or write requests for an object from clients, the system
invokes the appropriate methods within the replication managers according to the
attributes of the requested object. For example, write methods can be different
if replication mechanisms are different, and the system calls the appropriate write
method according to the attributes of the requested object. A replicated object can
also have attributes that are specific to a replication mechanism.
From the client’s point-of-view, the model is replication transparent, so the client
does not need to know the underlying structure of replication. Operations related
to replication are handled internally by the replication managers. A variety of conÂ
sistency strategies can be implemented and added to this model. The essence of the
model is replication managers which can handle multiple replication strategies and
can provide diversity to users.
Figure 3.2 shows an example configuration of the Flexible Replicated Objects
model in a distributed system. Clients access objects and their attributes through
the external interfaces, and replication managers communicate with each other to
keep replicated objects consistent. A replication manager in a host handles replicated
objects in that host.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Clients
External
Interfaces
Replication
Manager
Internal
Interfaces
Objects &
Clients
Replication
Manager
External
Interfaces
Internal
Interfaces
Attributes Attributes
Figure 3.2: Flexible Replicated, Objects in a distributed system
3.2 Generic Phases for R eplication M echanism s
To design a common framework for multiple replication mechanisms, generic phases
that can fit for replication mechanisms are invented. Wiesmann et al. [42] proposed
generic phases for replication mechanisms used in the database area or distributed
systems area. But his scheme lacks some functionalities such as synchronization
processes after partial updates to a set of replicas, so it cannot cover replication
mechanisms like quorums.
The following general phases are designed to support multiple replication mechÂ
anisms:
C lien t R e q u est A client sends requests to any servers that have replicas of the
intended object. Since operations on replicated objects are transparent to
clients, clients can send requests for replicas without knowing attributes of
replicas and their replication policies.
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Server C oordination for E xecution In this phase, the server that receives reÂ
quests from a client acts as a coordinator server. The coordinator selects other
servers that will process clients’ requests, according to policies of the repliÂ
cation mechanism of the requested object. The requests from the client are
broadcasted to the selected servers, if necessary.
E xecution Actual operations are executed on the selected servers in this phase.
The results of execution are sent to the coordinator server.
Server R esponse The coordinator server collects all the responses from other
servers and sends appropriate responses to the client according to the poliÂ
cies of the replication mechanism.
Server C oordination for Propagation Since some replication mechanisms allow
a partial set of replicas to be changed, changes must be propagated to the rest
of the replicas to make the whole set of replicas consistent. Different replicaÂ
tion mechanisms have different propagation algorithms. So actual procedures
of this phase are different from one replication mechanism to another. Servers
that have inconsistent replicas will be selected and get the propagation reÂ
quests, based on the propagation policies of replication mechanisms.
Synchronization and Conflict Resolution The synchronization process will be
done in this phase. Because of possible concurrent updates to replicas, servers
may find conflicts between replicas while processing synchronization. Conflict
resolution algorithms are defined in each replication mechanism.
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Replication mechanisms can be categorized into two kinds of mechanisms: pesÂ
simistic mechanisms and optimistic mechanisms. Both pessimistic mechanisms and
optimistic mechanisms can be enforced from either server-side or client-side. The
next sections show how these replication mechanisms are supported in the generic
phases.
3.2.1 Pessimistic Mechanisms
P e s s im is tic (S e rv e r sid e )
Client
Request
s
Server
Coordination
Execution Server
Response
Server
Coordinatio
Synchronizati
& Conflict
Resolution
C lie n t C lie n t
\
, t
V
\
/
\
Figure 3.3: Functional Model for Server-side Pessimistic Mechanisms
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Pessimistic (client side)
Server
Coordination
Client:
Request
Synchronizeci in
! â– Conflict
Resolution
Execution Server
Coordinatio
Server
Response
C lie n t C lie n t
Replica 1
Replica 2
Replica 3
Replica 4
Figure 3.4: Functional Model for Client-side Pessimistic Mechanisms
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Optimistic (server side)
Server
Coordination
Client
Request
Synchronizati in
& Conflict
Resolution
Server
Coordinatio
Server
Response
C lie n t
C lie n t C lie n t
Replica 1
Replica 2
Replica 3
Replica 4
Figure 3.5: Functional Model for Optimistic Mechanisms
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
applied. Because the client’s requests are processed on only one server, the functional
models for server-side optimistic mechanisms and client-side optimistic mechanisms
are the same except that server coordination for propagation is done by the client
in the client-side optimistic mechanisms.
3.2.3 Generic Phases and a Common Framework
The phases introduced in the above can support both pessimistic and optimistic
mechanisms and can be used to design a common framework for replication mechaÂ
nisms. The PRS provides common libraries based on these generic phases. Chapter 4
shows implementation details of the PRS.
3.3 D esign Issues o f R eplication M echanism s
3.3.1 M ultiple Replication Mechanisms
Replication mechanisms are used to keep replicated data consistent and to give
clients a consistent view of the data. Some replicated data need to be maintained
with strong consistency control while other data can tolerate inconsistency of data
and can be maintained with weak consistency control. For example, data in email
applications can be maintained with weak consistency control since the order of
operations on email applications can be changed without affecting the final results,
but online transaction data must be maintained with strong consistency control
to insure the integrity of data. So, in a large distributed system, one replication
mechanism cannot meet the needs of all users because there are various applications
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
and objects which require different replication strategies. If the system is to be
used widely, it is very important to support more than one replication mechanism.
In addition, the system must allow application programmers to be able to provide
their own mechanisms when appropriate. To meet all of these needs the system must
provide a flexible infrastructure for multiple replication mechanism s.
Prospoo Server
(O bjeaJ— I
- f
Protpcro Saver Prospero Server
o a
cxm
CHZ]
o n
Figure 3.6: Overall Design of Prospero Replication Service
For this research, a Prospero Replication Service (PRS) module, which is a frameÂ
work that supports multiple replication mechanisms, was designed and implemented.
Shown in Figure 3.6, the PRS is based on the Flexible Replicated Objects and is imÂ
plemented on top of Prospero. Prospero servers have libraries that are used for
various replication mechanisms, and each object has attributes that define a replicaÂ
tion mechanism to be used. The PRS provides applications and servers with APIs
which are used to manage replicas. Multiple replication mechanisms are stored as
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
libraries, and each object has meta-data indicating a replication mechanism to be
used to replicate it. Each replication mechanism provides the Prospero servers with
a list of functions that define operations on replicas. When a replicated object is
accessed by clients, the appropriate function is called at run-time according to the
meta-data of the object. The prototypes of these functions are predefined in the
system so that replication mechanisms can have the same interfaces for the same
operations.
Application programmers can also add their own replication mechanisms to the
framework. A newly added replication mechanism must provide a given set of opÂ
erations for replicas according to the predefined prototypes. The operations include
read, write, create and delete. When building these operations programmers can
use the APIs provided by the Prospero server to manipulate attributes and objects.
Details on the framework and the APIs are explained in Chapter 4.
Mechanisms can also be added on the client side so that consistency control can
be enforced by clients [25]. If an application developer finds a more efficient way of
managing replicated objects, client side mechanisms can be more efficient.
Since this framework supports multiple replication mechanisms, there are space
and performance overheads. Meta-data used to support replication mechanisms are
stored with objects. This results in additional use of disk space. Also, the system
performance is degraded since the server must handle the replicas of each object
differently. The system must use the meta-data associated with the object to find
the appropriate replication mechanism at run-time. A more precise analysis of the
system and its overheads can be found in Chapter 5.
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.3.2 M ultiple Levels of Replication
The increasing use of mobile computers presents other challenges for replication
systems due to the high possibility of disconnection. Replicas supported on mobile
hosts must be different from those on stable hosts.
To support replication mechanisms for mobile computers, the PRS defines three
different levels of replicas: caches, light-weight replicas, and regular replicas. Caching
is supported as one of the replication mechanisms in the PRS. By supporting caching
together with other replication mechanisms, the PRS can take advantage of replicaÂ
tion information when managing caches.
Caches and replicas have many similar characteristics. The goal of both caches
and replicas is to improve the availability of objects and to reduce read/write latency,
but caches can have less space and performance overheads than replicas, and cache
replacement algorithms are used to replace items in caches if there are conflicts. In
some senses, replicas can be regarded as caches with different cache manipulation
algorithms. If the cost of network connection and cache miss penalty are expensive as
in mobile computers, replicas can have better performance because items in caches
can be swapped out by cache replacement algorithms though users may want to
continue to access them, whereas replicas continue to exist in mobile computers
unless users delete them explicitly.
Different environments and applications need different levels of replication to get
better performance. As said earlier, three different levels of replicas were defined
in the PRS so that the PRS is flexible enough to work well in different situations.
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Among them, light-weight replicas and caches are discussed in detail in the following
sections.
3.3.2.1 Light-W eight R eplicas
When a replica is altered, the changes must be propagated to the other replicas. If
some replicas are unavailable, the changes must be logged and kept until the propagaÂ
tion succeeds. These days, mobile computers are becoming more and more popular,
and they are disconnected much of the time. If a replica is on a mobile computer,
the possibility of the replica being unavailable is very high due to disconnection.
As a consequence, as propagations continue to fail, the logs become larger and the
replication system must pay higher prices to handle propagation and reconciliation
of disconnected replicas. In addition, the possibility of conflicts between replicas
increases with the duration of disconnection, which increases the cost of reconcilÂ
iation. As the number of replicas grows, the problems with disconnected replicas
become even worse. Using caches on mobile computers can solve only some of the
problems and even cause another problem, since cached objects can be swapped out
by cache replacement algorithms and the servers on mobile computers cannot cache
the objects again if disconnected.
Locks can be used on cached data to avoid undesired replacements. But, still, if
the corresponding source object is not available, the changes to the caches cannot
be propagated even though other replicas are available. Therefore, there is a need
for a system that supports the characteristics of both caches and replicas together.
The PRS can provide the two-tier replication mechanism to support mobile comÂ
puters. In the two-tier replication mechanism there are two kinds of replicas: regular
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
replicas and light-weight replicas (LWRs). Regular replicas are used on stable comÂ
puters whereas LWRs are used on mobile computers or poorly connected computers.
By sacrificing the consistency of LWRs, the overall performance is improved. OpÂ
erations on regular replicas are different from those of LWRs because operations on
LWRs remain tentative until LWRs are connected with regular replicas. LWRs are
hybrids between caches and replicas, with characteristics that lie somewhere between
replicas and caches.
Each LWR has a link to a corresponding regular replica which is responsible for
propagation of updates from/to LWRs and resolutions of update conflicts. Updates
on an LWR remain tentative and, when the LWR is connected to regular replicas,
updates are propagated to the corresponding regular replica, and are propagated to
other regular replicas. When a regular replica is updated, linked LWRs are notified
of the changes and marked as stale. If an LWR is disconnected during updating
of the regular replica, the server maintains logs of the changes for a finite time for
later propagation. The logs will be deleted after the specified time has passed or
the propagation to the LWR has succeeded. The structure of these logs is explained
in Section 4.7. LWRs have a list of regular replicas so that they can contact other
regular replicas if the corresponding regular replica is not available. This list may
be incomplete and will be complete as the communication between replicas goes on.
Section 4.8.3 shows how two-tier replication mechanism are supported in the PRS.
If the number of LWRs that have links to a regular replica exceeds a certain
threshold, the regular replica is split into two to avoid becoming a bottleneck.
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
To reduce the number of regular replicas, LWRs can also be placed on stable
hosts. The smaller number of regular replicas means the smaller number of transÂ
missions when updates are propagated. Therefore, by using LWRs instead of regular
replicas, overall system performance can be enhanced at the expense of consistency of
LWRs. There are trade-offs between overall performance and consistency of LWRs,
and the best configuration depends on the access patterns of replicas and consistency
requirements of replicas. In the PRS, users can specify the type of replicas to be
placed on a host. To help users make a better decision, the PRS can provide users
with statistical information, such as read/write delay, read/write frequency, conflict
rate and rollback rate.
This two-tier replication scheme is similar to that of Gray et al. [16] except that
in Gray’s mechanism, the inaccessibility of the corresponding regular replica makes
it impossible to propagate the changes on LWRs even though other regular replicas
are available. In this scheme, mobile replicas have a list of regular replicas and can
contact other replicas if necessary. Coda [33] uses caches as second-level replicas,
but the same problem of inaccessibility of corresponding replicas exists. In addition,
in Gray’s scheme, every node must have the same set of replicas. In the PRS,
file granularity is used to enable selective replication, so different nodes can have
different sets of replicas. In mobile environments it is critical to be able to replicate
objects selectively due to limited resources.
A problem with the previous work that support mobile computers is th at most of
the systems support only a single consistency control mechanism, i.e. an optimistic
mechanism. This prevents users from using diverse applications on mobile computers
or loosely connected computers.
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In the PRS, Gray’s leases [15] can be used for applications that need strong
consistency on mobile environments. Leases are similar to locks except that leases
have an expiration time. A mobile node can acquire leases. Upon acquisition of
leases, the mobile node has total control of the object for a certain period that is
specified in leases. Leases can be renewed and the mobile node must propagate
updates before leases expire. Otherwise, it will lose the changes that were made
during the leasing period.
3.3.2.2 Caches
Replication and caching have many similar characteristics, so the PRS supports
caching as one of the replication mechanisms. How to support caching in the PRS
is explained in section 4.8.4. The most significant benefit of supporting caching
and replication together is that cache management modules can use the replication
information for better performance.
3.3.2.3 Configuration Exam ples
There are many possible configurations of caches and replicas in distributed systems
as shown in Figure 3.7. In Figure 3.7 (a), only one object is used and clients have
cached data, so there is a bottleneck since every cache must contact a single central
object to update changes. Objects can be replicated to avoid the bottleneck and
improve data availability as well as fault tolerance, as shown in Figure 3.7 (b) which
has three replicas and five cached objects. In Figure 3.7 (c), light-weight replicas are
used instead of caches on some nodes. Using light-weight replicas rather than caches
gives users a more persistent view for the objects. Regular replicas are used in all
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(b)
A <
â–¡ : LWR
O : full replica
A : cache
(c)
( d )
: loosely connected
: tightly connected
Figure 3.7: Caches and Replicas
nodes in Figure 3.7 (d), and all replicas have equal privileges. If regular replicas are
used in all nodes including mobile nodes, it could generate more space overhead and
performance overhead than other configurations while it gives equal privileges for
replicas on both mobile nodes and stable nodes.
Which configuration is better depends on many factors such as the requirements
of applications, object access patterns and network latencies. The PRS allows users
to be able to choose any of the above configurations to cope best with specific
situations.
3.3.3 Granularity
The granularity of a replica affects network traffic, performance, and the scalabilÂ
ity of distributed systems. Mechanisms using coarse granularity, such as volume
granularity, replicate a volume which includes files as well as directories. Volume
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
granularity takes advantage of the locality of accesses. One of the problems with
volume granularity is that the entire volume must be replicated together even if users
want to replicate only a small part of it. W ith fine granularity, like file granularity,
the system can replicate objects selectively. The selective replication can reduce the
number of replicas for an object since objects are replicated only as needed.
However, fine granularity has space overhead since each object needs to have its
own replication attributes and values. There is also performance overhead because
servers must handle each object differently depending on the attributes of the object.
The PRS uses fine granularity for object replication. Even though there is overÂ
head of supporting fine granularity, the system can achieve better performance since
the system can select the best appropriate mechanism for each object and objects
are replicated only where they are used.
Figure 3.8 shows an example of volume replication and actual sharing of files. In
volume replication, all files in the same volume must be replicated together when
a user wants to replicate an object in that volume. In Figure 3.8, object a, b, and
c in the same volume are replicated at five nodes with volume granularity. The
actual sharing is represented with dotted lines, and object a can be replicated at
only three nodes with file granularity and object b can be replicated at only two
nodes since only two hosts use object b. Note also that object c does not have to be
replicated because it is not really shared. As the number of replicas or the size of a
volume grows, the performance of volume replication can degrade sharply because
of false sharing. This false sharing problem is similar to the false sharing problem
in distributed shared memory systems described in [20].
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
o : a volume replica
O : an Object
: actual sharing
Figure 3.8: Volume replication and file sharing
Ficus [17] and Coda [33] use volume granularity for replication, and many unÂ
necessary files are replicated because of false sharing. To overcome this, Ratner
proposed selective replication for the Ficus file system [32] to reduce the granularity
to a file. But in his scheme, to replicate a file, all directories from the root directory
to the file are replicated as well as the file itself.
The granularity of operations also affects system performance. When changes
of a replica are propagated, coarse granularity causes more false sharing problems
than fine granularity especially when pessimistic methods are used. Therefore coarse
granularity replication can have a scalability problem in large distributed systems
though it may work well with a small number.
37
O c
: o
a -
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.4 Conclusion
The PRS supports file granularity for replication and operations. W ith the supÂ
port of both file granularity and multiple replication mechanisms, users can specify
exactly which files are to be replicated and which replication mechanism is to be
used for each file. This feature gives great flexibility to applications and users and
compensates for the overheads of file granularity since applications can select repliÂ
cation mechanisms according to their needs. The analysis of the PRS can be found
in Chapter 5.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 4
Im plem entation
There are many implementation issues in building a flexible framework that supÂ
ports multiple replication mechanisms. First, different mechanisms need different
operations and meta-data which must be supported together in a single framework.
When a request arrives for a replica, the system must provide efficient ways of calling
the correct functions at run-time. Second, the framework must provide easy ways
of adding new mechanisms because the framework cannot cover comprehensive lists
of replication mechanisms and the framework will not be widely used if it is difficult
for application programmers to add new mechanisms.
The following sections explain how the PRS is implemented to support replication
in a distributed system. This model assumes that the distributed system has a set
of objects and those objects can be replicated on multiple nodes, including mobile
ones.
The PRS has been implemented on Prospero as one of the middleware services
for the Prospero Directory service. As shown in Figure 4.1, Prospero has layered
semantics with a simple data model for repository. The Prospero repository contains
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Applications
Semantic Semantic
Middleware Middleware
MW services: e.g. replication,
IP protection, information flow
Repository:
8 o
Meta-data (semantic)
r T 3 *
Meta-data (mw services)
O Q
Raw data
S 3
Figure 4.1: The Prospero Repository
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
objects and their attributes. Attributes define semantics for applications and provide
m eta-data for middleware services. Middleware services include replication services,
IP protection services, information flow services, and garbage collection services.
4.1 Overall Structure of the Framework
The proposed framework was implemented as the Prospero Replication Service
(PRS) module in the Prospero system as shown in Figure 4.2. The PRS has multiple
replication mechanisms which are managed by the Replication Management module
through APIs. Clients send requests to a server to access an object, and the server
locates the replication mechanism to use according to the attributes of the requested
object, and then the server calls the correct function in the replication mechanism
through the replication management module. The PRS module sends requests to
other Prospero servers for consistency control, if necessary.
The Replication Management Module accesses meta-data and objects through
APIs provided by the server. The functions ip replication mechanisms use library
functions in the module to access meta-data and objects.
As shown in Figure 4.2, there are also client side mechanisms so that consistency
control can be enforced by clients. If an application developer finds a more efficient
way of managing replicated objects, the mechanism can be built on the client side.
Clients and servers can be on the same host, and servers can be added without
interfering with other servers.
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Applications
Applications
Replication
Mechanisms
Replicatioi
Mechanisms
Client
Client
Network
Replication
Management
Module
Replication
Management
Module
Prospero
Server
Prospero
Server
APIs APIs
Replication
Mechanisms
Replication
Mechanisms
PRS
PRS
Objects & Objects &
attributes attributes
Figure 4.2: The Prospero Replication Service(PRS) module in Prospero
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1.1 Replication Management M odule
The Replication Management module provides interfaces for object management
and communicates with other servers for consistency control. This module hides
details about the objects and their attributes from application programmers by
providing APIs which allow programmers to access local and remote objects and
attributes. The APIs are provided as libraries, and application programmers can
add new libraries to the module as well as share the libraries with others.
The Replication Management module maintains meta-data used for replication
which are stored with each object in permanent storage. There are also mechanism-
specific meta-data. Section 4.2 explains the meta-data used in this module.
4.1.2 Replication Mechanisms
To add a replication mechanism to the framework, application programmers must
provide a set of functions as shown in Table 4.1. The table has both client and server
side entries, so client-enforced mechanisms can be supported. The client side entries
can have one of the following values:
T O A N Y : The client can send requests to any server that has replicas of the reÂ
quested object. This is the default value if consistency control is enforced by
servers.
fu n ctio n nam e: This indicates a function name to be used on the client-side and
enables clients to have application specific consistency control.
The server side entries can have a function name or a blank. If the entry is a
blank, only local operations are performed. Read and Write operations read and
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Operation Client Side Server Side
Read
Write
Create
Delete
Insert
Remove
GET
SE T
D ESTRO Y
TO ANY
TOANY
TOANY
TOANY
TOANY
TOANY
TOANY
TOANY
TOANY
QREAD
QWRITE
QCREATE
QDELETE
QINSERT
QREMOVE
QGETATTRIBUTE
QSETATTRIBUTE
QDESTROY
Table 4.1: Example of a function table for a replication mechanism
write contents of objects, and Insert and Remove operations insert or delete object
entries in directories. To create or delete replicas Create and Delete operations are
used. GET, SET, and D ESTRO Y operations manipulate the attributes of objects.
The Prospero server invokes the correct functions by mapping entries of the table
to real functions. Operations in the table are extensions of common I/O operations
in operating systems. Example mechanisms supported in the PRS are addressed in
Section 4.8.
To add new mechanisms, dynamic link libraries (DLL) can be used. Using DLL
the existing server does not need to be recompiled. The libraries are put in specified
directories to make them secure. Mechanisms can also be added on the client side.
The client-enforced mechanism is less secure, but it can take advantage of better
knowledge of the objects and can provide more application specific mechanisms.
Figure 4.3 shows the life cycles of objects in the PRS. The operations in TaÂ
ble 4.1 is the minimal set of operations needed for modeling all possible life cycles
in Figure 4.3. Insert and Remove operations are used with directory objects, and
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
SET, GET and D E STR O Y operations are used to handle attributes of objects. The
rest of the operations are applied to all kinds of objects.
4.2 A ttributes of an Object
Each object uses the following attributes for replication management. The PREPACKÂ
AGE and PRSOPERATIONS attributes define the replication mechanisms to be
used for each object. By setting different attribute values, different objects can have
different replication mechanisms.
P R S P A C K A G E : This specifies the name of the replication package. When the
first replica is created, this attribute is set and cannot be changed later. Setting
this attribute specifies a suite of operations.
P R S O P E R A T IO N S : This contains the table, shown in Table 4.1, that defines
individual operations. The system assigns a table to this attribute when the
PRSPACKAGE attribute is specified. If an operation is invoked for an object,
the system looks up the PRSOPERATIONS attribute and calls the function
specified in the table.
P R S M E T A D A T A : Different mechanisms need different meta-data. This attribute
contains m eta-data specific to a replication mechanism. For example, an opÂ
timistic mechanism needs to have version vectors, whereas quorum consensus
mechanisms need to have quorums.
R E P L IC A : This has a list of replicas of the object. The list may be incomplete
and as the propagation and reconciliation processes go forward, the list is
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
eventually completed. If a replication mechanism requires a complete list,
programmers of the replication mechanism should make sure that all replicas
have a complete list after adding or deleting replicas.
Replicas in the list are distinguished by replica identifiers. The replica identifier
includes a host-specific object name and the name of the host where that replica
is located, such as (darkstar.isi.edu, /path/nam e/filenam e).
4.3 System O bject
Each server has one special object called a system object, which is used to keep
system-wide resources for replication. The object holds the m eta-data and functions
associated with each repository. Update logs are an example of m eta-data stored in
the system object, and application programmers can add m eta-data and functions
to the system object if necessary.
The system object can be a bottleneck of the system because this object can be
accessed frequently. Several techniques can be used to improve the performance of
accessing the system object: to avoid conflicts of accesses, locks can be used. By
allowing locks on each attribute of the system object rather than on the system
object itself, the system can avoid false sharing and allow concurrent accesses of the
system object.
U p d a te Logs: In optimistic mechanisms, operations are performed on a single
replica and are propagated later. To propagate operations, the system must
keep logs of the operations. The server collects and maintains the update logs
as an attribute of the system object. In other words, all the update logs of
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the server are stored in the system object so that the server does not need
to check all objects to find update logs periodically. The structure of update
logs is addressed in Section 4.7 which illustrates what kinds of information are
saved in update logs.
The updates logs are deleted when the server finds out that the operations are
successfully applied to all the other replicas.
Accessing update logs can be a bottleneck, too. So this can be optimized
since write operations of update logs can be executed concurrently. Only
delete operations of update logs may cause conflicts with other operations.
By allowing concurrent read and write operations, the system can have better
performance when accessing update logs.
4.4 Libraries
The following functions are added to the libraries in the framework, and application
programmers can also add additional functions to the libraries and share functions
with others. These functions and the set of operations defined in the PRSOPÂ
ERATIONS attribute can define and support all the actions in the whole phases
mentioned in Section 3.2.
check-op-onserver()
This function checks if an operation can be performed on the specified server.
So it checks liveness of servers as well as access control for an operation.
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where hsoname represents host specific object name. The M’ option is used when
creating second-level replicas:
mkreplica -I -h BOREAS.ISI.EDU -s POS/locaLvsystems/egim/egim/filel -p
TW O TIER my.filel
The above command will create a second-level replica called my.filel which is a
child replica of the object filel in the host BOREAS.ISI.EDU, and filel will become
the master replica of my-filel. The ’ -1’ option works only with the two-tier replication
mechanism. Otherwise, the command will return errors.
The ’-r’ option is used for recursive replication, and it is used to replicate objects
in directories. More details can be found later in this section. Here is the example
usage. The hsoname with the ’-r’ option has to be a directory name.
mkreplica -r -h BOREAS.ISI.EDU -s POS/locaLvsystems/egim/egim/dimamel
-p OPTIM ISTIC my .dir 1
The above command will replicate objects in dim am el directory on the host
BOREAS.ISI.EDU to m y.dirl directory on the local host. If m y.dirl doesn’t exist on
the local host, it will create the directory. The ’ -p’ option specifies a replication proÂ
tocol to be used, and the protocol names that are supported with this command are
the read-one/write-all(ROWA), the quorum mechanism(QUORUM), the optimistic
mechanism(OPTIMISTC), and the two-tier replication mechanism(TWO-TIER).
Directory replication can be done with the ’-r’ option for the ’mkreplica’ comÂ
mand. In Prospero, a directory is considered to be a Prospero object. The Prospero
object has attributes, and the links to file entries of a directory are saved as atÂ
tributes of the directory. If the same procedure is used for directory replication as
for file replication, the replicated directory will have the links to objects in the source
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
directory, instead of having objects in the source directory replicated. This is not
what users expect from directory replication. Usually users want to replicate all the
objects in the directory rather than replicate a directory itself and this directory has
links to objects in the source directory
So in the PRS, directory replication works as follows: First, a directory is created
on the target host. Then, all the objects in the source directory are replicated. If
there are any subdirectories, the same processes are done as for the parent directory.
4.5.2 vrm
A new option ’ -a’ is added to the vrm command of Prospero. This option is used to
delete all the replicas of an object. W ithout this option, only the specified replica is
deleted. For example,
vrm -a linkl will delete linkl and all the replicas of it. But vrm linkl will delete
only linkl, update the list of replicas and propagate the information to the other
replicas.
4.6 Protocols
To support replication, the following protocols are added to the Prospero system.
CHECK-EDIT-OBJECT-INFO
This protocol command is used for the first phase of the two-phase commit
algorithm [28] [37]. In the first phase of the two-phase commit algorithm, obÂ
jects are locked, and then in the second phase, the locked objects are modified.
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This protocol command calls the chk_edit_object_info() function on the server
side. This function checks the requested object and its access control list to
see if the operation can be successfully performed on it.
COMMIT-EDIT-OBJECT-INFO
This protocol command is used for the second phase of the two-phase commit
algorithm. After successful completion of the CHECK-EDIT-OBJECT-INFO
protocol, clients will send this protocol command. Then, the cmt_edit_object_
infoQ function is called on the server side, which calls the edit_object Jnfo_sub()
function with a flag indicating that the call is from the cmt_edit_object Jnfo()
function. The CHECK-EDIT-OBJECT-INFO protocol and the COMMIT-
EDIT-OBJECT-INFO protocol can be used for pessimistic mechanisms.
CREATE-REPLICA
The user command of ’mkreplica’ uses this protocol command and the cre-
ate_replica() function is called on the server side. This function creates an
object and copies all attributes of the source object.
DELETE-ALL-REPLICAS
The ’-a’ option of the vrm user command uses this protocol to delete all the
replicas of an object. The delete_all_replicas() function is called on the server
side. This function gets the list of replicas and sends the DELETE-OB.IECT
request to other servers. If any replica cannot be deleted, a log is created for
later deletion. For the structure of the log, see the following section.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.7 U pdate Logs
Update logs are generated when any replicas are not available for operations or a
mechanism requires delayed propagation rather than immediate propagation.
A log has the following entries:
Name of the operation
The operation that is performed.
Initiating object
This indicates the object which is initially updated.
List of updated replicas
The list of successfully updated replicas.
List of replicas to be updated
The list of replicas that need to be updated. This entry is checked before
examining the list of updated replicas. If this entry is not NULL, the list of
updated replicas is disregarded.
Version number
The version number when the operation is performed.
Arguments of the operation
Any arguments of the operation.
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.8.1.1 PRSM ETAD ATA
The following m eta-data are specific to the optimistic mechanisms, and they are
stored in the PRSMETADATA attribute of each object.
Version Vectors: The system uses version vectors to find out up-to-date replicas or
inconsistency between replicas. When a replica is changed, the version vector
of the replica is updated. The version vectors are lists of pairs of the replica
identifier and the replica version number.
O ld Copies: Old copies of the object and their timestamps are kept for possible
rollbacks. An old copy is destroyed when the largest synchronized timestamp
becomes later than the timestamp of the old copy.
Largest Synchronized Tim estam p: This is the timestamp of the latest successÂ
ful update propagation.
4.8.1.2 PR SO PER A TIO N S
This attribute defines operations that support optimistic mechanisms. Each operaÂ
tion is defined as follows:
• Read, GET: Since optimistic mechanisms do not need to read other replicas,
the only thing the read operation has to do is to read the local copy of the
requested object.
• Write, Insert, Remove, SET, DESTROY: After changing the local copy, this
operation adds the changes and a timestamp for later propagation to the upÂ
date logs of the system object then saves the old copy.
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• Create: After creating a new replica using the server APIs, the new replica
is added to the list of replicas, or the REPLICA attribute, and the new list is
propagated to the other replicas.
• Delete: After deleting a replica, the list of replicas is updated and propagated
to the other replicas.
4.8.1.3 L ibraries
The following functions are added to the libraries for optimistic mechanisms.
U p date propagation: This function is called by the propagate^alLupdates function
which is invoked by users or servers. Star topology is used for propagation. If
conflicts are detected, the server calls the conflict resolution routine.
If, when propagating updates, a replica is unavailable because of network parÂ
titions, then the server saves the log and skips to the next available replica
and continues to propagate updates. The server keeps periodically trying to
propagate updates to the replica which was unavailable for a certain period
of time. If propagation fails, the server notifies the owner of the object of
the failure. The largest synchronized timestamp attribute is updated when
propagation is successful for all replicas.
When a system is reconnected after disconnection, this function is called by
an explicit user command or by the server when the server detects any inconÂ
sistency between replicas.
C onflict R esolution: If a conflict between replicas is found, the automatic resoluÂ
tion algorithm is applied first. If the conflict cannot be resolved automatically,
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
W rite Quorum(w): the minimum number of replicas that must be written to to
complete a write operation.
W rite Lock: The server must get w write locks first before sending write requests.
The locked objects keep information about which server gets the lock. Locks
are reset after a certain period of time to prevent lost locks. The write operaÂ
tion is addressed in the next section.
4.8.2.2 PR SO PE R A T IO N S
Operations for the voting mechanism are defined as follows:
• Read, GET: r replicas are read and the contents of the replica with the latest
version are returned as a result.
• Write, Insert, Remove, SET, DESTROY: These operations use the two-phase
commit algorithm. In the first phase, w replicas are locked. In the second
phase, write requests are sent to the replicas which have write locks. If the
server cannot get enough locks, it frees all locks and tries the operation again
later.
• Create: After creating a new replica, quorums and the list of replicas are
changed and propagated to the other replicas.
• Delete: When deleting a replica, the system changes the quorums and the
list of replicas and propagates the changes to the other replicas.
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.8.2.3 U pdate Propagation
The update propagation function is added to the libraries. Because only w replicas are
updated with write operations, updates must be propagated to the other replicas.
This function sends updates to the rest of the replicas. When two replicas are
different, the latest replica is copied to the other after comparing the last modified
times.
4.8.3 Two-tier Replication Mechanisms
As addressed in Section 3.3.2.1, two-tier replication mechanisms [16] are useful for
supporting mobile computers. In two-tier replication mechanisms, there are two
types of replicas: regular replicas and light-weight replicas (LWRs). LWRs are simÂ
ilar to caches in the following respects: LWRs are copied from regular replicas, and
the updates to the regular replicas are propagated to the LWRs as in the callback
method of caching. However, unlike caches, LWRs are not swapped out by replaceÂ
ment algorithms and LWRs can be created and deleted by explicit requests from
clients or servers, just like regular replicas.
Each LWR has a link to a regular replica, and the regular replica keeps lists of
which LWRs have links to it. Updates to a regular replica are propagated to other
regular replicas, and each regular replica notifies the LWRs of the changes. Updates
to LWRs remain tentative until LWRs are connected and the updates are propagated
to regular replicas. Since the optimistic strategy is used for regular replicas, some
functions and meta-data are the same as those for optimistic mechanisms. Again,
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the contribution of this dissertation is not the two-tier replication mechanism itself
but the framework that can support the mechanism.
The framework does not limit the number of tiers that it can support. But
currently replication mechanisms that have more than two tiers are not in need in
any applications.
4.8.3.1 P R S M E T A D A T A
The following m eta-data are used to support two-tier replication mechanisms, and
they are saved in the PRSMETADATA attribute.
M a ste r R eplica: This points to the regular replica that the LWR is copied from.
The value of this attribute can be changed over time.
L ist o f M obile R eplicas: Regular replicas use this attribute to keep track of moÂ
bile replicas which have copied contents from them.
O ld C opies: Old copies of regular replicas and their timestamps are kept for posÂ
sible rollbacks until the largest synchronized timestamp becomes later than the
timestamp of the old copy.
V ersion V ectors: The system uses version vectors to determine the most up-to-
date regular replicas or inconsistency between regular replicas. When a regular
replica is changed, the version vector of the replica is updated.
L arg est S ynchronized T im estam p : This is the latest timestamp of successful
propagation.
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.8.3.2 PRSO PER A TIO N S
Operations for two-tier replication mechanisms are defined as follows.
• Read, GET: Only reads local copy.
• Write, Insert, Remove, SET, DESTROY: After changing a local copy, this
function adds the changes to the update logs of the system object, saves the
old copy and a timestamp, and calls the propagate^alLupdates function of the
system object with a certain delay. The following is the write algorithm:
two_tier_write_replica(operation_name, object)
version.vector = get.version.vectors(object);
master.repl = get_master_replica(object);
if (master.repl) {
/* This is a 2nd level replica */
retval = update_master_replica();
if (retval == FAILED) {
write_update.log();
>
> else {
/* This is a 1st level replica */
write.update.logO;
>
>
• Create: After creating a regular replica, the new replica is added to the list
of replicas and the list is propagated to the other replicas. LWRs will get the
updated list eventually when reconnected. After creating an LWR., only the
corresponding regular replica is notified.
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• Delete: When deleting a regular replica, the function notifies the other replicas
of the change. LWRs can select other regular replicas as new master replicas
if the corresponding regular replica is deleted. When an LWR is deleted, only
the corresponding regular replica is notified.
4.8.3.3 U pdate Propagation
The propagation function has two parts: one for regular replicas and the other for
LWRs. If the corresponding regular replica is available, the function sends updates
to the LWR. Otherwise, the LWR tries other regular replicas on the list. For regular
replicas, the function is the same as that of optimistic mechanisms except that
connected LWRs are notified of the updates.
4.8.4 Caching
The PRS can support caching mechanisms as one of the replication mechanisms.
Each server selects its caching mechanism, or caching strategy, by specifying a cache
replacement algorithm as an attribute of the system object. The advantage of supÂ
porting caching and replication together is that caching can have better performance
using replication information.
Caching mechanisms use the following algorithm when remote objects are acÂ
cessed:
get.remote.object(object.name)
{
/* lookup caches */
if (lookup.caches(object.name) == TRUE) {
return.cached.object (object.name) , *
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
> else - C
/* Request the object to a remote server */
obj = get.object(object.name);
/* Store the object in the caches */
cache.replacementCobj);
return.object(obj);
>
>
The system object keeps information about caches, such as the list of cached
objects, available caches size, a cache replacement algorithm, and so on. How to
write back cached objects depends on caching policies that are also defined in the
system object. See section 4.3 for more details. The caching mechanism presented
here is one way of supporting caching in the PRS. Application programmers can
develop their own caching mechanisms in the PRS depending on their needs.
4.8.4.1 PRSM ETAD ATA
Last R eferenced Tim estamp: This indicates the last time the cached object is
referenced. It is used by cache replacement algorithms.
M aster O bject: This indicates the object from which the cached object is copied.
4.8.4.2 PR SO PER A TIO N S
• Read, GET: If the requested object is in a cache, the server updates the
last referenced timestamp. Otherwise, the server gets the object from other
servers and creates a cached replica. If there is not enough space when placing
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
items in a cache, the server calls the cache replacement algorithm. The cache
replacement algorithm is illustrated in Section 4.8.4.4.
• Write, Insert, Remove, SET, DESTROY: These operations are the same as
the read operation except that updates are propagated to the master object.
• Create: This is used by the above read or write operations when a cached
replica needs to be created. This function copies an object to a cache, and
puts its information in the list of cached objects.
• Delete: When an item in a cache needs to be removed, the read or write
operations use this function to delete the specified cached object and remove
it from the list of cached objects.
4.8.4.3 S ystem O b ject
The following meta-data are added to the system object to support caching mechaÂ
nisms.
A vailable cache size: This keeps track of the available cache space on the server.
L ist o f cached ob jects: This attribute contains the list of cached objects.
4.8.4.4 C ache R ep lacem en t A lg o rith m
The cache replacement algorithm is added to the system object as an attribute. The
server first checks if the object is replicated on the server. If so, the object is not
cached and the replica of the object is used instead. Servers can have different cache
replacement algorithms depending on their environments.
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.9 Exam ples of w rite operations
The following module is called by Prospero servers when write operations are inÂ
voked. This module gets the list of replicas and then, using prspackage and prsop-
erations attributes, the set of operations are selected, and the write function in the
set is invoked with the arguments passed to this module. The PRS has a table that
contains sets of operations for each mechanism.
int
write.replicas(init.obj, operation.name, attr.of.replication,
num.of.var.args, ...)
/* get the list of replicas */
replicas = attr.of.replication.replicas;
/* get the set of functions */
an.entry = str2replica_protocol_entry
(attr.of.replication.prspackage,
attr.of.replication.prsoperations);
/* for variable number of arguments */
va.start (ap, num.of.var.args);
/* call the appropriate write function */
if (a_entry->write_function) {
retval = (a_entry->write_function)(operation.name, init.obj,
replicas, no.args, ap);
> else - C
retval = SUCCESS;
>
va.end(ap);
return retval;
>
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This module is an example write function for an optimistic mechanism. In this
module, version vectors are used and update logs are generated with version inforÂ
mation. The version vectors attribute is a mechanism-specific attribute, and it is
stored in the PRSMETADA attribute of each object.
/* write function for an optimistic mechanism */
int
opt_write.replicas(operation.name, init.obj, replicas,
num_of.var.args, ap)
{
/* get version vectors */
version.vectors = get.version.vectors.token(init.obj);
/* increase version number */
increase.version.number(init.obj);
/* write update logs */
write_update.log(operation.name, init.obj, version.vectors,
num.of.var.args, ap);
return SUCCESS;
>
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
Com parison
In this section, the PRS proposed in this dissertation is analyzed and compared
with other replication systems. The following analysis is done to show the flexibilÂ
ity of the PRS which is the main contribution of this dissertation. First, different
replication strategies and systems are analyzed in terms of availability and consisÂ
tency of replicas. Then, example applications implemented in previous research are
described, and surveyed to determine whether each replication system can support
such applications. Last, an example configuration of objects and replicas for some
of the applications is constructed, and their availability and the number of messages
for read and write operations are compared.
5.1 Com parison o f R eplication System s
Replication increases the availability of an object, but if a replicated copy is inconÂ
sistent with the original copy, operations on a replica may be incorrect. Availability
and consistency are among the most important factors in designing replication sysÂ
tems. Availability of an object can be defined as the success rate of operations on the
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
object, and consistency of an object can be defined as the probability of correctness
of the object. In this section, the PRS was compared with other replication sysÂ
tems in terms of availability and consistency of replicated data. For the comparison,
different replication strategies were analyzed, and then different replication systems
were compared.
5.1.1 Comparison of Replication Mechanisms
The following replication mechanisms are considered in the comparison. More details
on these mechanisms can be found in [11] and [16].
> ! W
+ Primary Copy
+ Tokens
+ read-one/write-all(write)
+ Missing Writes
+ Voting
+ read-one/write-all
(read)
+ Version vector
+ Two-tier replication
Availability
of replicas
Figure 5.1: Comparison of Replication Mechanisms
P rim a ry C opy: One replica of an object is designated as the primary copy, and is
responsible for the activities performed on the object. The primary copy must
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
be involved with all activities of the object. If the primary copy of an object
fails, a new primary copy can be elected among the other replicas. Since all
activities are performed on the primary copy, the consistency of an object is
very good but the availability is low.
Tokens: Each object has a token associated with it, and a replica which has a token
acts as a primary copy just like the Primary Copy mechanism. But this scheme
is more flexible than the Primary Copy mechanism since the primary copy can
be changed according to the loads of replicas.
Read-one/W rite>all: Clients read the nearest copy of an object and write to all
copies. Since all copies are updated together the consistency of replicas is good
but the availability of replicas for write operations is low, because if any replica
is unavailable the write operation fails. However, the availability of replicas
for read operations is very high since the system needs to read only one object
to complete the operation.
Voting: In the voting scheme, every transaction must collect a read quorum of r
votes to fulfill a read operation, or a write quorum of w votes to fulfill a write
operation. W ith this scheme, transactions can be successfully executed even
though some sites failed.
M issing W rites: In the missing writes mechanism, transactions run in two modes:
normal mode and failure mode. When the system is in normal mode, the
read-one/write-all scheme is used. If some copies cannot be updated, the
transaction must run in failure mode. When the system is in failure mode, a
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
voting mechanism is used. This has better consistency than the voting scheme
since this scheme uses the read-one/write-all scheme in normal mode. The
availability is the same as the voting scheme.
V ersion vectors: This is an optimistic scheme that uses version vectors to keep
track of changes in replicas. In this scheme activities can be performed on any
replicas of an object and activities are logged and propagated later. So, it has
high availability but low consistency because of possible concurrent updates.
T w o-tier replication: This scheme uses second-level replicas on loosely connected
sites. Second-level replicas are kept less consistent than first-level replicas to
reduce the overall costs of propagation and conflict resolution. Since replicas
are maintained with an optimistic mechanism, the availability of an object is
as good as the version vectors scheme, but the consistency is lower.
Figure 5.1 shows an overall comparison of replication mechanisms. In general,
pessimistic mechanisms have high consistency, whereas optimistic mechanisms have
high availability. Which mechanism to use depends on the requirements of objects
and applications.
5.1.2 Comparison of Replication Systems
There have been many replication systems proposed so fax. Among them, Coda,
Ficus, Bayou, Rover and LotusNotes support only optimistic mechanisms, while
voting, Clouds and the system proposed by Brun-Cottan support only pessimistic
mechanisms. Though the previous systems work well with small domains, they are
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
not flexible enough to support very large distributed domains. The Prospero RepliÂ
cation Service can support both pessimistic and optimistic mechanisms as well as
other replication mechanisms, so it is the most flexible system so far. Figure 5.2
shows where replication systems are located on a graph of consistency and availabilÂ
ity.
> 1 w
Prospero
Replication
Service
Q ) *H
fcioudslCBrun
V J -Cottan
o o
[Voting
Coda,Fucus N
Bayou,Rover
LotusNotes
0 Availability
o f r e p l i c a s
Figure 5.2: Comparison of Replication Systems
5.2 Exam ple Applications of Previous Work
Previous replication systems show many applications that can be supported in their
systems. To prove the PRS is the most flexible, applications proposed by others
were collected and the PRS was shown to support all of them.
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2.1 Replicated Counter
The replicated counter, implemented in Brun-Cottan’s work, is used in a parking
lot simulation [5]. The counter maintains the number of free spaces in a parking
lot with multiple entrances and exits. Since decreasing the counter could result
in conflicts, the counter must be maintained with pessimistic consistency control.
So PRS, Brun-Cottan, voting and Clouds can support this application, but Coda,
Ficus, Bayou, Rover and LotusNotes cannot.
5.2.2 Replicated Document
Brun-Cottan also presented a Replicated Document application which is used in a
distributed cooperative editing session [5]. Replicas of shared documents are held by
the cooperating editors and may be modified concurrently by clients. This applicaÂ
tion must be maintained with pessimistic consistency control to prevent concurrent
updates and conflicts of updates. So PRS, Brun-Cottan, voting and Clouds can
support this application, but Coda, Ficus, Bayou, Rover and LotusNotes cannot.
5.2.3 Bibliographic Database
A bibliographic database is implemented in the Bayou system [40]. This application
allows users to cooperatively manage databases of bibliographic entries. A user can
freely read and write any copy of the database. An entry’s key is tentatively assigned
when the entry is added. A user must be aware that the newly assigned keys are
only tentative and may change when the entry is ‘committed.’ The database is
maintained with weak consistency control and can reside on mobile sites.
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This application can be easily supported by PRS, Coda, Ficus, Bayou, Rover
and LotusNotes, because they can support optimistic mechanism s.
5.2.4 Rover Applications
Designers of Rover [21] developed several applications which are mobile transparent
and mobile-aware. Because of the nature of mobility, these applications are best
maintained with weak consistency controls.
R over W ebcal: Rover Webcal is a distributed calendar tool which stores items
such as appointments, daily to-do lists and daily reminders. Items can be
replicated in Rover Webcals on different sites including mobile sites. Per item
base consistency is used in this application.
Rover Exmh: Rover Exmh is a mail tool for mobile environments. Upon start-up,
it prefetches a list of mail folders, the mail folders the user has recently visited,
and the messages in the user’s inbox folder. During periods of disconnection
it could delay servicing of user requests, such as send and delete.
Rover H T T P Proxy: Rover HTTP proxy intercepts all HTTP requests on a moÂ
bile site and, if the requested item is not locally cached, returns a null response
to the browser and queues the request in the operation log. When a connecÂ
tion becomes available, the page is automatically requested. In the meantime,
the user can continue to browse already available pages and issue additional
requests for pages without waiting.
The client and server cooperate in prefetching. The client specifies the depth of
prefetching for pages, while the server automatically prefetches in-line images.
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
PRS, Coda, Ficus, Bayou and Rover can support all the above applications, while
LotusNotes can support only Rover WebCal and Rover Exmh.
5.2.5 Reference Data of Online Transaction Processing system
In an online transaction processing (OLTP) system [6] for trading businesses, referÂ
ence data is replicated to improve the availability of data and reduce response time.
Because of the nature of the trading business, the infrastructure must provide a high
degree of availability and integrity.
New York (Headquarters)
Ref Daca
Ref Daca
Replication
Service
Component
Cos Angeles 1
Chicago 1
Mexico City I
Buenos Aires
Ref Data
Sydney | Milan
Paris Singapore
Ref Data Tokyo
Figure 5.3: OLTP Reference Data Replication [6]
To obtain a high degree of availability and integrity, reference data is replicated
on hosts at the trading branches and maintained with a master/slave model. All
updates are processed onto the master data located at headquarters, and the changes
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Counter
ReplDoc
Biblio DB
WebCal/ Exmh
Proxy OLTP
PRS O O 0 O
[Brun-Cottan] 0
Coda O 0
Ficus O 0
Bayou O 0
Rover 0 0
Clouds 0
LotusNotes O
Voting O
Table 5.1: Overall comparison of replication systems
5.3 Q uantitative Analysis
For a communication performance comparison of the PRS, three replication mechÂ
anisms are analyzed: the quorum consensus mechanism, the optimistic mechanism,
and the two-tier replication mechanism. The number of messages sent by read and
write operations for each mechanism was analyzed and with the example configuraÂ
tions of objects and replicas, the PRS was compared with other replication systems.
5.3.1 Number of messages
In this section, the number of messages sent by read and write operations of the
quorum consensus mechanism, the optimistic mechanism and the two-tier replication
mechanism was analyzed.
For a read operation, the optimistic mechanism and the two-tier replication mechÂ
anism need two messages: a read request and a response to the request. For the
quorum consensus mechanism, 2qr messages are sent, where qr is a read quorum.
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The following shows the number of messages needed for a write operation and
its propagation.
• The quorum consensus mechanism
4qw + 2(R - qw)
where qw is a write quorum, and R is the number of replicas.
Because the two-phase commit algorithm is used for this mechanism, two reÂ
quest messages and two response messages are sent for each replica until a
write quorum is reached. The rest of the replicas need two messages each for
propagation.
• The optimistic mechanism
2 + 2(R - 1) + (WPH * delay * (R - 1)) * 2(R - 1)
where WPH is the average number of writes per hour for each replica, and delay
is the mean time between updates of the original copy and that of replicas.
The write operation used two messages to update the local copy and 2(R - 1)
messages to propagate the changes. In addition to that, if there are concurrent
write operations, conflicts occur. The formula (WPH * delay * (R - 1)) presents
the average number of conflicts, and each conflict needs 2(R - 1) messages since
these conflicts are resolved and propagated to the other replicas.
• The two-tier replication mechanism
Update to regular replicas: 2 + 2(RS -1) -I- (WPH * delay * (Ra-1)) * 2(Ra-l)
+ Rm
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Update to mobile replicas: 1 + 2 + 2(RS -1) -f (WPH*delay*(Rs-l)) * 2(RS -1)
+ Rm
where Ra is the number of regular replicas, and Rm is the number of mobile
replicas.
When a regular replica is updated, the number of messages is same as that
for optimistic mechanisms excluding the number of messages notifying mobile
replicas (Rm). When a mobile replica is updated, the changes are propagated
to a regular replica upon connection, and then the changes are propagated
to the other regular replicas. So one message is sent when updates of mobile
replicas are propagated, and the rest of the numbers are the same as for regular
replicas.
5.3.2 Example Configuration
Figure 5.4 shows an example configuration of replicated data in a distributed system.
There are three objects: A, B, and C. Object A is used by the Replicated Document
application which uses pessimistic consistency control. Object A is replicated over
a local area network, and all replicas are frequently accessed by clients. Object B is
a bibliographic database which is shared by three institutions. Object C is used by
the Rover WebCal application, and it is replicated on several sites, including mobile
sites. All applications used here are explained in Section 5.2. The numbers shown
on the lines represent the probability of connection, and connections show that the
objects are replicated.
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
o : object or replica
0 . 1.
/
\
A : Replicated document
B : Bibliographic DB
C : WebCal
Figure 5.4: Example configuration of replicated data
5.3.3 Comparison
Using the example configuration in Figure 5.4, the availability and the number of
messages sent by the three different replication mechanisms are analyzed. The first
analysis shows results when only the quorum consensus mechanism was used for all
objects. Replication systems, such as voting, Clouds and the system proposed by
Brun-Cottan, support only pessimistic mechanisms and will have similar results to
this case. The second analysis shows results when only the optimistic mechanism
is used for all objects, which is the case in Coda, Ficus, Bayou, Rover and LotusÂ
Notes. In the last analysis different objects are maintained with different replication
mechanisms as in the PRS. Since each object can select the most suitable replication
mechanism, the performance of the PRS is better than that of others.
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Quorum
Availability Messages
Read Write Read Write
A 1.00 1.00 2 16
B 1.00 0.97 4 22
C 0.55 0.15 4 14
Optimistic
Availability Messages
Read Write Read Write
A N/A N/A N/A N/A
B 1.00 1.00 2 17
C 1.00 1.00 2 33 + a
PRS
Availability Messages
Read Write Read Write
A 1.00 1.00 2 16
B 1.00 1.00 2 17
C 1.00 1.00 2 6.7
Table 5.2: Comparison of the availability and the number of messages
As for parameters used in the analysis, WPH is set to 1 for object A and 0.1
for the other objects since object A is replicated in local systems and frequently
updated. The analysis results are shown in Table 5.2.
• The quorum consensus mechanism
The quorum numbers are set as follows.
A: qr = 1, qw = 4
B: qr = 2, qw = 5
C. qr = 2, qw 3
Since object C is replicated on mobile sites and the read and write quorums
are 2 and 3 respectively, the availability is very low.
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• The Optimistic Mechanism
Because object A should be maintained with strong consistency control, it
cannot be supported with this mechanism. As for object C, many messages
are sent when the system tries to propagate changes to other replicas because
of the high probability of disconnection of mobile sites and consequently, there
are repeated trials of propagation. The assumed propagation delay is one
hour. The variable a represents the number of messages sent when the system
propagates changes after conflict resolution.
• The PRS
In the PRS, since the PRS can provide each object with multiple choices of
replication mechanisms, objects can have different replication mechanisms. BeÂ
cause object A cannot tolerate weak consistency control, it is maintained with
the pessimistic mechanisms, i.e. quorum consensus mechanism. Object B and
object C can tolerate inconsistency of replicated data, so they are maintained
with the optimistic mechanisms. Since replicated data of object C can reside on
mobile node, object C is maintained with the two-tier replication mechanism
that is a variant of the optimistic mechanisms. By allowing objects to select
the best suitable replication mechanism, the PRS provides better availability
and performance for all of the objects.
81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.4 Sim ulation
One of the most important contributions of this dissertation is to design a flexible
framework for replication in distributed systems. To prove the PRS is more flexible
than other systems, two simulations were done. In the first simulation, three kinds
of data from OLTP applications were used: reference data that requires a near-
real-time consistency, analysis data that can be maintained with weak consistency
controls, and transaction data that must be maintained with strong consistency
controls.
The number of transmissions and the probability of the consistency of replicas
were calculated for four different cases: optimistic mechanisms only, two-tier mechÂ
anisms only, read-one/write-all mechanisms only, and a case in which the PRS is
used. For each object and its replicas, three parameters are specified: frequency of
updates, frequency of propagations, and probability of connection. Random numÂ
bers are generated based on these parameters for up to 10 replicas and simulation
was done with these values.
Figure 5.5 shows the probability of consistency for transaction data. The X
axis represents the number of replicas and the Y axis represents the probability
of consistency. Since transaction data requires strong consistency control, in the
PRS the data was maintained with the read-one/write-all mechanism which is a
pessimistic mechanism. In this simulation, the probability of consistency is defined
as the probability of consistency of returned data of read or write operations. So
the consistency of the read-one/write-all mechanism remains perfect because data
is locked and it cannot be accessed by others while updating is in progress.
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
consistency
1
0.9
0.8
0.7
0.6
S 0.4
0.3
0.1
0
-ROW A !
- two-tier ;
-optimistic j
-PRS f
I
6 7
re p lic a s
10
Figure 5.5: Probability of Consistency of Transaction Data
Figure 5.5 shows that the PRS and the read-one/write-all mechanism have good
enough consistency to support transaction data, but the optimistic mechanisms and
the two-tier mechanisms do not have enough probability of consistency to support
transaction data. W ith more frequent propagations, the consistency can be better
for the optimistic mechanisms and the two-tier mechanisms but they still cannot
guarantee the consistency of data all the time.
Figure 5.6 shows the average number of transmissions for each case. In the PRS,
transaction data is maintained with a read-one/write-all mechanism, reference data
with an optimistic mechanism, and analysis data with a two-tier mechanism. Even
though the optimistic and two-tier mechanisms have a smaller number of transÂ
missions, they cannot provide enough consistency for transaction data. The PRS
and the read-one/write-all mechanism can guarantee consistency of the data and
83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
number of transmissions
70000
60000
50000
30000
20000
-•— ROWA
■•-— optimistic
— two-tier
-x— PRS
replicas
Figure 5.6: Average Number of Transmissions
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
between these two, the PRS has fewer transmissions than the read-one/write-all
mechanisms. Therefore, this simulation shows that if we consider both consistency
and the number of transmissions, the PRS is best suitable for OLTP applications.
In the second simulation, the number of transmissions of the PRS and a two-tier
replication mechanism is compared. The access patterns of replicas may change
over time. Therefore, if a replication mechanism with fixed parameters is used for
an object and its replicas, the system performance may vary with changes in access
patterns, while systems with different mechanisms or different parameters may get
better performance. In the PRS that I proposed, multiple replication mechanisms
can be supported and, in addition, application programmers can implement and use
replication mechanisms that dynamically adapt to environments.
The two-tier replication mechanism has two kinds of replicas: regular replicas
and second-level replicas. To have better performance, instead of using regular
replicas for all objects, second-level replicas can be used for objects that are less
frequently accessed. In the second simulation, update frequencies are generated for
10 replicas and the number of transmissions are measured. In the two-tier replication
mechanism, levels of replicas are predetermined by applications and are not changed
over time. In the PRS, levels of replicas keep changing based on access patterns.
Figure 5.7 shows the update frequencies of 10 replicas. In this simulation, the
two-tier replication mechanism uses initial access patterns, so the # 2 , #5, and # 9
replicas will be second-level replicas and the others will be regular replicas. Since the
best suitable mechanism is dependant on different environments, in PRS, the levels
of replicas are changed dynamically based on the access patterns. In this simulation,
the number of transmissions is calculated for the two-tier mechanism and the PRS.
85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
$ 50
4 J
40
§ < 30
m 20
0 10
•repl 1
'repl 2:
in m m
ro
r-
m
repl 3
â– repl 4
co oi
C N
cn
repl 5 |
‘repl 6j
in C \
CN
m
m
J L 1
repl 7
repl 8
ro m
c*
po
m
m
repl 9 |
repllO|
c\ « —*
Figure 5.7: Access Patterns of Replicas
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.8 shows the number of transmissions when 10 replicas are used. The PRS
has about 15 percent fewer transmissions than the two-tier mechanism. Figure 5.9
shows the average number of transmissions for two to 12 replicas. The PRS has
about 5 percent to 19 percent fewer transmissions.
0 1
C
C O
C O
C O
G
( 0
30000
25000
20000
15000
10000
5000
Two-tier |
PRS
s n o m r N n o i
» ■r n n n v
time
in
It)
Figure 5.8: Number of transmissions for 10 replicas
5.5 Overhead
Figure 5.10 shows the server response time of the SE T operation which changes
the attributes of an object. The operations with two replication mechanisms were
tested: the optimistic mechanism and the quorum mechanism. In the optimistic
87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35000
30000
25000
20000
15000
10000
PRS
Two-tier
1 2 3 4 5 6 7 8 9 10 11 12
# replicas
Figure 5.9: Average number of transmissions for two to 12 replicas
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
45
-a optimistic
-o quorum
40
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10
Number of Replicas
Figure 5.10: Response time of the SE T operation
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
â– Prospero |
â– PRS |
â–¡Optimistic;
Figure 5.11: Breakdown of elapsed time of SE T operation for an optimistic
mechanism
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
mechanism, logs for write operations must be recorded for later propagation, and
in the quorum mechanism n replicas must be updated for read or write operations
where n is the number of a quorum for the operation. As the number of replicas
increases, the size of the meta-data of the object grows. Since the SE T operation
changes the meta-data of a replica, the response time of the operation increase as
the number of replicas grows.
Figure 5.11 shows the breakdown of the elapsed time for the SE T operation of an
optimistic mechanism. For the SE T operation, the elapsed time will be calculated
as follows:
T elapsed — T Prospero—SE Top "b T replication— specific “ F T read— replication—in fo
* b T process— replication—in f o
In this formula, (T read— replication—in fo d- T process—repnC aticm—info') is the overhead of
the PRS.
In optimistic approaches, when a replica is changed, the updates logs are genÂ
erated for later propagation. So among the elapsed time of the PRS, the following
are done: 1. Reading attributes of the requested object, 2. Selecting an appropriate
function according to the attributes, 3. Writing update logs. The third item and a
part of the first item are mechanism-specific overheads. The real overheads of the
PRS are the second item and a part of the first item, i.e. reading the PRSPACKAGE
attribute and the PRSOPERATIONS attribute. In other words, the overhead of the
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In Figure 5.11, the overhead of the PRS and Prospero are broken down in detail.
The PRS has about 3.8 percent overheads compared to the base Prospero system in
these experiments.
View Consistency has 1 to 8 percent overhead compared to the elapsed times of
base Ficus when attributes are obtained during file opens and by not updating these
attributes from the server on each operation. Since Ficus supports optimistic mechÂ
anisms, the overheads shown in their paper do not include overheads of optimistic
mechanisms.
We cannot directly compare the overhead of PRS and those of View Consistency
because the base systems are different. But the results show that the PRS provides
more flexibility than View Consistency while adding small overheads to the base
systems.
The current experiments were conducted on five Sun U tra-ls and four Sun Ultra-
28 that were connected by 10 Mbps Ethernet.
92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
As the number of available replication mechanisms increases, it is more difficult
to select the best suitable replication mechanism for an object. Among all the availÂ
able replication mechanisms, application programmers must be able to select best
suitable replication mechanisms for given replicas easily based on the characterisÂ
tics of objects. Moreover, replication mechanisms used can be changed dynamically,
if necessary. Therefore, the system can have better performance by switching to
new replication mechanisms according to its environments. This feature allowed
by the PRS introduces a new research area of designing and selecting best suitable
mechanisms for replicated objects. In other words, with various kinds of object
and replication mechanisms, there should be efficient ways of binding objects and
replication mechanisms that give the best performance for the objects.
6.2 Dynam ic Change of R eplication Param eters
In distributed systems, many parameters continue to change. For example, network
delays may continue to change depending on the activities of other applications or
users, and access patterns for objects may also continue to change. Because of such
changes of system parameters, some replication mechanisms may become inefficient
when system parameters are changed. To overcome this problem, the system must
be able to adapt to dynamically changing environments. In the PRS, application
programmers can build applications or replication mechanisms that can be changed
dynamically based on their environments. This feature allows another research area
in replication in distributed systems.
94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In the PRS, to comply with dynamic changes, application programmers can
provide objects with more than one replication mechanism and can specify objects
to use different mechanisms in different situations. For example, in normal or stable
modes, the read-one/write-all mechanism is used and if some replicas are unavailable
or unstable, different replication mechanisms can be used. This is similar to the
missing write mechanism [12], but the PRS can support more diverse mechanisms.
The two-tier replication mechanism can be another example. The two-tier repliÂ
cation mechanism has two kinds of replicas: regular replicas and second-level repliÂ
cas. To have better performance, instead of using regular replicas for all objects
in stable computers, second-level replicas can be used for objects that are less freÂ
quently accessed. In the PRS, application programmers can build a modified two-tier
replication mechanism in which replicas change their levels of replication dynamiÂ
cally based on system parameters. So a replica can be either a regular replica or a
light-weight replica according to parameter values of the replica.
In summary, application programmers can build replication mechanisms to change
their parameters based on their needs. For example, in the primary copy mechaÂ
nism, the primary copy can be changed dynamically based on other conditions, such
as response time, access patterns, and so on. W ith the PRS, multiple replication
mechanisms or replication mechanisms with different parameters can be built in the
framework and applications can change replication mechanisms or replication paÂ
rameters dynamically based on their conditions. The simulation in 5.4 shows that
the PRS can support parameter changes of two-tier replication mechanisms to adapt
to changes of access patterns of replicas. Many similar kinds of applications can be
supported in the PRS.
95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
W ithin this dynamic feature of the PRS, another research issue emerged. There
should be more research on finding and determining optimal or near-optimal values
of parameters for replication mechanisms, so that the best suitable replication mechÂ
anisms can be easily determined with this guideline. There are lots of parameters to
be considered, so there should be research on deciding which replication mechanism
is best for given parameters.
6.3 Load Balancing and Replica Placem ent
There is an issue that needs to be considered when multiple replication mechanisms
are supported in a single server. As the number of replication mechanisms provided
in a system grows, there is more storage overhead as well as performance overhead.
Since the system with more replication mechanisms can have more overheads than
those with a small number of replication mechanisms, replicated objects must be
distributed based on replication mechanisms and usages of the objects. In other
words, if replicated objects with the same or similar replication mechanisms are
placed in a single server, the overall performance of the distributed system can get
better. This is an extension of the replica placement issues.
There is some work on replica placement [1,18, 43, 31]. To adequately distribute
loads of an object, the replicas of that object must be distributed evenly, so that no
replica will create a bottleneck. One way to distribute loads is to create and delete
replicas dynamically. Maffeis et al. [26] present replication heuristics and polling
algorithms for object replication.
96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
W ith the support of multiple replication mechanisms, as in the PRS, another
parameter must be considered when a replica is placed on a server. If replicas
with many different replication mechanisms are placed on a server, the server must
have all the libraries and attributes that are used by those replication mechanisms
and the server will have more system overhead because of the size of libraries as
well as other resources used for replication. Even though the increasing number of
replication mechanisms in a server does not cause a significant problem, for a system
to have better performance, replicas with the same replication mechanisms should
be placed in the same server and a server must have less than a certain number of
replication mechanisms in it. In other words, placement of replication mechanisms
among servers must be considered as well as placement of replicas.
6.4 Quality Control of Replication M echanism s
Application programmers can freely add new replication mechanisms to the PRS,
but a new mechanism can be very costly for the server to support. Therefore, there
should be a quality assurance process before a new replication mechanism is accepted
to the server.
The quality assurance process must check the following: correctness of the mechÂ
anism, performance of the mechanism, space overhead of the mechanism, and so on.
When a new mechanism is added, those things must be checked first and it is also asÂ
sured that the new mechanism must provide all the functionalities that are required
by the server and it must satisfy certain performance and system criteria. How to
97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
check new replication mechanisms needs to be researched, and formal specifications
for a new replication mechanism must be developed.
6.5 Conclusion
The introduction of the PRS opened several new research areas, such as static and
dynamic selection of replication for objects, and issues of placement of both replicas
and replication mechanisms. This list of future work shows that the PRS created
new directions for replication in distributed systems.
98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 7
Conclusions
Distributed systems today have the following characteristics: large distribution, diÂ
versity of users and applications, and popularity of mobile computers. To comply
with these characteristics, the goal of this dissertation was to build a flexible repliÂ
cation system that is suitable for large distributed systems and allows the needs of
various users and applications to be met.
To achieve this goal, a flexible framework for replication was proposed and the
PRS was implemented on top of Prospero. As distributed systems grow larger and
larger, there will be more and more users and applications. Previous replication sysÂ
tems have failed to provide solutions to satisfy the demands of diverse users and apÂ
plications all at the same time. The PRS supports multiple replication mechanisms
together and provides application programmers with ways to add new replication
mechanisms. To provide a common framework to various replication mechanisms,
generic phases for multiple replication mechanisms are drawn and the libraries in
the PRS were created based on the generic phases. In the PRS, each object can use
different replication mechanisms and has PRSMETADATA and PRSOPERATIONS
99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
attributes which define the replication mechanism to be used for that object. AppliÂ
cation programmers can add their own replication mechanisms, including client-side
replication mechanisms by providing functions for the PRSOPERATIONS attribute.
The PRSOPERATIONS attribute represents a set of operations for an object, and
this set of operations can define all the activities for an object from object creation
to object deletion.
In addition to multiple replication support, the PRS also supports multiple levÂ
els of replication which can improve overall the performance of systems that have
mobile nodes. In multi-level replication schemes, second- or third-level replicas can
be placed on mobile nodes to reduce the loads on first-level replicas and, as a result,
the overall performance will be improved. Caching is supported in one of the mulÂ
tiple levels and is considered to be one of the replication mechanisms. The unified
control of caching and replication enables caching to produce better results with
replication information. Lastly, this dissertation introduced the flexible replicated
objects which can be used as a base model in designing replication in distributed enÂ
vironments. This model is an extension of the object-oriented model in distributed
environments, and it has replication managers which communicate with each other
to manage replicas. This model presents the abstract view of a flexible framework
for replication.
To evaluate the system, a couple of simulations were done. In the first simulation,
using example configurations of replicas, read and write operations were simulated,
and the number of packets and the availability of the object for each operation was
measured. The results show that the PRS is more flexible and provides better perforÂ
mance in terms of availability and network traffic. In the second simulation, access
100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
patterns of replicated objects are generated and it shows that the PRS dynamically
changes replication parameters to have better results. Therefore the PRS can deal
with changes of environments efficiently.
Experiments were done on the PRS and the results of the experiments were
compared with those of the View Consistency system of Ficus. The View Consistency
only supports client-side optimistic mechanisms and it adds about 1 to 8 percent
overheads to the base system. Whereas the PRS adds about 3.8 percent overheads
to the base system. Even though the overheads values cannot be compared directly
because their environments and systems are different, the results indicate that the
PRS provides good performance while supporting multiple replication mechanisms.
The benefits of the PRS can be summarized as follows: the PRS can support
diverse users and applications by providing a framework that can support various
replication mechanisms. So the PRS can satisfy the needs of many users and apÂ
plications in distributed systems. As the size of distributed systems grows, new
demands and requirements for replication will arise. For example, as mobile comÂ
puters become popular, mobility should be supported in replication systems. As
an another example, hand-held devices have very limited resources and require less
use of resources. The PRS provides flexibility to users by allowing new replication
mechanisms to be added easily.
There are also some limitations to the PRS. As the number of supported repliÂ
cation mechanisms increases, the system has more overheads because the size of
libraries becomes bigger. It may give more system overheads because of the diÂ
versity of uses of replication mechanisms by clients. For example, if the number
of actively used replication mechanisms is large, the system must keep all these
101
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reference List
[1] Swarup Acharya and Stanley B. Zdonik. An efficient scheme for dynamic data
replication. Technical Report CS-93-43, Computer Science Department Brown
University, 1993.
[2] P. A. Alsberg and J. D. Day. A principal for resilient sharing of distributed
resources. In Proceedings of the Second International Conference on Software
Engineering, October 1976.
[3] L. Amsaleg, G. Muller, I. Puaut, and X. Rousset de Pina. Experience with
building distributed systems on top of the mach microkernel, 1995.
[4] Henri E. Bal, M. Frans Kaashoek, Andrew S. Tanenbaum, and Jack Jansen.
Replication techniques for speeding up parallel applications on distributed sysÂ
tems. Technical report, Dept, of Mathematics and Computer Science, Vrije
Universiteit, Netherlands, 1992.
[5] Georges Brun-Cottan and Mesaac Makpangou. Adaptable replicated objects
in distributed environments. Rapport de Recherche No 2593 (ISSN 0249-6399),
Innstitut National de Recherche en informatique et en Automatique, Rocquen-
court (France), May 1995.
[6] Marie Buretta. Data Replication: Tools and Techniques for Managing DisÂ
tributed Information. Wiley Computer Publishing, 1997.
[7] John B. Carter, John K. Bennett, and Willy Zqaenepoel. Implementation and
performance of munin. ACM Operating Systems Review, 25(5):152-164, OctoÂ
ber 1991.
[8] S. J. Caughey, G. D. Parrington, and S. K. Shrivastava. Shadows - a flexible
support system for objects in distributed systems. In Proceedings of the Third
International Workshop on Object Orientation in Operating Systems, pages 73-
82, Asheville, NC (USA), December 1993.
[9] George Coulouris, Jean Dollimore, and Tim Kindberg. Distributed Systems :
Concepts and Design. Addison-Wesley, second edition, 1994.
103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[10] Partha Dasgupta, Richard J. Leblanc Jr., Mustaque Aharaad, and Umakishore
Ramachandran. The clouds distributed operating system. IEEE Computer,
24(ll):34-44, November 1991.
[11] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. Consistency in parÂ
titioned networks. ACM Computing Surveys, 17(3):341-370, September 1985.
[12] Derek L. Eager and Kenneth C. Sevdik. Achieving robustness in distributed
database systems. ACM Transactions on Database Systems, pages 354-381,
September 1983.
[13] D. K. Gifford. Weighted voting for replicated data. In Proceedings of the 7th
Symposium on Operating Systems Principles, pages 150-162, Pacific Grove, CA
(USA), December 1979.
[14] Ashvin Goel, Calton Pu, and Gerald Popek. View consistency for optimistic
replication. In Proceedings of the the 17th IEEE Symposium on Reliable DisÂ
tributed Systems, October 1998.
[15] C. G. Gray and D. R. Cheriton. Leases: an efficient fault-tolerant mechanism for
distributed file cache consistency. In Proceedings of the 12th ACM Symposium
on Operating Systems Principles (SOSP), volume 23(5), pages 202-210, 1989.
[16] Jim Gray, Pat Helland, Patrick O’Neil, and Dennis Shasha. The dangers of
replication and a solution. In Proceedings of the 1996ACM SIGMOD InternaÂ
tional Conference on Management of Data, pages 173-182, June 1996.
[17] Richard G. Guy, John S. Heidemann, Wai Mak, Thomas W. Page, Jr., GerÂ
ald J. Popek, and Dieter Rothmeier. Implementation of the Ficus replicated
file system. In USENIX Conference Proceedings, pages 63-71. USENIX, June
1990.
[18] James Gwertzman. Autonomous replication in wide-area internetworks. TechÂ
nical Report TR-17-95, Harvard University, April 1995.
[19] Graham Hamilton, Michael L. Powell, and James G. Mictchell. Subcontract:
A flexible base for distributed programming. In Proceedings of the 14th SymÂ
posium on Operating Systems Principles, pages 69-79, Asheville, NC (USA),
December 1993.
[20] Ayal Itzkovitz and Assaf Schuster. Multiview and millipage - fine-grain sharing
in page-based dsms. In Proceedings of the Third USENIX Symposium on OpÂ
erating Systems Design and Implementation, New Orleans, Louisiana (USA),
February 1999.
104
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[21
[22
[23
[24
[25
[26
[27
[28
[29
[30
[31
Anthony D. Joseph, Alan F. deLespinasse, Joshua A. Tauber, David K. Gifford,
and M. Frans Kaashoek. Rover: A toolkit for mobile information access. In
Proceedings of the Fifteenth Symposium on Operating Systems Principles. ACM,
1995.
Anne-Marie Kermarrec, Ihor Kuz, Maarten van Steen, and Andrew S. Tanen-
baum. Towards scalable web documents. Technical Report IR-452, Vrije Uni-
versiteit Computer Science Department, October 1998.
James J. Kistler and Mahadev Satyanarayanan. Disconnected operation in the
Coda file system. ACM Transactions on Computer Systems, 10(1), February
1991.
James Jay Kistler. Increasing file system availability through second-class repliÂ
cation. In Proceedings of the Workshop on Management of Replicated Data,
pages 65-69. IEEE, November 1990.
Mark C. Little and Santosh K. Shrivastava. Using application specific knowlÂ
edge for configuring object replicas. In International Conference on Distributed
Computing Systems, pages 169-176, 1996.
Silvano Maffeis and Clemens H. Cap. Replication heuristics and polling algoÂ
rithms for object replication and a replicating file transfer protocol, July 1992.
Mesaac Makpangou, Yvon Gourhant, Jean-Pierre Le Narzul, and Marc Shapiro.
Fragmented objects for distributed abstractions. In T. L. Casavant and
M. Singhal, editors, Readings in Distributed Computing Systems, pages 170-
186. IEEEComputer Society Press, July 1994.
Elliot Moss. Nested Transactions : An Approach to Reliable Distributed ComÂ
puting. The MIT Press, Cambridge, Massachusetts, 1985.
Kwith Marzullo Navin Budhiraja. Tradeoffs in implementing primary-backup
protocols.
B. Clifford Neuman. The Virtual System Model: A Scalable Approach to OrgaÂ
nizing Large Systems. PhD thesis, University of Washington, 1992.
Michael Rabinovich, Irina Rabinovich, Rajmohn Rajaraman, and Amit Ag-
garwal. A dynamic object replication and migration protocol for an internet
hosting service. In International Conference on Distributed Computing Systems,
pages 101-113, 1999.
[32] David H. Ratner. Selective Replication: Fine-Grain Control of Relicated Files.
PhD thesis, University of California, Los Angeles Department of Computer
Science, 1995.
105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[33] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and
D. C. Steere. Coda: A highly available file system for a distributed workstation
environment. IEEE Transactions on Computers, 39(4)-.447-459, April 1990.
[34] Mahadev Satyanarayanan. Mobile information access. IEEE Personal ComÂ
munications Magazine, 3(1), February 1996.
[35] Jeff Sidell, Paul M. Aoki, Sanford Barr, Adam Sah, Carl Staelin, Michael Stone-
braker, and Andrew Yu. Data replication in mariposa. In Proceedings of the the
Twelfth International Conference on Data Engineering, pages 485-494, 1996.
[36] Alexander Siegel. Performance in Flexible Distributed File Systems. PhD theÂ
sis, Cornell University, May 1992.
[37] M. Singhal and N. Shivaratri. Advanced Concepts in Operating Systems.
McGraw-Hill, 1994.
[38] E. Gressier-Soudan T. Cornilleau. A combinded-consistency approach: SequenÂ
tial & causal-consistency. ACM Operating Systems Review, 1996.
[39] D. B. Terry, A. J. Demers, K. Petersen, M. J. Spreitzer, M. M. Theimer, and
B. B. Welch. Session guarantees for weakly consistent replicated data. In ProÂ
ceedings of the the Third International Conference on Parallel and Distributed
Information Systems, pages 140-149, September 1994.
[40] Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J
Spreitzer, and Carl H. Hauser. Managing update conflicts in bayou, a weakly
connected replicated storage system. In Proceedings of the Fifteenth Symposium
on Operating Systems Principles. ACM, 1995.
[41] Thuan L. Thai. Learning DCOM. O’Reilly, April 1999.
[42] M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, and G. Alonso. UnderstandÂ
ing replication in databases and distributed systems. In Proceedings of 20th
International Conference on Distributed Computing Systems (ICDCS’ 2000),
pages 264-274, Taipei, Taiwan, R.O.C., April 2000. IEEE Computer Society
Technical Commitee on Distributed Processing.
[43] Ouri Wolfson, Sushil Jajodia, and Yixiu Huang. An adaptive data replication
algorithm. ACM Transactions on Database Systems, 22(2):255-314, June 1997.
106
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Grid workflow: A flexible framework for fault tolerance in the grid
PDF
Distributed annotation framework supporting collaborative filtering of information
PDF
Augmenting knowledge reuse using collaborative filtering systems
PDF
Compiler optimizations for architectures supporting superword-level parallelism
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Adaptive execution: improving performance through the runtime adaptation of performance parameters
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
An efficient design space exploration for balance between computation and memory
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Combining compile -time and run -time parallelization
PDF
Group key agreement: Theory and practice
PDF
Cost -sensitive cache replacement algorithms
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Agile COCOMO-II
PDF
Distributed constraint optimization for multiagent systems
PDF
High performance crossbar switch design
PDF
Experimental evaluation of a distributed control system for chain-type self -reconfigurable robots
PDF
A hybrid systems modeling framework for transport protocols
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
A framework for learning from demonstration, generalization and practice in human -robot domains
Asset Metadata
Creator
Im, Eul Gyu (author)
Core Title
A flexible framework for replication in distributed systems
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Neuman, Clifford (
committee chair
), Horowitz, Ellis (
committee member
), Pinkston, Timothy M. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-223888
Unique identifier
UC11334881
Identifier
3073796.pdf (filename),usctheses-c16-223888 (legacy record id)
Legacy Identifier
3073796.pdf
Dmrecord
223888
Document Type
Dissertation
Rights
Im, Eul Gyu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA