Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The evolution of knowledge creation online: what is driving the dynamics of Wikipedia networks
(USC Thesis Other)
The evolution of knowledge creation online: what is driving the dynamics of Wikipedia networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THE EVOLUTION OF KNOWLEDGE CREATION ONLINE:
WHAT IS DRIVING THE DYNAMICS OF WIKIPEDIA NETWORKS
by
Ruqin Ren
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
(COMMUNICATION)
December 2020
Copyright 2020 Ruqin Ren
ii
ACKNOWLEDGEMENTS
I completed this dissertation at an unusual time when we are all experiencing some
uncertainty, turmoil, and anxiety in our daily lives. I felt indebted to many great teachers, friends,
and family members who supported me in different ways, so I can stay focused on finishing this
project.
I thank the Annenberg school's continued financial and administrative support in the past
six years, which allowed me to finish this dissertation.
My advisor, Professor Peter Monge, has been a great guide on my research journey. He is
always there to give me careful, detailed, and patient feedback throughout my Ph.D. He pushed
me to think carefully about the theoretical development and methodological analyses in this
dissertation writing process and many other research projects. He provided all the help he could
whenever possible, and I always trusted his suggestions and opinions when I needed someone to
consult with for my research. It is a great honor to have Prof. Monge as my advisor and mentor,
and this past six years left many unforgettable and precious memories for me.
I would also like to thank my dissertation committee members, Professors Janet Fulk and
Ann Majchrzak. Their feedback, questions, and different perspectives on my work always gave
me new inspirations and ideas about making this work better. Professor Janet Fulk is a great
mentor who helped me throughout the dissertation project and many other research projects. She
taught me how to become a better writer, a better thinker, and a better collaborator. Professor
Ann Majchrzak is always helpful and resourceful when I needed feedback on any of my research
projects. I know I can always count on her to provide valuable guidance in theoretical
developments and analyses.
iii
My parents have been a constant source of inspiration as I try to understand who I am and
what I want to do. Mom is the hardest-working person I know. She is always ready to take on
more obligations and always ready to serve. I thank Dad for trying to listen to me and trying to
understand me (though sometimes he does get it). This dissertation is dedicated to you, your
love, and your hard work for the family.
To my husband Lang: I am lucky to have someone who understands what it means to
write a dissertation and get a Ph.D. You always trusted in me, and you knew I will be able to do
it. And yes, I did!
Finally, to my little buddies, Chi and Gamma. They are the loveliest, smartest, and most
playful dogs. I cannot remember how many afternoons they spent with me, napping or barking,
while I was writing. Thanks for reminding me that life is all about happiness and fun.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ........................................................................................................ii
LIST OF TABLES ..................................................................................................................... vi
LIST OF FIGURES .................................................................................................................. vii
ABSTRACT .............................................................................................................................. ix
CHAPTER 1: INTRODUCTION ................................................................................................ 1
CHAPTER 2: MODELS OF SOCIAL EVOLUTION ................................................................. 6
2.1 Evolutionary Epistemology ................................................................................................... 6
2.2. Measuring Selection ............................................................................................................ 9
2.3 Community Ecology ........................................................................................................... 19
2.4 What is Being Selected?...................................................................................................... 25
2.5 Network-based Characteristics ............................................................................................. 33
CHAPTER 3: EVOLUTIONARY DYNAMICS AND ONLINE KNOWLEDGE CREATION 35
3.1 Content-based Characteristics .............................................................................................. 38
3.3 Comparing Content-based vs. Network-based Characteristics .................................................. 50
3.4 Comparing Different Types of Network-based Characteristics ................................................. 55
3.5 The Effects of Network-based Characteristics on Content Exploration ...................................... 59
CHAPTER 4: DATA AND METHOD ..................................................................................... 62
4.1 Wikipedia as a Research Site ............................................................................................... 62
4.2 Data Collection .................................................................................................................. 63
4.3 Operationalization of Concepts ............................................................................................ 64
4.4 Analyses Plan .................................................................................................................... 80
CHAPTER 5: RESULTS .......................................................................................................... 87
5.1 Descriptive Statistics of the WikiProject Aquarium Fishes ...................................................... 87
5.2 Results for Hypothesis 1 ..................................................................................................... 92
5.3 Results for Hypothesis 2 ..................................................................................................... 94
5.3 Results for Hypothesis 3 ..................................................................................................... 96
5.4 Summary of Hypotheses Testing Results............................................................................. 112
CHAPTER 6: DISCUSSION .................................................................................................. 113
6.1 Theoretical Implications.................................................................................................... 114
6.2 Methodological Strengths .................................................................................................. 122
6.3 Practical Implications ....................................................................................................... 123
v
6.4 Limitations and Future Work ............................................................................................. 124
6.5 Conclusion ...................................................................................................................... 126
References .............................................................................................................................. 127
vi
LIST OF TABLES
Table 2. 1 Eight Possible Relations Between Organizational Populations (Adapted from Aldrich
and Ruef, 2006) ........................................................................................................................ 22
Table 2. 2 Lists of Network Metrics Used in Hilbert et al. (2016) Study .................................... 47
Table 5. 1 Descriptive Statistics and Correlations ...................................................................... 99
Table 5. 2 Model Estimates Predicting Content Exploration .................................................... 100
Table 5. 3 ARMA Model Comparisons ................................................................................... 102
Table 5. 4 Augmented Dickey-Fuller Test Results .................................................................. 105
Table 5. 5 Examining Each Network-based Endogenous Series for Predicting Content
Exploration ............................................................................................................................. 111
Table 5. 6 Hypotheses Testing Results .................................................................................... 112
vii
LIST OF FIGURES
Figure 3. 1 Theoretical Model of Network Configurations That Predict Content Exploration. ... 61
Figure 4. 1 A Graphical Representation of the Development Trajectory of the Wikipedia Article
“The Hobbit”, Showing the Article’s Content Exploration Process. Graph Adapted from Arazy et
al. (2020) .................................................................................................................................. 76
Figure 4. 2 Some Exemplary Wikipedia Article Development Trajectories. Graphs ordered by
the average level of exploration throughout the data observation period. Graph adapted from
Arazy et al. (2020) .................................................................................................................... 77
Figure 4. 3 A Graphical Representation of How Transfer Function Models Decomposes Three
Dynamics in Modeling a Response Variable Y.......................................................................... 86
Figure 5.1 Total Pageviews per Week. The X-axis represents the sequence of weeks from Week
1 to Week 163). The Y-axis value is the weekly total pageviews (natural log transformed). ...... 89
Figure 5. 2 Total Revisions per Week. The X-axis represents the ID of the week (from Week 1 to
Week 163). The Y-axis value is the weekly total revisions (natural log transformed). ............... 89
Figure 5. 3 Histogram of Revision Counts by Editors. X-axis shows the number of revisions
(natural log transformed) made by an editor. Y-axis shows the count of editors corresponding to
each level of “number of revisions (log)”. ................................................................................. 90
Figure 5.4 Number of Articles in Each Quality Level. The X-axis represents number of articles
and Y-axis represents the quality levels from low to high, where Start is the lowest level and FA
the highest quality in the dataset). ............................................................................................. 91
Figure 5. 5 Average Percentile Rankings of Natural Selection Forces per Population Structure.
X-axis represents each different population partitioning structure, Y-axis shows the average
percentile ranking of natural selection value. The “Content” panel on the left side includes the 11
viii
content-based characteristics, and the “Network” panel on the right side includes the 10 network-
based characteristics.................................................................................................................. 93
ix
ABSTRACT
This project aims to understand the networked nature of socio-cultural evolution in online
knowledge creation systems and the evolution of online knowledge creation networks. The study
analyzes knowledge creation as an evolutionary process by examining the interconnections
among knowledge creation artifacts and the human actors who worked on these artifacts. The
research context is in an open, self-organized, collaborative knowledge network known as
Wikipedia. Specifically, the networked nature of evolution is explored in hypotheses 1 and 2,
where different kinds of network metrics were examined to find out which ones are particularly
important in driving natural selection in a knowledge creation system. Natural selection was
examined adopting the Price equation theory. The results suggested that network-based
characteristics are just as influential as content-based characteristics. Understanding the
evolutionary processes of knowledge creation cannot leave out the critical roles played by
network structural properties. The evolutionary dynamics exhibited in communication networks
are presented in the study of hypothesis 3, where time series regression techniques were used to
model how temporal changes in communication networks lead to consequential changes in
content development trajectories in a knowledge creation system. Over two years of article-
editor network data were collected along with the content-based characteristics of knowledge
creation (Wikipedia articles). Three network structural configurations -- network embeddedness,
network connectivity, and network redundancy-- were shown to have lagged temporal effects on
how the contents of knowledge products evolve over time. This suggests potential causal links
between network configurations and knowledge creation trajectory.
1
CHAPTER 1: INTRODUCTION
How is knowledge created collectively? Understanding how knowledge is produced and
exchanged in our society critically affects the way we see the state of the world as it is and might
be (Benkler, 2006). Thus, this has always been one of the most fundamental questions regarding
knowledge management and information systems studies (Faraj et al., 2011; Kane et al., 2014).
Researchers have long been concerned about the processes of creating, communicating, and
retaining valuable knowledge (Campbell, 1974; Heyes & Hull, 2001; Popper, 1959, 1972;
Toulmin, 1967, 1972). In social science one such effort is using evolutionary and ecological
theory as a basis of studying human social processes involved in knowledge creation,
transmission, and retention. This perspective argues that both organizations and their
environments are subject to operating principles like those in natural environments (Malik &
Probst, 1982). This insight is even more compelling today because of the new forms of
organizing made possible by the Internet, such as knowledge collaboration networks (Benkler,
2006; Faraj et al., 2011; Kane & Ransbotham, 2016)
This type of collaborative network can be formally defined as emergent systems
consisting of synergies among three elements: (a) data/information/knowledge (b) technology
and (c) people with certain qualities or expertise to provide valuable information (Glenn, 2016).
The dynamic interaction among the three elements holds the promise of providing better
intelligence than any of these elements acting alone. Prominent examples of this type of
organizational form, such as Wikipedia, StackOverflow, GitHub and Kaggle, have attracted
much scholarly attention to understanding how knowledge is created collectively in a
technology-mediated environment. Collective knowledge creation occurs online where the
cumulative efforts of users come together through internet-based technologies. Here, people
2
work on projects which lack the extensive management structures of traditional organizations
(Benkler, 2006). Instead, the technological design itself serves to integrate the knowledge
production process. This research project analyzes the evolutionary process of knowledge
creation by examining the interconnections among these elements and exploring how the
dynamic process unfolds in an open, self-organized, collaborative knowledge network known as
Wikipedia.
Knowledge collaboration networks have a set of new characteristics that challenge the
traditional knowledge creation model (Lee & Cole, 2003). Applying evolutionary theory to the
knowledge creation processes in online environments, this dissertation attempts to achieve the
following three goals. First, it presents a system-level explanation of the knowledge creation
process, linking together knowledge workers, information entities they collectively create, and
the external environment that provides resources for and imposes natural selection pressures on
the system. The technology provides a material basis for the information to be created, presented,
and modified by human actors. The people provide content to be shown on the technical
platform. The knowledge collectively created by these people is what makes the technical
platform successful. These elements are naturally interconnected as a network, and the
interacting processes generate the dynamic forces driving the evolution of knowledge products.
Specifically, capturing these interconnected relationships among people (editors on Wikipedia)
and the information products they create (content on Wikipedia) is important for understanding
the evolutionary changes that occur within this knowledge creation system. The evolutionary
perspective is often ignored in the discussion of collaborative knowledge creation activities
online. By applying key evolutionary concepts such as variation, selection, and retention to
understanding this knowledge creation ecosystem, it is possible to see how the technology-
3
mediated environment may selectively favor certain types of collaboration network patterns and
information characteristics, and thus lead to varied knowledge creation outcomes.
Second, this analytic angle moves the level of analysis from that of individuals or small
communities, toward populations and their relations with the environment (Aldrich & Ruef,
2006). It has long been recognized that evolutionary theory is concerned with population-level
dynamics, instead of the individuals (Astley, 1985; Nowak, 2006). While the literature has
mostly focused on investigating how individual editors work together in this collective project of
encyclopedia writing, this study attempts to consider the population of Wikipedia articles and the
population-level dynamics exhibited collectively by these articles.
Third, it contributes to the recent theoretical development regarding how to define the
partitioning structures of a population. While most previous work focused on individual entities’
traits, such as a bird’s beak length or a Wikipedia article’s reference count, to categorize a
population, this study also evaluates network-based metrics as a partitioning trait of a population.
It tests whether network-based or content-based characteristics are relatively more important in
terms of the natural selection pressure they underwent -- a direct testing of a recent theoretical
development of organizational evolution theory (Hilbert, Oh, & Monge, 2016). Additionally, it
investigates why certain network-based characteristics may be driving the evolution of content
by causally analyzing the relationship between network-based characteristics and whether the
article exhibits high levels of content exploration, a tendency to deviate from a set development
trajectory.
This study draws upon general theories about evolutionary epistemology (Campbell,
1974,1997; Popper, 1972) and models of organizational knowledge creation (G. K. Lee & Cole,
2003; March, 1991) to address the practice of knowledge creation in a distributed online
4
community setting. An evolutionary model is proposed as the basis for examining knowledge
creation in online environments. Wikipedia, an international project that uses Wiki software to
collaboratively create an encyclopedia, is becoming increasingly important for information
search and knowledge retrieval purposes (Kittur & Kraut, 2008; Ren & Yan, 2017). The
emergence of Wikipedia as a knowledge resource that anyone can edit also offers a rich research
site for the study of knowledge epistemology. Every edit log is publicly available accompanied
with rich historical data; thus, it is now possible to closely observe how evolutionary processes
unfold in this unique new context.
The next section reviews the literature on foundational evolutionary epistemology theory
(Campbell, 1974; Heyes & Hull, 2001; Popper, 1972), especially the theory of selection (Frank,
2012b; Heylighen, 1997; Kim, 2001), and recent theoretical developments in terms of the
definition of population partitioning structures (Hilbert et al., 2016; S. Lee & Monge, 2011;
Monge et al., 2008). After that, the dissertation describes the evolutionary processes observed on
Wikipedia and discusses how the evolutionary perspective applies to the online environment.
Then hypotheses are developed to test for the relative importance of content-based and network-
based partitioning structures, and to further investigate why network-based partitioning
structures may explain the evolutionary outcomes of knowledge creation. Then empirical testing
methods and results are reported. Lastly, the dissertation provides a discussion about the
implications of the results and suggests possible interpretations of the evolutionary patterns that
may apply to other knowledge systems. This effort also seeks to provide practical insights for
wiki organizers and practitioners working in such knowledge collaboration systems, by showing
what the important drivers of knowledge creation are and what kind of relationships exist
between the collaboration network structures and knowledge outcomes. Understanding these
5
issues can inform decision-making about how and where to invest resources for developing more
successful wiki systems.
6
CHAPTER 2: MODELS OF SOCIAL EVOLUTION
This chapter provides a background about different perspectives in studying social and
cultural evolutionary change. The first section overviews key theories in the field of evolutionary
epistemology, which examines the evolutionary process of human knowledge creation. The next
section discusses general selection theory and provides more formal and mathematical
definitions of evolutionary change that are applicable to both biological evolutionary systems
and social-cultural evolutionary systems. The third section introduces the theory of community
ecology, which allows for a slightly different interpretation about how to define populations and
subpopulations. The last section in this chapter brings together the streams of research and
discusses hypotheses to be developed for empirical testing.
2.1 Evolutionary Epistemology
Evolutionary epistemology is an analytical framework that combines biological
evolutionary processes with the philosophy of science and knowledge (Sereno, 1991). In this
stream of literature, many philosophers (Campbell, 1974; Heyes & Hull, 2001; Popper, 1959,
1972; Toulmin, 1967, 1972) have drawn from the classic variation- selection-retention (V-S-R)
paradigm of biological evolution and applied it to human knowledge accumulation processes
(Sereno, 1991). The overall thesis of evolutionary epistemology is that knowledge development
is “a direct extension” of general evolutionary development, and the dynamics of the two
processes are similar (Hahlweg & Hooker, 1989, p. 23). It is imperative to point out here that the
term “evolutionary epistemology” refers to two slightly different yet related historical
investigative efforts (Bradie, 1986). Evolutionary epistemology, on one hand, refers to an
attempt to study human cognition as a biological phenomenon that evolves (Plotkin, 1982). This
research effort is a “straightforward extension of the biological theory of evolution” to the
7
“biological substrates of cognitive activity, e.g., their brains, sensory system, etc. (Bradie,
1986)”. The other program of evolutionary epistemology “attempts to account for the evolution
of ideas, scientific theories and culture in general by using models and metaphors drawn from
evolutionary biology (Bradie, 1986, p. 401) ”. The main distinction between these two is that the
first stream is essentially part of evolutionary biology that studies biological bases of cognition –
just like other subfields of evolutionary biology that study respiratory systems or visual systems,
while the second is a direct application of evolutionary theory in studying cultural phenomenon
(Callebaut & Pinxten, 2012). These two programs are not contradictory. In fact, they share the
same theoretical roots of Darwinian evolutionary theory which are applied to different fields –
biological and sociological, respectively. Some authors, including Campbell (1974), Popper
(1984), Plotkin (1982), advocate for both programs. Campbell (1974) for example argued for the
applicability of a general selectionism model that can explain not only the evolution of all
biological structures (certainly including cognitive ones) but also the growth of scientific
knowledge. However, since our interest is in the sociocultural phenomenon of knowledge
creation, rather than biological structures of human cognition, the second program is more
appropriate for this context.
Even though sociocultural evolution does not directly involve DNA transmission as in
biological processes, several principles of evolutionary theory can apply to the field of evolution
of science (Hahlweg & Hooker, 1989; Heyes & Hull, 2001). First, evolutionary epistemology
proposes a universal selection theory similar to biological evolution, and this is fundamental to
all inductive knowledge achievements (Campbell, 1974; Heyes & Hull, 2001). Karl Popper
(1972) distinguishes a purely objective world from the subjective world where human minds
make sense of this objective world. Humans then create explications as products of our mental
8
states. This is the dialectic nature of objective versus subjective knowledge. However, there is no
certainty that we are approaching truth or reality with objective knowledge; instead, we can only
have tentative hypotheses about the problems we have. Thus, in the end, science may be hard to
justify; but we can still rationally criticize the theories and other aspects of knowledge, and
tentatively adopt those that seem best to withstand our criticism and that have the greatest
explanatory power. To some extent, knowledge creation process can be summarized as a trail-
and-error process (Popper, 1972).
“It is primarily through the works of Karl Popper that a natural selection epistemology is
available today,” wrote Campbell (1974, p.413). In the same spirit, Campbell proposed a
universal selectionism theory as a blind-variation-and-selective-retention process (BVSR
process), which is “fundamental to all inductive achievements, to all genuine increases in
knowledge … (Campbell, 1960, p.380)”. Just as Darwinian evolution explains the fit between
organisms and the environment, Campbell’s general selectionist theory attempts to explain the fit
between our beliefs of knowledge and reality (Campbell, 1974). Essentially, all knowledge
creation achievements can be dated back to the many blind variations and selection processes
that happened across time. Knowledge increases are thus products of the evolutionary processes
in which we first create many blind variations of beliefs and then test these beliefs against reality
(Campbell, 1974). The creators of the blind variations do not know the relationship between the
proposed belief and the reality. People will only know which are the “fitter” beliefs by checking
which ones can withstand many rounds of reality checks. Those that survived many rounds of
reality checks are retained for longer time and replicated more while those that could not survive
reality checks turn out to be our false beliefs about reality and eventually will be disregarded.
The notion of BVSR thus could be developed to allow us to empirically link the degree of
9
selection (input of the evolutionary process) with the degree of knowledge retention (output of
the evolutionary process) – which will be discussed in detail in Chapter 3.
This BVSR model, a traditional way of explaining how human beings incrementally
achieve knowledge acquisition, shares a certain level of similarity with the modular contribution
model commonly observed in many collective intelligence systems today (Benkler, 2006; Deng,
Joshi, & Galliers, 2016; Weidema, López, Nayebaziz, Spanghero, & van der Hoek, 2016).
Modularity is “a property of a project that describes the extent to which it can be broken down
into smaller components, or modules, that can be independently produced before they are
assembled into a whole (Benkler, 2006, p.96)”. Some of the knowledge-intensive tasks today can
be achieved without recruiting professional scientists to do them, rather, these seemingly
daunting tasks could be replaced by a combination of high modularization of sub-tasks. Large
scale participation from “non-professional” crowds are encouraged, without worrying that these
crowd workers may introduce many errors. It is actually quite important that these contributions
are diverse in their quality, quantity, and focus, in order to maximize participants’ autonomy and
flexibility to define the nature of their contributions to the project (Benkler, 2006). After
receiving diverse and varied contributions for these modularized sub-tasks, task organizers can
build in task redundancy to probabilistically average out the errors.
As an illustrative example, the NASA Clickworker project launched the task of mapping
Mars craters to a crowd of online users, by breaking up the Mars map into many small segments
(Benkler, 2006; Kanefsky et al., 2000). The participants can easily use a marking tool provided
by the technical platform to help identify craters shown on a small segment of the map – taken as
a part of the complete map. To create a mechanism of error correction, the platform assigns the
same image segment to multiple users to work on, so that an image will be evaluated
10
independently multiple times by different people. The results will be more accurate and reliable
than when only one person is making judgments. As such, the modular contribution model can
be understood as a process of contributing many “blind” versions of variations to the sub-tasks,
followed by selection forces that retain only the fittest individual entities by building in task
redundancy to check for errors.
2.2. Measuring Selection
In the field of biology, many scholars have attempted to quantitatively describe
evolutionary changes. It is helpful to review them briefly, even though they were initially not
proposed to study sociocultural evolution. Understanding and applying these fundamental
mechanisms of evolutionary theory will be useful in our examination of knowledge creation
systems.
Some notations will be useful. Ϸ
is defined to be the fitness of element or individual
member i in the population at a given time. Ϸ is the average fitness of the whole population at
that time. So Ϸ equals the average of Ϸ
across all individual members of the population. (The
current notation system is chosen for its accessibility and wide acceptance in the literature, e.g.,
Frank, 1995; Okasha, 2008. Other notations systems also exist, such as Fisher 1958; Price, 1972.
The choice of notation does not change the essential content of the following discussion.)
Fisher derived a fundamental theorem of natural selection and described the relationship
between the rate of fitness increase for a population and variance in fitness. He stated the
theorem that “the rate of increase in fitness …at any time is equal to its genetic variance in
fitness at that time” (Fisher, 1958, p. 46). Other scholars later recommended to use the wording
of “additive genetic variance” instead of “genetic variance” to be more precise and compatible
with our current terminology (Okasha, 2008; Price, 1972). This is a fundamentally important
11
theory about evolution, because it was one of the earliest theories that pointed out a general and
predictable rule in evolution. The idea is that as a result of natural selection, average population
fitness changes at a rate that can be mathematically determined, which is the additive genetic
variance of fitness. Formally, Fisher’s theorem is expressed as
= Var (w).
is the rate of change of average population fitness at time t. For the purpose of understanding
socio-cultural evolution, additive genetic variance in fitness can be simply understood as the
variance in fitness due to natural selection, or Var(w), because genes are not really involved. The
calculation of Var(w) follows classic definition of variance, which is defined as the mean
squared deviation. Fisher was primarily concerned with a continuous-time model in his original
work, hence the continuous derivative symbol. Later, other scholars also argued that the
application to discrete-time situation can be valid (Okasha, 2008).
Because Fisher was primarily interested in biological evolution, it is meaningful for him
and other scholars to discuss the precise definition of additive genetic variance in fitness.
Biologists decompose total variance in fitness, Var(w), into several sources, such as that caused
by genetic differences, environmental differences, or pure chance (Okasha, 2008; Price, 1972).
The focus of Fisher’s theorem is on genetically caused variance in fitness, or Var(genetic).
Var(genetic) can also be partitioned into two parts, an additive component, Var(additive genetic),
and a non-additive component, Var(non-additive genetic). This is where the terminology of
“additive genetic variance” comes from. Further discussion of these concepts is beyond the scope
of the current project. See Okasha (2008) for more details.
Price (1970) introduced the Price equation, which offers an “exact and complete”
mathematical description of evolutionary change under all conditions (Frank, 1997, p. 1712).
12
This equation provides a way to generalize “from genetical selection to obtain a general selection
theory” (Price, 1995, p. 389). Thus, his theory could be equally applied to biological evolution
as well as sociocultural evolution.
To begin with, selection is defined as the act or process of “producing a corresponding
set” (Price, 1995, p.392). Given a population P and the trait of interest I that defines this
population in a certain way, we can say that the set P contains ⁖
amounts of distinct individuals
with trait value Ϻ
. The traits, or properties, in empirical research, could be many things with
quantitative values, such as number of fingers, body length, body weight, revenues of a
company, and payoffs of a decision (Price used the term “character” and “property” instead of
“trait”. While they all contain essentially the same meaning, here I avoid using “character” for
clarity).
Price then defines a corresponding set Ⱦ⁔ as follows: “…a set Ⱦ⁔ is a corresponding set of
P if there exists a one-to-one correspondence such that for each trait element of P, there is a
corresponding element in of Ⱦ⁔ (1995, p. 392).” Typically, P can be thought of as a parent
population and P’ an offspring population that corresponds to the parent population; or P can be
a population at an earlier time point and P’ the corresponding population at a later time point. In
the biological world, for example, P could be an ancestor group of animals and P’ the current
group of animals. In socio-cultural evolution, P could be a population of convenience stores in
one city and P’ the population of convenience stores in the same city after one year. Specifically,
the one-to-one correspondence relationship between the elements exists if ⁖⁔
has been derived
directly or indirectly from ⁖
, due to inheritance or replication or other reasons. With this
definition of a corresponding set, selection could then be defined with further specificity:
“Selection on a set P in relation to a property I is the act or process of producing a corresponding
13
set Ⱦ⁔ in a way such that the amounts ⁖
are non-randomly related to the corresponding Ϻ
values
(Price, 1995, p. 392; Price’s original notation system was modified by the author to reflect the
notation system widely used in the current literature. For example, Price used notation x to
indicate what is z here, the quantitative value of a trait, and he used notation w to represent what
is q here, the amount of that trait in a population.)”
To provide an illustrative example, assume a population of gas stations in a city (Usher &
Evans, 1996). The trait of interest, I, is whether there are car wash services hosted by a gas
station, which is one defining property of a gas station (there could many other traits depending
on research interests). Based on the differences in I, we could categorize the gas stations into two
categories -- those that have one car wash spot and those that have two car wash spots. The ones
with one car wash services are labeled with trait element i = 1, and those with two car wash
services are labeled with trait element i = 2, and the list could continue. If there are 10 gas
stations with one car wash spot, then ⁖
ṕṔ, Ϻ
ṕЇ and if there are 15 stores with trait
element i = 2, then ⁖
ṕṙ , Ϻ
Ṗ .
After explicating Price’s definition of the key concepts, we now discuss the Price
equation. Let P be a set containing ⁖
amounts of trait element i which have the trait value Ϻ
.
There is also the corresponding set P’, defined by the above description. The set P’ contains ⁖⁔
amounts of distinct elements with properties Ϻ ⁔
. In another word, the index of trait element i
identifies subpopulations within the set P, and the frequency of a subpopulation takes up a
fraction ⁖
of the total population. The change from P to Ⱦ⁔ is termed a selection process that
gives rise to the effect Ϻ 3 to Ϻ 3 ⁔ in a population trait I. This effect Ϻ 3 to Ϻ 3 ⁔ can be calculated as the
change in the average value of the trait (Knudsen, 2004; Price, 1995):
ɛ Ϻ = Ϻ
⁔ - Ϻ
= Ẫ⁖
Ϻ
Ẫ⁖
Ϻ
(1)
14
So ɛ Ϻ represents change in the average value of a trait. It is a result of the average trait
value difference between the two sets (P and Ⱦ⁔ ). The power of the Price equation is the unique
way it connects the posterior set (Ⱦ⁔ ) with the anterior set (P) by using the trait index i (Frank,
1998). This is done by defining the change ⁖
to ⁖
:
⁖
= ⁖
* (
Ϸ
Ϸ
) (2)
The component Ϸ
is usually referred to as the fitness of trait element i in the population.
Ϸ is the average fitness of the whole population. So, the ratio multiplying factor,
Ϸ
Ϸ
, is the
amount of ⁖
(of each trait index i) from set P that are actually contained in set Ⱦ⁔ . This equation
could be used to substitute for ⁖
in equation (1). And intuitively, Ϻ
is the value of trait element
i in the posterior set Ⱦ⁔ , which is a result of the trait value in the anterior set, Ϻ
, plus the amount
of change ɛ Ϻ
:
Ϻ
= Ϻ
+ ɛ Ϻ
(3)
The derivation below starts with equation (4) which is a restatement of equation (1). With
some rearrangement, we could remove the variables related to set Ⱦ⁔ (⁖
and Ϻ
⁔ ) and only work
with variables related to population P. The end goal is to have an equation that can describe the
change in the average value of a trait (ɛ Ϻ ) by only referring to population P. The second line of
the derivation replaces ⁖
with the right side of equation (2) and replaces Ϻ
with the right side of
equation (3). After this substitution, all terms in the second line now only refers to population P
and does not refer to population P’. The third line expands the first sum term by splitting
Ϻ
ɛ Ϻ
©Æ ¥ Ø ¥∑Ø ∞ °≤¥≥ , and each part multiplies Ẫ⁖
ὸ
Ϸ
Ϸ
separately. The fourth line
combines the first and the third term which has a common Ẫ⁖
Ϻ
.
ɛ Ϻ Ẫ⁖
Ϻ
Ẫ⁖
Ϻ
(4)
15
= Ẫ⁖
ὸ
Ϸ
Ϸ
ὸ Ϻ
ɛ Ϻ
Ẫ⁖
Ϻ
Ẫ⁖
ὸ
Ϸ
Ϸ
ὸ Ϻ
Ẫ⁖
ὸ
Ϸ
Ϸ
ὸɛ Ϻ
Ẫ⁖
Ϻ
= Ẫ⁖
Ϸ
Ϸ
ṕϺ
Ẫ⁖
Ϸ
Ϸ
) ɛ Ϻ
Multiplying Ϸ on both sides of the equation, rearranging terms, and taking expectations
leads to the following covariance equation (Derivation procedure can found in Knudsen, 2004):
Ϸ ɛϺ = Ϸ Ẫ⁖
Ϸ
Ϸ
ṕϺ
Ϸ Ẫ⁖
Ϸ
Ϸ
) ɛ Ϻ
= Ẫ⁖
Ϸ
Ϻ
Ẫ⁖
Ϸ Ϻ
Ẫ⁖
Ϸ
ɛ Ϻ
= Ẫ⁖
Ϸ
Ϻ
ϷẪ⁖
Ϻ
Ẫ⁖
Ϸ
ɛ Ϻ
= E (Ϸ
Ϻ
) Ϸ Ϻ + E (Ϸ
ɛ Ϻ
)
ↄ⁛϶Ϸ
єϺ
ԑϷ
ɛϺ
(5)
Equation (5) could be rewritten as follows, by moving Ϸ to the right side:
ɛϺ = Cov (
Ϸ
Ϸ
, Ϻ
) + E (
Ϸ
Ϸ
ὸ ɛ Ϻ
) (6)
Equation (6) is the Price equation (Frank, 1995; Price, 1972), where the left side of the
equation is the change in average trait value. The right side of the equation shows that change in
mean trait value comes from two sources: the covariance term, cov (
Ϸ
Ϸ
єϺ
), and the
expectation term, E (
Ϸ
Ϸ
ὸ ɛ Ϻ
). The covariance term describes the selection effect which
links each trait value Ϻ
with its relative fitness success
Ϸ
Ϸ
. Given the mathematical definition
of covariance, if the trait value Ϻ
is independent of the relative fitness, the covariance term
equals 0. A positive relation between trait value Ϻ
and relative fitness gives a positive
covariance, and a negative relation gives a negative covariance. A positive covariance means
selection is in favor of the trait value’s increase, and a negative covariance implies a decrease in
the trait’s values. For example, if individual members with longer body length (a trait value) tend
16
to have more offspring (high fitness), then the covariance term would be positive and selection
acts to increase the average body length of the group of individuals.
The expectation term describes the transmission effect, the extent to which the posterior
set is an exact copy of the anterior set. ɛ Ϻ
are the changes in trait values between each pair of
elements i of the two sets. When the expectation term is zero, it means selection is the only factor
involved in the evolution of the trait. However, the posterior set could certainly differ from the
anterior set, perhaps because of genetic mutation or recombination, or because of a change in the
environment (Gardner, 2008). In these situations, ɛ Ϻ
is non-zero. Larger value of ɛ Ϻ
indicates
that the transmission fidelity is lower.
In some contexts, the change in mean trait value is mainly related to the covariance
between the trait value and its fitness (the covariance term), rather than due to transmission effect
(the expectation term). In our investigation of socio-cultural evolution online, we are less
concerned about the fidelity of transmission between generations, as online materials can very
easily be replicated or copied without information loss. The covariance term describes Darwinian
idea of the ‘survival of the fittest’ because it shows how natural selection operates to favor those
traits that are positively correlated with reproduction success. “Discarding the …change due to
transmission, the Price equation can be used to provide a formal statement of natural selection”
(Gardner, 2008, p.199):
Ϸ ɛ Ϻ ↄ⁛϶Ϸ
єϺ
(7)
This is also called the reduced form of the Price Equation (Frank, 1995). The power of
the Price Equation comes from two aspects. First, it not only applies to biological evolution, but
it also applies to anything else that evolves. For instance, the trait value Ϻ
could be height, hair
color, the revenue of a type of company, or the payoff of an altruistic behavior. There have been
17
important applications of the Price equation to socio-economic evolution (El Mouden et al.,
2014; Knudsen, 2004). In fact, because Ϻ
could be anything that evolves, we could set trait value
Ϻ
to equal Ϸ
-- the fitness of each group. Then, it could be further derived that the covariance of
a trait with itself is equal to its variance:
Ϸ ɛ Ϸ ↄ⁛϶Ϸ
єϷ
ɂȠ⁝Ϸ
With rearrangement:
ɛ Ϸ =
(8)
This essentially shows that change in average fitness is related to the variance of fitness. This is
also the discrete-time version the Fisher’s theorem that we discussed earlier. This conclusion
will be a useful starting point for hypotheses development later.
Second, the Price equation captures the process of evolution by linking the ancestral and
descendant populations. The price equation quantifies a classic idea in evolutionary theory that
“neither genes, nor cells, nor organisms, nor ideas evolve. Only populations can evolve”
(Nowak, 2006, p. 14). Selection operates at the population level, not the individual level.
Selection is thus theorized as a process that links two (somewhat related) populations or sets at
two different time points. Usually, they are the parent population and the offspring population in
the biological field; they could also be social or cultural elements that replicate or maintain
themselves in a relatively stable manner over a period of time. The power of the Price equation is
that it provides an unusual way of establishing the correspondence by deriving elements from the
posterior set from the anterior set (using the index i).
Later, more evolutionary economists (Knudsen, 2004) and evolutionary population
biologists (Frank, 1998, 2017; Gardner, 2008) used the fundamental ideas expressed in the Price
Equation to decompose population growth. The basic idea is to first divide a population into
18
different groups based on their diverse traits i (groups here are what Hilbert et al., 2016 called
partitions or types), and then measure the differential selection force and growth rate of different
groups (within a population) over time (Frank, 2017). Commonly used biological characteristics
that describe group differences could be genes, weight, height, hair color, and so on.
Formally, growth rate of a group can be termed fitness. In biological systems, fitness
often refers to “the number of offspring” or the “rate of reproduction”, while economists refer to
it as a “payoff” or “return on investment”. Ϸ is often used to denote the fitness of the entire
population. It is calculated as the number of offspring at the time ϴ + 1 divided by the number of
ancestors at time ϴ . This can be written as:
Ϸ = ϵ⁚˽ϴ ⁞ of ϴ +1 / ϵ⁚˽ϴ ⁞ of ϴ
Accordingly, the ith group in a population has fitness of Ϸ i. And this component is what
results in the evolution of the whole population – differential growth rate of each group. Groups
with superior relative fitness gain population share, and will dominate the population in the long
run, while types with inferior fitness will be selected against and decline in number over time.
The group-level fitness values combined together will equal population fitness.
We already have equation (8) as follows:
ɛ Ϸ =
This formula states that selection force (i.e., change in population fitness) equals mean-
normalized variance of group-level fitness. This is a powerful equation because we eventually
care about selection at the population level, instead of the group (i.e., subpopulation) level.
Consider an illustrative example, which expands a toy dataset provided in another source
(Hilbert et al., 2016, p. 40). Suppose that there is a small Labrador Retriever population
containing 8 units at time t divided into two groups on the basis of whether they are chocolate or
19
yellow colored Labradors. The population grows in size and comes to have 18 offspring units at
time t + 1, as a result of the breeder’s selection. For example, the breeder may choose to provide
more mating opportunities for one group and less opportunities or no opportunity at all for the
other group; also, the breeder may choose to send some less-desirable puppies away or for sale.
So, the overall population-level fitness Ɇ
is equal to 2.25 (= 18/8). Group-level fitness can be
measured from the sub-populations, in this case two groups, chocolate and yellow colored
Labradors. The identifying trait index would be the color index, such as 1 = yellow, 2 =
chocolate. The yellow colored Labradors grow from 4 units to 12 units from time t to t+1, and
the chocolate ones grow from 4 units to 6 units from time t to t+1. Thus, we can calculate w yellow
= 3, and w chocolate = 1.5. The chocolate colored Labradors are less popular in the pet market and
have lower fitness than the yellow colored group; with time, this trend will be observed as
increased size of the yellow colored group and relative shrinking in size of the chocolate group.
The selection force is operating on different groups, based on their different traits (in this case,
their differentially colored coats). Using equation (8),
Change in population fitness =
=
Љ
Љ ̗
=0.25.
2.3 Community Ecology
By reviewing the Price equation about a general selection theory, it became clear that a
first step in any effort to quantify evolutionary change would be to define a unit of selection, or a
pattern that operates in the evolutionary process. Using more formal terms, we need to first
correctly identify what constitutes the subpopulation index i and the corresponding trait value z,
so that application of the above formula is possible. Most of the sociocultural evolution models
usually assume that some sort of unit exists (Aunger, 2000; Dawkins, 1976; Pocklington & Best,
1997). For example, the concept “meme” was explored as a basic unit of information that can be
20
replicated and inherited in cultural evolution (Aunger, 2000; Dawkins, 1976). “Meme” was a
concept analogous to biological genes – the basic unit of biological evolution. Pocklington and
Best (1997) suggested to look for a basic unit of cultural evolution called “cognitive motif”
which is a combination of multiple verbal terms that represent particular meaning in human
discourse. This concept is conceptually similar to proteins in biological evolution.
In biological evolution, genes, cells, organisms or even entire species could be units of
selection, which interact with the environment with different success rates in reproduction (Hull,
2001). For organizational scholars, they identify organizational forms (Aldrich & Ruef, 2006;
McKelvey, 1982). Organizations that share similar characteristics can be grouped together into
organizational forms (Monge & Poole, 2008). In addition, there are many possible characteristics
based on different research interests, for example, organization size, industry, whether the
organization is for profit or not, the development stage they belong to, and so on.
Aside from these characteristics that can be identified by examining each individual
organization, community ecology shifts focus to examine the relationships that the organizations
are embedded in. The ecological perspective of organizations is built on a central idea that very
few organizations in the world today can live in isolation; almost all of them exist in social
communities or ecologies, and are connected by various types of relations, interactions, or
information flows that constitute communication networks (Hannan & Freeman, 1977;
McKelvey, 1982; Monge & Poole, 2008; Powell et al., 2005). Organizational communities are
typically defined as “a spatially or functionally bounded set of populations” of organizations
(Aldrich & Ruef, 2006, p.260). These organizations are interconnected by multiple types of
relationships.
21
The relations among organizations within the same population can often be classified as
mainly collaborative, competitive, or neutral (Aldrich & Ruef, 2006). Such relationships are the
bases for the emergence of communities. Specifically, they identified eight possible relations that
fall into the two main categories: competition or mutualism (see Table 2.1 for details).
Competition is a kind of relationship in which at least one population is harmed. Two
organizations or entities can be in full, partial, or predatory competition. Full competition is
when neither organization can benefit from this relationship, such as when two competing
companies engage in a price war and both end up operating below cost price. Partial competition
is when only one party loses in the competition. This may happen when one company lowers its
price hoping to gain more market shares from its competitor, but customers are still purchasing
from its competitor. Predatory competition is when one party wins at the expense of the other
party, such as when one company develops a successful new technology and increases its market
share while its competitor loses current customers who are now attracted by the new technology.
Mutualism is a kind of relationship that neither does harm to the other and one may even
obtain benefit from the other. Two organizations in a mutualistic relationship can either be in a
partial mutualism, full mutualism, or symbiotic relationship. Partial mutualism happens when
one party benefits while the other party does not lose. For example, when a new social media app
enters the market and attracts a group of users who never used any social media app before, this
new app benefits and no competitor loses its current user base. Full mutualism is when two
organizations occupying the same resource niche have a mutually beneficial relationship. For
example, two high-tech companies in the same industry establish an R&D alliance that could
benefit both parties in their internal R&D efforts. Symbiosis is a relationship where two
organizations from different resource niches have a mutually beneficial relationship. For
22
example, if a university and a high-tech company establishes a research alliance that benefits
both organizations, they are having a symbiotic relationship.
In addition, two organizations can have a relationship that has no substantial impact on
each other – a neutral relationship. Or, the relationship may be dominated by one party and the
party in dominance decides the nature and outcome of this relationship – a dominance
relationship.
Table 2. 1
Eight Possible Relations Between Organizational Populations (Adapted from Aldrich and Ruef,
2006)
Type of relation Effect of one
population on
the other
Definition
Full competition -,- Growth in each population detracts from growth in
the other
Partial
competition
-,0 Relations are asymmetric, with only one having a
negative effect on the other
Predatory
competition
+,- One population expands at the expense of the other
Neutrality 0,0 No effect on each other
Partial
mutualism
+,0 Relations are asymmetric, with only one population
benefiting from the presence of the other
23
Full mutualism +,+ Two populations in overlapping niches benefit from
the presence of the other
Symbiosis +,+ Two populations are in different niches and benefit
from the presence of the other
Dominance Effects depends
on the specific
outcome
A dominant population controls the flow of resources
to other populations
Note. + positive effect, - negative effect, 0 no effect
In reality, populations within a community can have complex relationships. This means
that multiple types of relationships may exist in the same community, because many different
populations are present in the same community for different purposes and they need to build
different types of linkages with each other (Monge et al., 2008). For example, a study about the
emergence of the biotechnology community over a 10-year period identified six different
populations, including new biotechnology firms, public research institutions, financial
institutions, government regulatory agencies, other biomedical companies, and pharmaceutical
companies. The types of relationships that exist among these populations include research and
development exchange, capital flows, regulation information exchange, intellectual property
information exchange, etc., (Powell et al., 2005). This study was less concerned with the possible
within population competitive relationships. Bryant and Monge (2008) studied the evolution of
the eight populations that comprised the children’s television community between 1953 – 2003.
This community consisted of educational content creators, entertainment content creators,
content programmers, toy/licensed product manufacturers, advertisers, advocacy groups,
24
governmental bodies, and philanthropic organizations. They described the relationships among
these populations as competitive, neutral, or positive (i.e., mutual or symbiotic).
Since ample empirical observations showed multiple relationships could be present in
organization communities simultaneously, it is natural to further analyze the sources of such
relationships. Shumate, Fulk, and Monge (2006) suggested that the emergence and evolution of
an HIV/AIDS NGO community depended on the communication linkages among the members.
They identified possible relationships that may predict the development or demise of NGO
community linkages, including whether they belong to the same cohort, linkages with IGOs, past
alliance connections, and geographic proximity. Their results showed that past alliance
connections and geographic proximity were the two most important predictors for community
linkage establishment.
The ecological perspective allows us to consider the ways in which community changes
are dependent on not only the members of the community but also the linkages they have (or
lack) within the communication network. There are three advantages provided by
evolutionary/ecological approach for studying organizational change (Monge et al., 2008). First
the ecological perspective emphasized that it is important to take into account populations or
even communities in which organizations are embedded. This shifts analytical focus from single,
individual organizations toward groups or populations of organizations. It also shifts focus from
individual characteristics possessed by single organizations to collective/aggregated
characteristics possessed by a group or population of organizations. Second, it adopts a unique
perspective that focuses on how resources are driving the change in organizations. This
perspective explains that because resources are limited, organizations or populations need to
identify and compete for their own resource niches. Just like resource niches can only sustain a
25
limited number of organizations, networks themselves also have limited carrying capacity
(Monge et al., 2008). Each agent in a network may have a limit on the number of links that it can
maintain. This means that organizations must carefully design strategies for linkage formation
and retention, and that over-loaded networks will not last long. For the current study context, the
network relationships between a system of knowledge collaboration workers and the information
entities they created are thus not random, but a direct result of careful planning (see Forte et al.,
2009 for a detailed discussion about how Wikipedia is a self-organized community). Third,
evolutionary theory provides a generalized, dynamic theory of change (Baum & Rao, 2004) as
the previous discussion of the Price Equation amply demonstrates. Thus, rather than taking a
static perspective on organizations, evolutionary theory highlights how organizations and their
communication networks change over time. It will prove to be a useful theoretical tool for
analyzing the evolution of knowledge creation online as the participants keep adding, modifying,
editing, and maintaining the information products over a period of time.
This section provided a theoretical base for treating network relationships as important
defining characteristics for organizations and argued that relationship linkages make an essential
driving force for organizational change, along with individual characteristics. The next part
explains in detail how this illuminates research on quantitative models of social evolution.
2.4 What is Being Selected?
Natural selection effects arise from the fact that different groups within a population have
differential reproductive rates. A group is a sub-population where members share similarity in a
trait of interest. The task of identifying such “groups” is not easy (Frank, 2012a; Hilbert et al.,
2016). The best way of choosing an appropriate grouping trait “varies with biological context,
intellectual goal and subjective bias about what is ultimately meaningful” (Frank, 2012a, p. 230).
26
The literature has suggested different approaches to defining units of selection, but the best ways
to treat them are still under discussion (Lewens, 2015; Schaden & Patin, 2018).
In biological evolutionary processes, traits, characteristics, or properties of individuals
are often investigated as the defining indexes. It could also be the value of a particular
phenotype, gene frequencies, etc. Further, different researchers from different fields also define
“traits” from their own analytical perspective, so there can also be molecular, behavioral,
ecological, and geographic indicators (Hilbert et al., 2016). Researchers studying organizational
change have proposed a number of different ways to identify these traits. For example,
organizations can be classified based on their different organizational forms (Aldrich & Ruef,
2006, Hannan and Freeman, 1977, McKelvey, 1982), routines adopted by the organizations
(Nelson & Winter, 1982), cultural norms and institutionalized habits (Boyd & Richerson, 2007).
Scholars studying evolutionary epistemology are not only concerned with collectives of
organizations but also collectives of human knowledge in forms of scientific research outcomes,
information products, language, etc. So, the populations of interest go beyond classic
organizational entities, but also could include information products, such as images, texts,
videos, etc. These information products are not capable of producing themselves. Knowledge
outcomes or information products must be created by knowledge workers who are embedded in
groups, social institutions, or organizations. Thus, it is natural that an investigation using the
evolutionary epistemology perspective takes both organizations and knowledge products into
account.
2.4.1 Memetics
A classic view in cultural evolution looks for traits by examining the “content” of
information under selection. Although this concept has led to much interesting philosophical
27
discussion, it received considerably less support in empirical research. The next section explains
this concept, its main arguments, and the criticisms it received. The current project does not
directly apply the concept of memes, but a brief overview of the history of this once promising
notion will help us better understand other approaches that will be useful in this project. This
section goes over the definitions of memetics, and some criticisms it received that might explain
why it did not lead to more fruitful empirical research as was once expected.
Theories and empirical studies in memetics (Aunger, 2000; Gupta et al., 2016; Heylighen
& Chielens, 2009) suggest the concept of “meme”, as the basic unit propagating and evolving in
sociocultural processes. Originally, Dawkins (1976) defined a meme as “a unit of cultural
transmission, or a unit of imitation. (p. 206)”. He suggested examples of memes under this
definition include tunes, ideas, catch-phrases, clothes fashions, popular songs, ways of making
pots or of building arches (Dawkins, 1982). Later, Dawkins modified the definition of a meme as
“a unit of information residing in a brain … It has a definite structure, realized in whatever
physical medium the brain uses for storing information (Dawkins, 1982, p.109)”. With this
definition, a meme is physically residing in the brain, and then there are phenotypes of memes
existing in the outside world, which are the observable things such as tunes, ideas, catch-phrases,
popular songs, etc. Put in another way, Dawkins first defined memes to be cultural artifacts and
later emphasized memes as neural configurations in brains.
Initially, the notion of memes seems very appealing for its promise to scientifically
quantify memes just like we can quantify and measure genes. This potentially could help the
development of the study of cultural evolution. However, the idea about memes has not been free
from critics (Chesterman, 2005). Below is a summary of some of the main criticisms.
28
Concept. Definitions of memes are varied and, ironically, keep evolving. Roughly
speaking, all memetics scholars agree that memes are units of cultural transfer or evolution. But
they disagree about what exactly these units are. Chesterman (2005) provided a summary list of
definitions of memes in the literature, which even includes inconsistent definitions from
Dawkins himself:
A cultural element or behavioral trait whose transmission and consequent
persistence in a population, although occurring by non-genetic means (esp. imitation), is
considered as analogous to the inheritance of a gene. (Oxford English Dictionary, 2001)
A unit of cultural transmission, or a unit of imitation. (Dawkins, 1976, p. 206)
A unit of information residing in a brain … It has a definite structure, realized in
whatever physical medium the brain uses for storing information (Dawkins, 1982, p.109)
A meme is a node of semantic memory and its correlates in brain activity (Wilson,
1998, p.149).
Chesterman (2005) commented that the varied definitions of memes cause confusion
about exactly what kinds of unit memes are meant to be: “units of culture, of information, of
memory, of the brain? All of these, or some of them? (p.21)” Unclear definitions lead to
difficulties in empirical operationalization, which is one of the reasons why this interesting field
lacks further empirical evidences to validate its theories.
Methodology. Another stream of criticism opposes the definition of memes as proposed
by Dawkins but does not necessarily question the theoretical foundation of memetics. The
challenge of using memes as defining units in cultural evolution is that it mandates researchers to
find an exact meme, just like biologists identify a gene. In empirical research, this turned out to
be very difficult or nearly impractical (Edmonds, 2002; Gatherer, 1998). If one adopts the
29
definition that a meme is a unit of brain, it is equally difficult to prove that the replicated neural
configurations between two brains are identical (high transmission fidelity with reasonable
variation or error). Plus, in the long run, we need to be able to empirically prove that the
replication success rate of the “host” of the meme is positively associated with having the meme,
just like certain biological traits give organisms evolutionary advantages measured as replication
success. These challenges make empirical validation of memetics difficult, which is another
reason why the current project does not directly apply the concept of memes.
Agency. Another stream of criticism questions whether the concept of memes is actually
appropriate to be considered as the objects that evolve. Some scholars argue that memes are
“driven to self-create” (Jenkins, 2009), which is consistent with Dawkins (1972). Memes are thus
personified to be “actor, agent, and doer”, and have an “innate will” to be preserved for longer
and to be replicated for more times (Wiggins & Bowers, 2015, p. 1895).
Contrary to this view, some others point out that memes cannot really cause anything,
although they are often described to do so. It may be a result of the biological metaphor applied
too literally. But unlike biological organisms, cultural memes do not really have the possibility to
replicate themselves. It’s the humans who create, remix, and spread the cultural products
(Wiggins & Bowers, 2015). Thus, memes are the product of social actions performed by human
agents, not memes themselves.
This is still an ongoing discussion, with researchers drawing from memetics theory,
structuration theory, and participatory popular culture to address it from different perspectives
(Wiggins & Bowers, 2015). It is probably the metaphorical language that sometimes misleads
people to believe that memes should have agency so that they replicate themselves. In fact,
memes do not need to have agency. According to the Price equation reviewed earlier, strict
30
causation of set P on set Ⱦ
is not necessary. The only linkage required between two sets of
populations in order to describe evolutionary change is that the anterior set and the posterior set
is reasonably similar. For the purpose of our current project, we do not assume that cultural
artifacts have the agency to replicate themselves. The underlying assumption is that cultural
evolution studies the non-genetic transmission of cultural elements, rather than the genetic
transmission of them.
2.4.2 Variants of Memes
All these criticisms above were serious attacks on the field of memetics, which largely
remained a highly appealing research direction but did not end up producing enough credible
studies to sustain its own transmission. In an effort to expand and develop the field of memetics,
many scholars (Cavalli-Sforza, 1981; Lumsden & Wilson, 1985; Lynch, 1998) contributed to the
discussion about how to make the “units of selection” problem in cultural evolution more
practical, without calling the unit a “meme”. Alongside memes, many similar concepts were also
proposed by cultural evolutionists (Gatherer, 1998).
Lumsden and Wilson defined the concept of “culturegen” as a unit of cultural selection.
They defined a “culturgen” as “the node of semantic memory” (Lumsden and Wilson, 1985,
p.348). Lynch (1998) suggested to use mnemon, which indicates an abstract unit of memory.
“The principle abstractions manipulated with memetics theory are memory abstractions, or
mnemons” (Lynch, 1998, Section 4). Mnemons exclude the “inanimate propagating items” or
cultural artifacts (Lynch, 1998, Section 4). It is understood to be units of memories, awareness,
and beliefs stored in human brains. This concept is not in disagreement with Dawkins’ (1982)
definition. Richerson and Boyd (Richerson & Boyd, 1978) used a less specific concept of
“culture-type”. This definition states that natural selection works on culture-types, and positively
31
selects for “culturally coded information” which “produce phenotypes (the straightforward and
visible artifacts) that are more successful in passing the culture-type to the next generation”
(Richerson and Boyd, 1978, p.132). Their definition emphasizes a parallel relationship between
visible characteristics represented by cultural artifacts (phenotypes in the outside world) and
invisible internal configurations that code cultural information in it. Cavall-Sforza (1981) and
Guglielmino et al. (1995) both discussed “cultural traits,” which focuses on the visible cultural
elements such as artifacts or human behaviors.
In the Internet era, memes again have become a catch phrase, and the concept of memes
is now cultivating a small yet growing field about Internet memes (Wiggins & Bowers, 2015).
This is a field that takes an internet-inspired translation of the classic ideas on meme
transmission. It is mostly focused on the viral circulation of popular culture artifacts on the
Internet, such as videos, hashtags, emojis, and gifs. Typically humorous and provocative in
nature, they are copied and spread rapidly by internet users, often with slight variations. The
studies of Internet memes are conceptually consistent with Dawkins (1976) definition of memes
or cultural traits, as these concepts all focus more on the directly observable cultural artifacts and
concerned less about detecting the underlying neural configurations in our brains that carried
these emojis and gifs (as in Dawkins 1982).
2.4.3 How to quantify cultural elements?
There is obviously some disagreement in the above discussion about which is the best
way to categorize cultural elements in transmission, but these efforts all stem from a fundamental
search for a way to scientifically quantify basic units in cultural evolution. According to the
definition provided in the Price Equation, natural selection works based on the different values of
characteristics that each individual group possesses. Identifying culturally coded neural
32
configurations to be bases of selection may still be challenging, but there are already successful
attempts to identify observable characteristics of cultural elements as bases of evolution
(Blumenstock, 2008; Candelario et al., 2017; Dang & Ignat, 2016; Kane & Ransbotham, 2016;
Lewoniewski et al., 2017).
Then, the question becomes an empirical issue of how to identify these cultural elements
that underwent selection? Some scholars, primarily information system researchers and computer
scientists, define traits or characteristics based on the observable content or structure of a target
cultural artifact (Blumenstock, 2008; Candelario et al., 2017; Dang & Ignat, 2016; Kane &
Ransbotham, 2016; Lewoniewski et al., 2017). For example, a novel can be described by many
different traits or characteristics, such as its length, number of nouns, number of verbs, number
of characters, number of continents the storyline covers, sentiment level, etc. An online gif could
be quantified by the number of colors used, distribution of colors by sizes, color contrast, topic,
level of emotional arousal, novelty, etc. The list could go on and there is obviously no exhaustive
way to quantify a cultural element.
The current study adopts this approach and, instead of finding units of evolution that
reside in human brains, focuses on using observable characteristics to define populations and
subpopulations. Some successful empirical research about sociocultural evolution so far has
adopted this approach. For example, in prior research scholars have identified the following
cultural artifacts as objects that can evolve: viral videos, images or gifs spreading on the Web
(Jenkins, 2009; Wiggins & Bowers, 2015), ideas contained in textual content that spread from
one language version to another language version by means of translation (Chesterman, 2005),
organizational forms (Aldrich & Ruef, 2006, Hannan and Freeman, 1977, McKelvey, 1982),
routines adopted by the organizations (Nelson & Winter, 1982), and cultural norms and
33
institutionalized habits (Boyd & Richerson, 2007). These varied cultural phenomenon or artifacts
are all observable in popular culture, in translated texts, or in organizations, rather than residing
in human brains as neural configurations. Specifically, in connection to Wikipedia research, most
of the extant literature has proposed useful population-defining characteristics that can be
measured in a straightforward and easily observable manner. So, this current project follows the
prior literature that identified defining characteristics of Wikipedia and these characteristics will
be discussed in detail in the next chapter.
2.5 Network-based Characteristics
Adopting the community ecology theory, Monge et al. (2008) added a network-based
interpretation to the issue of how to identify traits under selection in sociocultural evolution. In
addition to the classic trait-based perspective, they suggested that there are analytical gains if we
expand our focus from not only traits of social groups and cultural phenomenon but also to their
network properties as the basis for defining evolving populations. Traditionally, the analytical
focus is on individual members’ “properties rather than on the networks that link them. However,
a full understanding of the evolution of organizational communities requires insight into both
organizations and their networks” (Monge et al., 2008, p. 449).
Following this theoretical development, an empirical examination testing this idea has
shown that effects of network metrics indeed were important selection criteria that underwent
natural selection, and that network metrics can actually identify stronger selection forces than the
traditional trait-based approaches (Hilbert et al., 2016). Importantly, this study linked network
structure of the populations and evolutionary dynamics of those populations and implied that
network metrics are well suited to identify which groups are fitter and which are less fit.
34
Until now, no research has applied the idea of examining network-based characteristics to
study the evolutionary dynamics of knowledge creation. That will be the focus of discussion in
the next chapter. This chapter so far has provided the necessary background to understand the
theoretical basis of cultural evolution, quantitative models of evolutionary dynamics, and the
different approaches of identifying the units of selection as a foundation for examining
evolutionary dynamics in an online knowledge creation community. The next chapter applies
this framework to analyze a typical knowledge collaboration system, Wikipedia, and examines
how we can use these theories to help get a more complete understanding of how people create
knowledge online, and how the knowledge created evolves in a self-organized community.
35
CHAPTER 3: EVOLUTIONARY DYNAMICS AND ONLINE KNOWLEDGE
CREATION
Even though Darwinian natural selection is meant to be blind and un-purposeful
(Campbell, 1965, 1974), human organizations and systems are not completely blind. According
to the evolutionary perspective of organization theory, human organization is defined as an
ecosystem designed purposefully to guide the evolution and find the most “fit” organization-
environment interaction (Lovas & Ghoshal, 2000). These strategic and purposeful structural
designs could certainly fail, but nevertheless, the intention underlying these designs is usually
aiming for fitness and long-term development. For Wikipedia, their slogan “the free
encyclopedia that anyone can edit” represents such a long-term goal and the main ambition of
the organization as articulated by its designers. It is hoped (but not guaranteed) that the ultimate
“fitness” goal is well-articulated by the strategic intent of the organization, so that resources of
variation and selection can be distributed to places guided by the strategic intent, and the whole
organization is hopefully adaptive towards the fitness goal.
If the Wikipedia platform could be considered as a dynamically evolving knowledge
creation system, then it is vital to clarify what exactly is evolving in such a system and in what
form can we observe the evolutionary processes. For traditional knowledge creation institutions
in human history, theorists have posited that they are all part of a larger, encompassing global
knowledge process. This knowledge evolution system includes different levels of selectors such
as “habit and instinct, visually-guided thought, memory-supported thought, social learning,
language, and finally science” (Kim, 2001, p.105). As discussed in chapter 1, all knowledge
created by humans was first developed as blind “beliefs” and we developed all kinds of ways to
test the validity of these beliefs. These methods for choosing the “fitter” beliefs are called
36
selectors (Bradie, 1986; Callebaut & Pinxten, 2012). Together, they form a pyramid of selectors
that can help humans develop and validate knowledge (Kim, 2001).
Selectors could be direct or indirect. For example, we may develop some knowledge to
predict the arrival of hurricanes, and this knowledge can be tested against the nature when next
hurricane actually arrives. This is process of reality check is a direct selector – which selects out
those beliefs that do not fit with the reality and selects for those beliefs that survive the reality
check.
However, oftentimes, human knowledge is not likely to be tested against nature directly.
Vicarious selectors, or indirect selectors, is a concept complementary to direct selector.
Vicarious selectors function as short-cuts for direct selectors, whereby the indirect set of
selection criteria substitutes for a more direct form of selection (Allchin, 1999; Campbell, 1974).
Human knowledge creation is mostly a vicarious process, because we usually do not have the
chance, due to high costs, limited time, availability of technology, risks, etc., to directly test a
hypothesis or idea against reality, even though that would be the more accurate way than via
vicarious selectors. For example, archaeologists may develop a belief about the purpose of an
Early Stone Age tool, but there is no direct way to know whether these beliefs are correct (no
direct selector). What they can do is to, say, find other relevant artifacts from the nearby site to
validate this belief, or to apply well-received knowledge from other disciplines, including
physics, chemistry, biology, to help test this belief. All these methods can serve as indirect
selectors which help us make informed guesses about the purpose of this tool, but there is no
guarantee that the current conclusion is the truth. Archaeologists may have to modify or even
reject the current conclusion in the future. In a more recent example, the medical professionals
have experienced the same trial-and-error processes about our beliefs about a new coronavirus
37
that causes COVID-19. A few months into the pandemic, we are already updating many beliefs
about this virus such as whether asymptotic carriers can infect others and in what ways, and
whether this virus is airborne and to what extent. When scientists first investigated this unknown
virus, they had to rely largely on the past knowledge of similar illness such as that caused by the
SARS virus. That past knowledge may or may not hold in a new situation. More empirical data
about the new virus provides different ways (selectors) to test if our hypotheses about the new
virus were correct (Moghadas et al., 2020; Nishiura et al., 2020).
Within the Wikipedia system, there are two types of selection happening: Wikipedia as a
selector of knowledge produced in external resources and Wikipedia itself as a selector system
that modifies the content on this platform in a dynamic way. First, Wikipedia as a whole is a
selector of the knowledge and information materials produced elsewhere, by citing and
referencing the original content from external resources (Wikipedia, 2019, NOR). Wikipedia
positions itself as merely a reference resource where original research and information produced
elsewhere gets collected and curated. This is reflected in a core Wikipedia content policy called
NOR -- no original research. OR (original research) used on Wikipedia refers to “material such
as facts, allegations, and ideas for which no reliable, published sources exist” (Wikipedia, 2019,
NOR). In order for editors to demonstrate that they are not adding OR, they must cite “reliable,
published sources that are directly related to the topic of the article, and directly support the
material being presented” (Wikipedia, 2019, NOR). By collectively writing and curating this
knowledge repository online, editors are vicariously selecting the “fitter” information materials
originally published elsewhere, that they consider appropriate and meaningful to be posted on
Wikipedia.
38
Second, if we unpack the knowledge processes within Wikipedia, there are selectors in
forms of writing and editing. The Wikipedia platform takes information generated elsewhere as
input, and it delivers encyclopedia content as output to users. Imagine one specific piece of
scientific finding that has been collected as part of Wikipedia, there still could be many different
forms of writing about the same piece of content. Some of the writings meet certain selection
criteria better than others. This selector is called “language” (Kim, 2001). The explicit verbal
representation of meanings is in itself a contingent and fluid process, and it is only an
approximate resemblance of the actual information (Richerson & Christiansen, 2013). Thus,
there are different characteristics of language that could lead to different success rates of these
encyclopedia items.
These types of selection happening within Wikipedia show that not all content published
on Wikipedia will have equal reproduction rates. Some of the articles possessing certain
attributes will be more likely to enjoy successful outcomes while other articles that do not
possess these desirable attributes will be less likely to receive critical resources for survival. In
the previous chapter, two major types of defining traits were mentioned: content-based and
network-based. The next sections explicate what these vicarious selectors, or traits, might be.
3.1 Content-based Characteristics
The first set of partitioning characteristics that may have some predictive value in
determining evolutionary outcomes are content-based characteristics. Two streams of research
shed light on identifying the sets of characteristics that matter most for driving evolutionary
change in Wikipedia. First, much empirical work has been done in the field of computer science,
where researchers look for specific traits (features in computer science terms) to describe
individual articles. There are numerous such measurement dimensions, both objective and
39
subjective, that could be used to distinguish individual articles. These measures have been shown
to be differentially associated with the success rates of Wikipedia articles (again success rates
mean many different things for different researchers). For example, Blumenstock (2008) and
Kane and Ransbotham (2016) showed that length of the articles or edit length is positively
associated with the quality of information it contains. Kraenbrign et al. (2014) and Candelario et
al. (2017) both suggested that accuracy and completeness are two important dimensions of
medication-related content on Wikipedia. More complete content and more accurate language
representations are positive indicators of the quality of medication-related information.
Lewoniewski et al. (2017) suggested the verifiability of references across language versions to be
a positive quality dimension of Wikipedia articles. Some other scholars used implicitly measured
life cycle of edits and texts (i.e., edit longevity) to represent the quality of edits, where longer
edit longevity means higher quality (Adler & De Alfaro, 2007; Halfaker et al., 2009; Qin &
Cunningham, 2012).
Second, another useful stream of research in the field of library science and information
science has suggested a set of selection criteria for evaluating text-based information products.
These criteria constitute the vicarious selectors of information within the Wikipedia system, as
explained previously. There are a few lists compiled for such criteria. In developing the
assessment guideline for typical reference texts used in libraries, Wong (2011, p. 530) proposed
the following eight dimensions when evaluating encyclopedia quality: (a) scope (that an article is
comprehensive and complete in delivering the necessary information), (b) format (appropriate
form of presentation in terms of tables, pictures, formats, etc.), (c) uniqueness (that it contains
features that set it apart from other encyclopedias), (d) authority (high reputation and reliability),
(e) accuracy (that information is accurate), (f) objectivity (that an article keeps an objective
40
viewpoint), (g) currency (that the content is up-to-date), (h) accessibility in indexing and
arrangement (that the content is organized in a way that has structural and semantic consistency).
Castelfranchi (Castelfranchi, 2001) proposed selection criteria such as credibility, importance,
and plausibility (that content is true or reasonable). Another group of scholars studied the
transmission of urban legends and proposed two types of criteria: informational selection
criterion (that the information is true or contains moral education value) and emotional selection
criterion (that the content is able to evoke strong emotions like anger, fear, or disgust) (Heath et
al., 2001). Communicability, the extent to which the contents “are likely to be expressed in
interpersonal discourse” (Schaller et al., 2002, p. 863) has also been proposed as a criterion of
information transmission.
Specifically, for evaluating the quality of Wikipedia article, some scholars borrowed
from these general lists of selection criteria and developed lists of criteria specifically designed
for this context. For example, Lewandowski and Spree (2010) derived a list with 14 main criteria
based on a review of previous literature, including: labeling/lemmatization, scope,
comprehensiveness, size, accuracy, recency, clarity and readability, writing style, viewpoint and
objectivity, authority, bibliographies, accessibility, additional material, Wikipedia ranking
(Featured Article or Good Article). Beyond that, Wikipedia also provided a list of its own to
guide users in conducting peer review and evaluating article quality. The model that Wikipedia
uses include eight main criteria including: well-written, comprehensive, well-researched, neutral,
stable, correct and consistent Wikipedia style, media content with proper copyright status,
appropriate length (Wikipedia, 2019, featured article criteria).
Reviewing these selection criteria is useful because it is expected that all other things
being equal, a piece of information that scores better on one of these criteria is predicted to
41
become more numerous than information that scores worse. The two streams of researchers may
not use the same terminology when referring to these “characteristics” or “selection criteria;”
however, a comparison of their works show interesting similarities. For example, a commonly
used characteristic by computer scientists --“length of an edit” -- corresponds well with a
selection criterion of “completeness” by information science scholars. The more complete piece
of content contains more information, and a proxy for measuring completeness could be “length
of an edit”. According to the above review about selection criteria in extant literature, it is clear
that many scholars agreed that some indicators exist for predicting the differential success rates
of information products, though different people may choose to focus on different sets of criteria.
Wikipedia is an increasingly important venue of collective knowledge creation where
Internet users worldwide are accessing, reading, editing and using its content. Many researchers
have asked the question of how Wikipedia succeeds in developing high-quality content (Arazy &
Nov, 2010; Dang & Ignat, 2016; Kane & Ransbotham, 2016; Li et al., 2015; Ren & Yan, 2017).
From the perspective of evolutionary theory, the question is to test if the evolutionary dynamics
observed in the Wikipedia articles population is related with these different traits or selection
criteria suggested in the literature. The apparent success rate of different types of subpopulations
could be examined in relation to the degree to which the subpopulations of interest fulfill the
proposed selection criteria. Understanding the linkage between selection criteria and success of
knowledge creation will give us a better idea as to which sets of criteria may be most important.
This assessment could then be used to guide platform managers or staff in more efficient
resource distribution. Alhough there have been abundant empirical studies that measure a set of
article-level characteristics and explore their relationship with article-level success, the current
project differs from these studies because it will analyze how different selection criteria about the
42
article content will be related to the selection forces imposed on these articles. The input variable
of interest is not a specific value of a characteristic but a set of population partitioning methods
(i.e., a selection criteria) that play different roles in determining the natural selection force
imposed on the population. The outcome variable of interest is not the success of individual
articles, but average population fitness which is derived from calculating each subpopulation’s
fitness level under different selection criteria.
The current study will consider some previously identified characteristics as selection
criteria, to explore whether and how they are driving evolutionary dynamics of a Wikipedia
article population. These characteristics were mainly based off a list suggested by Dang and
Ignat (2016), with some modifications. Their proposed list has several advantages, including (a)
it reaches a balance between number of characteristics and prediction accuracy, (b) the
characteristics are highly interpretable and meaningful, and (c) this list has been validated
multiple times by both industry practitioners (Halfaker & Warncke-Wang, 2019) and other
researchers (Shen et al., 2017).
Next, these content-based characteristics are categorized into five groups based on their
theoretical relatedness, and each group will be discussed in detail.
(1) Scope of content is the extent to which an article is comprehensive and complete in
delivering the necessary information (Wong, 2010). It matters for the success of encyclopedic
articles because this trait fulfills an important purpose for the readers – to look for a
comprehensive information source about a topic area. Each Wikipedia article should be a concise
yet comprehensive summary of the relevant key information about the topic, and a complete
presentation should help people decide if they want to continue exploring some aspects of the
article with more depth. The more comprehensive and complete an article is, the more likely that
43
a reader will get informational benefits by reading this article, and thus be more likely to spread
or use this article in the future.
Empirically, the scope of content is also one of the most frequently studied characteristics
of Wikipedia articles that can predict article success. Multimedia content is available in
Wikipedia articles, including both textual content and visual content (i.e., images). In general,
longer textual content indicates more comprehensive and complete information, or at least it
reflects more efforts from the editors during the writing process. Thus, it is likely that longer
content can provide more informational utilities for consumers. Literature has confirmed that
content length positively predicts article quality (Blumenstock, 2008). In addition to textual
content, visual content is an important element in Wikipedia articles too. Images provide
information that is easily accessible and straightforward, which often complements the textual
content. Using more images in an article could provide more comprehensive, accurate and
complete information to readers, and also it provides concise information because some images
could serve as summaries of a large amount of texts. Thus, the scope of both textual and visual
content should positively predict the fitness of an article.
(2) Using references is necessary for the success of an article, because it serves to
validate the accuracy of an article. Wong (2011) suggested that accuracy of information is an
important dimension of information quality evaluation for encyclopedia entries. This trait is
especially important for an online information source that is collectively contributed by a crowd
of online users instead of being written by a small group of professional editors and researchers,
which is what typically happens at traditional encyclopedias. Readers must feel that the content
is well-researched, valid, and accurate so that they want to keep coming back to use this source.
44
Having rich references cited in an article indicates that the content may be sourced from
authoritative publishing venues. As required by the Wikipedia editing policy, the references
should come from reliable and published sources, and there are specific requirements about what
constitutes these sources (“Wikipedia,” 2019). For references that do not satisfy these guidelines,
others will modify them to improve the quality of references. By linking external references, the
writers could suggest that they have responsibly checked the references for accuracy, and these
resources could also be further validated by information consumers if they wish. This provides a
promise of information accuracy and also the possibility for readers to expand the scope of the
content if they want to explore it in greater depth. Empirical research has also demonstrated that
using external references is a significant predictor of article quality (Dang & Ignat, 2016; Shen et
al., 2017).
(3) Formatting is also a critical part in Wikipedia knowledge creation, and also has been
regarded an important trait for evaluating article quality (Wong 2011). Good formatting makes
sure that the article is presented with appropriate tables, figures, headings, etc. In order to
support consistent and clean formatting, the Wikipedia platform has designed their own mark-up
language, so that readers will be able to easily locate relevant information based on the
formatting elements that are consistently used throughout the platform.
Wikipedia mark-up language provides a series of formatting elements, including info
boxes, templates, pictures or images, etc. It has been suggested that using correct and rich
formatting elements will help improve the overall writing style of Wikipedia articles
(Lewandowski and Spree, 2011). Articles that involve one or more formatting elements should
indicate several things about the article. First, the content creators of this article are quite
familiar with the usage of mark-up language, so that they know how to insert and use these
45
different elements. Second, they have devoted substantial efforts in writing the content which
goes beyond plain text writing. The editors spent time in formatting the content in a structured
way, so that the written presentation would be accessible and clear. Third, readers would be able
to more easily locate relevant information using these formatting elements as guidance. For
example, they would be able to know that an info box provides a summary of key information,
and in some situations, readers could just skim through the info box to see if it contains the
specific information they are looking for and skip the rest of the article. In this sense, using more
formatting elements help increase the information utility of Wikipedia content.
(4) Indexing accessibility is a concept that describes the extent to which the content has
been indexed or arranged in a consistent and accessible manner (Wang, 2011). This corresponds
to what Lewandowski and Spree (2011) termed as labeling or lemmatization of articles. In
Wikipedia, the indexing or labeling of articles can be done in a way similar to how traditional
libraries arrange and index books by the broader topic areas that these books belong to.
Wikipedia uses “categories” to identify the topic areas to which an article belongs. At the top
level, for example, there are categories like culture and the arts and geography and places. Then
culture and the arts can be further categorized into performing arts, visual arts, games and toys,
etc. Performing arts can also be categorized into sub-categories like dance, film, music, etc. The
list goes on. Editors can tag each article with one or more categories that are associated with the
content of it. Readers will be able to access (i.e. click on) the categories shown at the end of each
article to explore other “neighboring” articles that are also under the same category tag. As such,
readers can obtain an experience similar to that of browsing a book shelf in a physical library,
where books associated with a similar category are arranged together. Thus, the correct usage of
labeling or indexing function in Wikipedia articles can be vital for readers to retrieve relevant
46
information, especially when they do not clearly know the key words they should be looking for.
To some extent, indexing articles based on their categories is an important part of Wikipedia
knowledge organization, just like traditional libraries must carefully consider the system they use
for indexing books. The literature has also shown that usage of categories in an article is a
significant predictor of article quality (Dang & Ignat, 2016; Shen et al., 2017).
(5) Clarity and readability is suggested to be an important selection criterion of
Wikipedia articles, according to Lewandowski and Spree (2011). This means that higher quality
articles should use clear and readable language. Clearly written and easily understandable
language presentation of an article can help readers understand the content of it, and thus obtain
higher information utility from the encyclopedia entry. If readers feel that they have found the
information they need, it would be easier for them to remember, use and spread the content of
this article in the future. Clear and readable writing style thus could lead to higher fitness of the
articles that possess these traits.
3.2 Network-based Characteristics
The network approach proposes another way to find selection criteria of knowledge
creation in the Wikipedia system. This approach not only explores the content of each individual
article, but also examines how each individual article is positioned relative to other articles
within the population. Generally, it is assumed that there will be differential success rates
associated with articles occupying different network positions. The network positions of these
articles can be measured in many different ways. Hilbert et al., (2016) tested the idea that
network-based characteristics are also important predictors of a population’s growth rate, above
and beyond traits about the entities themselves. Their hypothesis was generally supported with
diverse data sets including YouTube videos, entrepreneurs receiving online crowdfunding at
47
Kiva.org, and international trading relations. The growth rate or fitness of each population can be
traced back to not just the basic traits, but also the network positions of these entities of interest.
Table 2.2 shows the list of network-based traits used in the Hilbert et al., (2016) study.
Table 2. 2
Lists of Network Metrics Used in Hilbert et al. (2016) Study
In addition to evolutionary theorists, computer scientists also arrived at similar
conclusions that network-based characteristics are important predictors in studying the dynamics
of collective knowledge creation. Among them, some researchers measured social network
metrics more directly while others derived network metrics indirectly. Qin et al. (2015)
constructed editor-editor communication networks based on WikiProject talk pages. They
hypothesized that network structures influence project effectiveness and examined the
hypotheses using a longitudinal dataset of 362 WikiProjects. The findings suggest that an
intermediate level of cohesion improves effectiveness for a WikiProject. Ransbotham et al
48
(2012) used a locally weighted degree centrality
1
and closeness centrality of an article in
predicting viewership of Wikipedia articles. In the editor-article affiliation network, both the
locally weighted degree centrality and closeness centrality were positive predictors of number of
views of an article. Kane and Ransbotham (2016) again used two network measures to predict
information quality of Wikipedia articles: weighted local degree centrality of a focal article and
eigenvector centrality of a focal article. Both degree centrality and eigenvector centrality were
positive and significant predictors of an article’s quality score (see detailed explanation on pp.
428 - 429).
Some other scholars also explored how to indirectly mine the connection between editors
and their content contributions to predict Wikipedia content quality. de La Robertie et al. (2015)
designed algorithms for calculating a Quality index for each article, defined over the qualities of
its individual pieces of contents, and then an Authority index for each author, defined over the
qualities of one’s individual pieces of contributions to multiple articles. Their empirical work
confirmed the intuition that good articles are likely to be written by good editors, and conversely,
good editors are more authoritative and reliable if they get involved with good articles. The
quality of editors and writers can be partly inferred from their network positions in co-edit
behavior networks (a co-edit behavior tie exists when two editors have worked on the same
piece). Suzuki (2015) proposed a survival-ratio based quality index that calculates “approval” by
peer editors in the form of unchanged texts. The idea is that if an editor leaves a text intact
1
A locally weighted degree centrality is defined as a focal article’s number of connections to other articles made by
shared editors, weighted by the number of contributions each editor made. See p. 393 of this article for a detailed
explanation as the authors created a slightly different measure of degree centrality than its classic operationalization.
49
without editing, it means the editor approves the text. Only changes indicate disapproval.
Therefore, if a text survives multiple editors’ reviews, it is considered that the text is approved by
multiple editors, thus, the quality of the text should be high. Adler and Alfaro (2007) proposed a
content-driven reputation system using this logic. If editor A created some content first, and then
editor B subsequently revised the same article, by preserving it, B provided a vote of confidence
in these edits and also in author A. A reputation system will increase the reputation of A
depending on number of preserved edits and on the reputation of B, in the sense that “votes”
from a higher-reputation editor weigh more than those from a lower-reputation editor. Even
though these researchers did not explicitly use traditional social network metrics, they
nevertheless emphasized the importance of analyzing the social graphs generated by Wikipedia
editing behaviors, such as co-editing relationship between two editors, and the relationship
between editors and content they worked on.
Taken together, the extant literature consistently showed the significance of considering
network-based characteristics in future research. There are at least four types of networks
constructed in Wikipedia collaboration literature and they all meaningfully represent certain
aspects of the Wikipedia community. They are (a) communication networks as in Qin et al.
(2015) that are composed of direct conversations among editors; (b) article-editor two-mode
affiliation networks as in Ransbotham et al. (2012) and Kane and Ransbotham (2016); (c) article-
article networks with edges being hyperlinks between articles; and (d) article-article networks
with edges being a co-editorship connection, meaning that one or more editors have worked on
both articles. Note that network (d) can be derived from network (b), meaning they are
mathematically equivalent.
50
The analysis of the current project considers network (d) article-article co-editorship
network. This type of network is chosen because the focus of the current project is primarily on
the relative network positions of articles embedded within a population of articles while the
information flow that meaningfully connects these articles are made by human editors, who
created and edited these articles. This network originates from a bipartite article-editor network
that captures how human actors (editors) can potentially bring in information and resources for
the purpose of knowledge creation. Thus, compared to other types of networks such as (a) and
(c), it best represents our theoretical interests in studying the link between network configuration
and information flow, and eventually knowledge creation outcomes. Network type (a) is made up
of only editors and network type (c) is made up of only artifacts, thus they both cannot link
editor’s actions with the artifacts that they created. In addition, the article-article network is a
commonly studied network type in prior literature about knowledge creation systems (Qin et al.,
2015; Kane and Ransbotham, 2016). Many scholars have emphasized that understanding articles
as co-produced artifacts by human actors may shed lights on research about knowledge creation
systems.
3.3 Comparing Content-based vs. Network-based Characteristics
Now that we have introduced that both content-based and network-based characteristics
will be significantly associated with evolutionary dynamics of a Wikipedia article population, it
is worth testing which type of characteristics is relatively more impactful in predicting
evolutionary success. Characteristics are used to discriminate subpopulations from each other.
Among the different characteristics or the different ways of identifying subpopulations, some are
more powerful in identifying the selection pressure than others. For example, if we compare a
content-based characteristic “content length” with a network-based characteristic “betweenness
51
centrality”, that means we would partition the population of articles in two different ways –by
the length of content or by the centrality metric. Each way of partitioning the population leads to
a different natural selection pressure score based on the Price Equation. Higher natural selection
pressure on a partitioning method indicates that a partitioning method attracts more selection
pressure and will more effectively differentiate some advantageous members from the less
advantageous ones. If the result shows that content length is a characteristic that attracts more
natural selection pressure than a network-based characteristic, it means the network metric
cannot as effectively discriminate advantaged members from less advantaged members. Thus,
from an individual article’s perspective, it is more useful to have longer length than a centralized
position in the network.
By analyzing the evolutionary change brought by each one of the characteristics, we will
be able to comparatively study which one or which set of characteristics induces stronger
selection pressure. The selection pressure is imposed by the users and editors of Wikipedia, who
search for, edit, and spread the pieces that are good fits for their information consumption needs.
In other words, selection is the outcome of the interaction of characteristics with their local
environment (Rose & Rose, 2000). More specifically, the local environment is made up of
editors and users who have information consumption demands that are waiting to be fulfilled by
these articles.
Traditionally, the content-based characteristics received much more scholarly attention as
those are more easily observable than network-based characteristics. It is intuitive that the more
complete, accurate, accessible articles will be more likely to achieve reproductive success in the
future. Recent theory developments have suggested that network-based characteristics should be
an important part of our consideration in terms of what is driving the evolution (Monge et al.,
52
2008). Hilbert et al., (2016) not only empirically confirmed the idea that network-driven
evolution is important, they further compared the relative influence of content-based vs.
network-based characteristics. Specifically, they tested eight empirical networked populations
evolving over time, which include hyperlink networks of YouTube videos, organizational
networks from the microcredit crowdsourcing platform Kiva, and the international trade network
among 118 countries. Network metrics as selection criteria were found to identify stronger
evolutionary natural selection than content-based population partitioning methods in most of the
empirical contexts, except for only one network – the PBS documentary videos network on
YouTube. For this exception, the content-based characteristics were found to be more powerful
in identifying natural selection than the network-based characteristics. This study provided
indicative findings showing that it is important to not only consider who these members are, but
also to consider their relevant interdependent relationships as members of groups. In economics
terms, this could be expressed as an intuition that some countries may enjoy better economic
growth not only because of their basic characteristics like the number of college students or the
number of inhabitants, but also because of how they connect with other countries (Hilbert et al.,
2016). Similarly, in terms of the online community, a YouTube video may become viral not only
because of the content of the video or the author of the video, but also because of how this
particular video is related to other videos or is recommended along with other videos. This
notion suggested an alternative approach that uses network metrics derived from the relative
network positions of members that evolve jointly. In other words, selection acts not only on who
members are (characteristics of individual members), but also who they are connected with
(network positions derived from interdependent relationships) (Hilbert et al., 2016).
53
Selecting eight different empirical contexts for this study shows an effort to conduct
exploratory analysis which hopefully will be generalized to a wide range of contexts. This
decision made sense because, in their study, there was not enough theoretical basis to conclude if
network-based or content-based characteristics will necessarily be the more influential type.
Instead of suggesting hypotheses before empirical testing, they framed the problem as an open-
ended question and explored eight different contexts to see if there can be a consistent finding
across the board.
Knowledge collaboration networks were not directly examined in their study. Applying
the initial findings from Hilbert et al. (2016) to this context may provide further evidence about
this new development in evolutionary theory and also it may provide new insights to research in
knowledge collaboration communities. As introduced earlier, the Wikipedia community may
indeed exemplify unique characteristics that are different from those datasets explored in the
research literature, and there are some (unknown) underlying reasons that might make the
network-driven evolution less extreme than in prior situations. Wikipedia is, by definition, a
knowledge creation community and the “products” of the editors’ co-creation activities are
extremely important. Thus, it is reasonable to argue that the actual content of these informational
products is quite important for the readers and editors. The knowledge artifacts contain
information that is itself important in the evolutionary trajectories and such information is not
overpowered by the information contained in network structures. The datasets used by Hilbert et
al. (2016) include two YouTube video networks (Democracy Now hyperlink network and
NOVA PBS video hyperlink network) which share some similarities with the Wikipedia
community, because they are all online information sharing communities. One of it (NOVA PBS
network) did not support their general claim that network-driven evolution leads to larger
54
magnitudes of evolutionary forces. There is no further evidence to conclude if this is a
coincidence or it might signal some underlying properties of online informational communities
in general. In summary, the current study highlights the importance of considering network-
based characteristics because it is at least equally important as the content-based characteristics.
Thus, the current project aims to replicate a general finding reported in Hilbert et al.
(2016) in the context of knowledge creation systems:
H1: In a knowledge creation system, network-based characteristics identify stronger
selection pressures than content-based characteristics.
The force of selection pressure brought by each different way of population partitioning
will be calculated by the Price equation, as discussed in Chapter 2. The equation measures a
population’s average level of natural selection due to the varied characteristic values possessed
by each individual member. Each different way of partitioning the population leads to a different
natural selection value (NS value). H1 will be tested by comparing the NS values obtained from
the content-based characteristics with the NS values obtained from the network-based
characteristics. The detailed metrics used to obtain these NS values will be introduced in the
Method section. A permutation test will allow for such a comparison between the two sets of NS
values (Pesarin, 2001). The null hypothesis of H1 can be expressed as
H1: NScontent-based = NSnetwork-based
Note that H1 aims to arrive at a general conclusion about which group of NS values is
greater than the other group, and this test does not test each possible pair of combinations. The
details of the permutation test and further comparisons for each possible pair of combinations
will be introduced in the Method section.
55
3.4 Comparing Different Types of Network-based Characteristics
This section focuses on the network-based characteristics and tests whether each
characteristic has different abilities in driving the evolution outcome. It addresses an issue that is
yet to be explored in Hilbert et al., (2016) where they called for more research to study “different
kinds of network metrics” that seem particularly apt at identifying large evolutionary forces
(p.45). In their study, they treated different kinds of network metrics as belonging to the broader
category of “network metrics”, rather than considering whether each different kind of metrics
will also be identifying different levels of selection pressure. This study proposes H1 which aims
to replicate the results about the difference between content-based and network-based
characteristics. H2 now tries to further unpack which specific network-based characteristics are
more influential in driving the evolution. Three important types of network structures – network
embeddedness, network connectivity, and network redundancy -- will be discussed in detail
below. They are commonly used network structural constructs in prior research about online
knowledge systems (Qin et al., 2015; Kane and Ransbotham, 2016).
Network embeddedness describes the role of social capital in production networks
(Grewal et al., 2006; Ransbotham et al., 2012) . Social capital is defined as “the sum of the actual
and potential resources embedded within, available through, and derived from the network of
relationships possessed by an individual or social unit” (Nahapiet and Ghoshal, 1998, p.243). In
the context of knowledge creation networks, researchers have linked the concept of social capital
with the structural characteristic of network embeddedness (Ransbotham et al., 2012). They refer
to network embeddedness as the extent to which a particular piece of content is connected to
other pieces of content “through the network of content creators.” In Wikipedia networks,
network embeddedness could indicate how well a focal article is connected to other articles in
56
the population through the editors that work on these articles. Higher embeddedness means the
article may be holding a key position because many other articles are connected with this article,
or because this article is often on the connection path among other articles, or because the article
is connected with other important articles (just like having important friends makes someone an
important person). For a collaborative knowledge production project like Wikipedia, the primary
resources are information and knowledge, and social capital enhances the quality of such
resources by facilitating the combination and exchange of information, two important processes
for building knowledge and intellectual products (Nahapiet and Ghoshal, 1998; Ransbotham et
al., 2012). Combination and exchange of information happens when editors can “transfer” the
knowledge they learned from one piece of an article to another (Ransbotham et al., 2012, p.390).
So, articles with higher embeddedness are relatively more important in the network, receive
more attention and resources from the editors, and can potentially receive better information
resources as these editors bring their learning and experiences accumulated elsewhere to
contribute to the focal article.
Connectivity of an article’s local environment is another important type of characteristic.
A highly connected or clustered local neighborhood means that the neighbors are themselves
well-connected. So that information can flow smoothly and efficiently in this neighborhood,
without a few central actors dominating or controlling the flow of resources. This is often a
structural signal of highly efficient information exchange patterns in this local neighborhood
(Lazer & Friedman, 2007). The focal article will certainly derive advantages by being part of this
highly connected neighborhood, and it can also be a well-connected information exchange hub
that’s serving the local network by closing triangle networks or by closing cliques (a clique is an
everyone-connects-with-everyone network or part of a network). Having a highly connected
57
local environment or being in a position that facilitates connectivity in a local environment can
bring more resources to the focal article, and thus makes it more likely that the article will be
useful and providing more information utility to the readers.
Redundancy of an article’s local environment can be another important factor that
impacts how much new and unique information that article has access to. If embeddedness is
describing how much information resources are available to a focal article, then redundancy of
an article’s local environment determines how much of those resources are truly unique and can
contribute meaningful new content to the article. Wang (2011) emphasized that the value of
encyclopedia entries partly hinges on the extent to which they can provide unique information to
readers. Imagine an article that occupies a highly embedded position in a highly redundant local
neighborhood, the amount of total information available to the article may be high, but the
amount of new and unique information would still be low. In this sense, it matters that an article
has access to diverse and rich information resources in a local neighborhood so that the content
presented is truly unique and useful for the readers. This means that a non-redundant local
neighborhood could help an article to become more fit.
Based on the above review, all three types of network-based characteristics have received
theoretical support about the roles they may play in influencing the development processes of
knowledge products. In order to determine which specific type is the most influential in driving
the natural selection pressure in this context, we first test the global hypothesis if there is at least
one type that is significantly different from the other types. Thus,
H2: In a knowledge creation system, network embeddedness, network connectivity, and
network redundancy identify different levels of selection pressure.
Formally, the null hypothesis of H2 can be expressed as
58
H2null: NS embeddedness = NS connectivity = NS redundancy
Testing for H2 requires special consideration because there will be multiple pairwise
comparisons embedded in H2. If the testing of H2 accepts the null hypothesis and finds that all
three types of network-based characteristics are equal in determining the natural selection
pressure, then we will not proceed to the following subset hypotheses. If the testing of H2 rejects
the null hypothesis and shows that at least one group of network-based characteristic is
significantly different from other groups, we can then continue to ask, which group is the most
influential one? All three possible pairwise comparisons will be explored.
H2(a): In a knowledge creation system, network embeddedness identifies selection
pressure that is significantly different from network connectivity.
H2(b): In a knowledge creation system, network connectivity identifies selection pressure
that is significantly different from network redundancy.
H2(c): In a knowledge creation system, network redundancy identifies selection pressure
that is significantly different from network embeddedness.
Formally, each null hypothesis can be denoted as
H2(a) null: NS embeddedness = NS connectivity
H2(b)null: NS connectivity = NS redundancy
H2(c)null: NS redundancy = NS embeddedness
Together, the univariate hypotheses of H2(a-c) constitute the three pairwise comparisons
that can help identify which pair is contributing most to the significance observed in H2. If H2(a)
is supported, then network embeddedness and connectivity does identify significantly different
NS values. The same can be said about H2(b) and H2(c). Note that this procedure does not
suggest a monotonically ordered relationship between the three sets of NS values, such as NS
59
embeddedness > NS connectivity > NS redundancy because there is no theoretical basis to make this
conclusion. However, the ordering, if any, will be an interesting issue to be explored in the post-
hoc analysis. The testing of the individual hypothesis will rely on permutation tests (Ellis et al.,
2017; Pesarin, 2001; Wheeler et al., 2016), similar to H1.
3.5 The Effects of Network-based Characteristics on Content Exploration
The above-mentioned hypotheses were based on an implicit assumption that network
configurations represent some aspects of information flow patterns in a network. This is a long-
held but rarely-tested assumption. Some research established correlational relationships between
network configurations and knowledge production outcomes. For example, Kane and
Ransbotham (2006) used article-editor network embeddedness to predict the quality of content
creation using a cross-sectional dataset. No study yet has directly examined the role played by
network configuration as an information flow process and it is not clear whether a causal link
exists between network configurations and content creation. If we generally assume that network
structure matters because it facilitates or prohibits information flow in certain ways, there should
be directly observable changes in terms of knowledge production content. If we can confirm that
the level of content exploration – the extent to which the “realized” content development
trajectory deviates from an “expected” baseline trajectory – along with variations in the network
structure, then we might conclude that indeed, network structure matters because of its influence
on how information gets exchanged and recombined during the content production process.
There is ample research about network configuration, information flow, and systematic
performance. Exploration is a classic strategy often adopted in knowledge creation and in
innovation processes in large scale systems (Lazer & Friedman, 2007; March, 1991; Wilden et
60
al., 2018). Exploration of resources or solutions indicates that there is high diversity and rich
possibilities in terms of the final outcome. When exploration is high, it means that content
creators explore many different options before converging onto the final one. When exploration
is low, it means that content creators do not spend much time trying out different options and
quickly adopt one dominant option. Network configuration has been commonly linked with the
underlying mechanism of information exploration, which then leads to variance in overall
systematic performance. Lazer and Friedman (2007) tried to unpack the relationship between
network configuration and information exploration as a strategy by conducting a computer
simulation study. This study models how different communication network structures impact
information sharing processes among the actors in a system and then evaluates system-level
performance in both short-term and long-term time frame. Their simulation showed that an
inefficient (poorly connected) network maintains diversity of information in the system and is
thus better for exploration than an efficient (well-connected) network. Still, more empirical
research using real-world data is required to further validate their observation and to establish
causal connections between network structure and exploration as a strategy of content creation.
This paper proposes to directly examine the causal relationship between network
configuration and content exploration over time in a knowledge creation system. Based on the
discussion above, the three types of network configurations of interest should all bear some
importance in terms of content exploration. It is thus hypothesized that:
H3(a): Over time, higher levels of network embeddedness in a knowledge creation
system will cause subsequently higher levels of content exploration.
H3(b): Over time, higher levels of network connectivity in a knowledge creation system
will cause subsequently higher levels of content exploration.
61
H3(c): Over time, lower levels of network redundancy in a knowledge creation system
will cause subsequently higher levels of content exploration.
A graphical representation of H3 is shown in Figure 3.1.
Figure 3. 1 Theoretical Model of Network Configurations That Predict Content Exploration.
The squared feedback loop on the top of each box represents a control for the influence of
the variable’s past history on itself. The curved arrows (on the left side of the boxes) connecting
three explanatory variables represent the fact that the model will control for the influence of the
other two explanatory variables, when regressing response variable on each explanatory variable.
62
CHAPTER 4: DATA AND METHOD
4.1 Wikipedia as a Research Site
As an example of online knowledge collaboration communities (Faraj et al., 2011), the
free encyclopedia Wikipedia probably constitutes the most well-known collaborative system
where any user can create and edit articles (de La Robertie et al., 2015). Recent statistics report
almost 35 million articles in more than 280 languages, among which there are close to 5 million
English-language articles. This collaborative process, involving more than 55 million
contributors worldwide, generates 10 million edits each month, which is approximately 10 edits
per second (de La Robertie et al., 2015).
The emergence of Wikipedia as an open knowledge collaboration project that “anyone
can edit” offers a rich research site for the study of self-organized user collaboration community
online. This site is often considered to embody some principles of self-organization because the
tasks are not mandatory – users freely choose to contribute or not; the managerial rules are social
norms determined and negotiated by the community’s members rather than imposed by a central
power; and the writing tasks are not assigned to users but voluntarily accepted by those who are
willing to do so; in addition, the contributors generally do not receive monetary compensation
for their work (Benkler, 2006; Forte et al., 2009).
Since Wikipedia has established a peer-to-peer content creation paradigm, users are able
to contribute and nurture collective intelligence by adding, revising, and deleting small chunks of
information, ‘wiki’, which eventually becomes a part of collective intelligence in the knowledge
ecosystem shared by all internet users. On Wikipedia, geographically dispersed users collectively
achieve the goal of encyclopedia writing and knowledge collaboration. The main strength of
Wikipedia is to allow anyone to contribute to its content. The potential pitfall of such an open
63
collaborative editing process is the emergence of doubtful or even radically poor-quality content.
Thus, the Wikipedia community has been working on maintaining high-quality standards for
Wikipedia entries, while keeping the community open to all potential contributors.
The previous chapter has discussed that the Wikipedia system can be considered as part
of the general human knowledge creation process, thus the general selection theories reviewed in
Chapter 2 should apply to this collaborative knowledge creation system. Specifically, processes
of variation, selection, retention, mutation, and replication, among others, influence the survival
and vitality of the processes of knowledge creation and curation that are expressed in the form of
Wikipedia entries. The next part describes how to operationalize the theoretical concepts using
specific measurements from the Wikipedia context.
4.2 Data Collection
This study tests theoretical predictions by analyzing one WikiProject as the population of
interest. A WikiProject is comprised of a group of volunteer contributors who commit to
develop, maintain, and organize articles related to a focal topic, such as medicine, fashion,
history, etc (Ransbotham et al., 2012). Each WikiProject population includes English documents
from Wikipedia articles around that topic that have been reviewed by the Editorial Team
Assessment of that project. Overall, the English Wikipedia currently has over 2,000
WikiProjects, about 1,000 of which are actively monitored by around 30–2,000 editors and all
projects have varying levels of activity
2
. Some other WikiProjects are not manually monitored by
human editors but watched by robots. These projects cover a wide range of topics, from the
mainstream to the obscure.
2
https://en.wikipedia.org/wiki/Wikipedia:WikiProject. Data retrieved on June 15, 2020.
64
There are two advantages in choosing WikiProject articles. First, articles belonging to
these specialized WikiProjects receive a peer-review quality score. It could be potentially useful
as a dependent variable in the analysis about fitness. Second, articles belonging to the same
WikiProject are reasonably associated around a topic area. Thus, the networks constructed for
these projects will not be too sparse. Editors can be expected to have engaged in activities that
exchange and transfer knowledge from one article to another in the same WikiProject.
For each article collected in this sample, several things were collected: (a) editing history
of each article; (b) co-editing network of the articles; (c) fitness measures.
4.3 Operationalization of Concepts
4.3.1 Fitness
Though there are rich possibilities of interpreting what a “successful” or “fit” Wikipedia
article should be like, this project focuses on viewership of the article. This measure is close to
the idea that evolutionary fitness is represented by replication, the process of making more
replicates (i.e., copies, offspring) of the piece of information. To say that a Wikipedia article
acquired “numerous” copies in readers’ consumption processes would indicate that the article
content has achieved success in spreading to a wider audience. Since Wikipedia is generally
transmitted online, we can proxy the spread of an article during a period of time to be the change
in page views of articles. Viewership has been theorized to reflect the market value of online
content (Miller, 2009). Higher page views indicate fitter content. This measure has been used by
previous research as a measure of Wikipedia articles’ market value (Ransbotham et al., 2012).
65
4.3.2 Content-based characteristics
An article’s content-based characteristics could be measured through automatic text
analysis techniques or collected from an article’s editing history.
(1) Scope of content is measured in two different ways, content length and image by
content length. Content length is measured as the length of the article in bytes.
Generally, longer content length is associated with broader scope of content and more
complete content (Blumenstock, 2008). Image by content length is measured as the
number of images in an article divided by content length. It captures the portion of
images relevant to the content length, and thus represents the scope of visual content
in an article. Between two edit sessions of the same article, editors may choose to
maintain the original version’s scope of content or choose to modify the scope of
content as they see fit.
(2) External reference is measured as the number of references of an article. For each
Wikipedia article, there will usually be a reference section at the end of the content.
Again, between two edit sessions, editors make decisions about whether they want to
keep, add, or delete the external references (Dang & Ignat, 2016; Warncke-Wang et
al., 2015).
(3) Indexing of articles is measured in two different ways, number of links to other
Wikipedia articles and number of categories tagged in an article. Number of links to
other Wikipedia articles reflects to what extent the terms used in the focal article have
been connected to other supporting articles, which will help the readers to dig deeper
and easily find the information necessary to understand the focal article. Number of
categories shows the extent to which editors have tagged the focal article with topic
66
areas that this article belongs to. Both of these functions serve to help readers locate
supplementary information and obtain rich information resources that will improve
the efficiency of information seeking in Wikipedia (Dang & Ignat, 2016; Warncke-
Wang et al., 2015).
(4) Formatting of articles is reflected at least in two different ways in Wikipedia. First,
editors use citation templates to format in-text citations in a consistent way, as an
alternative to formatting the in-text citations by hand. This procedure is similar to
how academic citation software helps to automatically insert in-text citations and
compile a reference list in connection with in-text citations. In Wikipedia mark-up
language, such citation templates are often enclosed by a pair of “{{citation
content}}” tag (Dang & Ignat, 2016; Warncke-Wang et al., 2015).
Second, infobox is a particularly interesting type of template used in Wikipedia
(Warncke-Wang et al., 2015). Infoboxes are often shown at the side bar of a page,
which looks like small tables that summarize key features of the page’s subject. For
example, a template “{{Infobox sportsperson}}” includes parameters specific to
sportsperson biographies, such as ethnicity, national team, agent, height, weight, and
so on. And this infobox has been applied to 52,256 encyclopedic entries about
sportspersons. As another example, a template “{{Infobox K-pop artist awards}}” is
the infobox often used in K-pop artist articles and lists to show a summary of notable
awards won or nominated. This infobox has been applied to 63 K-pop artist articles
and it includes pre-set award names such as Melon Music Awards or MTV Asia
Music Awards, to name a few. By using these pre-set topic-specific infoboxes, editors
can easily set up a framework of content presentation necessary for that type of
67
content, and others can also easily fill up missing information or modify the
information in an infobox by following the structure of it. Readers benefit from
infobox usage as they can quickly find relevant information that is summarized in a
table rather than sprinkled throughout the texts.
The third measure of formatting is the number of level 2 headings used in an article.
Usage of headings are encouraged in Wikipedia, as they introduce section and
subsections to content, clarify articles by breaking up texts, and they also will
automatically populate the table of contents shown at the topic of each article.
Wikipedia mark-up language allows a six-level hierarchy of headings, starting at 1
and ending at 6. Literature has focused on the usage of level 2 headings mainly
because they are not as common as level 1 heading and not as rare as level 3, 4 and
above headings. Carefully-written articles will see more of such headings used
throughout an article. In the mark-up language, level 2 headings are marked with
“==Heading 2 title==”.
(5) Clarity and readability are measured by three varied measurements, which are mostly
a family of readability scores as suggested by Dang and Ignat (2016) and again
validated in Shen, Qi, & Baldwin (Shen et al., 2017). It is based on a general
assumption that a good article should not only be well-organized, well-formatted, but
also well-written. They have shown that models using readability scores as part of the
feature set outperform models only using structure-derived features. In particular,
they found that the measure of difficult_words actually beats a long-standing best
predictor of Wikipedia article quality, article_length, in terms of contribution to
prediction accuracy. This set of readability scores is somewhat related based on their
68
algorithm designs, but this has been shown to not be a problem in prior literature.
This is because, first, each readability score adds prediction power in a cumulative
manner, meaning that they are related but not overlapping. Second, each readability
score uniquely contributes to the model to a similar degree, meaning that no one score
is less important than the other ones (Dang & Ignat, 2016).
Flesch reading score is defined as a function of average sentence length and average
syllables per sentence. The idea is that shorter sentences and shorter words are two
positive indicators of readable texts. Easily readable texts should use shorter
sentences made up of shorter words. Flesch reading score (Kincaid et al., 1975) was
designed such that higher values indicate better readability (easier).
Coleman Liau index (Coleman & Liau, 1975) measures readability as a function of
average word length and average sentence length, with different weighting values.
Longer word length and longer sentence length are two indicators of long and
difficult texts.
Difficult word counts the number of words that do not appear in a list of 3000
common English words that fourth-grade American students can reliably understand
(Chall & Dale, 1995). Higher value of difficult word count indicates that more
difficult words are used in a text, which might create hurdles for the audience to
consume the information.
4.3.3 Network-based Characteristics
The operationalization of the three types of network-based characteristics suggested will
be discussed in detail below. There is a total of ten network-based characteristics and they all
describe some aspect of how an article is located in a network. These characteristics are
69
conceptually similar, and some of them will be highly correlated. The existence of correlation
among these variables does not cause a problem for testing H1 and H2, because they are not
handled by regression tests. H1 and H2 only uses NS values obtained from these different ways
of partitioning the population and thus the NS values are directly comparable (Hilbert et al.,
2016). H3 will be using a regression technique and further steps will be taken to eliminate the
correlated variables. These steps will be discussed in the Analyses Plan section later.
(1) Network embeddedness
To describe network embeddedness of an article, there will be seven different metrics
used to capture different aspects of network embeddedness. These seven metrics are all
members of a family of centrality measures. The first three, degree, closeness, and
betweenness centrality have been regarded as “prototypical measures” that capture the
most important aspects of embeddedness (Brandes et al., 2016, p. 153). And eigenvector
centrality is also a well-known measure (Bonacich, 1972), along with some of its
variants, including PageRank, Hub and Authority (Brin & Page, 1998).
Total degree centrality measures the percentage of connections that articles have to other
network articles. Higher degree centrality reflects that articles are well-connected to other
articles. Well-connected articles reflect the fact that editors working on these articles
have worked on many different other articles (Wasserman & Faust, 1994).
Closeness centrality is the inverse of sum of shortest paths’ length (i.e., distance) from
the focal node to all other nodes in the network. Shorter distance means that a node is
closer to other nodes in general. The larger the value of closeness centrality indicates that
a node is in a more central position, and it has easy access to many other nodes. Easy
70
access usually represents advantages in information and resource benefits (Brandes et al.,
2016; Wasserman & Faust, 1994).
Betweenness centrality is based on the idea that sitting on the shortest paths between
many other pairs of nodes can provide a unique opportunity to tap into the
communication and information flow. It is also a concept related to shortest paths
between nodes or distance. If all shortest paths need to go through a focal node, then this
node matters for this whole network, because many others nodes’ communication
channels are dependent on this focal node (a broker). To summarize, closeness centrality
is often interpreted as access or efficiency in information transfer, and betweenness
centrality is often interpreted as the potential for information control between other actors
(Brandes et al., 2016; Wasserman & Faust, 1994)..
Eigenvector centrality is proposed by Bonacich (1972) as a measure of node prestige. It
is based on the assumption that a node is important when the nodes that it connects to are
important nodes. It takes into account not only how many connections a node has, and
also how many links their connections have, and so on through the network. This is a
useful concept because it extends the span of influence to more than one step away from
the focal node.
The PageRank measure (Brin & Page, 1998) is a variant of eigenvector centrality and
now has been proven highly useful for ranking the importance of Web pages. It is first
designed to represent the relative importance of Web pages by examining the hyperlinks
they received. Presumably, if a page received more hyperlinks from more important
pages, then more traffic will likely be directed here, and thus a page becomes highly
71
visible and important. Each connecting link generates a PageRank score, and a node-level
PageRank measure is the sum of the PageRank score of other nodes linking to it.
Hub is also a variant of the eigenvector centrality (Kleinberg, 1998). A high hub score
means that node links to many good and important resources (nodes) in a network. The
hub score of a node is therefore proportional to the sum of the hub scores of its neighbor
nodes. In this sense, it is conceptually similar to eigenvector centrality in that it
recursively defines a node’s position based on its connections’ positions.
(2) Network redundancy
This is a concept describing the extent to which one’s connections are bringing in unique
and new resources. A centrally-positioned and well-connected node may benefit from
many advantages, but it does not mean that the node automatically has access to unique
resources. On the contrary, a well-connected node may be surrounded by a neighborhood
of highly similar connections; thus, the resources are abundant yet redundant. In the
context of knowledge creation, redundant information resources are not the most
desirable because as Nahapiet and Ghoshal (1998) pointed out, knowledge creation relies
on exchange and recombination of diverse information. Redundancy can be measured in
two different ways, constraint (Burt, 2001) and effective network size (Borgatti et al.,
1998; Burt, 1992).
Constraint measures the extent to which “a network is directly or indirectly concentrated
in a single contact” (Burt, 2001, p.39). Think of a small team with one manager. If all
team members spend most of their time reporting to the manager and communicating
within this team, then the manager is likely a centrally-positioned person, but these rich
connections will not bring the manager much new knowledge, as they only have access to
72
a similar pool of knowledge themselves. If the same team now spends more time also
talking to external teams and engage in cross-functional collaboration, the manager will
be able to receive much more diverse set of information and resources from the team
members. The latter situation is when the manager enjoys lower constraint.
Effective network size is the number of connections that a focal node is directly
connected to, minus a “redundancy” factor (Borgatti et al., 1998; Burt, 1992). It is a
weighted method of describing the span of one’s connections – the more different regions
one can reach, the greater the potential of information and control benefits. Note that
higher values of effective network size indicate less redundancy (a good thing), while
higher values of constraint indicate more redundancy (a bad thing).
(3) Network connectivity
It describes the process of one node connecting to various parts of a network. In
constructing an article co-editorship network, a basic assumption is that editors working
on these articles serve as information flow channels among the contents. It is thus
important that editors can have many opportunities to work on many different articles in a
topic area. When editors work on different yet related articles, they can collectively build
norms for article writing in this area, increase background knowledge of a certain topic
area, and learn about best practices in collaborative editing and community governance
(Halfaker et al., 2012). All these collaborative experiences cannot be easily gained by
only working on a single article. Network connectivity reflects the extent to which a local
network has articles that are themselves connected, instead of just being connected by a
single central article. This type of well-connected local environment provides a fostering
community for editors who practice their editing skills, accumulate relevant information
73
that will benefit their future work, and thus provides advantages for the content creation
of the focal article.
Three metrics can describe the connectivity of a local environment. A local environment
in this context is defined as a focal article’s two-step network neighborhood. The step of
two is chosen because this will lead to a network that is large enough to have a
meaningful number of possible connections, and small enough to remain relevant for the
focal node.
Clustering coefficient is a metric that describes to what extent a network closes triangles
( Newman, Watts, & Strogatz, 2002; Watts & Strogatz, 1998). Assume that in a network,
A and B are connected, and B and C are also connected, then the probability that A and C
are also connected can be measured by the clustering coefficient. On average, if more
triangles are closed in a network, the higher the clustering coefficient value.
Density measures the portion of realized network connections out of all possible
connection opportunities in a network (Wasserman & Faust, 1994). A maximal complete
network where all possible edges are connected has a density score of 1. A network with
no edge has a density score of 0. Higher density means that the network is well-
connected, while lower density means that many potential connections have not been
made yet.
4.4.3 Content Exploration
The operationalization of content exploration was adopted from Arazy et al. (2020). In
the areas of design, engineering and machine learning, an artifact’s design is often quantified
through positioning it within a space of all possible alternatives, and this approach has been used
to explain how problems are solved (Mason, 2013). The space of possibilities is considered a
74
feature space, and the process of surveying the space of possibilities is referred to as search
(Mason, 2013). The end goal is for creators to identify a position in the space that leads to the
best outcome, such as highest quality, lowest cost, highest popularity, etc. The decision to arrive
at such a position is thus a search problem.
Content exploration, according to Arazy et al. (2020), was defined as the degree of which
an article explores a two-dimensional feature space constructed by an article’s starting and
ending position. At the beginning stage of content creation, there are many possibilities and
many possible directions for the content to explore. Imagine a two-dimensional feature space
constructed by X and Y axes. Each position (i.e., a point defined by X and Y values) represents a
configuration of the article’s content out of all the possibilities. Some are realized while many
other possibilities are never realized or implemented. A realized possibility can be described as a
point in this feature space, given the X and Y values. The distance between two points in the
space reflects the extent to which two solutions are dissimilar. The process of generating and
evaluating a wide range of artifacts (positions in the feature space) is referred to as exploration
(Mason, 2013; Navinchandra & Riitahuhta, 2011). The realized positions of that artifact
constitute a line or a development trajectory in this space, which is analogous to a regression line
in a two-dimensional regression graph. Exploration happens when an artifact (i.e. an Wikipedia
article) travels along the content development trajectory.
Content creators/editors negotiate in which direction they would want the artifact to go
and this is a dynamic process where many people could have many different ideas about how this
piece of artifact should look like. Editors can work on the current version of the artifact by
modifying, adding or deleting content to push the current version into a direction that they would
like to see. Content creation can thus be understood as a process of different forces (in this case,
75
editors with different opinions) pulling and shaping the content trajectories into different
directions throughout the feature space.
In the context of Wikipedia article creation, Arazy et al. (2020) suggested to model
content exploration as a dynamic process that moves from a starting point to an end point in a
two-dimensional space. That is, the end of an observation period is assumed to be the end of an
article’s history. In reality, these articles continue to move into new positions in the feature
spaces, but we won’t be able to observe that because we are limited by the fact that the data
collection process has to stop at a certain time point and will not indefinitely continue. (This
problem can be addressed if a researcher chooses to use a content creation site that has stopped
updating and when the complete history of content development is available.) The starting and
the ending point connects a straight line – the baseline trajectory. As an article gets created and
developed, the article may move from position to position. Some may fall onto the baseline
trajectory while some will not. These positions can be described by pairs of X and Y axes values.
In particular, the X value is the distance (Hamming, 1950) between the article’s current status
and the article’s initial status, and the Y value is the distance between the article’s current status
and the ending status. (The detailed computation procedure of the distances can be found in
Arazy et al., 2020). Points that fall onto the baseline are considered to have zero exploration,
because they move in a direction that they “should” go by not deviating from the baseline
trajectory at all. Points that are closer to the baseline trajectory are considered to have low
exploration. Points that are farther away from the baseline trajectory are considered to have
higher exploration.
As an illustration, Figure 4.1 shows the concept of the trajectory of the Wikipedia article
of “The Hobbit”. The development process moves from the upper left-hand corner (inception
76
status of the article) to the lower right-hand corner (ending status of the article or the end of the
data observation period). The shortest straight line between the inception status and the ending
status is the baseline trajectory, where all points have zero exploration. The realized trajectory
curved away from the shortest straight line. As the article evolves over time, the trajectory
explores multiple points within the feature space. The distance between a certain point on the
actual trajectory and the baseline trajectory is the level of content exploration.
Figure 4. 1 A Graphical Representation of the Development Trajectory of the Wikipedia Article
“The Hobbit”, Showing the Article’s Content Exploration Process. Graph Adapted from Arazy et
al. (2020)
77
Figure 4.2 presents some exemplary Wikpedia article trajectories, as provided in Arazy et
al. (2020). The numbers are article IDs used in their dataset. This graph gives a general
impression of what typical development trajectories of Wikipedia articles look like.
Figure 4. 2 Some Exemplary Wikipedia Article Development Trajectories. Graphs ordered by
the average level of exploration throughout the data observation period. Graph adapted from
Arazy et al. (2020)
78
As can be seen from the graphs and the above discussion, several observations about this
variable can be made. First, it can be understood as a time series because it is concerned with a
dynamic process. In this dissertation, each version is defined as a weekly observation of the
article. Each article version has a content exploration level, as calculated by its distance between
the current position to the baseline trajectory. Editors’ behaviors drive the article’s content to
change over time. When editors try to drastically change the content of the current version from
its prior version, this leads to a big change in the development trajectory and thus a larger value
of the content exploration level. When editors only make minor edits to a prior version, this leads
to a small change in the development trajectory and thus a smaller value of the content
exploration level.
Second, the degree of content exploration varies across the dataset. Some articles will
experience a more rugged trajectory than others. More rugged trajectories are often a result of
many different forces and opinions pulling the article into their desired directions. When these
opinions are at conflict with each other, such as Article IDs 86 and 45 in Figure 4.2, the
trajectory may be particularly rugged because the directions are not consistent. When most
people agree with the general direction of the article, such as Article IDs 29 and 92 in Figure 4.2,
the trajectory may be smooth and stays close to the baseline trajectory because people are not
trying to push the content into a different direction drastically different from its current version.
Third, the trajectory starts at a low exploration level and ends at a low exploration level,
and the exploration happens in the middle stage of the observation period. This is due to the way
the variable is calculated. At the beginning stage, the “current” status will not be too different
from the inception status (the point on upper left-hand), and over time, the “current” status will
get close to the ending status (the point on lower right-hand).
79
In Arazy et al. (2020), the variable of content exploration was treated as a dependent
variable, and they included independent variables such as article age, conflict level among
editors, editor’s retention rate and percentage of newcomers to predict the level of content
exploration. They used fixed effects regression technique and found that compositional features
of the editor community, such as retention rate and percentage of newcomers among editors, can
predict content exploration level. They controlled for the factor of “time” by including two
proxies of “time” in the Wikipedia community, one is the edit-session number and the other is
article age. For each article, they recorded multiple editing sessions throughout the article’s
history – the variable of edit-session number. Small edit-session number shows that this is an
early version of the article, and larger edit-session number indicates a later version of the article.
They also recorded the age of each editing session (number of hours since the inception of this
article) – the variable of article age. Fixed effects model was used in their study to control for the
fact that the data were nested under the specific edit session (a slice of observation taken from
that article’s development history). This choice made sense because this study mainly focused on
proposing a new algorithm, understanding the basics of this construct, and analyzing its
relationship with editor community’s composition.
Their analysis also showed that it is feasible to allow for time-varying effects in modeling
content exploration. This measure was proposed to track the evolution of Wikipedia articles’
content development process, so it is naturally a longitudinal observation throughout a period of
time. There are possibilities for using more sophisticated time series model to understand this
variable as a dynamic process, even though they reduced the “time-varying” component to two
proxy measures, article age and edit-session number. A limitation of their work, as recognized in
Arazy et al. (2020), is that they focused on analyzing what might be the antecedents and
80
consequents of this concept. So, their modeling “focused more on explaining multiple
relationships across a complex set of variables… rather than focusing on a single relationship
with the overriding goal of establishing a causal relationship, as is common in econometrics
(p.46).” They recommended that a future research direction is to identify more specific and
causal relationships between its antecedents and content exploration. Thus, it is an appropriate
method to operationalize the construct of content exploration, which captures a dynamic process
over time.
4.4 Analyses Plan
4.4.1 Tests for H1
H1 proposed that network-based characteristics will be relatively more influential than
the content-based characteristics in attracting selection pressure. The output variables will be a
relative fitness measure, specifically, page views as the indicator of fitness. The force of natural
selection brought by each different way of population partitioning will be calculated by the Price
equation, as discussed in Chapter 2. The Price equation measures a population’s average level of
natural selection due to the varied trait values possessed by each individual member. Further,
each different way of partitioning the population leads to a different natural selection value.
The comparison method adopts that used in Hilbert et al. (2016), which directly compares
the two vectors of natural selection force imposed by content-based characteristics and network-
based characteristics. To obtain the statistical significance of the comparison between these two
vectors of selection pressure, an empirically generated sampling distribution will be created
(Pesarin, 2001; Wheeler et al., 2016). The procedure is as follows: First, in the original dataset,
there are two vectors of data points at each time point: an article’s content-based NS values and
network-based NS values. Next, one value is randomly drawn from the content-based NS vector
81
and another value is randomly drawn from the network-based NS vector (at time t). As such,
each pair of data points constitutes a randomly selected content-based NS value and a randomly
selected network-based NS value also at that time. The procedure was repeated for 10000 times
to generate a new list of paired NS values. The Fisher-Pitman test, a nonparametric counterpart
of the F test in one-way ANOVA, was used to analyze whether content-based or network-based
characteristics are generally more powerful in driving the evolutionary change of Wikipedia
articles. This test is desirable compared to typical ANOVA test because it loosens the
assumptions about normality and equal variance of the data required in parametric tests. Given
the limited understanding about natural selection forces in cultural evolution, we have no
theoretical base to assume normality and/or equal variances. The procedure will be implemented
using the R package coin (Zeileis et al., 2008).
4.4.2 Tests for H2
H2 and the three competing hypotheses H2(a-c) ask which type of network-based
population partitioning method identifies stronger natural selection pressure. Similar to the
methodology used for testing H1, the Price equation will be used to determine the average level
of natural selection due to the three network-based characteristics (network embeddedness,
connectivity and redundancy) possessed by each individual member. Each different way of
partitioning the population leads to a different natural selection value.
As stated in H2, we are first interested in testing the individual null hypothesis between
the three vectors of natural selection values, based on different population partitioning methods.
H2: NS embeddedness = NS connectivity = NS redundancy.
82
Testing this individual hypothesis will use the same method introduced in Section 4.4.1. Fisher-
Pitman permutation test can handle more than two factor levels in an explanatory variable (Berry
et al., 2002).
Taking H2 and H2(a-c) together forms a multiple testing problem (Dmitrienko et al.,
2009), which requires more conservative treatment of the null hypothesis rejection criteria than
testing a single hypothesis, because the multiplicity of hypotheses increases the probability of
erroneously rejecting a true null hypothesis (increased probability of Type I error). The multiple
testing is handled by the following procedure. If the testing of H2 turns out significant, judging
by a pre-specified significance level (α level) of 0.05, then we proceed to the following subset
hypotheses for pairwise comparisons among the three vectors H2(a-c). Otherwise, the procedure
stops. Testing a set of interrelated hypotheses requires correction of the α level to ensure that we
are not increasing the probability of Type I error. The treatment of α level adopted here follow
the Bonferroni-Holm procedure (Dmitrienko et al., 2009), which suggests a step-wise testing
method to adjust the p values for each hypothesis. For example, we will need to simultaneously
find significance for both H2(a) and H2 under the adjusted condition to reject the null of H2(a).
Simultaneous significance of both H2(b) and H2 are required to reject the null of H2(b). The
same procedure applies to H3(c).
4.4.3 Tests for H3
H3 proposed to directly examine the causal relationship between three types of network
configuration and content exploration over time in a knowledge creation system. This hypothesis
does not directly examine the measure of fitness but tries to causally unpack the mechanism of
why network configurations matter in this context. The response variable of interest is the level
of content exploration.
83
The testing process will be conducted in the following steps. First, the output series is
analyzed by just examining its own past history as a dynamic process. This is sometimes called a
pre-whitening process (Box et al., 2008). An Auto-Regression Integrated Moving Average
(ARIMA) model will be used to model the dependent series. This establishes a baseline model to
predict Y series only using historical information of Y without considering the input series’
effects. Second, transfer function models will be estimated by only including the dependent
series’ past lags as predictors of its current value, without the input series. This step builds a
baseline model for the full transfer function model. Third, transfer function models will identify
the lagged effects of input series (content-based and network-based characteristics) on the output
series. If the parameter estimation shows that certain network configurations have significant
lagged effects on content exploration, we will be able to conclude that network configurations
indeed cause changes in content exploration levels over time. This will lend support for H3.
Next, each step is introduced in further details.
ARIMA model is a univariate time series model that relates the present value of a series
to its past values and past prediction errors. It usually consists of three terms – AutoRegressive
(AR) term, differencing term, Moving Average (MA) term. The AR term, usually denoted as
AR(p), indicates how many previous times (order p) we use to predict the outcome variable at
the present time. AR(1), for example, stands for autoregressive term of order 1, which means we
use value of the variable and the immediate past time to predict the value of the variable at the
current time. The MA term, usually denoted as MA(q), indicates how many past times (order q)
of error we include in the model to predict the present time. MA (2), for example, stands for
moving average term of order 2, which means we use the error terms in the past two periods to
predict the value of the variable at the current time. The differencing term (order d) denotes the
84
number of times the data have had past values differenced (subtracted). This step aims to get the
time series closer to a stationary or stable status. A stationary time series does not have an
obvious overall trend (sometimes called a global trend) over time, rather, the series fluctuates
around a stable mean value over time. (See Enders (2014) for technical definitions of
stationarity).
An ARIMA model is often written in the form of ARIMA(p,d,q) model, with p,d,q
indicating the AR term, differencing term, and MA term, respectively. It thus describes a
univariate time series model where the current time value can be decomposed into the part that is
influenced by its past values (AR with order p) and past errors (MA with order q), and the part
that is left after taking differences in the specified order (differencing with order d). When the
time series is already a stationary one, the model can be simplified to be an ARMA (p,q) model,
or ARIMA (p,0,q) model, meaning that the differencing term is not needed. See Enders (2014)
and Chatfield and Xing (2019) for more details about ARIMA model.
Specifically, the estimation of an ARIMA process was obtained by using the following
procedure. First, if an initial examination of the content exploration as a time series object
revealed that this series is not stationary, then it is necessary to first turn it into a stationary
series. This step identifies the order of d. Next, values of p and q terms are explored, by starting
from the simplest model such as p = 1, q = 1, and proceeds to larger p and q values to test for
more complicated models. A satisfactory model will be obtained when the residuals of the
ARIMA model approximate a white noise series. Augmented Dickey-Fuller test and Ljung-Box
Q test will be used to identify if the ARIMA model has achieved good result. These steps are
conducted in R package TSA and tseries.
85
Transfer function models (Williams and Monge, 2001, pp. 209-213), a technique of
multivariate time series modeling (Enders, 2014), will be used to model the dynamic process of
how network configurations cause changes in the response variable – content exploration. This
model is also sometimes called an AutoRegressive Distributed Lag model (Demirhan, 2020;
Pesaran & Shin, 1998). This is a dynamic regression technique that allows the explanatory
variable(s), a dynamic process, to influence the response variable, also a dynamic process. This
technique will describe how the changes in the explanatory variable get transferred to the
response variable, by identifying a transfer function that is conceptually similar to the regression
coefficient in classic regression models.
Formally, transfer function models can be written as
Ϲ
ː ₮
Ϲ
a ₮
Ϲ
₥
ϸ
a ₥
ϸ
Where ₮ are coefficients that reflects the lagged effects of Y series’ history on its current value,
and ₥ coefficients reflect the lagged effects of X series’ history on Y. p represents the number of
lags of the dependent variable to be modelled, and q denotes the number of lags of the exogenous
variable to be modeled. Conceptually, these lagged coefficients are similar to the beta
coefficients often used in traditional linear regressions to represent the fact that changes in X
leads to (transfers to) some changes in Y. The only difference is that in transfer function models,
the “coefficient” is, instead of a single value, actually a polynomial that accounts for several past
values of X. Each of these past values of X has different and time-varying effects on Y.
By examining this equation, we can tell several important features of this method. First, it
handles both input and output variables as time series, while still able to provide coefficient
estimations that can be interpreted similarly to traditional regression coefficients. Second, it
accounts for the lagged effects of the output series’ past values on its current value. And third, it
86
also accounts for the lagged effects of the input series’ past values on the current value of the
output series. The latter two points are helpful in explaining more variances of the output series,
because this model takes into account the fact that both the input series and the output series can
be dynamic processes themselves, plus the input series has additional effects on the output series.
Graphically, an output series Y can thus be decomposed into several dynamics as shown
in Figure 4.3.
Figure 4. 3 A Graphical Representation of How Transfer Function Models Decomposes Three
Dynamics in Modeling a Response Variable Y.
87
CHAPTER 5: RESULTS
This chapter details the results to the analyses performed using the Wikipedia project
data. The first section (5.1) provides the descriptive statistics, giving empirical background on
the dataset. The next sections (5.2, 5.3, 5.4) reports hypotheses testing results.
5.1 Descriptive Statistics of the WikiProject Aquarium Fishes
The WikiProject of Aquarium Fishes was used for this project. This WikiProject was
chosen because it is large enough to include meaningful information while small enough to be
handled by network analysis software. The final dataset includes 394 articles from the project.
The WikiProject Aquarium Fishes is a topic area that is not overly active or overly popular,
compared to projects that are of wide interests to the general public. This is not a place that gets
constant attention by the “fluid” or one-time editors who come to work on entries when they are
related to sensational news and leave when the entries are no longer popular (Keegan et al.,
2013). The internal dynamics of its editors are relatively stable, which is ideal for our research
focus on the internal dynamics generated by editors’ network behaviors rather than the
momentum brought by external environment shocks.
Figure 5.1 shows the natural log transformed distribution of weekly pageviews
summarized over all 394 articles. Figure 5.2 shows the natural log transformed distribution of
weekly total revisions summarized over all 394 articles. Natural log transformation helps to
reduce the variability of the raw series and thus the data visualization will not look as stretched
out in the graph. The actual model results to be reported later use original data, not log
transformed data. By examining these two graphs, the volume of weekly activities (pageviews to
articles and revisions to articles) to this WikiProject does not seem to fluctuate much, meaning
that this WikiProject as a whole did not become more or less popular over this period of time.
88
Second, the editors working on this project tend to be “stable” or repeating users. This dataset
includes 2610 unique users. Users are defined as registered accounts and unique IP addresses.
Note that sometimes, a registered user can choose to not log in so that one’s activities will be
recorded as coming from an IP address. In this case, we will not be able to identify that IP
address actually belongs to a registered user. We also do not know when the same user changes
its computer network proxy setting to and seem to have multiple IP addresses. Thus, this dataset
has no way to link multiple IP addresses together even if the visits were actually made by the
same person. 1679 of the 2610 users were one-time users (64.3%), and 931 (35.7%) were
repeating users who contributed more than once. The histogram of Figure 5.3 shows the
distribution of revision counts by editor counts. When the WikiProject contains a high
percentage of repeating editors, the network of information flow constructed by these editors will
be more meaningful compared to when the percentage is low. If there is a large group of people
who constantly work on the same project, they are likely to bring the knowledge and experiences
learned from one article to another. Information exchange and accumulation of experiences are
less likely to happen when most of the editors are one-time users visiting for random reasons
(e.g., coming from a Google search) and may never come back again.
89
Figure 5.1 Total Pageviews per Week. The X-axis represents the sequence of weeks from Week
1 to Week 163). The Y-axis value is the weekly total pageviews (natural log transformed).
Figure 5. 2 Total Revisions per Week. The X-axis represents the ID of the week (from Week 1 to
Week 163). The Y-axis value is the weekly total revisions (natural log transformed).
90
Figure 5. 3 Histogram of Revision Counts by Editors. X-axis shows the number of revisions
(natural log transformed) made by an editor. Y-axis shows the count of editors corresponding to
each level of “number of revisions (log)”.
At the time of the data collection in February 2020, there was an initial set of 910 articles
in the WikiProject of Aquarium Fishes. Then, some of the articles were excluded based on the
following rules: (1) the collection processes removed articles that were not evaluated, or for
some other reasons, do not receive a valid evaluation score. Only those being evaluated to have a
quality score of Featured Article, Good Article, A, B, C, or Start were included. (2) The dataset
also excluded articles that were created after January 1, 2017, the beginning of the data
observation period. This avoids having missing data in the sample due to the non-presence of
91
these articles at the beginning of the data observation period. These two exclusion steps led to a
final dataset of 394 articles. Figure 5.4 shows the distribution of the articles in each quality level
(Start being the lowest and FA being the highest quality). 196 out of the 394 articles (49.7%)
belong to the Start level, which is the minimal quality level being collected in this dataset.
The data observation period thus starts from January 1, 2017 and ends on February 23,
2020, spanning over 165 calendar weeks. The texts of each article were collected at the end of
each calendar week (at Sunday midnight). To ensure that each observation is exactly one week
apart, the first and the last incomplete calendar week were removed. This step leads to 163
weekly observations with equal intervals.
Figure 5.4 Number of Articles in Each Quality Level. The X-axis represents number of articles
and Y-axis represents the quality levels from low to high, where Start is the lowest level and FA
the highest quality in the dataset).
92
5.2 Results for Hypothesis 1
Hypothesis 1 suggested that in a knowledge creation system, network-based
characteristics exert stronger selection pressures than content-based characteristics. The analysis
started with calculating the natural selection pressure based on weekly pageviews as an indicator
of fitness. A WikiProject is considered to be a population and the dataset includes 163 weekly
observations of the same population over time. Because the evolutionary forces can only be
captured (using the Price Equation) based on change between two observations, this dataset
generated 162 NS values for the WikiProject. In order to meaningfully compare the evolutionary
forces imposed on different partitioning methods, the raw NS values were ranked by percentiles
from the lowest to the highest, with percentiles closer to zero representing smaller NS forces.
The treatment follows that used in Hilbert et al., (2016) and it is useful in removing the variances
of NS values in absolute terms among the network while keeping the rank order. This also makes
the results reported here consistent with prior studies.
The content-based characteristics include the following 11 items: content length, image
by content length, number of categories, number of templates, number of level 2 headings,
number of other Wikipedia page links, number of references, whether the article contains info
box, number of difficult words, the Flesch reading score, the Coleman-Liau index. The network-
based characteristics include the following 10 items: degree centrality, betweenness centrality,
closeness centrality, eigenvector centrality, PageRank centrality, hub centrality, constraint,
effective size, density, transitivity. The detailed measurements of these characteristics can be
found in Chapter 4. These 21 population partitioning methods thus lead to 21 sets of NS values
that reflect the evolutionary forces imposed by changes in each of these characteristics. Figure
93
5.5 compares the average percentile rankings from both groups, distinguishing between
population structures made of content-based versus network-based characteristics.
Figure 5. 5 Average Percentile Rankings of Natural Selection Forces per Population Structure.
X-axis represents each different population partitioning structure, Y-axis shows the average
percentile ranking of natural selection value. The “Content” panel on the left side includes the 11
content-based characteristics, and the “Network” panel on the right side includes the 10 network-
based characteristics.
The average content-based NS value is 0.504 and the average network-based NS value is
0.495. To determine if the difference is a statistically significant one, a Fisher-Pitman
permutation test was used by building randomly sampled distributions of the NS values,
94
controlling for the weekly observation period. The permutation test relied on randomly generated
sampling distribution of NS values from 10,000 simulations. This method could be used to
examine whether a partitioning method significantly reduces or increases NS values compared to
another group (content-based vs. network-based), holding constant the week of the observation.
The null hypothesis of the Fisher-Pitman test is the absence of difference between the two
groups, or stated differently, the equal value of two groups of NS. The result shows that the
observed value is likely to happen in 23.83 % of all random simulations (2383 cases out of
10,000 simulations, two-sided test). In other words, 23.83% of the simulations might generate
more extreme values (including both greater or lesser) than the observed one, and in 76.2% of
the simulations we might observe less extreme values than the observed one. With p = .24, the
null hypothesis cannot be rejected, and these two partitioning methods do not produce
significantly different evolutionary forces. Thus, H1 is not supported.
5.3 Results for Hypothesis 2
Hypothesis 2 tests for differences among the three network-based NS values, including
network embeddedness, network connectivity, and network redundancy. The omnibus null
hypothesis is that there are no differences between the set of NS values; the substantive
alternative hypotheses are that there are differences (inequality) between different specific NS
values. A Fisher-Pitman permutation test was conducted to examine if there is at least one group
of NS values that is significantly different from the others, while controlling for the weekly
observation period. The permutation test relied on a randomly generated sampling distribution of
NS values from 10,000 simulations. The result shows that the observed value is likely to happen
in 1.21 % of all random simulations (121 cases out of 10,000 simulations, two-sided test). That
means, only 1.21% of the simulations might generate more extreme values (including both
95
greater or lesser) than the currently observed one, and in over 98% of the random simulations we
might observe less extreme values than the observed one. With p = 0.012, the null hypothesis
was rejected. At least one group of the NS values was significantly different from others. H2 was
supported.
Now that we have passed the global hypothesis about there being at least one group that
is significantly different from the others, we proceed to subset hypotheses testing to identify
which groups are substantively contributing the most to the significance in H2. H2(a-c) proposed
to check all pairwise relationships between the three network characteristics. The mean of
NSconnectivity is 0.52, the mean of NSembeddedness is 0.49 and the mean of NS redundancy is 0.48. We
want to confirm if the observed larger mean value of NSconnectivity is indeed significantly greater
than the other two groups. H2(a) has the null hypothesis about the equality of NS values between
network embeddedness and connectivity. H2(b) tests the equality of NS values between
connectivity and redundancy. H2(c) tests the equality of NS values between redundancy and
embeddedness. Results show that for H2(a), the pairwise comparison between embeddedness
and connectivity, the obtained value for the statistical test was p = 0.034; for H2(b), the pairwise
comparison between connectivity and redundancy has a p-value of 0.015; for H2(c), the pairwise
comparison between embeddedness and redundancy has a p-value of 0.20.
Since this involves a multiple hypothesis testing problem, as discussed earlier, we need to
take into account the increased likelihood of making Type I error when these hypotheses are
interrelated. We adopted the Bonferroni-Holm procedure to correct for the significance
evaluation level to accommodate for the fact that there are four hypotheses (H2 and H2a-c) in
this family. This procedure ranks the four hypotheses by the p-values from smallest (most
significant) to the largest (least significant). The smallest p-value reported here is H2 (p = 0.012).
96
Then the most significant p-value is evaluated against an adjusted α level of 0.05/4 (=0.0125),
based on the fact that there are four interrelated hypotheses here. Since 0.012 < 0.0125, H2 was
supported. We can move onto evaluate the second smallest p-value of 0.015, generated from
H2(b), by comparing connectivity and redundancy. This p value is evaluated against an updated
α level of 0.05/(4-2+1) =0.017. Because 0.015 < 0.017, H2(b) was supported. The rest of the
hypotheses were not significant based on the adjusted α level and we omit the analysis process
for brevity (see Dmitrienko et al., 2009 for detailed explanation about this algorithm). H2(a) and
H2(c) thus received no support.
The findings show that connectivity-driven natural selection is indeed significantly
different from the other groups, judging by the supported H2(b); the difference between
redundancy and embeddedness group was not significant. Remember that we already have
observations that the mean value of NSconnectivity is 0.52, the mean of NSembeddedness is 0.49 and the
mean of NSredundancy is 0.48. The difference between NSconnectivity and NSembeddedness is significant, so
NSconnectivity > NSembeddedness. The difference between NSembeddedness and NSredundancy is not
significant, so NSembeddedness = NSredundancy. Taken the results together, the three vectors can now
be ranked in the order of NSconnectivity > NSembeddedness = NSredundancy.
5.3 Results for Hypothesis 3
H3 proposed to directly examine the causal relationship between network configurations
and content exploration over time in a knowledge creation system. Although this hypothesis does
not directly involve an analysis of the evolutionary forces (natural selection pressure), it aims to
unpack the causal mechanism that could potentially explain what we observed in H2 – why
certain network configurations are driving the evolutionary outcomes. The proposed explanation
is that if network configurations can describe some aspects of information flow brought by the
97
editors’ activities, it is then possible to observe the variance in the level of content exploration as
a direct result of changes in network configurations.
Descriptive statistics for these measures are reported in Table 5.1. The table details the
key metrics for 394 articles and their correlations.
The estimates for the univariate ARIMA model and transfer function models that takes
into account exogenous variables are reported in Table 5.2. Note that the ARIMA model and
transfer function models are two different techniques -- ARIMA model is a univariate model that
only deals with the outcome series, and transfer function models are regression techniques that
use exogenous variable series to predict the outcome series. ARIMA models are commonly used
to describe the temporal variances of a time series. Transfer function models allow for modeling
the temporal variances of a target time series by accounting for influences from other time series
(predictor time series). Specifically, in this case, transfer function models were used to test if
changes in network configurations cause changes in the level of content exploration at a future
time (Demirhan, 2020; Enders, 2014). They were presented side-by-side for the purpose of
comparing how these two techniques predict the response variable in different ways. In the
current project, transfer function models will be used to test hypotheses because we care more
about analyzing the lagged effects of network metrics on content exploration, which cannot be
addressed solely with ARIMA models. Because they are two distinct modeling techniques
assessing for different things, coefficients from Model 1 (using ARIMA model) and Model 2 and
3 (using transfer function models) should not be directly compared against each other. ARIMA
model uses its own history to predict itself, so the coefficient terms are based on its own lags,
cyclicality and error lags. The TF model not only accounts for the variance of a series by its own
98
history, but also includes the impacts of other predictors. Model 2 and Model 3 are generated
using the same technique and can thus be directly compared.
99
Table 5. 1
Descriptive Statistics and Correlations
Mean St.
Dev.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1. Content
length
4,810.19 79.51 1 0.99 0.99 0.96 0.84 0.87 0.97 -
0.72
0.3 1 -0.25 0.76 -
0.52
-
0.06
0.36 -
0.26
-
0.11
-
0.43
-
0.25
-
0.44
-
0.78
2. Num of
references
7.58 0.36
1 0.99 0.98 0.86 0.88 0.96 -
0.69
0.27 0.99 -0.23 0.77 -
0.51
-
0.03
0.36 -
0.23
-
0.14
-
0.44
-
0.25
-
0.46
-
0.74
3. Num of
internal links
40.41 0.95
1 0.97 0.84 0.88 0.96 -
0.72
0.3 0.99 -0.26 0.78 -
0.53
-
0.05
0.39 -
0.26
-
0.11
-
0.46
-
0.28
-
0.48
-
0.76
4. Num of
templates
12.12 0.56
1 0.89 0.89 0.92 -
0.59
0.17 0.95 -0.2 0.78 -
0.45
0.02 0.37 -
0.18
-
0.19
-
0.45
-
0.28
-
0.48
-
0.64
5. Num of
categories
4.29 0.1
1 0.94 0.81 -
0.35
-
0.09
0.83 0.002 0.55 -
0.19
0.21 0.11 0.03 -
0.36
-
0.24
-
0.06
-
0.27
-
0.41
6. Images by
length
0.000419 0.0001
1 0.86 -
0.49
0.03 0.87 -0.09 0.59 -
0.31
0.11 0.19 -
0.08
-
0.27
-
0.29
-
0.12
-
0.31
-
0.56
7. Num of lvl2
headings
7 0.09
1 -
0.77
0.39 0.98 -0.18 0.77 -
0.47
-
0.11
0.24 -
0.25
-
0.14
-
0.28
-
0.13
-
0.29
-
0.83
8. Flesch
reading score
47.3 0.17
1 -0.8 -
0.74
0.35 -
0.66
0.54 0.37 -
0.34
0.46 -
0.19
0.28 0.19 0.25 0.92
9. Coleman
Liau index
15.1 0.03
1 0.32 -0.4 0.52 -
0.42
-
0.58
0.34 -
0.55
0.43 -
0.17
-
0.18
-
0.15
-
0.67
10. Difficult
words
231.37 3.44
1 -0.24 0.75 -
0.51
-
0.08
0.34 -
0.26
-
0.11
-0.4 -
0.22
-
0.41
-
0.79
11. Degree
centrality
0.21 0.09
1 -
0.51
0.38 0.76 -
0.89
0.93 -
0.88
0.82 0.78 0.8 0.28
12. Closeness
centrality
0.002 0.001
1 -
0.54
-
0.37
0.65 -
0.52
0.28 -
0.57
-
0.52
-
0.59
-
0.67
13.Betweenness
centrality
0.001 0.001
1 0.12 -
0.53
0.3 -
0.21
0.57 0.65 0.56 0.57
14. Eigenvector
centrality
0.36 0.08
1 -0.5 0.92 -
0.82
0.33 0.36 0.32 0.38
15. PageRank 0.04 0.01
1 -
0.74
0.72 -
0.94
-0.9 -
0.93
-
0.25
16. Hub 0.29 0.09
1 -
0.85
0.64 0.6 0.63 0.43
17. Constraint 0.36 0.05
1 -
0.56
-
0.63
-
0.52
-0.1
18. Transitivity 0.55 0.11
1 0.9 0.99 0.2
19. Effective
size
0.8 0.16
1 0.9 0.14
20. Density 0.66 0.08
1 0.19
21. Content
exploration
11.45 5.62
1
100
Table 5. 2
Model Estimates Predicting Content Exploration
Model 1:
ARMA model
Model 2:
TF model with
only dependent
series
Model 3:
TF model with
input series
Intercept -0.15 (0.004) *** -0.02 (0.02) -0.09 (0.02) ***
Endogenous series:
Content exploration - AR (1) 0.93 (0.007) ***
Content exploration - MA (1) -0.99 (0.02) ***
Content exploration - lag 1 0.31 (0.02) *** 0.23 (0.07) ***
Content exploration - lag 2 0.12 (0.08) 0.12 (0.07) *
Content exploration - lag 3 0.18 (0.07) ** 0.16 (0.06) **
Content-based
exogenous series:
Content length
0.01 (0.005) ***
Content length - lag 2 0.008 (0.005) ***
Content length - lag 10 0.01 (0.005) ***
Network-based
exogenous series:
H3(a) Network embeddedness
Degree - lag 9
1.92 (0.72) ***
Closeness - lag 7 724.56 (174.06) ***
PageRank – lag 2 84.13 (28.76) ***
H3(b) Network connectivity
Density – lag 2 5.90 (2.01) ***
Density – lag 6 -2.51 (1.20) **
H3(c) Network redundancy
Constraint – lag 2 -4.35 (2.10) **
Effective size – lag 7 1.30 (0.42) ***
Adjusted R
2
0.37 0.28 0.41
101
RMSE 4.64 4.50 3.38
AIC
0.88 0.82 0.50
BIC 0.93 0.90 0.78
Log likelihood -68.36 -61.93 -23.32
F-statistic 48.94 *** 20.14 6.02 ***
Note. *p <0.1, **p < 0.05, ***p < 0.01. Standard errors in the parentheses.
5.3.1 Modeling Content Exploration as an ARIMA process: Model 1
In order to fit an ARIMA model for the dependent series, a stationarity check is
necessary. The raw series of content exploration was found to be a non-stationary series, judging
by the Augmented Dickey-Fuller Test (ADF test statistic = 0.36, p = 0.29). The large p-value
indicates that we cannot reject the null hypotheses of non-stationarity, and thus the raw series of
content exploration needs to be differenced first to ensure stationarity. A first order difference is
enough to make the raw series stationary, judging by the ADF test. The first differenced series
leads to ADF test statistic = -7.8, p = 0.01, indicating that the null hypothesis of non-stationarity
is rejected and the differenced series is a stationary one and can be used for ARIMA modeling
(Demirhan, 2020; Enders, 2014).
Given that a first order difference is enough to achieve stationarity, we proceed to
construct an ARMA model for the differenced series in the form of ARMA (p,q). To identify the
appropriate p and q values, we start with testing p and q at 1 and then proceeds to higher values
till the ARIMA model diagnostic, including Akaike information criterion (AIC), Bayesian
information criterion (BIC) and Ljung-Box Q tests are satisfactory. The process of testing
different p and q values are presented in Table 5.3. The table shows that smallest AIC and BIC
values were achieved for ARMA model (1,1), with AIC = 0.88 and BIC = 0.93. This model also
passes the Ljung-Box Q test (p = 0.67). The large p value indicates that the null hypothesis of
102
lack of autocorrelation cannot be rejected and thus the residuals are similar to white noise. We
conclude that ARMA model (1,1) is the best model. This model generated an adjusted R
2
value
of 0.37.
Table 5. 3
ARMA Model Comparisons
p q Adjusted R
2
AIC BIC Q-test
p value
0 1 0.18 1.13 1.19 0.29
0 2 0.22 1.07 1.15 0.97
1 0 0.26 1.01 1.07 0.04
1 1 0.37 0.88 0.93 0.67
1 2 0.36 0.88 0.98 0.58
2 0 0.29 0.98 1.06 0.24
2 1 0.36 0.89 0.98 0.69
2 2 0.37 0.89 1.01 0.59
103
5.3.2 Modeling Content Exploration with Network Configurations as Exogenous Variables:
Model 2 and 3
The input series left were entered into the transfer function models to evaluate their
specific effects on the dependent series (Demirhan, 2020; Enders, 2014; Pesaran & Shin, 1998).
The estimation used statistical software EViews 10.
Specifying this model needs to first determine how many past lags of the dependent
variable will be included and the order of how many past lags of the exogenous variables will be
included. Each exogenous variable can have a different lag order, but a highest order of them
needs to be determined for the model to run. For the dependent series, given our knowledge from
estimating its ARIMA model, we can reasonably assume that most recent one or two lags of the
dependent variable should have some impact on the current value. Thus, the maximum lag length
of the dependent series was set to three (one lag longer than what we have already observed) to
make sure the procedure evaluates all possible candidate lags. If the higher order lags are not
useful in the model, they are screened out in the automatic model selection process. For the input
series, the maximum lag length was set to ten. This is an arbitrary choice set by the researcher to
help the software know when to stop searching for an optimal model. The choice of maximum
lag length should be reasonably meaningful given the observation frequency. For example, in a
review article, Ivanov and Kilian (2005) reported that empirical studies using quarterly data often
choose lag order of 4, which means researchers often include the last four quarters’ (or the past
year) influence on the current period; studies with monthly observations often choose 6, 12 or 18
to be maximum lag lengths, which correspond to half a year, a year, or a year and a half of the
data. In this current project, maximum lag length set too long will make the results hard to be
meaningfully interpreted. For example, it is hard to meaningfully interpret why a network
104
structure will still be impactful on content exploration after five months. Maximum lag lengths
set too low (say, only taking into account one or two past lags) will omit important predictors in
mid-to-long term. This also fits the model specification strategy recommended in Johnston and
Dinardo (1996) that it is better to start with a more complicated model specification both in terms
of included variables and lag structure, and then move backwards to reduce the number of
regressors in the model. Based on the research context of this project, a maximum lag length of
ten means the model will evaluate input series’ impact on content exploration in the following
two and a half months.
Second, differenced data instead of raw data are used because Augmented Dickey-Fuller
tests for all the raw series could not reject non-stationarity at 5% critical value (see Table 5.4). In
contrast, the Augmented Dicky-Fuller tests suggest that all first-differenced series are stationary:
the test statistics are all smaller than critical values at 1% level. Thus, first-order differenced data
instead of the raw data are used to reduce correlations among lag terms of the input series. This
helps to better identify their independent effects on the dependent series. In short, these
differenced series can be interpreted as “weekly change” of that variable from the previous week.
This also makes the TF model results comparable to the ARIMA model results because they both
use first-differenced content exploration as the dependent series.
105
Table 5. 4
Augmented Dickey-Fuller Test Results
Variable
Augmented
Dickey-Fuller
test statistic
1% critical
value
5% critical
value
Test result
Content exploration 0.36 -3.47 -2.87 Non-stationary
First-differenced
content exploration
-7.83 -2.57 -1.94 Stationary
Degree centrality -2.78 -4.01 -3.43 Non-stationary
First-differenced
degree centrality
-4.39 -3.47 -2.87 Stationary
Closeness centrality -1.86 -4.01 -3.43 Non-stationary
First-differenced
closeness centrality
-11.42 -3.47 -2.87 Stationary
PageRank -1.81 -4.01 -3.43 Non-stationary
First-differenced
PageRank
-12.50 -3.47 -2.87 Stationary
Content length -2.62 -4.01 -3.43 Non-stationary
First-differenced
content length
-12.40 -3.47 -2.87 Stationary
Constraint -3.09 -4.01 -3.43 Non-stationary
First-differenced
constraint
-4.35 -3.47 -2.87 Stationary
Effective size -2.67 -4.01 -3.43 Non-stationary
First-differenced
effective size
15.40 -3.47 -2.87 Stationary
Density -1.73 -4.01 -3.43 Non-stationary
First-differenced
transitivity
-12.28 -3.47 -2.87 Stationary
Note. For brevity, this table only presents variables included in the final model.
106
Model 2 shows a TF model that only takes into account endogenous series -- content
exploration’s history -- as predictors to explain the current value of content exploration. Content
exploration is a significant predictor of itself at lag 1 (0.31, p < .01), and lag 3 (0.18, p < .05). It
shows that the past history of content exploration matters for its current value – a conclusion that
is consistent with what we have obtained using ARIMA modeling. This serves as a baseline
model for Model 3 which includes all exogenous variables as predictors, in addition to the
endogenous series.
A final model is returned by automatically selecting an optimal combination of lag
lengths for each of the significant predictor variable, judging by the information criterion of AIC
values. Specifically, the selection of specific lag lengths used a backward stepwise selection
algorithm provided by EViews. This is an automatic process with the following procedures. The
algorithm starts with the most complicated model (all variables with 10 lag lengths) and removes
one variable with the largest p-value (i.e. least important in the model) at each step. Each
updated model is evaluated for the AIC values. If removal of that one variable leads to a smaller
AIC value, the algorithm proceeds with the updated version. At each successive removal from
the model, all the previously removed variables are automatically checked against the selection
criterion of AIC again, and can be potentially re-added if adding it back improves the AIC value.
The algorithm then proceeds to the next largest p-value and the procedure continues. The model
with the smallest AIC value, will be returned as the final model. After this process, the best fit
model obtained is presented as Model 3 of Table 5.2.
There are seven significant exogenous variables left in the final model and each is
interpreted below.
107
The variable of “content length” is a positive and significant (0.01, p < .01) predictor of
content exploration at its current period of time, a positive and significant (0.008, p < .01)
predictor at lag 2, and also a positive and significant (0.01, p < .01) predictor of content
exploration at lag 10. This means that if an article contains longer content (higher content
length), the more likely that this article will be exploring a wide variety of content options during
the development of this article (high content exploration). This is probably because adding length
to an article means the editors are making substantial efforts to this article. Longer length is often
a strong signal showing that many people are interested in the article and are working on it.
When many different editors join in the same project, different opinions can powerfully pull and
shape the trajectory of this article’s content. Note that this variable is not part of the hypotheses
testing (it is not a network-based metric) but entered in the model for the purpose of finding a
better model fit and presenting information completely.
H3(a) stated that over time, higher levels of network embeddedness in a knowledge
creation system will cause subsequently higher levels of content exploration. This hypothesis
received support. There are three metrics describing network embeddedness that are positive and
significant predictors of content exploration levels in the future. First, degree centrality is a
positive and significant (1.92, p < .01) predictor of content exploration at lag 7. This means that
an article with higher degree centrality will enjoy higher levels of content exploration in the
future. Second, closeness centrality is a positive and significant (724.56, p < .01) predictor of
content exploration at lag 7. This means that the more centrally located an article is, judging by
its closeness centrality in the network, the more likely that it explores a wide variety of content
options. Third, PageRank centrality is a positive and significant (84.13, p < .05) predictor of
content exploration at lag 2. An article’s PageRank centrality is a positive and significant
108
predictor of its content exploration in the future. PageRank centrality, closeness centrality and
degree centrality were all related to network embeddedness, thus, H3(a) is supported.
H3(b) stated that over time, higher levels of network connectivity in a knowledge
creation system will cause subsequently higher levels of content exploration. There is one metric
describing connectivity that is a positive and significant predictor of content exploration level in
the future. Density positively predicts content exploration at lag 2 (5.90, p < .05) and negatively
predicts content exploration at lag 6 (-2.51, p < .01). Overall the larger coefficient at lag 2
cancels out its negative coefficient at lag 6 and thus the overall effect of density on content
exploration is still positive. Density describes the extent to which all possible links are realized in
the network. Higher density means that the articles with densely connected neighborhoods are
more likely to explore a wide variety of options in the process of content development, as
expected. This lends support to H3(b).
H3(c) stated that over time, lower levels of network redundancy in a knowledge creation
system will cause subsequently higher levels of content exploration. This part of the analysis is
also supported. First, as expected, constraint is a negative and significant predictor of content
exploration at lag 2 (-4.35, p < .01). Second, effective size is found to positively predict content
exploration at lag 7 (1.30, p < .05). Remember that effective size and constraint are in the
opposite direction, so higher effective size is reflecting rich and diverse (less redundant) network
connections, so this also supports the hypothesis. Thus, H3(c) was supported. More redundant
network connections are harmful for content exploration in the future.
Overall, Model 2 has an R
2
value of 0.28, a decrease from Model 1 R
2
of 0.37, showing
that this model is not able to explain as much of the variance as the ARMA model did, when the
TF modeling technique only uses the history of content exploration to predict of the current
109
value of content exploration. After adding exogenous series into the TF model, as shown in
Model 3, TF modeling technique generates the best model in predicting content exploration. The
R
2
of Model 3 is 0.41, the highest among all three models. In addition, Root Mean Squared Error
(RMSE), AIC, and BIC are three measures used to describe model fitness. RMSE is the square
root of the average of squared residuals. Smaller values of RMSE, AIC and BIC indicate that the
discrepancy between actual values and fitted values are smaller than larger values of these
statistics, thus the model fit is desirable. Compared to Model 1 and Model 2, RMSE of Model 3
is the smallest (RMSE = 3.38). Also, the AIC (0.50) and BIC (0.78) values of Model 3 are smaller
than in the prior models, indicating a good model fit.
In addition, two tests were conducted to ensure that the final model did generate random
residuals – an indicator that the current model is explaining away most of the variances and what
remains is just white noise. First, a Ljung-Box test was conducted to see if the residuals obtained
are independent from each other (i.e., not autocorrelated). The returned large p-value (Q statistic
= 1.46, p = 0.23) suggests that these autocorrelations were not significantly different from zero
and the residuals are white noise. Second, an Augmented Dickey-Fuller test of the residuals (test
statistic = -5.14, p = 0.01) shows that the null hypothesis of non-stationary is rejected, and the
stationary of residuals is confirmed. Overall, the model fitness is satisfactory.
To further analyze the relative contribution of each set of exogenous variables, three
additional models were presented in Table 5.5, which shows separate models that take into
account one set of network-based exogenous variable at a time. Specifically, the model that adds
only H3(a) related variables about network embeddedness (variable terms extracted from the full
model) can explain nine percent more variance compared to the baseline model in Table 5.5.
The model that adds only H3(b) related variables about network connectivity can explain six
110
percent more variance compared to the baseline model. The model that adds only H3(c) related
variables about network redundancy can explain one percent more variance than the baseline
model. Judging by this analysis, network embeddedness is the set of network-based exogenous
variable that contributes most to explaining future values of content exploration.
Note that the three separate models in Table 5.5 should not be interpreted as step-wise
entered models. The regression terms of the three separate models were extracted from the full
model just for the purpose of comparisons. The full model, as explained earlier, was obtained by
using an automated backwards selection algorithm. For each separate model, the specific terms
and lag lengths were kept in exactly the same way they were presented in the full model. For
example, model 1 represents an evaluation based on the three significant regression terms related
to H3(a) in the full model – degree centrality at lag 9, closeness centrality at lag 7, and PageRank
centrality at lag 2. This does not mean Model 1 is a best fit model for network embeddedness
related variables, because that would require a re-consideration of all possible combinations of
network embeddedness related variables and lag structures.
111
Table 5. 5
Examining Each Network-based Endogenous Series for Predicting Content Exploration
Baseline:
with
endogenous
and content-
based series
Model 1:
Adding
only H3(a)
input series
Model 2:
Adding
only H3(b)
input series
Model 3:
Adding
only H3(c)
input series
Full model
Intercept Ǎ 0.09 (0.03)
**
-0.12 (0.03)
***
-0.11 (0.03)
***
-0.11 (0.03)
***
-0.09 (0.02) ***
Endogenous series:
Content exploration - lag 1 0.24 (0.08) *** 0.16 (0.08)
**
0.188 (0.08)
**
0.23 (0.08)
***
0.23 (0.07) ***
Content exploration - lag 2 0.08 (0.08) 0.10 (0.07) 0.10 (0.07) 0.10 (0.08) 0.12 (0.07) *
Content exploration - lag 3 0.08 (0.07) 0.12 (0.07) 0.12 (0.07) 0.06 (0.07) 0.16 (0.06) **
Content-based
exogenous series:
Content length 0.01 (0.005) * 0.01 (0.005)
**
0.01 (0.005)
**
0.009
(0.005) *
0.01 (0.005) ***
Content length - lag 2 0.006 (0.005) 0.006
(0.005)
0.006
(0.005)
0.006
(0.005)
0.008 (0.005) ***
Content length - lag 10 0.01 (0.005) ** 0.01 (0.005)
**
0.01 (0.005)
**
0.01 (0.005)
**
0.01 (0.005) ***
Network-based
exogenous series:
H3(a) Network embeddedness
Degree - lag 9 1.78 (0.77)
**
1.92 (0.72) ***
Closeness - lag 7 592.79
(174.98) ***
724.56 (174.06) ***
PageRank – lag 2 1.06 (11.70) 84.13 (28.76) ***
H3(b) Network connectivity
Density – lag 2 1.93 (1.26) 5.90 (2.01) ***
Density – lag 6 -3.9 (1.28)
***
-2.51 (1.20) **
H3(c) Network redundancy
Constraint – lag 2 -0.24 (1.20) -4.35 (2.10) **
Effective size – lag 7 0.54 (0.24) * 1.30 (0.42) ***
Adjusted R
2
0.24 0.33 0.30 0.25 0.41
ẦƗ
(compared with baseline)
0.09 0.06 0.01 0.17
RMSE 4.05 3.82 3.91 4.03 3.38
AIC
0.70 0.63 0.66 0.68 0.50
BIC 0.84 0.83 0.84 0.84 0.78
Log likelihood -46.68 -37.91 -41.46 -42.84 -23.32
F-statistic 4.04*** 4.89*** 4.47*** 3.22*** 6.02 ***
Note. *p <0.1, **p < 0.05, ***p < 0.01. Standard errors in the parentheses.
112
5.4 Summary of Hypotheses Testing Results
The hypotheses testing results are summarized in Table 5.6.
Table 5. 6
Hypotheses Testing Results
Hypothesis Results Findings
H1 Not supported There is no significant difference between content-
based and network-based characteristics in driving
evolutionary outcomes.
H2 H2 supported;
H2(b) supported;
H2(a) and H2(c) not
supported
Network connectivity identifies the strongest
selection pressure, compared to the other two
network-based characteristics (network
embeddedness and network redundancy).
H3 H3(a) supported;
H3(b) supported;
H3(c) supported.
Higher levels of network embeddedness cause
content exploration to increase in the future, which
supports H3(a); Higher levels of network
connectivity cause content exploration to increase
in the future, which supports H3(b); Higher levels
of network redundancy cause content exploration
to decrease in the future, which supports H3(c).
113
CHAPTER 6: DISCUSSION
This chapter first outlines how the empirical findings relate back to the theory and how
these findings matter in practical terms. The next section discusses the limitations of this study.
The third section focuses on future work that can address these limitations. The final section
summarizes and concludes this study.
Knowledge collaboration networks have a set of new characteristics that challenge the
traditional knowledge creation model (Lee & Cole, 2003). Applying evolutionary theory to the
knowledge creation processes in online environments, this dissertation examines the relationship
between network-based characteristics of knowledge artifacts and evolutionary outcomes of
these artifacts. Interestingly, the network structures describe the connection between human
actors (editors of Wikipedia) and the information products they created (articles of Wikipedia).
This study thus points to the importance of understanding networks as channels of information
flow, and it empirically tests whether the configuration of these networks matter in terms of the
evolutionary outcomes.
The first hypothesis is a replication of a recent theoretical development in organizational
evolutionary theory, proposed by Hilbert et al., (2016). Their theory suggests that, in addition to
content-based characteristics, network-based characteristics are also important in driving
evolutionary success. In fact, seven out of the eight network populations examined in their study
showed that network-based characteristics were more influential than content-based ones. The
current study did not find support for this hypothesis, as the result shows no significant
difference between content-based and network-based characteristics in driving evolutionary
outcomes. The results suggest that the network-based characteristics are just as important as the
114
content-based ones, which emphasized the importance of considering both types of
characteristics when analyzing content evolutionary processes.
The second hypothesis elaborates on the first hypothesis and asks, which specific
network-based characteristics matter more in terms of driving evolutionary success? This study
examined a total of ten network configurations along three theoretical dimensions – network
embeddedness, network connectivity, and network redundancy. The results show that overall,
different network configurations influence the evolutionary outcomes at different rates. Network
connectivity identifies the strongest selection pressure, compared to the other two types of
network configurations (network embeddedness and network redundancy).
The third hypothesis tries to unpack the underlying mechanism of the relationship
between network configurations and content creation results. This hypothesis does not directly
use evolutionary outcomes as the response variable of interest. Instead, it focuses on a particular
metric that could describe editors’ strategies in content development process. Content
exploration was used to measure to what extent editors explore a wide variety of content options
before settling on a conclusion or a final product. The more widely people explore, the higher
content exploration level. Time series models show that network configurations have significant
predictive value for content exploration levels in the future. This means that network
configurations causally influence the way people explore different options when they
collectively create a piece of an information product. The network structures of an article’s
network will, to some extent, determine the content creation outcomes in the future.
6.1 Theoretical Implications
The current study provides several important theoretical implications to the stream of
literature on ecological evolution and natural selection. First, it builds on a recent development of
115
natural selection theory in the realm of socio-cultural evolution, which suggests that there are
analytical gains that can occur by taking into account how the network-based characteristics are
driving evolutionary changes, in addition to content-based characteristics. The current research
attempted to provide a direct application of this novel theory development (Hilbert et al., 2016)
in the context of a knowledge creation system online but found no evidence to support the
previously established findings that network-driven evolution is stronger than content-driven
evolution. However, this does not mean that network-driven evolution is not important; instead,
based on what can be found in this study, network-driven evolution is just as important as
content-driven evolution.
This seemingly unexpected result can be explained partly by the fact that the dataset used
here is quite limited. The current study used only one WikiProject and there is no evidence that
supports a more conclusive finding about whether this might be the case for other WikiProjects.
In the prior investigation (Hilbert et al., 2016) that reported that network configurations are
stronger influencers on natural selection than content-based ones, a careful examination of their
findings showed that among the eight datasets they used, at least one dataset (the PBS
documentary videos network) did not support the general conclusion that content-based
characteristics will lead to less extreme natural selection. A comparison across all datasets (that
is, combing the eight datasets as one) reported that in about 20% of all cases they found network-
based characteristics producing natural selection significantly higher than the content-based
group. And there are around 3-4% of cases where content-based characteristics are more
powerful than network-based characteristics. This leaves about 76-77% of cases generating a
statistically insignificant result between the two groups. In other words, for around 76-77% of
the cases, it is not possible to observe a statistically significant difference between content-based
116
and network-based characteristics. So, the seemingly inconsistent findings between the current
project and prior research might be a result of the fact that the current data sample happens to fall
into that 76-77% of cases, making it difficult to observe the difference between groups.
Another possible explanation for this result is that network-based characteristics are just
as important as the content-based characteristics. The datasets used by Hilbert et al. (2016) found
one exception to their general claim that network-driven evolution is more influential than
content-based ones. That exception is also an online information sharing community (YouTube
video network), which may share some similarities with Wikipedia, an online information
creation network. It is thus possible to speculate that informational artifacts networks may indeed
carry some characteristics that are distinct from other organizational networks made up by
human actors or international trade networks made up by countries. The information contained in
artifacts is not overpowered by the information contained in network structural configurations.
Second, the current study also contributes to the network research literature by providing
a comparison about which group of network configurations might be relatively more influential
in driving evolutionary change. To the best of our knowledge, this is the first study to make such
a comparison. While there are many network signals that have been considered important and
commonly used in prior research about Wikipedia networks or co-creation networks in general,
most of these studies examine each of the network signals in isolation. Thus, it is difficult to
know if any or some of these network metrics were actually more useful than others. This study
provided a fuller examination of these network metrics by comparing ten commonly used
network metrics in their relative influences in driving evolutionary change (H2). The results of
H2 showed that network connectivity identifies the strongest selection pressure, compared to the
other two network-based characteristics (network embeddedness and network redundancy). It is
117
somewhat surprising because network embeddedness is a widely-used choice when talking about
network structural signals (Brandes et al., 2016) in association with production outcomes of
knowledge creation systems.
Natural selection most prominently picks up signals generated by variance in network
connectivity of an article’s local neighborhood, more than other ways of measuring network
activities. Specifically, density and transitivity were the two metrics that represent network
connectivity in a knowledge creation system. They were important in predicting evolutionary
changes because they reflect how much an article is connected with other articles in the co-editor
network. The links in such a network indicate information flow, knowledge flow and possible
exchange of good practices and community management policies among the editor community.
Network connectivity matters because the more connected the network means that the more
links are established in that focal article’s local neighborhood, thus giving more potential
resources (both duplicate and novel) to the article. Network embeddedness, on the other hand,
mostly reflects how much bridging power the article possesses compared to other articles. It was
considered an important network configuration because in a typical human-to-human network, a
centrally positioned node (high network embeddedness) has a lot of power to “manipulate”
others by dominating the information flow and deciding what to share or not share. The way the
network was constructed here is not a human-to-human network; instead, it is an article-to-article
network where the articles themselves are not active agents that can wield their power and
dominate the network in active ways. Articles passively depend on activated network links that
human editors activate to bring in information as resources. Thus, network embeddedness may
not matter as much as network connectivity in this context.
118
In addition, the result also highlighted that capturing a local neighborhood is a more
meaningful way to quantify an information flow network than a complete network in an online
knowledge creation system. Relatively speaking, the structure of a local network matters more
than a global network in this context. This is because an article does not really exchange
information and resources with an article as far as, 10 degrees away in the network. It is hard to
interpret how information flows between two articles that are, for example, 10 degrees away
from each other. They are “linked” in network terminology, because an editor who worked on
article #1 then worked on article #2 and then #3, and so on. After 10 moves, the editor arrived at
article #10. It does not make much sense to say that the information can be carried freely
between these nodes. On the other hand, a smaller local neighborhood, such as the two-degree
neighborhood in this study, makes more sense in explaining how the information and resources
actually get through the network. Articles are connected in a local neighborhood network
because the articles have mostly been edited by the same group of people – they work on several
interrelated articles in terms of topic. They have read about some useful information, content
wise or writing style wise, that can be directly applied to the focal article. Beyond a small-sized
local environment, it is unclear to what extent the transferability of information still exists.
These results should not be interpreted to mean that network embeddedness and network
redundancy do not matter. In fact, the NS values for all three types of network configurations
range between 0.4-0.5 and none of them is distinctively larger or smaller than the others. So, this
finding is really providing a more nuanced interpretation of several network metrics that have all
been considered important in prior literature. This study points out that researchers should be
quite careful when applying network embeddedness measures to understand a non-human
119
network. The results also call for more thinking about context-specific explanations as to what
these network metrics actually represent under different network construction methods.
Third, this study aimed to find the causal mechanism as to why network configurations
can be associated with evolutionary outcomes. The first two hypotheses in this project were
testing if different network configurations attract different levels of natural selection pressure in
an online knowledge creation system. The results were positive. The third hypothesis thus tried
to understand why network configurations matter in terms of content creation. Is there a causal
relationship between network configurations and content development processes?
In the literature, there is ample evidence showing correlational connections between
network configurations and knowledge creation outcomes (Lazer & Friedman, 2007; Qin &
Cunningham, 2012; Ren & Yan, 2017), but there are few that use longitudinal design to directly
make causal inference (Zhang et al., 2017). This study suggested that network configurations
matter because they directly cause the level of content exploration to vary as a result. Content
exploration is a key variable that has been shown to influence the outcome of knowledge
creation. In general, the idea of content exploration is a search strategy often used to find certain
optimal solutions in a feature space. A problem solver generally has two options: to widely
explore many different paths before finalizing on something or to explore fewer and focus more
on the prior path to quickly reach an end. The choice between these two strategies is summarized
as a choice between exploration (high content exploration) or exploitation (low content
exploration). Adopting a novel approach to quantify text-based content exploration, this study
captures how much content exploration was directly caused by changes in network
configurations.
120
Three network embeddedness measures were found to be particularly important based on
the results of a time series transfer function modeling: degree centrality, closeness centrality, and
PageRank centrality. Network configurations show the patterns of editors’ activities in a
community; thus, it was hypothesized that these activity patterns will directly leave traces in the
contents developed by these editors. These three centrality measures show to what extent an
article is connected to other articles via the editors’ co-editing behaviors. Higher degree
centrality means the editors that worked on the focal article also worked on many other articles.
Higher closeness centrality indicates that an article is at a central position, and it has easy access
(short distances) to many other articles in the same network. High PageRank centrality is
reflecting of the importance of the article’s neighbors (other articles that its editors also worked
on). If editors worked on many important articles in the same domain area, the focal article is
also assumed to enjoy rich resources and information brought by these editors.
These metrics were commonly interpreted as measures of the level of embeddedness of
nodes in a network. They were shown to have positive and significant influence on content
exploration levels in the future. This confirmed the hypothesis that for a collaborative knowledge
production project like Wikipedia, the primary resources are information and knowledge.
Network embeddedness is desirable because it facilitates the flow of information via editors
working on different projects and ultimately helps with the combination and exchange of
information. The editors accumulated experience and knowledge by working in this topic area
and they knew how to create content on Wikipedia by engaging with the community to learn the
best practices. They were also likely to be knowledgeable in this specific topic area. So high
embeddedness means the article attracts many editors to work on it and people bring in diverse
knowledge and resources to the focal article because of their past experiences working on other
121
articles. The result adds support to the general notion that diversity of information leads to higher
exploration.
Network connectivity, measured by density, is also found to have positive influence on
the level of content exploration. For articles that have a well-connected local neighborhood, they
are more likely to have access to diverse and useful information from its neighboring articles.
Editors have learned from their past experiences working on similar articles in the same topic
area and thus their knowledge and experiences may be transferrable to the focal article. This
again confirmed the general idea that editors bring in information, knowledge and experiences by
working on different projects. The rich experiences of editors in a topic area, manifested as a
well-connected local neighborhood, can help the article to explore more content options.
Hypothesis 3(c) about network redundancy’s negative influence on content exploration
received support. Two indicators of network redundancy, constraint and effective size, was
found to negatively predict content exploration, as expected. This is because high network
redundancy means that the network contains much repeated information and they are not as
useful in facilitating higher levels of content exploration. Editors working on these articles are
not able to bring in novel and diverse information, because they are similarly limited by their
experiences in the same set of projects. What is known by one editor is also known by another
editor, so they do not have good ways to find new resources needed to explore other possibilities
in content creation. This again confirmed a general belief that diversity of network connection is
helpful for increasing exploration levels, while repetitive information may harm a network’s
ability to explore.
122
6.2 Methodological Strengths
The research design employed in this study has at least three distinct advantages. First,
the present research represents a unique approach to the study of evolutionary changes in an
online knowledge creation community. The Price equation was introduced as a tool that can
directly capture the amount of evolutionary change based on different ways of measuring the
population’s characteristics. The study further used permutation-based methods to meaningfully
compare the magnitude of evolutionary forces over multiple different characteristics, as
suggested by Hilbert et al. (2016). This approach has only been introduced to the research of
communication networks recently and this study represents one of the few empirical examples
that has directly applied this framework so far.
Second, the longitudinal research design allows for a much-needed examination of the
causal mechanism between network configurations and content exploration. Researchers have a
collective interest in studying Wikipedia networks and its relationship with knowledge creation
outcomes, but not many empirical studies utilized longitudinal data to identify how and why
such networks are having an impact on the knowledge creation outcomes. The longitudinal
design of this study allows for time series modeling techniques that handle complex temporal
dynamics between several exogenous variables and the response series. The transfer function
modeling directly produces estimates of the degree to which the dependent variable, content
exploration, can be caused by several network-based input variables. It establishes causal
relationships after controlling for other covariates’ effects and the dependent variable’s temporal
effects on itself. In the present study, the addition of the hypothesized network configurations
(H3) results in incremental increases in the predictive power of the models. Thus, it provides
convincing evidence of time-ordered causality because it eliminated the ambiguities in
123
interpreting cross-sectional correlations as causal. The fact that lagged effects exist necessitates
future exploration of how an online community operates in a temporal framework.
Third, this study adopts a new way to quantify the level of content exploration using a
novel method proposed in Arazy et al. (2020), and for the first time, applied the method in
association with network configurations. Researchers have long been keen to analyze the ways
that different content creation strategies are adopted by different knowledge creation
communities or solution-seeking activities (Lazer & Friedman, 2007; Mason, 2013;
Navinchandra & Riitahuhta, 2011). This text-based measurement of content exploration was one
of the new developments in this direction and it was shown to be a useful one in this project. It
indeed captures changes in content creation caused by changes in network configurations. It
promises to have many empirical applications in future communication networks research or
research in knowledge creation systems.
6.3 Practical Implications
A number of managerial implications can be gleaned from the results of this study. First,
by showing that certain network configurations and content characteristics are particularly useful
at identifying large evolutionary forces, we now have a retrospective scope to understand why
some Wikipedia articles are more successful than others. Assuming that platform managers can
develop effective ways to manipulate these characteristics reported to be important, they will be
able to more easily guide the community in developing higher quality articles that meet the needs
of the audiences. For example, community administrators or managers can periodically highlight
some articles to be the high-priority tasks as a reference; thus, more editors will be invited to
contribute to these articles. As a result, more network links can be activated by encouraging
editors to work on different projects.
124
Second, it established a causal link between editors’ activities and the content of articles
in a future time. Thus, it clearly shows the possibility to leverage WikiProjects’ editors to focus
on certain articles and this guiding effort will result in different development trajectories of
article contents in the future. There are many intentional efforts in Wikipedia that try to guide
editors about where to go and what are the important articles to work on. Beyond just basing
their choices on editors’ personal interests, there are spaces to introduce why and how the editors
can be of help in developing some key articles and as a result that may naturally bring changes to
the “neighboring” articles of the focal articles. The way editors work together and learn from
each other will translate into substantial changes in terms of article content. The community
overall will have better knowledge about when and where to put human resources in the projects
that need them the most.
6.4 Limitations and Future Work
This study has several limitations that are worth noting. First, as discussed earlier, it used
only one WikiProject dataset which limits the generalizability of the current findings. Though
there were good reasons to choose this particular WikiProject, it cannot represent all Wikipedia
as a whole. In fact, a challenge for future research is to identify several representative datasets
that can meaningfully represent Wikipedia in general. Or, it might be possible to collect the
complete dataset of Wikipedia and process the data on more powerful computers. On the other
hand, picking only a few samples from the site requires careful consideration about the size,
active level, popularity, susceptibility to external shocks, and many other factors of the chosen
datasets, as these factors all have implications about the generalizability of the empirical
findings.
125
Second, this study only considered one type of network among the articles – the editor to
article network, while there are other plausible ways of constructing networks, as detailed in
Chapter 3. It would be interesting to collect more data and construct other types of networks,
such as article hyperlinking networks, which may reflect more insights about information flow
through the channels of hyperlinks (Agirre et al., 2015; Pilny & Shumate, 2012). Also, testing
the conclusions generated from the current study to a new type of network would provide more
evidence about the generalizability of the current results.
Third, the current study did not cast a wide net in terms of what network metrics to be
included. In the prior investigation of Hilbert et al. (2016), they used nearly 20 different network
metrics and this research used only half of them. They were casting a wider net for the purpose
of theory development when there is not much evidence to choose some and not the others, and
they were also dealing with some quite distinct research contexts ranging from international
trade networks to YouTube video networks. So, it was proper for them to include as many
measurements as possible. For this research, however, the purpose was to identify what network
metrics matter the most at identifying evolutionary forces and why; thus, the focus was on a
much smaller selection of network metrics. Even though justifications are provided based on a
review of Wikipedia research literature about why the chosen characteristics may be meaningful
in that context, it is still unclear what the empirical ramifications are of this selection of network
metrics. Exactly because the findings reported here highlighted the importance of understanding
how each different metric may reveal different aspects of characteristics about the network, we
need to be more careful in deciding what are the network metrics that matter most in a particular
situation and why. It is possible that testing the hypotheses with different network metrics will
yield different conclusions.
126
Fourth, the conclusion that network configurations lead to consequential changes in
content exploration was built on an untested assumption, specifically, that content exploration
will lead to changes in evolutionary outcomes. Ideally, all three variables should be included in
the same model and content exploration will be tested as the mediator (Zhao & Luo, 2019). Then
a theoretical model linking network configurations to content development to the performance of
content (quality or quantity) can be formally examined. This is an important future direction that
can potentially lend more support to the current conclusions.
6.5 Conclusion
This research can be understood from two broad perspectives: the evolutionary dynamics
exhibited in communication networks, and the networked nature of evolution (Hilbert et al.,
2016). The networked nature of evolution is explored in hypotheses 1 and 2, where different
kinds of network metrics were examined to find out which ones are particularly important in
understanding natural selection processes in a knowledge creation system. The result suggested
that network-based characteristics are indeed prominent drivers of evolutionary changes and
network connectivity may be the most influential one at driving evolution. Understanding the
evolutionary processes cannot ignore the critical roles played by network structural properties.
The evolutionary dynamics exhibited in communication networks are mostly presented in the
study of hypothesis 3, where time series regression techniques were used to model how temporal
changes in communication networks leads to consequential changes in content development
trajectories in a knowledge creation system. The communication networks were constructed and
analyzed as having exogenous effects on how the contents of knowledge products evolve over
time.
127
References
Adler, B. T., & De Alfaro, L. (2007). A content-driven reputation system for the Wikipedia.
Proceedings of the 16th International Conference on World Wide Web, 261–270.
Agirre, E., Barrena, A., & Soroa, A. (2015). Studying the wikipedia hyperlink graph for
relatedness and disambiguation. ArXiv Preprint ArXiv:1503.01655.
Aldrich, H. E., & Ruef, M. (2006). Organizations Evolving. SAGE.
Allchin, D. (1999). Do We See through a Social Microscope?: Credibility as a Vicarious
Selector. Philosophy of Science, 66, S287–S298.
Arazy, O., & Nov, O. (2010). Determinants of wikipedia quality: The roles of global and local
contribution inequality. Proceedings of the 2010 ACM Conference on Computer
Supported Cooperative Work, 233–236.
Arazy, Ofer, Aron Lindberg, Mostafa Rezaei, and Michele Samorani. “The Evolutionary
Trajectories of Peer-Produced Artifacts: Group Composition, the Trajectories’
Exploration, and the Quality of Artifacts.” Management Information Systems Quarterly
2020. (pre-print).
Astley, W. G. (1985). The Two Ecologies: Population and Community Perspectives on
Organizational Evolution. Administrative Science Quarterly, 30(2), 224–241. JSTOR.
https://doi.org/10.2307/2393106
Aunger, R. (Ed.). (2000). Darwinizing culture: The status of memetics as a science. Oxford
University Press.
Baum, J. A., & Rao, H. (2004). Evolutionary dynamics of organizational populations and
communities. Handbook of Organizational Change and Innovation, 212–258.
128
Benkler, Y. (2006). The wealth of networks: How social production transforms markets and
freedom. Yale University Press.
Berry, K. J., Mielke, P. W., & Mielke, H. W. (2002). The Fisher-Pitman Permutation Test: An
Attractive Alternative to the F Test. Psychological Reports, 90(2), 495–502.
https://doi.org/10.2466/pr0.2002.90.2.495
Blumenstock, J. E. (2008). Size matters: Word count as a measure of quality on wikipedia.
Proceedings of the 17th International Conference on World Wide Web, 1095–1096.
Bonacich, P. (1972). Factoring and weighting approaches to status scores and clique
identification. Journal of Mathematical Sociology, 2(1), 113–120.
Borgatti, S. P., Jones, C., & Everett, M. G. (1998). Network measures of social capital.
Connections, 21(2), 27–36.
Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (2008). Time Series Analysis: Forecasting and
Control (4 edition). Wiley.
Boyd, R., & Richerson, P. J. (2007). Culture, Adaptation, and Innateness. Oxford University
Press.
Bradie, M. (1986). Assessing evolutionary epistemology. Biology and Philosophy, 1(4), 401–
459.
Brandes, U., Borgatti, S. P., & Freeman, L. C. (2016). Maintaining the duality of closeness and
betweenness centrality. Social Networks, 44, 153–159.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Computer Networks and ISDN Systems, 30(1–7), 107–117.
Bryant, J. A., & Monge, P. R. (2008). The evolution of the children’s television community,
1953-2003. International Journal of Communication, 2, 33.
129
Burt, R. S. (1992). Structural holes: The social structure of competition. Harvard Univ. Press.
Burt, R. S. (2001). Structural holes the social structure of competition. In Social capital: Theory
and research (pp. 31–56). Aldine de Gruyter.
Callebaut, W., & Pinxten, R. (2012). Evolutionary epistemology: A multiparadigm program
(Vol. 190). Springer Science & Business Media.
Campbell, D. T. (1965). Variation and selective retention in socio-cultural evolution. Social
Change in Developing Areas: A Reinterpretation of Evolutionary Theory, 19, 26–27.
Campbell, D. T. (1974). Evolutionary epistemology. In Schilpp, Paul Arthur (Ed.), The
Philosophy of Karl Popper (Vol. 1, pp. 413–463). Open Court.
Candelario, D. M., Vazquez, V., Jackson, W., & Reilly, T. (2017). Completeness, accuracy, and
readability of Wikipedia as a reference for patient medication information. Journal of the
American Pharmacists Association, 57(2), 197-200. e1.
Castelfranchi, C. (2001). Towards a Cognitive Memetics: Socio-Cognitive Mechanisms for
Memes Selection and Spreading. Journal of Memetics-Evolutionary Models of
Information Transmission, 5(1).
Cavalli-Sforza, L. L. (1981). Cultural transmission and evolution: A quantitative approach.
Princeton University Press.
Chall, J. S., & Dale, E. (1995). Readability revisited: The new Dale-Chall readability formula.
Brookline Books.
Chesterman, A. (2005). The memetics of knowledge. In Knowledge Systems and Translation
(pp. 17–30). De Gruyter, Inc.
http://ebookcentral.proquest.com/lib/socal/detail.action?docID=3041559
130
Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine
scoring. Journal of Applied Psychology, 60(2), 283.
Dang, Q., & Ignat, C. (2016). Measuring Quality of Collaboratively Edited Documents: The
Case of Wikipedia. 2016 IEEE 2nd International Conference on Collaboration and
Internet Computing (CIC), 266–275. https://doi.org/10.1109/CIC.2016.044
Dawkins, R. (1976). The selfish gene. Oxford University Press Oxford.
Dawkins, R. (1982). The extended phenotype: The long reach of the gene. Oxford University
Press.
de La Robertie, B., Pitarch, Y., & Teste, O. (2015). Measuring article quality in wikipedia using
the collaboration network. Proceedings of the 2015 IEEE/ACM International Conference
on Advances in Social Networks Analysis and Mining 2015, 464–471.
Demirhan, H. (2020). dLagM: An R package for distributed lag models and ARDL bounds
testing. PLOS ONE, 15(2), e0228812. https://doi.org/10.1371/journal.pone.0228812
Deng, X., Joshi, K. D., & Galliers, R. D. (2016). The duality of empowerment and
marginalization in microtask crowdsourcing: Giving voice to the less powerful through
value sensitive design. Mis Quarterly, 40(2), 279–302.
Dmitrienko, A., Tamhane, A. C., & Bretz, F. (2009). Multiple testing problems in
pharmaceutical statistics. CRC Press.
https://www.scholars.northwestern.edu/en/publications/multiple-testing-problems-in-
pharmaceutical-statistics-2
Edmonds, B. (2002). Three challenges for the survival of memetics. Journal of Memetics-
Evolutionary Models of Information Transmission, 6(2), 45–50.
131
El Mouden, C., André, J.-B., Morin, O., & Nettle, D. (2014). Cultural transmission and the
evolution of human behaviour: A general approach based on the Price equation. Journal
of Evolutionary Biology, 27(2), 231–241.
Ellis, A. R., Burchett, W. W., Harrar, S. W., & Bathke, A. C. (2017). Nonparametric inference
for multivariate data: The R package npmv. Journal of Statistical Software, 76(4), 1–18.
Enders, W. (2014). Applied Econometric Time Series, 4th Edition (4 edition). Wiley.
Faraj, S., Jarvenpaa, S. L., & Majchrzak, A. (2011). Knowledge Collaboration in Online
Communities. Organization Science, 22(5), 1224–1239.
Fisher, R. A. (1958). The genetical theory of natural selection (2rd ed.). Dover Publications.
Forte, A., Larco, V., & Bruckman, A. (2009). Decentralization in Wikipedia governance. Journal
of Management Information Systems, 26(1), 49–72.
Frank, S. A. (1995). George Price’s contributions to evolutionary genetics. Journal of
Theoretical Biology, 175(3), 373–388. https://doi.org/10.1006/jtbi.1995.0148
Frank, S. A. (1997). The Price equation, Fisher’s fundamental theorem, kin selection, and causal
analysis. Evolution, 51(6), 1712–1729.
Frank, S. A. (1998). Foundations of social evolution. Princeton University Press.
Frank, S. A. (2012a). Natural selection. III. Selection versus transmission and the levels of
selection. Journal of Evolutionary Biology, 25(2), 227–243.
Frank, S. A. (2012b). Natural selection. IV. The price equation. Journal of Evolutionary Biology,
25(6), 1002–1019.
Frank, S. A. (2017). Universal expressions of population change by the Price equation: Natural
selection, information, and maximum entropy production. Ecology and Evolution, 7(10),
3381–3396.
132
Gardner, A. (2008). The price equation. Current Biology, 18(5), R198–R202.
Gatherer, D. (1998). Why the thought contagion metaphor is retarding the progress of memetics.
Journal of Memetics-Evolutionary Models of Information Transmission, 2(2), 135–158.
Glenn, J. C. (2016). Collective intelligence systems. Handbook of Science and Technology
Convergence, 53–64.
Grewal, R., Lilien, G. L., & Mallapragada, G. (2006). Location, Location, Location: How
Network Embeddedness Affects Project Success in Open Source Systems. Management
Science, 52(7), 1043–1056. https://doi.org/10.1287/mnsc.1060.0550
Guglielmino, C. R., Viganotti, C., Hewlett, B., & Cavalli-Sforza, L. L. (1995). Cultural variation
in Africa: Role of mechanisms of transmission and adaptation. Proceedings of the
National Academy of Sciences, 92(16), 7585–7589.
Gupta, Y., Saxena, A., Das, D., & Iyengar, S. R. S. (2016). Modeling Memetics using Edge
Diversity. In Complex Networks VII (pp. 187–198). Springer.
Hahlweg, K., & Hooker, C. A. (Eds.). (1989). Issues in evolutionary epistemology. State
University of New York Press.
Halfaker, A., Geiger, R. S., Morgan, J. T., & Riedl, J. (2012). The rise and decline of an open
collaboration system: How Wikipedia’s reaction to popularity is causing its decline.
American Behavioral Scientist, 0002764212469365.
Halfaker, A., Kittur, A., Kraut, R., & Riedl, J. (2009). A jury of your peers: Quality, experience
and ownership in Wikipedia. Proceedings of the 5th International Symposium on Wikis
and Open Collaboration, 15.
Halfaker, A., & Warncke-Wang, M. (2019). Articlequality [Python]. Wikimedia.
https://github.com/wikimedia/articlequality (Original work published 2014)
133
Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell System Technical
Journal, 29(2), 147–160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
Hannan, M. T., & Freeman, J. (1977). The population ecology of organizations. American
Journal of Sociology, 82(5), 929–964.
Heath, C., Bell, C., & Sternberg, E. (2001). Emotional selection in memes: The case of urban
legends. Journal of Personality and Social Psychology, 81(6), 1028.
Heyes, C. M., & Hull, D. L. (Eds.). (2001). Selection theory and social construction: The
evolutionary naturalistic epistemology of Donald T. Campbell. State University of New
York Press.
Heylighen, F. (1997). Objective, Subjective and Intersubjective Selectors of Knowledge.
Evolution and Cognition, 3, 63–67.
Heylighen, F., & Chielens, K. (2009). Evolution of Culture, Memetics. In R. A. Meyers (Ed.),
Encyclopedia of Complexity and Systems Science (pp. 3205–3220). Springer New York.
http://link.springer.com/10.1007/978-0-387-30440-3_189
Hilbert, M., Oh, P., & Monge, P. (2016). Evolution of what? A network approach for the
detection of evolutionary forces. Social Networks, 47, 38–46.
Hull, D. L. (2001). Science and selection: Essays on biological evolution and the philosophy of
science. Cambridge University Press.
Ivanov, V., & Kilian, L. (2005). A Practitioner’s Guide to Lag Order Selection For VAR Impulse
Response Analysis. Studies in Nonlinear Dynamics & Econometrics, 9(1).
https://doi.org/10.2202/1558-3708.1219
Jenkins, H. (2009). If It Doesn’t Spread, It’s Dead (Part One): Media Viruses and Memes. Henry
Jenkins. http://henryjenkins.org/blog/2009/02/if_it_doesnt_spread_its_dead_p.html
134
Johnston, J., & Dinardo, J. (1996). Econometric Methods (4 edition). McGraw-Hill/Irwin.
Kane, G. C., Johnson, J., & Majchrzak, A. (2014). Emergent Life Cycle: The Tension Between
Knowledge Change and Knowledge Retention in Open Online Coproduction
Communities. Management Science, 60(12), 3026–3048.
https://doi.org/10.1287/mnsc.2013.1855
Kane, G. C., & Ransbotham, S. (2016). Content and collaboration: An affiliation network
approach to information quality in online peer production communities. Information
Systems Research, 27(2), 424–439.
Kanefsky, B., Barlow, B., Nadine G., Gulick, V., & Norvig, P. (2000, January 1). Can
Distributed Volunteers Accomplish Massive Data Analysis Tasks? Lunar and Planetary
Science Conference, Houston, TX, United States.
https://ntrs.nasa.gov/search.jsp?R=20010048412
Keegan, B., Gergle, D., & Contractor, N. (2013). Hot off the wiki: Structures and dynamics of
Wikipedia’s coverage of breaking news events. American Behavioral Scientist, 57(5),
595–622.
Kim, K.-M. (2001). Nested hierarchies of vicarious selectors. In Selection theory and social
construction (pp. 101–118).
Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new
readability formulas (automated readability index, fog count and flesch reading ease
formula) for navy enlisted personnel (No. 8–75; Naval Technical Training Command
Research Branch Report).
135
Kittur, A., & Kraut, R. E. (2008). Harnessing the wisdom of crowds in wikipedia: Quality
through coordination. Proceedings of the 2008 ACM Conference on Computer Supported
Cooperative Work, 37–46.
Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. Proceedings of the
Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 668–677.
Knudsen, T. (2004). General selection theory and economic evolution: The price equation and
the replicator/interactor distinction. Journal of Economic Methodology, 11(2), 147–173.
Kräenbring, J., Penza, T. M., Gutmann, J., Muehlich, S., Zolk, O., Wojnowski, L., Maas, R.,
Engelhardt, S., & Sarikas, A. (2014). Accuracy and Completeness of Drug Information in
Wikipedia: A Comparison with Standard Textbooks of Pharmacology. PLOS ONE, 9(9),
e106930. https://doi.org/10.1371/journal.pone.0106930
Lazer, D., & Friedman, A. (2007). The network structure of exploration and exploitation.
Administrative Science Quarterly, 52(4), 667–694.
Lee, G. K., & Cole, R. E. (2003). From a Firm-Based to a Community-Based Model of
Knowledge Creation: The Case of the Linux Kernel Development. Organization Science,
14(6), 633–649.
Lee, S., & Monge, P. (2011). The Coevolution of Multiplex Communication Networks in
Organizational Communities. Journal of Communication, 61(4), 758–779.
https://doi.org/10.1111/j.1460-2466.2011.01566.x
Lewandowski, D., & Spree, U. (2011). Ranking of Wikipedia articles in search engines revisited:
Fair ranking for reasonable quality? Journal of the American Society for Information
Science and Technology, 62(1), 117–132. https://doi.org/10.1002/asi.21423
Lewens, T. (2015). Cultural evolution: Conceptual challenges. OUP Oxford.
136
Lewoniewski, W., Węcel, K., & Abramowicz, W. (2017). Analysis of References Across
Wikipedia Languages. International Conference on Information and Software
Technologies, 561–573.
Li, X., Tang, J., Wang, T., Luo, Z., & de Rijke, M. (2015). Automatically Assessing Wikipedia
Article Quality by Exploiting Article–Editor Networks. In A. Hanbury, G. Kazai, A.
Rauber, & N. Fuhr (Eds.), Advances in Information Retrieval (pp. 574–580). Springer
International Publishing.
Lovas, B., & Ghoshal, S. (2000). Strategy as guided evolution. Strategic Management Journal,
21(9), 875–896.
Lumsden, C. J., & Wilson, E. O. (1985). The relation between biological and cultural evolution.
Journal of Social and Biological Structures, 8(4), 343–359.
Lynch, A. (1998). Units, events and dynamics in memetic evolution. Journal of Memetics-
Evolutionary Models of Information Transmission, 2(1), 5–43.
Malik, F., & Probst, G. J. (1982). Evolutionary management. Cybernetics and System, 13(2),
153–174.
March, J. G. (1991). Exploration and exploitation in organizational learning. Organization
Science, 2(1), 71–87.
Mason, W. (2013). Collective Search as Human Computation. In P. Michelucci (Ed.), Handbook
of Human Computation (pp. 463–474). Springer. https://doi.org/10.1007/978-1-4614-
8806-4_35
McKelvey, B. (1982). Organizational systematics—Taxonomy, evolution, classification.
University of California Press.
Miller, C. C. (2009, May 25). Ad Revenue On the Web? No Sure Bet. New York Times, B.1.
137
Moghadas, S. M., Fitzpatrick, M. C., Sah, P., Pandey, A., Shoukat, A., Singer, B. H., & Galvani,
A. P. (2020). The implications of silent transmission for the control of COVID-19
outbreaks. Proceedings of the National Academy of Sciences.
https://doi.org/10.1073/pnas.2008373117
Monge, P., Heiss, B. M., & Margolin, D. B. (2008). Communication network evolution in
organizational communities. Communication Theory, 18(4), 449–477.
Monge, P., & Poole, M. S. (2008). The Evolution of Organizational Communication. Journal of
Communication, 58(4), 679–692. https://doi.org/10.1111/j.1460-2466.2008.00408.x
Navinchandra, D., & Riitahuhta, A. (2011). Exploration and Innovation in Design: Towards a
Computational Model (1991 edition). Springer.
Nelson, R. R., & Winter, S. G. (1982). An Evolutionary Theory of Economic Change. Harvard
University Press. http://books.google.com/books?id=6Kx7s_HXxrkC
Nishiura, H., Kobayashi, T., Miyama, T., Suzuki, A., Jung, S., Hayashi, K., Kinoshita, R., Yang,
Y., Yuan, B., Akhmetzhanov, A. R., & Linton, N. M. (2020). Estimation of the
asymptomatic ratio of novel coronavirus infections (COVID-19). International Journal of
Infectious Diseases, 94, 154–155. https://doi.org/10.1016/j.ijid.2020.03.020
Nowak, M. A. (2006). Evolutionary dynamics: Exploring the equations of life. Belknap Press of
Harvard University Press.
Okasha, S. (2008). Fisher’s Fundamental Theorem of Natural Selection—A Philosophical
Analysis. The British Journal for the Philosophy of Science, 59(3), 319–351.
https://doi.org/10.1093/bjps/axn010
Pesaran, M. H., & Shin, Y. (1998). An autoregressive distributed-lag modelling approach to
cointegration analysis. Econometric Society Monographs, 31, 371–413.
138
Pesarin, F. (2001). Multivariate Permutation Tests: With Applications in Biostatistics (1 edition).
Wiley.
Pilny, A., & Shumate, M. (2012). Hyperlinks as extensions of offline instrumental collective
action. Information, Communication & Society, 15(2), 260–286.
Plotkin, H. C. (1982). Learning, development, and culture: Essays in evolutionary epistemology.
JWiley.
Pocklington, R., & Best, M. L. (1997). Cultural Evolution and Units of Selection in Replicating
Text. Journal of Theoretical Biology, 188, 79–87.
Popper, K. (1959). The logic of scientific discovery. London: Hutchinson.
Popper, K. (1972). Objective knowledge: An evolutionary approach. Oxford.
Powell, W. W., White, D. R., Koput, K. W., & Owen-Smith, J. (2005). Network dynamics and
field evolution: The growth of interorganizational collaboration in the life sciences.
American Journal of Sociology, 110(4), 1132–1205.
Price, G. R. (1972). Fisher’s ‘fundamental theorem’ made clear. Annals of Human Genetics,
36(2), 129–140. https://doi.org/10.1111/j.1469-1809.1972.tb00764.x
Price, G. R. (1995). The nature of selection. Journal of Theoretical Biology, 175(3), 389–396.
Qin, X., & Cunningham, P. (2012). Assessing the quality of wikipedia pages using edit longevity
and contributor centrality. ArXiv Preprint ArXiv:1206.2517.
Ransbotham, S., Kane, G. C., & Lurie, N. H. (2012). Network Characteristics and the Value of
Collaborative User-Generated Content. Marketing Science, 31(3), 387–405.
https://doi.org/10.1287/mksc.1110.0684
139
Ren, R., & Yan, B. (2017). Crowd Diversity and Performance in Wikipedia: The Mediating
Effects of Task Conflict and Communication. Proceedings of the 2017 CHI Conference
on Human Factors in Computing Systems, 6342–6351.
Richerson, P. J., & Boyd, R. (1978). A dual inheritance model of the human evolutionary process
I: Basic postulates and a simple model. Journal of Social and Biological Structures, 1(2),
127–154.
Richerson, P. J., & Christiansen, M. H. (Eds.). (2013). Cultural Evolution: Society, Technology,
Language, and Religion. The MIT Press.
Rose, H., & Rose, S. P. R. (2000). Alas, poor Darwin: Arguments against evolutionary
psychology (1st American ed.). Harmony Books.
Schaden, G., & Patin, C. (2018). Semiotic systems with duality of patterning and the issue of
cultural replicators. History and Philosophy of the Life Sciences, 40(1), 4.
Schaller, M., Conway III, L. G., & Tanchuk, T. L. (2002). Selective pressures on the once and
future contents of ethnic stereotypes: Effects of the communicability of traits. Journal of
Personality and Social Psychology, 82(6), 861.
Sereno, M. I. (1991). Four analogies between biological and cultural/linguistic evolution.
Journal of Theoretical Biology, 151(4), 467–507.
Shen, A., Qi, J., & Baldwin, T. (2017). A hybrid model for quality assessment of Wikipedia
articles. Proceedings of the Australasian Language Technology Association Workshop
2017, 43–52.
Shumate, M., Fulk, J., & Monge, P. (2006). Predictors of the International HIV-AIDS INGO
Network over Time. Human Communication Research, 31(4), 482–510.
https://doi.org/10.1111/j.1468-2958.2005.tb00880.x
140
Suzuki, Y. (2015). Quality Assessment of Wikipedia Articles Using h-index. Journal of
Information Processing, 23(1), 22–30. https://doi.org/10.2197/ipsjjip.23.22
Toulmin, S. (1967). The evolutionary development of natural science. American Scientist, 456–
471.
Toulmin, S. (1972). Human understanding: Vol. 1. The collective use and development of
concepts. Princeton, NJ: Princeton University Press.
Usher, J. M., & Evans, M. G. (1996). Life and Death Along Gasoline Alley: Darwinian and
Lamarckian Processes in a Differentiating Population. Academy of Management Journal,
39(5), 1428–1466. https://doi.org/10.5465/257004
Warncke-Wang, M., Ayukaev, V. R., Hecht, B., & Terveen, L. G. (2015). The success and
failure of quality improvement projects in peer production communities. Proceedings of
the 18th ACM Conference on Computer Supported Cooperative Work & Social
Computing, 743–756.
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications (Vol. 8).
Cambridge university press.
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’networks. Nature,
393(6684), 440.
Weidema, E. R., López, C., Nayebaziz, S., Spanghero, F., & van der Hoek, A. (2016). Toward
microtask crowdsourcing software design work. CrowdSourcing in Software Engineering
(CSI-SE), 2016 IEEE/ACM 3rd International Workshop On, 41–44.
Wheeler, B., Torchiano, M., & Torchiano, M. M. (2016). Package ‘lmPerm.’ R Package Version,
1–1.
141
Wiggins, B. E., & Bowers, G. B. (2015). Memes as genre: A structurational analysis of the
memescape. New Media & Society, 17(11), 1886–1906.
https://doi.org/10.1177/1461444814535194
Wikipedia:Reliable sources. (2019). In Wikipedia.
https://en.wikipedia.org/w/index.php?title=Wikipedia:Reliable_sources&oldid=92638968
8
Wilden, R., Hohberger, J., Devinney, T. M., & Lavie, D. (2018). Revisiting James March (1991):
Whither exploration and exploitation? Strategic Organization, 16(3), 352–369.
Wong, M. (2011). Encyclopedias. In R. E. Bopp & L. C. Smith (Eds.), Reference and
information services: An introduction (4th ed., pp. 433–459). Libraries Unlimited.
Zeileis, A., Wiel, M. A. van de, Hornik, K., & Hothorn, T. (2008). Implementing a Class of
Permutation Tests: The coin Package. Journal of Statistical Software, 28(8), 1–23.
Zhang, A. F., Livneh, D., Budak, C., Robert, L. P., Jr., & Romero, D. M. (2017). Crowd
Development: The Interplay Between Crowd Evaluation and Collaborative Dynamics in
Wikipedia. Proc. ACM Hum.-Comput. Interact., 1(CSCW), 119:1–119:21.
https://doi.org/10.1145/3134754
Zhao, Y., & Luo, X. (2019). Granger mediation analysis of multiple time series with an
application to functional magnetic resonance imaging. Biometrics, 75(3), 788–798.
https://doi.org/10.1111/biom.13056
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ecology and network evolution in online innovation contest crowdsourcing
PDF
The evolution of multidimensional and multilevel networks in online crowdsourcing
PDF
The evolution of scientific collaboration networks
PDF
The coevolution of multimodal, multiplex, and multilevel organizational networks in development communities
PDF
Interorganizational knowledge networks: the case of the biotechnology industry
PDF
A multitheoretical multilevel explication of crowd-enabled organizations: exploration/exploitation, social capital, signaling, and homophily as determinants of associative mechanisms in donation-...
PDF
The patterns, effects and evolution of player social networks in online gaming communities
PDF
Crowdsourcing for integrative and innovative knowledge: knowledge diversity, network position, and semantic patterns of collective reflection
PDF
Effects of economic incentives on creative project-based networks: communication, collaboration and change in the American film industry, 1998-2010
PDF
Communicating organizational knowledge in a sociomaterial network: the influences of communication load, legitimacy, and credibility on health care best-practice communication
PDF
The evolution of gene regulatory networks
PDF
The formation and influence of online health social networks on social support, self-tracking behavior and weight loss outcomes
PDF
Crisis and stasis: understanding change in online communities
PDF
The role of capabilities in new alliance creation and performance: a study of the biotechnology industry
PDF
Intermedia agenda setting in an era of fragmentation: applications of network science in the study of mass communication
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
The Silicon Valley startup ecosystem in the 21st century: entrepreneurial resilience and networked innovation
PDF
Understanding alliance building and communication across multiple technology platforms among California's immigrant-serving NGOs
PDF
Conspicuous connections as signals of expertise in networks
PDF
Organizational mimicry in American social movement communities: an analysis of form communication effects on the evolution of crisis pregnancy centers, 1989-2009
Asset Metadata
Creator
Ren, Ruqin
(author)
Core Title
The evolution of knowledge creation online: what is driving the dynamics of Wikipedia networks
School
Annenberg School for Communication
Degree
Doctor of Philosophy
Degree Program
Communication
Publication Date
10/30/2020
Defense Date
10/23/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
evolutionary theory,knowledge creation,network analysis,network dynamics,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Monge, Peter (
committee chair
), Fulk, Janet (
committee member
), Majchrzak, Ann (
committee member
)
Creator Email
renruqin@foxmail.com,ruqinren@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-389048
Unique identifier
UC11666290
Identifier
etd-RenRuqin-9080.pdf (filename),usctheses-c89-389048 (legacy record id)
Legacy Identifier
etd-RenRuqin-9080.pdf
Dmrecord
389048
Document Type
Dissertation
Rights
Ren, Ruqin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
evolutionary theory
knowledge creation
network analysis
network dynamics