Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sustaining open source software production: an empirical analysis through the lens of microeconomics
(USC Thesis Other)
Sustaining open source software production: an empirical analysis through the lens of microeconomics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SUSTAINING OPEN SOURCE SOFTWARE PRODUCTION: AN EMPIRICAL ANALYSIS THROUGH THE LENS OF MICROECONOMICS by Samuel Jospeh Boysel A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ECONOMICS) December 2022 Copyright 2023 Samuel Jospeh Boysel Dedication To my mother, father, Claire, and Jake. You give my work meaning. ii Acknowledgements First and foremost, I am deeply indebted to my advisor Matthew Kahn, whose support and guid- ance throughout my academic career has been second to none. I am also grateful for the detailed feedback on this work provided by my dissertation committee members David Kempe, Paulina Oliva, and Robert Metcalfe. I appreciate the work my committee has done in providing high-level insight, catching errors or problems, and suggesting strategies to achieve my research objectives. Additionally, I thank Shane Greenstein for his thorough review of early iterations of my work. Moreover, conversations with Professors Cheng Hsiao, Jeff Weaver, Vittorio Bassi, Monica Mor- lacco, Michael Leung, and Geert Ridder helped refine my thought process and methodology. I’d like to also thank the unyielding support of Young Miller and Annie Le, whose tireless efforts within the USC economics department too often go without proper acknowledgement. Finally, comments from colleagues Rajat Kochhar, Ruozi Song, Nicolas Roig, Karim Fajury, Thomas Ash, Islamul Haque, Amy Mahler, Yue Fang, Liying Yang, Xiongfei Li, Taraq Khan, and Amanda Ang have all helped shaped my research in positive directions. More importantly, I cherish their friendship in ways that transcend research. iii Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Economic Principles embedded in OSS Production . . . . . . . . . . . . . . . 2 1.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Empirical Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Quid Pro Code: Peer Effects and Productivity in Open Source Software . . . . . . 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Why Contribute to OSS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.2 Private Public Good Provision . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Peer Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Reduced Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1 Peer Effects on Individual Contribution . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.4 Detailed Analysis and Robustness . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6 Structural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.2 Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.3 Peer Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.6.5 Structural Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6.6 Counterfactual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 iv 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.A Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.B Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.C Data Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.D Additional Reduced Form Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 2.E Structural Estimation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 3: No Free Lunch For Programmers: Digital Supply Chains and the Economics of Software Dependency Management . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.2 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.2 Risk Embedded in Dependency Network Structure . . . . . . . . . . . . . . . 93 3.3.3 A Maintainer’s Choice Between Risky Alternatives . . . . . . . . . . . . . . . 95 3.3.4 Fragile Dependency Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.4.1 Sampling Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.4.2 Measuring Software Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.4.3 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.5 Reduced Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.5.1 Contribution Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.5.2 Project Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.5.3 Dependency Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.5.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.6 Structural Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.6.1.1 Project Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.6.1.2 Contribution costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.6.1.3 Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.6.1.4 Information Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.6.2 Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.6.2.1 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.6.2.2 Optimal contribution decision (x i ) . . . . . . . . . . . . . . . . . . . 119 3.6.2.3 A Maintainer’s Utility over Expected Project Quality . . . . . . . . 121 3.6.2.4 Optimal dependency formation decision (G ij ) . . . . . . . . . . . . . 122 3.6.2.5 Comparative Statics . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.6.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.6.3.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . 127 3.7 Counterfactual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.7.1 Reducing Fluctuations in Project Quality . . . . . . . . . . . . . . . . . . . . 129 3.7.2 Increasing Developer Risk Aversion . . . . . . . . . . . . . . . . . . . . . . . . 129 3.7.3 Key Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 v 3.A Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.B Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 3.C Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 3.C.1 Alternative Representations for the Maintainer’s Problem . . . . . . . . . . . 146 3.C.2 Expected Project Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 3.C.3 Optimal Dependency Formation . . . . . . . . . . . . . . . . . . . . . . . . . 150 3.D Estimation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 3.D.1 Additional simplifications to reduce computational burden . . . . . . . . . . . 154 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 vi List of Tables 2.A.1 Descriptive Statistics – Primary Measures in Empirical Sample . . . . . . . . . . . 56 2.A.2 Reduced Form – Individual Level Peer Effects (Baseline) . . . . . . . . . . . . . . 57 2.A.3 Reduced Form – Individual Level Peer Effects (Interactions) . . . . . . . . . . . . 58 2.A.4 Reduced Form – Temporal Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 59 2.A.5 Reduced Form – Beyond Contemporaneous Effects . . . . . . . . . . . . . . . . . . 60 2.A.6 Reduced Form – Project-Level Estimates (Contribution Levels) . . . . . . . . . . . 60 2.A.7 Reduced Form – Project-Level Estimates (Number of Contributors) . . . . . . . . 61 2.A.8 Structural Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.B.1 Descriptive Statistics – Node.js Dependency Network Sample . . . . . . . . . . . . 141 3.B.2 Reduced Form Estimates – Project Contribution . . . . . . . . . . . . . . . . . . . 142 3.B.3 Reduced Form Estimates – Project Quality . . . . . . . . . . . . . . . . . . . . . . 143 3.B.4 Reduced Form Estimates – Dependency Formation . . . . . . . . . . . . . . . . . 144 3.B.5 Counterfactual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 vii List of Figures 2.1 Reduced Form Identification Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.B.1 Example GitHub Repository Page – twbs/bootstrap . . . . . . . . . . . . . . . . 63 2.B.2 Descriptive Statistics – Project Creation Dates and Earliest Commits . . . . . . . 64 2.B.3 Descriptive Statistics – Distribution of Project-level Contribution Shares . . . . . 64 2.B.4 Descriptive Statistics – Aggregate contribution in sample . . . . . . . . . . . . . . 65 2.B.5 Descriptive Statistics – Distinct contributors in sample . . . . . . . . . . . . . . . 65 2.B.6 Descriptive Statistics – Mean individual and peer contribution per project . . . . 66 2.B.7 Reduced Form – Project Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 66 2.B.8 Reduced Form – Temporal Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 67 2.B.9 Reduced Form – Insider Contribution and Crowding Out . . . . . . . . . . . . . . 68 2.B.10 Structural Model – Recovered Benefit and Productivity Shocks . . . . . . . . . . . 69 2.B.11 Structural Model – Correlation between Benefit and Productivity Shocks . . . . . 70 2.B.12 Structural Model – Extensive Margin Peer Effects . . . . . . . . . . . . . . . . . . 71 2.B.13 Structural Model – Intensive Margin Peer Effects . . . . . . . . . . . . . . . . . . 72 2.B.14 Structural Model – Counterfactual Growth in Aggregate Contribution without Peer Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.3.1 An Illustration of Software Dependency Network Basics . . . . . . . . . . . . . . . 90 3.3.2 Risk Embedded in Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.3.3 Risk Aversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3.4 Fragility of Dependency Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.A.1 Empirical Dependency Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.A.2 Dependency Network Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.A.3 Dependency Network Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.A.4 Reduced Form – Project Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 138 3.A.5 Reduced Form - Temporal Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 139 3.A.6 Comparative Statics – Probability of Depdendency Formation . . . . . . . . . . . 140 viii Abstract In this manuscript, we explore microeconomic behavior shaping the production of open source soft- ware (OSS). We fortify our analysis with economic structure to guide our narrative and assess our hypotheses empirically, filling important gaps in the literature on the supply side of markets for OSS goods. Our motivation is rooted in a desire to better understand how various microeco- nomic phenomena influence sustained development of widely used OSS infrastructure. Following an introduction in Chapter 1, our contribution is divided into two distinct chapters. In Chapter 2, we examine the extent to which peer effects influence the private provision of public goods. In the case of public information goods, peer contribution may facilitate or otherwise incentivize further contribution from others, effectively subsidizing private provision. We first utilize a reduced form approach to derive causal estimates of net peer effects in public goods contribution by exploiting a peers-of-peers identification strategy. Next, we develop a structural model of peer- influenced public good provision that both (1) separates extensive and intensive margin contribution decisions and (2) decomposes contribution into marginal private benefits and costs. We apply these methodologies using a sample of peer contribution histories for 2,287 OSS projects hosted on the GitHub collaboration platform. Both reduced form and structural approaches suggest peer effects are much stronger along the extensive margin than the intensive margin. Contemporaneous inten- sive margin effects, while heterogenous across time and projects, are small and centered around zero, suggesting that strategic complementarity and substitution in peer contribution likely offset ix one another. Our counterfactual analysis suggests (extensive margin) peer effects account for nearly 56% of cumulative aggregate contribution for our sample, which translates to a value-added of 1–1.5 million software developer labor hours. These results support the notion that OSS is largely devel- oped by disproportionate efforts from smaller groups of dedicated core maintainers, who integrate incremental contributions from the wider community, and casts doubt on the potential for peer effects alone to deliver sustained maintenance labor to individual projects. In Chapter 3, we turn our attention to the formation of software dependency networks. Develop- ers of software projects can leverage the functionality of existing open source projects. This practice can potentially lower the cost of development albeit at the inherent risk of relying on external com- ponents. A “downstream” project maintainer can choose to “import” elements of an “upstream” project to outsource functionality, but is uncertain how future changes in this dependency project may expose her own project to software faults or vulnerabilities. Software dependency networks therefore represent a “digital supply chain”, an ecosystem of interdependent public goods that confer an intricate set of both positive and negative externalities for project maintainers and end users. Focusing on microeconomic fundamentals of the dependency management problem faced by the risk averse project maintainer, we use both reduced form and structural approaches to study how depen- dency networks create value, what forces shape their formation, and how individual behavior can influence the robustness of equilibrium network structure. We use a sample of open source software projects from the Node.js JavaScript packaging ecosystem for which contribution and dependency formation decisions are observed in real-time. Finally, we consider several policy interventions that can improve equilibrium welfare. In particular, we find that removing less than 1% of core projects can reduce aggregate project quality by more than 5% for the remaining peers. x Chapter 1 Introduction 1.1 Motivation The production of Open Source Software (OSS), a special class of public information goods (H. R. Varian, 2000), is both intriguing and puzzling from an economic perspective. 1 Most typically, OSS is incrementally and collaboratively developed by software developers willing to forgo property rights over their contributions. 2 In doing so, these efforts by contributors provide a rich and dynamic space of open technology available to the wider public. 3 In theory, allowing open access to a project’s 4 source code and encouraging contributions from the wider community has attractive benefits for the individual developer, by potentially redistributing the maintenance workload, encouraging more diverse perspectives over design input, and facilitating a public audit of the codebase. 5 A form of “digital capital” or infrastructure 6 , the availability of OSS allows developers to integrate subsets of 1 Throughout this manuscript, we will refer to open source software by the acronym OSS. By software, we mean a set of digital instructions for a machine to perform a task. 2 Invoking a prevailing definition, we say that software is open source if (1) users can freely access, (2) copy or distribute, and (3) make modifications to the software’s underlying codebase (Perens et al., 1999). See Laurent (2004) for an overview of the specifics of OSS licensing. 3 As a subset of digital technology, software goods can be credited for a considerable share of innovation and economic development over the past half-century (Goldfarb, S. M. Greenstein, and Tucker, 2015). 4 Throughout this study, we will reference the atomistic unit for an OSS codebase as a project, a collection of code typically organized in a repository designed for a specific purpose. For the purposes of this work, we will occasionally use the terms codebase, package, library, or repository as synonyms for an OSS project. 5 To paraphrase the oft used adage of Raymond (1999), “with enough eyes, all bugs are shallow.” 6 See (Eghbal, 2016) for a description of how OSS has become inextricably linked with and depended upon by contemporary software applications. 1 functionality into more complex applications and greatly mitigates the need to “reinvent the wheel” in subsequent software development. 7 Both ubiquitous and characterized by widespread adoption, OSS has become an integral component of software development and technological innovation. It is estimated that modern software projects derive 70–90% of their functionality from open source code (Nagle et al., 2022). 8 In this way, open software ecosystems operate as modular public goods production networks, characterized by intricate and dynamic sociotechnical interactions both within and across projects and individual contributors. Despite the attractive properties of OSS goods, there remains concern over the sustained main- tenance of widely used as public good infrastructure (Eghbal, 2016; Eghbal, 2020). Case studies 9 in problems arising from the under-maintenance of such OSS projects motivate our interest in exploring how economic behavior shapes production patterns. In an empirical analysis grounded in microeconomic theory, we seek to understand how peer effects create value in OSS production networks and what these forces imply for sustainability. 1.1.1 Economic Principles embedded in OSS Production Although the benefits and importance of OSS goods appear to be significant, open software suffers from problems common to most classes of public goods: the presence of positive externalities, a non- excludable nature, and the absence of narrowly defined property rights preclude the establishment of a private market. In terms of production, OSS can clearly be subject to free-ridership 10 and may therefore suffer from under-provision from private contributors in the classic sense (Samuelson, 7 Creating software requires human capital and labor but has the distinct advantage of being replicated at a marginal cost of zero. 8 We provide further details on the prevalence of OSS usage and dependence in Chapter 2, Sections 2.1 and 2.2 and Chapter 3, Section 3.1. 9 See for example, Mutton (2014), Schlueter (2016), Carey (2017), US CFPB (2022), and US FTC (2022). 10 By “free-riding”, we mean that users of a software project can enjoy the benefits of development efforts made by their peers while contributing relatively little, or not at all, themselves. In a broader sense, the wider population of technology consumers often use OSS at least indirectly and do not contribute whatsoever. Throughout this study, we will emphasize a narrower scope for potential free-riders by looking within the subpopulation of developers who make at least a single contribution to OSS projects in our empirical sample. 2 1954). However, the ubiquitous nature and widespread uptake of OSS in the present day seemingly assuageconcernsofunder-provisioninaggregate, suggestingstrongindividualbenefitsformotivated contributors are at work (Bergstrom, Blume, and H. Varian, 1986; Lerner and Tirole, 2002). In this work, we therefore instead focus on more nuanced problems surrounding under-maintenance: what happens when an integral OSS component, widely used and depended upon by many, lacks an adequate level of maintenance bandwidth to ensure security and proper functionality? 11 By encouraging additional contribution from peers or improving their productivity, can maintainers structure self-sustaining projects? To guide us toward understanding mechanisms at the heart of production and maintenance con- cerns, a deeper look into contributor behavior within OSS ecosystems reveals complex interactions that can be explained by economic theory. Beyond any intrinsic desire for pure altruism, a common motivation for releasing software code to the public under a permissive license is to encourage the use and adoption of the codebase. 12 Project maintainers benefit from putting the code into the hands of many by opening a pathway for community contributions, ranging from the addition of new features to identifying software faults, which serve to ultimately benefit the project’s quality. A common contribution pattern can be described as follows. A maintainer of an OSS project first releases their code, making it publicly accessible and distributing it under a permissive OSS license. Interested parties may then inspect the source code, use and modify it, and potentially contribute changes. To contribute, the developer formally requests the original maintainer integrates their proposed changeset into the codebase of the original project. 1314 11 Eghbal (2020) provides an excellent overview of modern OSS development and places particular emphasis on the problems faced by oversubscribed projects. 12 This can be seen as a special case of “impure” altruism (Andreoni, 1990). 13 This is known as either a merge or, in the development terminology popularized by software forges such as GitHub, a pull request. 14 Open source theoretically allows any given codebase to be forked into any number of alternative versions. It must therefore be the case that attractive benefits exist for users to center their development efforts around a canonical version of a project. 3 Individuals who directly engage in the OSS space are confronted with many choices: which projects to use, which to contribute to, and how much to contribute. By choosing which external projects to use as dependencies, maintainers balance expediting the development of new features at the risk of relying on codebases outside their direct control. 15 Even after a maintainer decides to utilizesoftwaredependencies,shemustchoosewhichinparticulartosourcebasedontheirobservable features. Users of software projects will contribute to projects if they find a net benefit in doing so. They must then determine an optimal level of contribution effort to allocate, conditional on the choices made by their peers. A common theme we explore throughout this work is the complex substitution patterns that arise as consequences of these choices. It may be the case that some OSS participants opt to “free-ride” on the outsized efforts of dominant contributors. Alternatively, some projects used as dependencies may greatly reduce the amount of work required in a dependent project while others may actually facilitate further development. Externalities proliferate within OSS ecosystems, both between projects and individual contribu- tors. The core inquiry in our analysis in Chapter 2 assess the extent to which contributors influence both contribution decisions and the productivity levels of their peers. Perhaps it is the case that some contributions induce peers to make additional follow-on contributions. This may arise when certain contributions lower the cost of subsequent development efforts or add critical features that encourage additional work. On the other hand, outsized contributions by dominant insiders may dissuade or “crowd out” contributions from outsiders, leading to increased free-ridership. It is un- clear ex ante whether peers find the contributions of peers to be net substitutes or complements, which determines a distribution of the contribution workload within projects. Our interest in peer effects continues in Chapter 3, where we explore influences between interrelated software projects. 15 “Outsourcing” functionality can also be seen as a special case of the firm’s “make versus buy” dilemma addressed in the literature (Coase, 1937; Williamson, 1975; Williamson, 1985; Grossman and Hart, 1986). 4 As upstream dependencies can serve a portion of functionality to any number of downstream de- pendents simultaneously, network externalities exist by way of these dependencies projects exerting some influence over the quality and contribution costs for dependents. This arrangement has the advantage of efficiency: organizing functionality into modular software packages reduces duplicated efforts and can lower upfront costs of project development. As public information goods, depen- dency projects expose their functionality in a non-rival and non-excludable fashion, giving rise to economies of scale. However, relying on external software under some level of development imposes additional costs, ranging from mild maintenance costs to exposure to exploits or software faults, and ultimately exposing dependent projects to external, network-mediated shocks. 1.1.2 Related Work Despite OSS serving as a foundational element of modern software development, the microeconomic fundamentals of the supply side have thus far received surprisingly little attention from the empir- ical economic literature. Some noteworthy exceptions include theoretical contributions by Lerner and Tirole (2002) and Lerner and Tirole (2005a) and Athey and Ellison (2014) and the empirical exploration of innovation by Fershtman and Gandal (2011). This absence motivates the focus of the body of work contained in this manuscript. Specifically, we seek to examine microeconomic forces that shape contributor behavior and ultimately drive the production of OSS in equilibrium. Our analysis of peer effects between contributors in Chapter 2 draws from a literature of private produc- tion of public goods by heterogeneous agents (Bergstrom, Blume, and H. Varian, 1986; Jacobsen, LaRiviere, and Price, 2017). Our study of software dependency formation in Chapter 3 is related to descriptive studies of OSS ecosystems (Decan, Mens, and Constantinou, 2018a; Decan and Mens, 2019; Decan, Mens, and Grosjean, 2019), OSS supply chain valuation (Keller et al., 2018; Robbins 5 et al., 2018), and both network innovation (Acemoglu, Akcigit, and Kerr, 2016) and formation un- der risk (Blume et al., 2013; Kovářík and Van der Leij, 2014; Elliott, Golub, and Leduc, 2022). 16 More specifics links between the work in this manuscript and the literature are made in Chapter 2 Section 2.3 and Chapter 3 Section 3.2. 1.2 Empirical Setting A distinct advantage for the empirical researcher studying OSS production in the modern era is the wealth of available observable data on both contributor behavior and project characteristics. Soft- ware forges, collaborative online platforms that serve as hubs for OSS development, have gained im- mense popularity. 1718 Maintainers can host project codebases, share documentation, communicate or coordinate design decisions, and easily integrate contribution requests from the wider community in a unified public forum. 19 Most pertinent to the present study, these platforms chronicle publicly available usage data at precise temporal granularity 20 and distribute them in digestible formats (Gousios, 2013). Furthermore, software development tools facilitate empirical analysis by preserv- ing a timestamped record of project development for each line of code authored by a contributor. Software version control 21 systems allow the researcher to “rewind” the state of a software project to a specific moment in time. By exploiting these advantageous sources of data, researchers can derive 16 Additionally, we refer the interested reader to broader discussions of OSS development and history covered by previous authors: the transition to stable OSS adoption (Fitzgerald, 2006), a case study of the development of the Apache Server (Mockus, Fielding, and Herbsleb, 2000), general OSS history (Bretthauer, 2001; Lerner and Tirole, 2005a), understanding OSS development (Feller and Fitzgerald, 2002; Von Krogh and Von Hippel, 2003; Fogel, 2005), developer motivations (Lerner and Tirole, 2002; Lakhani and Wolf, 2003), general discussions of open innovation (Chesbrough, 2003; Von Hippel, 2006) and commons-based peer production (Benkler, 2002; Benkler and Nissenbaum, 2006; Benkler, 2006). 17 Noteworthy examples include GitHub, SourceForge, Bitbucket, and GitLab. See Squire and Williams (2012) for an overview of the software forge ecosystem. 18 As of October 2022, GitHub plays host to over 98 million users contributing across 43 million public repositories (GitHub, Inc., 2022b). This can be seen as a lower bound estimate of platform activity as some share of GitHub activity occurs in private, unlisted repositories. 19 See Figure 2.B.1 for an example of the GitHub repository page for Twitter’s bootstrap library. 20 Since observable actions by software developers are recorded and timestamped on platforms like software forges, OSS characteristics can be observed down to seconds or even milliseconds. 21 For example, git, svn, mercurial, and bazaar are examples of popular version control systems. Among these, git is the most widely used by far (Stack Exchange, Inc., 2022). 6 refined and disaggregated sociotechnical measures for the evolution of project development activ- ity within the OSS ecosystem across time: contribution levels, the relative popularity of projects, dependency relationships, technical features of individual projects, and other characteristics. 22 While the sheer volume and granularity of public OSS data facilitates much of the methodology used in this study, it is wise to temper expectations by noting some key unobservables from the perspective of the econometrician. First and foremost, measuring uptake, usage, or valuation of OSS projects in the broadest sense remains elusive. In the absence of a well-defined market, it is at the very least challenging to completely observe a comprehensive measure of the public’s will- ingness to pay for OSS public goods. Some simple yet imperfect proxies for OSS uptake exist. The researcher can gauge popular interest in a project via download counts (Fershtman and Gandal, 2011), contribution activity (Kalliamvakou et al., 2014), search term frequency 23 , or by measuring the extent to which other projects rely on certain OSS packages (Robbins et al., 2018). 24 Another important set of unobservables are characteristics of individual contributors. By nature of version control systems, developers at a minimum contribute under a chosen email address and can po- tentially disclose some information about themselves on collaboration platforms. 25 Software forges can observe a contributor’s location when they interact with the platform, but this data is typically not publicly available. 26 Despite these disadvantages, the observable data gives the researcher an incredibly detailed level of insight into the OSS production process and allows us to place these dynamics into the context of economic theory. 22 See Chapter 2 Section 2.4 and Chapter 3 Section 3.4 for more specifics on OSS data and observable metrics utilized in this study. 23 For example, how prevalent individuals search for, discuss, or mention certain OSS technologies across various internet forums. 24 Note that none of these approaches truly capture a willingness to pay for OSS goods by end users. By using the revealed preferences of OSS contributors, a key contribution of Chapter 2 is to characterize peer effects in terms of labor development costs. See Section 1.3. 25 Such as their true name, geographic location, firm affiliation, and links digital identities. 26 This is usually due to privacy concerns or compliance with public policy, such as California’s Consumer Privacy Act (2018) or the European Union’s General Data Protection Regulation (GDPR). For further details on privacy policy and compliance on the GitHub platform, see https://docs.github.com/en/site-policy/privacy-policies. 7 1.3 Contribution In this manuscript, we develop microeconomic structure for two distinct phenomena within OSS production and discuss the implications our empirical analysis suggests for OSS sustainability. We exploit both disaggregated data on OSS contributor behavior and the structure of sociotechnical networks 27 to develop novel empirical contributions, filling a gap in the microeconomic literature on OSS. To the best of our knowledge, the work in this manuscript represents the first empirical tests of key theoretical hypotheses on OSS production dynamics (Athey and Ellison, 2014). Our contribution is organized into two distinct chapters. 28 In Chapter 2, we study the role of peer effects in shaping equilibrium contribution levels to OSS. We explore the extent to which the contributions of peers can influence both an individual’s contribution level or productivity. By observing the revealed preferences of contributors, we provide the first empirical estimates of these peer effects in terms of development labor costs, which we subsequently argue represent a lower bound for the willingness to pay for OSS goods. In Chapter 3, we turn our attention to the formation of software dependency networks. We discuss the maintainer’s decision-making framework for optimally balancing a level of internal development with external dependency usage and the implications this aggregate behavior has for dependency network structure in equilibrium. Extending the literature on strategic network formation, our structural approach characterizes the formation of software development networks under uncertainty and exploits highly disaggregated data to simplify estimation. Both studies fortify a theoretical framework with empirical analysis, each pairing reduced form approaches with fully specified structural models. This methodology has several advantages. First, we can capture descriptive statistics and build intuition for complex OSS production processes via reduced form exploration. Second, a structural methodology allows us 27 For example, we employ a “peers-of-peers” identification strategy in Section 2.5.2 to overcome endogeneity concerns (Manski, 1993) and recover unbiased estimates of peer effects in individual contribution levels. 28 These two chapters, while both concerned with the microeconomic details of OSS production, are written to serve as independent research contributions. 8 to rigorously specify and micro-found the mechanisms through which our hypothesized production effects operate. Finally, we can take our fully specified structural model to data and conduct policy counterfactuals. As our structural approach seeks to characterize the data-generating process for each production phenomenon, we can modify the underlying fundamentals of each process and assess welfare implications under the new equilibria. The key insights from our analysis can be summarized at a high level. First, we empirically confirm an often cited dilemma in OSS: simple contribution does not immediately imply sustained maintenance (Eghbal, 2020). Under our counterfactual analysis of contributor peer effects, we find that extensive margin peer effects account for approximately 56% of aggregate contribution (Sec- tion 2.6). We also find little evidence that intensive margin peer effects have any influence on contemporaneous individual contribution levels, on average. 29 Moreover, when individuals make larger contributions, they tend to do so at greater marginal cost. Hence, while we can say the peer contribution activity may lead to individuals contributing in small share, we cannot say that peers make each other more productive. Second, and related to the previous point, externalities are quite prevalent in the OSS setting in general. When studying digital supply chains, we find that removing less than 1% of core dependency packages reduces quality for the remaining peers by more than 5% (Section 3.7). Third, the role of maintainer risk aversion in delivering robust software dependency networks seems to be second order (Section 3.7). More often than not, attractive features of particu- lar projects seem to outweigh considerations for package security from the perspective of the project maintainer choosing which external packages to import as dependencies. 30 Finally, it is abundantly clear that complex substitution patterns exist in this space. We find considerable heterogeneity in estimates of externalities across both projects and time. For example, we find a higher level of complementarity in contribution levels between both (1) individuals and (2) dependent projects in 29 We find stronger intensive margin peer effects when relaxing the assumption of contemporaneous influence. See Section 2.5.4 and Table 2.A.5. 30 As we discuss in Section 3.8, this may be a consequence of our chosen package quality metric. 9 the early days of GitHub (Sections 2.5 and 3.5). We also find that free-ridership is more prevalent in larger projects with dominant “insider” contributors (Section 2.5). 31 This study motivates and demonstrates the research potential of OSS dynamics by empirical micreconomists. An abundance of rich data now exist to tackle fundamental questions regarding the nature of OSS production. Our findings have implications both for the management of OSS projects and future research in this space. First, there is some rationale for public support of critical yet oversubscribed 32 OSS projects from a public goods perspective. Peer effects alone cannot generate sustainable maintenance effort for key digital infrastructure. Second, a few core dependencies provide functionality for large segments of the OSS ecosystem. It would be prudent to prioritize support towards these cornerstone projects. 31 See Section 2.5. 32 Here meaning that the available maintenance bandwidth is insufficient to reliably ensure software quality for a widely used OSS codebase. 10 Chapter 2 Quid Pro Code: Peer Effects and Productivity in Open Source Software 2.1 Introduction Open Source Software (OSS) projects are public information goods produced through incremental efforts of individual contributors. 12 Interested parties can freely download software code for their own use and can also propose contributions to the original maintainer of the project 3 . The very existence of OSS rebukes conventional wisdom on privately produced public goods 4 and various explanations have been offered to rationalize their provision, from signalling (Lerner and Tirole, 2002), (impure) altruism (Andreoni, 1990), need satiation (Athey and Ellison, 2014), and institu- tional structures imposed by self-organizing local communities (Ostrom, 1990; Benkler, 2002). In this study we examine an alternative channel through which widespread contribution to public OSS 1 Our use of the term “open source” requires some definition. In a general sense, OSS is a computer technology for which the underlying source code is made publicly available under a license permitting use, modification, and subsequent redistribution of derived products (Open Source Initiative, 2007). While there are many variations on the specifics of this definition, the most important feature of software projects considered in this study is that (1) they are distributed under some permissive OSS license (GitHub, Inc., 2022a) and (2) they are collaborative projects that allow for modifications to be submitted from a contributor base wider than the original developer. 2 Throughout this chapter, we will use the terms “contributor”, “developer”, “individual”, and “agent” interchange- ably in reference to the population of study. 3 For example, a user may wish to propose a new feature or fix a software fault (i.e., a “bug”). 4 Since contribution is costly, agents choose their contribution levels both with respect to private benefits of contribution and the level of the OSS public good delivered by the efforts of their peers. If the net benefit of contribution is negative, an individual may simply opt to free-ride on the efforts of others, leading to misallocation of contribution away from an efficient equilibrium. 11 projects may be achieved: peer effects. Peer behavior can potentially affect the net returns to public good contribution through various channels, improving returns and ameliorating contribution costs. Can peer influence drive heterogeneity in preferences and contribution costs, effectively subsidizing the private provision of public goods? Consider the quandary faced by maintainers of OSS projects. 5 Sindre Sorhus is a superstar OSS contributor. As of December 2021, he works on OSS full-time and is the author and primary maintainerof over 1,000OSSprojects (Sorhus, 2021). Asa prolific maintainer, Sorhusinteractswith thewidercommunity ofOSScontributorsandhaspersonallyreviewedtensofthousandsofproposed contributions to his projects. Sorhus once reflected that “ ∼ 80% of contributors doesn’t [sic] know how to resolve a merge conflict, almost no one writes a good pull request titles, ∼ 30% don’t run tests locally before submitting a [pull request]”, and “∼ 40% don’t include docs/tests” (Sorhus, 2019). In essence, Sorhus’s concern centers around the lack of quality project contributions from his peers. Software development in general is a complex, ever-changing process and many potential contributors simply may lack the skills to contribute effectively. As opposed to shouldering the entire burden of OSS project development 6 , to what extent can the contributions efforts of skilled contributors like Sorhus actually improve the productivity of their peers? 7 A key difference between OSS projects and other public goods is that production of OSS gener- ates both a community of contributors and a set of auxiliary information goods around the project that can potentially reduce subsequent costs of contribution. For example, OSS project maintain- ers provide assistance and guidance to new contributors by responding to inquiries via mailing 5 In this chapter, we will at times classify agents in the OSS public goods setting according to their level of participation in what is known as the “contributor funnel” (McQuaid, 2018). Users of an OSS project may utilize a software product but do not contribute to it. A subset of users are contributors and allocate some contribution effort to developing the project. A subset of contributors are maintainers, typically agents responsible for a large share of project contribution and may also have decision-making power over what proposed contributions are integrated into the project. We will also sometimes refer to these agents as developers. 6 While the use of OSS is itself non-rival, the contribution bandwidth of project maintainers is not (Brown, 2018). 7 In other words, how do maintainers induce project users down the “contributor funnel” into becoming productive, recurring contributors? 12 lists, message boards, or real-time chat channels. 8 Moreover, OSS communities typically archive the history of such project-related interactions between contributors, creating a publicly accessible knowledge base for project development. 9 OSS projects typically feature documentation 10 that gives a broad overview of the project. 11 , provides detailed information on how the software oper- ates at a technical level, and suggest how to properly propose new contributions. 12 Popular OSS projects can also generate a significant amount of buzz outside the contribution platform itself, from community-authored articles demonstrating usage to external forums 13 where users can re- quest help for various programming and software tasks. The combination of these features form the basis for peer effects on contributor productivity. Contribution activity itself can generate a form of “digital capital” 14 for subsequent OSS production, working to both lower the initial fixed cost of contribution for potential contributors and to make current contributors more productive. Hence, in contrast with many conventional public goods settings, there is scope for individual and peer contribution to become strategic complements. Salient examples of OSS begin to illustrate the scale at which developers have contributed labor towards the production of complex public information goods. As each OSS developer’s “contribution bandwidth” is both scarce and costly, the significance of peer effects that drive contributor labor can be measured naturally in terms of the opportunity cost of a developer’s time: what is the equivalent private market labor expenditure to finance the development of large OSS projects? Consider the 8 Users who receive feedback on their contribution from project maintainers are far more likely to return to contribute in the future (Sholler et al., 2019). 9 Similarly, OSS projects are overwhelmingly managed using a version control system, making the entire projects incremental development history public record. 10 Note that documentation is generated by developer labor and a contribution to the project itself. 11 Examples of high-level documentation include project README files bundled with the project source code, “wiki” pages, and long-form vignettes on project usage. For an example of best practices on how these are actually integrated into an OSS project, see Sections 8, 10, and 11 of Wickham (2015). 12 For example, a project maintainer may include a “contribution template” so that novice contributors avoid common pitfalls for new project contributions. Referring back to the example of Sindre Sorhus, this improves the quality of the proposed change and reduces the “back-and-forth” between maintainer and contributor. 13 A relevant example is the programming-focused question and answer website Stack Overflow which has been described as a sister community to OSS collaboration platforms such as GitHub (Eghbal, 2020). 14 Or more accurately, human capital that is recorded or codified as a public information good and then used as an input in the production of additional public goods. 13 case of the Linux Kernel. Regarded as the largest collaborative OSS project in history, the Linux Kernel was first released in 1991 by Linus Torvalds and has become the most widely used operating system basis for web servers, mobile devices, and high performance computing infrastructure. As of September 2021, the Linux Kernel has amassed over 31.3 million single lines of code from 23,927 distinctcontributorsoverthepastthreedecades. Usingstandardmethodsfromsoftwareengineering costestimation,itwouldtakenearly70millionperson-hourstorewritetheentirekernelfromscratch, which would cost over $1.05 billion today. 1516 While estimates for the use-valuation of OSS is an important ongoing area of research (S. Greenstein and Nagle, 2014; Nagle, 2019), in this study we seek to characterize the extent to which peer effects can mitigate production costs of OSS public goods. We seek to empirically assess peer effects on public good production using the context of OSS. Our methodology is organized into two phases. In the first phase, we build intuition on the magni- tude of net peer effects in OSS contribution using a reduced form approach. To address concerns over endogeneity, we develop an identification strategy to determine to what extent individual ef- fort levels are influenced by the contribution levels of their peers. Specifically, we instrument the likely endogenous contribution effort of an agent’s peers in a given project with the effort levels of the agent’s “peers-of-peers” defined by common contribution in outside projects. 17 The instrument operates by changing the relative incentives for peers to contribute to a given project by varying the incentives in external projects. This approach allows us to determine whether individual and peer contribution are strategic complements or substitutes on net, conditional on the set of developers that contribute at all. In the second phase, we develop a structural model of OSS contribution to pin down the microeconomic foundations for contributor behavior. We seek to place emphasis on 15 Estimated (conservatively) using the COCOMO model of software development cost estimation developed by Boehm (1981) and the software utility scc (Source: https://github.com/boyter/scc). 16 The median annual salary for software developers in the United States for 2020 was $110,140 ($52.95 per hour) (U.S. Bureau of Labor Statistics, 2021). 17 Details for this identification strategy are given in Section 2.5 and Figure 2.1. 14 disentangling contribution decisions along the extensive versus intensive margin and integrate peer influence into both decisions. To this end, we embed a micro-founded model of private public good provision (Bergstrom, Blume, and H. Varian, 1986) into the selection model of (Heckman, 1979). The structural approach facilitates the recovery of individual productivity parameters, allowing us to characterize the welfare of particular contribution profiles and conduct counterfactual analysis. Our main counterfactuals of interest estimates the value of aggregate contribution added by peer effects. We apply this methodological framework in an empirical analysis, focusing on the context of Open Source Software contribution. We use individual-level contribution data for a random sample of 2,287 highly collaborative OSS projects hosted on the GitHub collaboration platform. The remainder of this chapter is organized as follows. We first provide additional background on OSS development in Section 2.2. Next, we survey segments of related literature in Section 2.3. We introduce the empirical setting in Section 2.4, describing OSS contribution activity on the GitHub platform and giving an overview of data included in the empirical sample. We then develop a reduced form strategy to estimate peer effects in Section 2.5. With high-level insight on net peers effects in hand, we next develop a structural model of public good contribution with extensive and intensive margin peer effects in Section 2.6. We outline an estimation strategy, present estimation results, and conduct counterfactual analysis to measure the value of contribution generated by distinct peer effects channels. Finally, we summarize and interpret our findings in Section 2.7 and discuss promising directions subsequent research. 15 2.2 Background OSS projects are typically organized around software code repositories, publicly accessible websites that host the project’s source code and provide collaboration functionality 18 . Users can view and download the source code of OSS projects for their own use. They can also contribute to the OSS project’s codebase. A typical contribution pattern works as follows: (1) a user downloads a copy of the source code, (2) makes a series of incremental changes to the codebase, and (3) submits a request to the owner of the original OSS repository to integrate their changes. Due to the open nature of the code and the permissiveness of OSS licenses in general, there is little to prevent a user from simply copying the codebase of an existing project into a new OSS good 19 . However, users can distribute contribution costs and share knowledge by working collaboratively with peers. It is therefore reasonable to assume there exist strong motivations for distributed users to rally around and contribute to particular OSS projects instead of splintering off into isolated endeavors, creating “digital communities” around OSS projects characterized by social norms and stocks of project-specific information capital. Peer effects have long been discussed as a driving force behind the “success” of particular OSS projects. An early discussion on net effect of peer influence on open software contribution began with the conjecture by Brooks Jr (1995), who observed that the addition of developers to a software project slows down the pace of development. In a response to the so-called “Brook’s Law”, Raymond (1999) countered this postulate with the example of OSS collaboration and “Linus’s Law” that roughly states that the likelihood that faults in a software’s codebase will be identified and fixed 18 For example, Figure 2.B.1 depicts a snapshot of the web user interface for the bootstrap project’s GitHub reposi- tory, a popular JavaScript framework for web development: https://github.com/twbs/bootstrap. The history of all user contributions to the pandas codebase can be viewed here: https://github.com/twbs/bootstrap/commits/main. The repository contains a README with the source code, a document that contains links to detailed documentation, installation and usage notes, and guidance for prospective contributors. 19 This process is known as “forking” in the OSS community. Forks of original projects can also become active contribution communities in their own right. This typically happens when there is a sufficient number of contributors interested in pursing a different direction of development. 16 rises with the number of users and contributors working with it. Raymond (1999) argues that the proliferation of highly collaborative, decentralized OSS projects is itself a rebuke of Brook’s Law. As the production of OSS can clearly be subject to peer influence, a core impetus for this study is to disentangle the various channels through which peer effects operate and estimate the empirical implications for these effects on equilibrium contribution. In theory, peer effects can have both negative and positive impacts the level of privately provided OSS. What anecdotal evidence do we have for either (1) free-riding or (2) productivity externalities in OSS development? With the rise of OSS use, a common concern amongst OSS project maintainers is over-subscription of their projects: users who flood communication channels with support requests without contributing the fix themselves (Eghbal, 2020). A related concern is that many OSS projects originating from small groups of contributors are widely used as part of the “digital infrastructure” (Eghbal, 2016) that underpins modern information and communication technologies. Consider the case of OpenSSL, an encryption library that by some estimates is used by two-thirds of public facing web servers to secure private information (The OpenSSL Project Authors, 2021). In 2011 a bug, now known as Heartbleed, was introduced into the OpenSSL codebase and was not discovered until 2014 20 , exposing a vast swath of internet communications that were previous thought to be secure. The estimated cost to simply limit the extent of this vulnerability was estimated to be over $500 million USD (Kerner, 2014) and does not consider the cost of any secure data lost through the exploit. The OpenSSL team at the time “never had more than three to four core developers” overseeing more than a half a million lines of code on an annual donation budget of $2,000 USD (Oberhaus, 2019). Whether it was the sheer size and complexity the OpenSSL codebase or the preferences of the maintenance team deterred potential contributors, the fact that an OSS project serving as a critical 20 Consequently, some have pointed to the Heartbleed exploit as a repudiation of Linus’ Law (Meneely et al., 2014). 17 component of internet infrastructure did not receive more attention from the wider community of users who rely on it ought to be cause for concern for OSS sustainability. While free-riding on OSS contribution is likely prevalent and perhaps inescapable when consider- ing a project’s user-base in the broadest sense 21 , it may also be the case that increased participation in OSS distributes the joint cost contribution and improves individual productivity. How can the development of OSS itself either make subsequent contribution less costly or induce the marginal free-rider to contribute? Recommended practices in software engineering encourage developers to include documentation, testing frameworks, and use automated processes whenever possible (Fo- gel, 2005). Documentation explains the functionality and inner workings of software code in plain language, making it easier for both users and potential contributors to work with the software. Testing frameworks ensure the code functions as intended and are essential for a large collaborative OSS project. Continuous Integration (CI), a form of automation in the integration and testing of changes to software projects, facilitates a greater volume of contribution and has been shown to allow software projects to release 22 more frequently (Hilton et al., 2016a). Investments in these features lower the cost burden of maintenance and lower the barriers to entry for new contributors. Moreover, active contributors in OSS communities often provide “non-code” contribution services to the project, answering user inquiries, reviewing and integrating proposed changes, establishing design principles and community guidelines, and other functions peripheral to contributing code. It’s natural to imagine that all else equal, a potential contributor would prefer allocating their con- tribution bandwidth to an OSS project with sociotechnical infrastructure that makes it easier to work with. 21 Modern software projects, both proprietary and open source, typically borrow 70 to 90% of their functionality OSS components (Nagle et al., 2022). 22 In software development, a “release” is a particular version of the project distributed to users. In an appeal to Linus’ Law, OSS proponents such as Raymond (1999) and Fogel (2005) encourage frequent releases. 18 The collaborative and decentralized nature of OSS development suggests a setting rife with intricate peer effects. The wider population of OSS users may lack the skills or resources needed to contribute to OSS codebases and may simply free-ride on the contributions of more prolific developers. Atthesametime, OSScontributionitselfgeneratesanabundanceoffeaturesthatreduce the cost of and further incentivize wider OSS participation. We use this study as an opportunity to develop a microeconomic framework decomposing these forces and to empirically estimate their implications. 2.3 Literature We review a subset of academic literature that can be divided into several distinct strands: (1) motivations for OSS contribution, (2) the private provision of public goods, and (3) peer effects. 2.3.1 Why Contribute to OSS? Although initially puzzling, the existence and proliferation of OSS goods has been studied through aneconomiclensforovertwodecades(LernerandTirole, 2002). Acommoninterestinearlyresearch on the economics of OSS focuses on the incentives for participation in public good production by both individuals and profit-maximizing firms. Different hypotheses have been offered to explain OSS provision and contribution behavior: • Individual private benefits: intrinsic motivation (Lakhani and Wolf, 2003), need satiation (Bessen, 2006; Athey and Ellison, 2014), signalling and status (Glazer and Konrad, 1996; Lerner and Tirole, 2002; Roberts, Hann, and Slaughter, 2006), “warm glow” (Andreoni, 1990), option value of modular codebases Baldwin and Clark (2006), permissive licensing (Fershtman and Gandal, 2004; Lerner and Tirole, 2005b; Fershtman and Gandal, 2007). 19 • Social effects: pure altruism (Bonaccorsi and Rossi Lamastra, 2003), social norms and re- ciprocal altruism (Raymond, 1999; Bergquist and Ljungberg, 2001; Benkler, 2002), project productivity (Fershtman and Gandal, 2011) • Strategic motivations for firms: innovation, market power (Bonaccorsi, Giannangeli, and Rossi, 2006), labor search 23 , cost reduction (Andersen-Gott, Ghinea, and Bygstad, 2012) Some closely related work examines contribution to OSS and open source content in general empirically. Fershtman and Gandal (2004) find that permissive software licenses induce greater levels of contribution. Hahn, Moon, and C. Zhang (2008) find that OSS developers are more likely to join projects with past collaborators. Fershtman and Gandal (2011) demonstrate an empirical relationship between the success of an OSS project, measured in downloads, and the extent to which its contributors work in other common projects, suggesting the existence of both direct and indirect project knowledge spillovers. In contrast, the present study uses microdata to measure peer effects on contribution at the individual level. Several authors have used the context of Wikipedia to study peer effects within collaborative production of open content. Exploiting blockages of Chinese language Wikipedia for mainland China, X. M. Zhang and Zhu (2011) find that pro-social peer effects are increasing in the number of peers: individuals contribute more to Wikipedia when they have more peers. Slivko (2014) use an indirect peers strategy to find modest evidence for positive, intensive margin peer effects amongst frequent contributors. 2.3.2 Private Public Good Provision Seminal work seeks to rationalize private provision of public goods. While the canonical public goods model of Samuelson (1954) suggests strong incentives to free-ride on the contributions of others, heterogeneity in both preferences and the marginal cost of provision can explain positive 23 See https://github.com/t9tio/open-source-jobs for a list of job listings for private firms with primary prod- ucts centered around GitHub OSS repositories. 20 levels of private provision in many contexts (Tiebout, 1956; Stiglitz, 1981; Stiglitz, 1982; Bergstrom, Blume, and H. Varian, 1986; Cornes and Sandler, 1985; Andreoni, 1990; Fischbacher and Gächter, 2006; Kotchen, 2009; Jacobsen, LaRiviere, and Price, 2017). In the case of OSS, online collabo- ration dramatically reduces transaction costs inherent to the production of other types of public goods (Coase, 1937; Nitzan and Romano, 1990). Social norms develop around projects in order to efficiently manage the needs of the community and the time constraints faced by contributors (Holländer, 1990; Ostrom, 1990). Moreover, agents are subject to contribution externalities and can confer productivity benefits on peers, which in turn confer additional benefits to the original agent (Elliott and Golub, 2019). In this sense, agents “pass through” benefits of increased contribution and can be compensated for these investments. Several authors have focused on public good provision specifically within the context of OSS. Johnson (2002) analyzes a model of OSS public good contribution. As expected, the assumption of the fixed costs of contribution preclude the efficiency of the decentralized equilibrium. Baldwin and Clark (2006) find that highly “modular” codebases provide contributors with option value and ultimately attract more contribution. 2.3.3 Peer Effects Productivity Spillovers Particularly of concern to our reduced form analysis, we link this work to an expansive body of lit- erature concerning peer effects and their estimation. Experimental evidence suggest peer effects in public goods settings can be driven by punishment (Fehr and Gächter, 2000), cooperation (Falk and Ichino, 2006), and can ultimately increase voluntary contribution to public projects (Archambault, Chemin, and Laat, 2016). Several empirical studies find evidence of labor productivity “spillovers” when high ability peers are introduced (Mas and Moretti, 2009; Lindquist, Sauermann, and Zenou, 21 2015). There is mixed evidence for peer effect heterogeneity across individuals (Arcidiacono and Nicholson, 2005; Cornelissen, Dustmann, and Schönberg, 2017), suggesting the context and estima- tion strategy matter. A related literature investigates the importance of group sizes on treatment and peer effects (Angrist and Lavy, 1999; Krueger, 2003). Identification Identification of peer effects in non-experimental settings is of great concern to this literature. Manski (1993) posits a “reflection problem” which Bramoullé, Djebbari, and Fortin (2009) suggest can be solved by using instruments generated by the network structure itself: the behavior of indirectly linked agents can generate quasi-random variation needed to address endogenity concerns with estimating peer effects in observational data. It should be noted that the identification strategy of Bramoullé, Djebbari, and Fortin (2009) relies purely on characteristics of the network structure between agents and without qualification, can be devoid of microeconomic foundations or even lack appealing quasi-random variation for causal identification (Angrist, 2014). Other authors have used alternative strategies, such as true random assignment of peers (B. Sacerdote, 2001; Guryan, Kroft, and Notowidigdo, 2009; Carrell, B. I. Sacerdote, and West, 2011), exploiting quasi-experimental designs (Dahl, Løken, and Mogstad, 2014), overlapping peer groups (De Giorgi, Pellizzari, and Redaelli, 2010), directly modelling endogenous peer networks (Goldsmith-Pinkham and Imbens, 2013), the use of panel data (Patnam, 2011), and explicit structural approaches (Ciliberto et al., 2016). Our study will draw several techniques from this literature to develop an identification strategy for peer effects, including social connections, changes in peer groups, and individual fixed effects to develop a unique, micro-founded “peers-of-peers” identification strategy in Section 2.5. To the best of our knowledge, the closest use of the peers-of-peers identification strategy in public good 22 contribution is Slivko, 2014, who uses the number and average contribution level of indirect peers to instrument for peer contribution. 2.4 Data We use observational data to measure individual contribution levels over time for a sample of OSS projects. We draw this sample from projects hosted on GitHub, the world’s largest collaborative software development platform. 24 For each project, we observe agent-level contribution efforts in continuous time 25 , measured in “commits”, or atomistic modifications to the codebases of OSS projects. 26 For the purposes of this study, we will define an agent’s peer group in a given project as the set of other developers contributing code to that project. Individual and peer contribution levels across projects and time will form the core of the reduced form and structural analysis. Additional details on the data used in this paper can be found in Section 2.C of the appendix. We begin by describing the dataset in broad strokes. Since the universe of OSS repositories on the GitHub platform is incredibly vast 27 , we restrict our empirical sample to a randomly selected subset of popular and highly collaborative projects. Specifically, we take a 10% random sample of GitHub projects with 15 or more distinct contributors and 100 or more “stars” 28 as of June 2019. This results in an empirical sample containing 2,287 projects and 107,921 distinct contributors 24 Launched on April 2008, GitHub has become the world’s largest source code host and de facto collaboration platform for OSS projects 25 Each commit to a project is recorded with a timestamp (e.g. 2009-10-31 01:48:52). 26 Note that a commit can encompass changes to any number of lines across any number of files. A natural concern maybethatvariationinthesizeofindividualcommitsmakesitdifficulttocompareasequivalentunitsofcontribution effort. For example, a single commit might be a simple typo correction requiring little effort or a complicated “bug” fix that took many hours to address. Some have argued for simpler measures to estimate labor commitment to software development, such as the number of days a developer makes at least one contribution to a project in a given time period (Sherwood, 2015). 27 As of January 2020, GitHub has over 40 million users and hosts more than 190 million software repositories (GitHub, Inc., 2020). 28 On the GitHub platform, users can mark interesting projects by “starring” them, which subscribes the user to a newsfeed covering project development. For the purposes of this study, we use project stars as a proxy for user interest or quality of the project. Stars also distinguish highly collaborative, “engineered” software projects from small, single-user projects (e.g., abandoned forks or repositories containing personal files like notes or school projects) (Munaiah et al., 2017b). 23 observed from the launch of GitHub in April 2008 through June 2019. 29 We aggregate individual contribution to a monthly frequency and therefore the unit of analysis is individual-project-time. 30 The most commonly represented programming languages for these projects are JavaScript (31%), Python (11%), and Java (9%). Of the contributors represented in the sample, 3.7% are members of the projects they contribute to and only 0.57% are project owners. 31 The average project in the empirical sample is 5 years old and is the product of 2,490 cumulative commits made by 56 distinct contributors. The average individual in the sample contributes 13 commits to a particular project a month and 53 commits across all projects over the sample period. It is critical to note the (right) skewness of contribution, both between and within projects: the median project has 829 cumulative commits made by 29 distinct contributors while the median agent makes only 3 commits to a single project each month. Furthermore, the share of individual contribution within projects is bimodal (see Figure 2.B.3). Roughly 45% of contributions in our sample are made by agents who represent 5% or less of total project contribution for that month. On the other end of the spectrum, about 8% of observations in the sample represent individual contributions that account for over 95% of total project commits for that month. In simpler terms, the most common contribution pattern within projects involves many individuals contributing a small share 32 relative to a dominant core contributor in each period. The sample provides evidence that even though both aggregate contribution and the number of distinct contributors have grown over time (see 29 Since GitHub is simply the web platform hosting the project, some projects in the sample have contributions made either prior to the existence of GitHub or it’s arrival on the GitHub platform. Projects are managed using version control systems (VCS) that record a complete history of changes in the project since its inception. GitHub’s namesake comes from the VCS tool used by the projects it hosts: git. Figure 2.B.2 overlays histograms for (1) the earliest recorded commit in each project and (2) the date the project was created or moved to the GitHub platform. 30 In other words, each observation is the level of contribution by an individual to a particular project for the given month. 31 For the projects in this sample, “ownership” does not imply property rights over the software code itself. Project “ownership” and membership on the GitHub platform simply means the user has certain administrative privileges within the repository, most important of which is the ability to merge proposed contributions of outsiders into the main project codebase. It should be noted that many projects may feature core contributors with a considerable amount of influence on project design decisions who are not officially project owners or members in the GitHub system. 32 This is known as “drive-by” or “casual” contribution (Fogel, 2005; Eghbal, 2020). 24 Figure 2.B.4 and Figure 2.B.5), average individual contribution levels have remained quite stable (see Figure 2.B.6). Consistent with anecdotal evidence from the OSS literature (Eghbal, 2020) and theory on contributor behavior (Athey and Ellison, 2014), these characteristics suggest the growth of an OSS projects is a combination of (1) small number of dominant core contributors and (2) the aggregate effect of small contributions from a wider population of software developers. With a general understanding of the GitHub contribution sample used in this chapter, we now direct attention towards measures of peer and individual contribution germane to both reduced form and structural analysis. We present key descriptive statistics for this empirical sample of contributions measures in Table 2.A.1. The average agent contributes 13 commits to a project each month and has an average of 17 peers contributing 188 commits in aggregate. As noted before, individual contribution is highly right-skewed. The median agent contributes just 3 commits per month and has 7 peers who contribute 59 total commits. Approximately 6.8% of observations in the empirical sample involve a sole contributor with no peers in that time period. Since the mean and median individual contribution levels coincide with project-specific contribution, these data suggest that most agents contribute to a single project in a month. 33 The average agent’s mean cumulative contribution to a particular project is 256 commits (median 23), a pattern that naturally is similar in peers. Together, insights from the empirical sample suggest agents form affinities with a particular project and continue to contribute to it over time. 34 Finally, we collect two additional measures most relevant for our structural approach described in Section 2.6. First, for each project and time period, we measure the number of “stars” associated 33 It should be noted that the apparent lack of contributors contributing to multiple projects may simply be an artifact of sample construction. We simply take a random sample of projects and observe contribution activity of agents within those particular projects. Therefore, individuals in the sample may be contributing to other OSS projects not recorded in the present sample. We at least partially account for this deficiency when constructing our instrument for peer contribution in Section 2.5.2, a measure that sums contribution levels of “peers-of-peers” across all projects recorded in the GHTorrent sample of Gousios (2013), some of which are not contained in our empirical sample. 34 This may be explained by many alternative mechanisms, including individual need (Bergstrom, Blume, and H. Varian, 1986; Lerner and Tirole, 2002; Lakhani and Wolf, 2003), the discipline of social norms (Ostrom, 1990), or an accumulated expertise within a project. 25 with the project. This is a rough proxy for an OSS repository’s popularity and is used to measure the level of public good quality. Similar to contribution levels, project quality is highly skewed: the mean (median) project has 910 (161) stars in a given month. Second, we observe individual time allocated on the platform. Specifically, for each individual and time period, we measure how many days they spend contributing to any project on the GitHub platform (Sherwood, 2015). We use this measure of time allocation to proxy for numéraire good consumption, which in turn facilitates the estimation of time and project-varying productivity shocks for each agent. In a given month, the average (median) agent makes commits over 3.88 (2) days to projects in the sample. Compared with the skewness in contribution levels, this descriptive suggests a stark difference between the extensive and intensive margin contribution decisions. 2.5 Reduced Form Before developing a structural model, we build intuition for net peer effects in public goods contri- bution using a simple reduced form framework. We seek to understand how an individual agent’s individual contribution level is influenced by the contribution level of her peers. This section is organized as follows. We first outline a baseline econometric specification to assess peer effects in public good contribution. Next, in an effort to address endogeneity concerns and give a causal in- terpretation to the peer effect estimates, we propose an instrumental variable for peer contribution, define its measurement, and discuss various possible threats to identification. The final subsection discusses the empirical results. 2.5.1 Peer Effects on Individual Contribution Considerasettinginwhichindividualsi∈N contributetoOSSprojectsp∈P ineachperiodt∈T . The outcome of interest, a ipt ≥ 0, is the contribution level for agent i to project p at time t. The 26 aggregate contribution level of agent i’s peers to project p at time t, denoted by a -ipt ≡ P j̸=i a jpt , is the regressor of interest. We present a baseline specification 35 for contribution peer effects in Equation (2.1): a ipt =δa -ipt +β ′ X ipt +ϵ ipt . (2.1) Here the vector X ipt is a set of observable exogenous factors driving agent i’s level of contribution to project p at time t. The term ϵ ipt represents unobservable factors driving contribution. The coefficient of interest in Equation (2.1) is δ , which captures the (average) effect of aggregate peer contribution on the level of individual contribution. 36 We will refer to the coefficient δ as the reduced form peer effect in contribution. This term is sometimes referred to in the literature as the “endogenous effect” (Manski, 1993) or “social multiplier” (Glaeser, B. I. Sacerdote, and Scheinkman, 2003). Our empirical analysis seeks to test the null hypothesis of no peer effects ( δ = 0) against an alternative that there exists some relationship between contribution levels of peers (δ ̸= 0). If there is evidence of peer effects, we are also interested in the net effect of the opposing externalities. The core premise of this study is that peer influence is the net effect of two distinct externalities in contribution. In the canonical public goods model, individual and peer contribution to public goods are strategic (gross) substitutes and therefore voluntary provision is vulnerable to free-riding. If incentives to free-ride dominate, we should expect δ < 0 37 in equilibrium. On the other hand, if an increased level of peer contributions also leads to an increase in agenti’s contribution ceteris paribus, it is likely the case that some other peer effect (e.g., externalities in productivity and contribution costs, pro-social behavior) dominates incentives to free-ride. This would imply δ > 0. 35 Alternative specifications similar to Equation 2.1 are presented in Section 2.5.4, serving to both provide robust- ness checks and to consider different characterizations of peer influence. 36 In other words, the effect on individual contribution when peer contributions increase by 1 commit, on average and ceteris paribus. 37 Note that this assumes that β ̸= 0 in the true model for individual contribution. If the population model in Equation (2.1) is such that β = 0 (i.e., a model without covariates or an intercept), then δ ∈ [0,1] by construction. As δ →0, a single contributor dominates and all others free-ride. As δ →1, contribution is uniform across peers. 27 Other observable factors that influence agent contribution are captured in a vector X ipt and may potentially vary across agents, OSS projects, and time. Examples of these influences may include individual and peer contribution history 38 , observable quality or popularity of the OSS project, the size of the contribution peer group, technical characteristics of the projects 39 , and other agent characteristics. 404142 In terms of the specification in Equation (2.1), we can also include a battery of individual, project, or time fixed effects 43 inX ipt . 2.5.2 Identification The specification in Equation (2.1) describes a model in which peer groups are defined as the set of agents contributing to a particular OSS project at a given point in time: individual contribution is a function of contemporaneous, aggregate peer contribution. The specification is a simplified linear- in-sums 44 formulation similar to reduced form models studied widely in the peer effects literature (Manski, 1993; Bramoullé, Djebbari, and Fortin, 2009; Goldsmith-Pinkham and Imbens, 2013). Point identification of the parameter δ in Equation (2.1) is demonstrated by Lee (2007) by exploiting “leave out” sums and variation in peer group sizes, overcoming the well known non- identification result of Manski (1993). In our setting, we point to the descriptive statistics in Table 2.A.1 as evidence that contributors are likely to have different groups of contributing peers in each period for any particular project. Since peer groups in the current empirical setting are 38 Such as an agent’s cumulative contribution to a project at time t or their contribution in previous periods. Cumulative and temporal lags of contribution can capture an agent’s accumulated experience or affinity with a particular project. 39 Such as project age and programming language used 40 In the context of GitHub data available, agent characteristics may include whether the agent is the owner or member of the project or if they identify with a particular employer. 41 Agents can voluntarily include the name of their employer in their GitHub profile and can make contributions with a company email address. 42 On the GitHub platform, an agent can be added as a member to a project, potentially giving them more discretion over what proposed changes by the wider community are integrated. It also is plausibly a signal of an agent’s affinity with a particular project. 43 Inclusion of individual level fixed effects Xipt accounts for an agent’s intrinsic proclivity to contribute to OSS goods, independent of other factors (Andreoni, 1990). 44 Technically, the specification is linear in “leave out” sums by definition of a-ipt. 28 naturally quite dynamic, we argue that point identification is established. Therefore, we wish to go one step further and establish causal identification for the parameter δ in Equation (2.1). Under what conditions can we interpret an estimate of δ as the local average treatment effect (LATE) of peer contribution on the level of individual contribution? The immediate challenge is that since individual and peer contribution are both observed choice variables, a naive estimate of δ likely suffers from endogeneity bias (Angrist, 2014; Lewbel, 2019). An experimental ideal to causally identify the net peer effects parameter δ would involve first randomly assigning agents to projects and then allowing them to decide contribution levels, ensuring random peer groups in which choice of contribution levels ought to be uncorrelated unobservables, or Cov(a -ipt ,ϵ ipt ) = 0. In reality, agents select into and choose contribution levels on the basis of potentially unobservable influences, such as personal need, technical ability, and their endowment of time to work on OSS projects. For example, high ability agents might select into and make above-average contributions to common projects, generating positive bias in the ordinary least squares estimate of δ in Equation (2.1) (i.e., Cov(a -ipt ,ϵ ipt )>0). On the other hand, low ability agents may also select into projects with highly skilled developers and make minimal contributions. Taken to the extreme, agents who free-ride completely do not contribute at all and therefore do not appear in the sample whatsoever. Since we cannot completely account for all free-riders in OSS 45 , we must acknowledge any interpretation of the estimated effects in this analysis is to be conditional on the population individuals who contribute at all. In the absence of purely random assignment of contributors to projects, we address the concern over endogeneity in peer contribution a -ipt by use of an instrumental variables strategy. We seek a valid instrument for peer contribution that, conditional on other observable and exogenous factors X ipt , (1) exerts some influence on the contribution levels of Agent i’s peers j ̸= i and (2) only 45 Given its wide-reaching prevalence, it’s difficult to imagine there exist consumers of information technology who have not used OSS at some point. 29 influences Agent i’s contribution to project p through its effect on peer contribution a -ipt . 46 In other words, in lieu of random assignment to projects, we must opt for an instrument that generates quasi-random variation in peer contribution levels, conditional on the set of agents that contribute at all to a given project. Furthermore, we combine this instrumental variables approach with a battery of both control variables that plausibly explain OSS contribution and fixed effects at the individual, project, and time period to account for common but unobservable shocks across each unit. 47 Contribution by peers-of-peers Consider an agent i who contributes to OSS project p. If for some reason i’s peers suddenly find contribution to other projects relatively more (less) attractive, they may allocate efforts away from (towards) projectp through a channel with no direct influence over Agent i’s contribution top. This strategy is facilitated by the project-mediated “social network” of individual developers in which connections are defined by the projects they commonly contribute to. 48 Agent i has peers j̸=i in projectp, who in turn also have peersk̸=i,j in other projectsq̸=p they also contribute to. Hence, we argue we can use the contribution network structure itself in a “peers-of-peers” identification strategy in the spirit of Bramoullé, Djebbari, and Fortin (2009) to recover the effect of peer effort levels on equilibrium contribution levels. 49 An important departure is that while the strategy of Bramoullé,Djebbari,andFortin(2009)isdesignedtoexploitgeneralcharacteristicsofthepeersocial network, the identification used in this study is based on microeconomic principles of substitution. We sketch out the identification strategy graphically in Figure 2.1. To guide the graphical intuition, 46 In the language of instrumental variable estimation, an instrument that satisfies both the (1) relevance and (2) exclusion conditions. 47 In the context of our notation, across each i,p, and t. 48 In a similar effort, Fershtman and Gandal (2011) use a bipartite graph to model connections between OSS projects and contributors. 49 As noted when surveying related literature, Slivko (2014) uses a similar “peers-of-peers” identification strategy by using a network of Wikipedia editors mediated by articles commonly contributed to. 30 Figure 2.1: Identification Strategy (“Peers-of-peers” Contribution). Agents {i,j,k} contribute to Projects{p,q}. Assume i and j contribute to p while j and k contribute to q. (a) Initial Setting a ipt a jpt a jqt a kqt i p j q k (b) Suppose Agent k increases contribution to Project q a ip a jp a jq a kq i p j q k (c) Case 1: Agent j substitutes towards Project p (contribution is a strategic substitute) a ip a jp a jq a kq i p j q k (d) Case 2: Agent j substitutes away from Project p (contribution is a strategic comple- ment) a ip a jp a jq a kq i p j q k consider the following hypothetical scenario. Suppose there are three contributorsN ={i,j,k} and two OSS projects P ={p,q}. Assume that at the beginning of period t, contribution profiles are a ipt ,a jpt ,a jqt ,a kqt >0. Hence, Agentsi andj contribute positive amounts to Projectp while Agents j and k contribute positive amounts to Project q. Agent i’s direct peer is Agent j and indirect or “peer-of-peer” is Agent k. In this sense, Agents i and k are connected only indirectly through the contribution patterns of Agent j. 50 This initial setting is represented in Panel (a) of Figure 2.1. Next, suppose Agent k increases their contribution to Project q (e.g., Panel (b) of Figure 2.1). If 50 That is to say, Agents i and k are not directly connected through the contribution networks. Any influences Agent k’s contribution has on that of Agent i operate only through changes in Agent j’s behavior. 31 changes in Agent k’s contribution to project q influence the time-constrained contribution behavior of Agent j, then Agent j may have incentives to change her contribution levels to Project p. The case in which Agent j finds her contribution to Project p a strategic complement with Agent k’s is depicted in Panel (c) of Figure 2.1. An example in the OSS setting may occur when Agent k contributes a fix for an issue in Project q that was consuming Agent j’s contribution bandwidth. Conversely, the case in which Agent j finds her contribution to Project p a strategic complement with Agent k’s is depicted in Panel (d) of Figure 2.1. This may arise if Agent k contributes an attractive fix or feature to Project q that encourages additional contribution from Agentj. In either case, the contribution pattern of Agent i’s indirect peer Agent k influences Agent i only through changes in the behavior of Agent j. In summary, we propose the use of aggregate contribution of peers-of-peers effort to instru- ment for peer effort. The instrument operates by inducing substitution of contribution effort across projects, generating quasi-random variation in aggregate peer effort from the perspective of individ- ual developers. In the following two subsections, we define the measurement of the peers-of-peers instrument and provide a set of assumptions for its validity. Instrument Measurement Denote the “peers-of-peers” instrument for peer contributions a -ipt as z ipt . Roughly speaking, we choose to define z ipt as the aggregate contribution of peers of i’s peers in project p at time t− 1. 51 Formally we measure z ipt as t: z ipt = X j̸=i X q̸=p X k̸=i,j 1{a jq,t− 1 >0}1{a jk,t− 1 }a kq,t− 1 . (2.2) 51 We consider the contribution of peers-of-peers in the previous period to mitigate concerns over reverse causality. 32 Hence, z ipt represents “aggregate contribution by i’s peers-of-peers defined by project p in month t− 1”. 52 To avoid concerns of reverse causality, we use peers-of-peers contribution in the previous month t− 1 to instrument for peer contribution in month t. Since Equation (2.1) postulates that agents respond to aggregate as opposed to average peer contribution, weconstruct the peers-of-peers instrument similarly. Threats to Identification The validity and strength of the peers-of-peers instrument z ipt instrument for peer contributiona -ipt rests on several assumptions. Assumption 1. No isolated contributors. Most obviously, contributors need to have peers in order to assess the influence of peer effects. Moreover, their peers must also have peers. While by construction of the sample each project has at minimum 15 distinct contributors over its lifespan, we acknowledge that the empirical sample includes a small share of observations in which only a single agent makes a contribution in that time period. 53 We should reasonably expect such observations to both weaken the relationship between the instrument z ipt and peer effort a ipt and introduce downward bias to estimates of the peer effect coefficient δ . Assumption2. For each agenti, there exists a set of projectsi will never contribute to, independent of the cost of contribution. Agents cannot be peers with everyone. For the exclusion restriction to hold, it is necessary that peers-of-peers contribution influences individual contribution only through peer contribution. 52 Since the sample is constructed by measuring all contribution around a set of core collaborative projects, it is important to note that agents in the empirical sample also contribute to outside projects. Therefore, projects q̸= p and contribution levels{ajqt,a kqt } may not be present in the sample. 53 As noted in Section 2.4, these observations comprise about 6.8% of the empirical sample. 33 Hence, we need a sufficient level of contribution behavior where agents are connected indirectly through peers. 54 Consider a setting in which all agents contribute to all projects. In this setting, agent i’s contribution level is directly influenced by other agents since the “peers of i’s peers” are really just i’s peers. Assumption 3. Conditional on observable influences, agents substitute contribution effort between projects. In other words, the peers-of-peers effort z ipt is conditionally correlated with aggregate peer contribution a -ipt for the relevance condition to hold: Cov(a -ipt ,z ipt | X) ̸= 0. This is our most critical assumption. For the instrument to be relevant, there needs to exist some degree of influence of peers-of-peers contribution on peer contribution in aggregate. The immediate concern with using “peers-of-peers” contribution to instrument for peer effort in Equation (2.1) is that the objective of reduced form analysis is to test the null hypothesis of δ = 0: no influence of peer contribution on individual contribution within projects. However, it is important to note that the peers-of-peers instrument operates through substitution with peer contribution between projects while the null hypothesis for Equation (2.1) only accounts for substitution with peer effort within projects. There is no reason ex ante that one substitution pattern precludes the other. Additionally, we argue that the peers-of-peers contribution levels are relevant conditional on other exogenous or predetermined factors drive peer contribution, such as cumulative contribution in a project (i.e. Cov(a -ipt ,z ipt | X ipt )̸=0). 55 Combining these arguments, we assert that if these assumptions hold then the peers- of-peers instrument z ipt drives some degree of meaningful variation in peer contribution a -ipt that is quasi-random and therefore exogenous from the perspective of the individual i. 54 In other words, the social network of contribution cannot be a complete graph and a sufficient number of “intransitive triads” exist in the contribution network (Bramoullé, Djebbari, and Fortin, 2009; De Giorgi, Pellizzari, and Redaelli, 2010; B. S. Graham, 2015). 55 Furthermore, we note there is some degree of mechanical correlation between a-ipt and zipt: a greater number of peers to individual i is likely to subsequently generate a greater number of peers-of-peers. 34 2.5.3 Results We present baseline estimates for the peer effects parameter δ of the reduced form model from Equation (2.1) in Table 2.A.2. Columns (1) through (3) of Table 2.A.2 present ordinary least squares (OLS) estimates while Columns (4) through (6) present instrumental variables estimates using two-stage least squares (IV 2SLS). We use peers-of-peers contribution z ipt to instrument for peer effort a -ipt . Columns (1) and (4) estimate a specification of Equation (2.1) with only an intercept and the endogenous regressor a -ipt . Columns (2) and (5) add covariate controls 56 and Columns (3) and (6) add covariate controls alongside project and year-month fixed effects. The estimates in Table 2.A.2 suggest little evidence of peer effects in contribution on average for the full sample. The OLS estimates in Column (3) are not statistically different from zero after accounting for project fixed effects and observables. The same is true for the corresponding 2SLS estimates. These specifications explain roughly 18% of the variation in individual contribution for the full sample. We note the F statistic from the first stage of the 2SLS estimate in Column (6) of Table 2.A.2 is 64.37. 57 Hence, given the model in Equation (2.1) and the sample at hand, we cannot reject the null hypothesis of no peer influence on individual contribution on average ( δ =0) once we account for project fixed effects and covariate controls. 2.5.4 Detailed Analysis and Robustness Estimates for the population average δ in Equation (2.1) mask considerable heterogeneity in peer influence on individual level contribution. To explore heterogeneity and provide additional robust- ness for the reduced form peer effect estimates, we estimate a series of alternative specifications 56 Controlvariablesincludethreetemporallagsofindividualandpeercommitstotheproject, cumulativeindividual and peers project commits, project quality measured in GitHub stars, quadratic terms for project age, and dummy variables indicating if the individual is a project owner, project member, or if they are affiliated with a firm. 57 Since the model is just-identified, we report the heteroskedasticity-robust F statistic proposed by Olea and Pflueger (2013) and recommended by Andrews, Stock, and Sun (2019). 35 and present the results in Appendix 2.D. Most notably, we find evidence that although the num- ber of contributors has grown over time, contemporaneous peer effects have diminished over time (see Figure 2.B.8). It’s reasonable to suspect that peer effects are stronger in the earlier days of GitHub as most projects in the sample were still in their infancy. Peer groups were smaller and there were simply fewer developers active on the platform. We also find considerable heterogeneity in peer effects at the project level (see Figure 2.B.7). We interpret this result as heterogeneity in net complementarity of contribution effort that likely varies across projects. 58 Moving beyond contemporaneous peer effects, we find that peer effects are stronger when regressing individual con- tribution 3 months after on peer contribution 3 months preceding a given period t (see Table 2.A.5). It is likely the case that peer influence takes some time to operate and individuals are induced to contribute on the basis of relatively recent development activity, not necessarily occurring in the same month. This result is important as it suggests that intensive margin peer effects are likely stronger after relaxing our rather restrictive assumption of contemporaneous influence. 59 Finally, we consider the effect of contribution by project “insiders” on the level of contribution by project “outsiders” (see Figure 2.B.9), and find evidence that increased contribution from project insiders “crowds out” contribution from outsiders. Our results together suggest that while contemporaneous intensive margin peer effects in con- tribution are limited on average for the entire sample, (1) there exists significant heterogeneity in peer effects across time and projects, (2) contemporaneous peer effects may too narrowly restrict the scope of peer influence, and (3) free-ridership is likely prevalent if dominant core contributors 58 Moreover, our data and econometric specification treat each individual commit as equivalent contribution. In reality, some contributions might be more important than others. Consider the difference between a typo fix and the introduction of a new feature set. Since the specific lines of code changed by each commit can be observed, a worthwhile continuation of this work ought to examine the interaction between specific types of contribution and peer influence. This approach can give context to the heterogeneity observed in peer effect estimates and guide recommendations for OSS sustainability policy. 59 We return to this assumption when interpreting our structural estimates of intensive margin peer effects in Section 2.6.5. 36 and project insiders carry out the bulk of OSS development. We discuss these results in more detail alongside the findings of the structural analysis in Section 2.7. 2.6 Structural Model While the reduced form analysis begins to reveal patterns of peer influence in OSS contribution, a more refined approach is needed to operationalize the various channels through which peer con- tribution can influence equilibrium behavior. Importantly, the reduced form specification in Equa- tion (2.1) conflates peer influence along both the extensive and intensive margin into a single param- eter. Therefore, estimates of contribution peer effects δ are conditional on agents who contribute positive amounts and does not separately account for why agents decide to contribute to particu- lar projects. A structural approach allows us to rigorously define micro-founded channels for both equilibrium contribution decisions and peer influence. There are several key features of our structural model. First, we seek to separately identify marginal private benefits and costs of contribution for each agent. Second, we can characterize each agent’s equilibrium contribution decision along both the extensive (i.e., whether to contribute) and intensive (i.e., how much to contribute) margins. Third, we can integrate peer influence into both of these features. Peer effects can potentially influence contribution benefits and productivity as well as intensive and extensive margin contribution in equilibrium. Finally, a fully specified structural approach permits counterfactual analysis. This will allow us to place a value-added estimate for both intensive and extensive margin peer effects in terms of changes to equilibrium contribution. Specifically, we can compare contribution from the observed equilibrium to a counterfactual under which peer effects are absent. 60 60 Or, as if software developers contributed to projects in isolation from one another. We operationalize this by setting intensive or extensive margin peer effects equal to zero. 37 The remainder of this section is organized as follows. First, we set up the model of OSS con- tribution and introduce its various elements in Section 2.6.1. Our approach combines a model of private provision of public goods (Bergstrom, Blume, and H. Varian, 1986) into a selection model (Heckman, 1979). Second, we define an equilibrium in Section 2.6.2. Third, in Section 2.6.3 we specify how peer effects enter into the structural framework. Fourth, we detail our estimation strat- egy in Section 2.6.4. Fifth, we describe the structural estimates in Section 2.6.5. Finally, we conduct a counterfactual analysis to estimate “value-added” by peer effects in Section 2.6.6. 2.6.1 Setup Individual agents (i.e., OSS developers) are indexed i ∈ N = {1,...,N}. In each period t ∈ T = {1,...,T}, agents choose contribution levels a ipt ≥ 0 across a set of OSS projects indexed p ∈ P = {1,...,P} to maximize incremental contribution utility in each period. To summarize what follows, Table 2.A.8 collects notation for the structural model. Project Quality Projects are indexed by their quality y pt at time t. We assume project quality y pt is a simple linear function y of cumulative contribution through t, a pt ≡{ a ips } i∈N s≤ t , and parameters b pt : y pt =y(a pt ,b pt )=b pt X i∈N X s≤ t a ips . (2.3) Note that this specification implies that the parameter b pt represents the marginal product of con- tribution labor in terms of the quality of project p at time t. 61 61 While the project quality specification in Equation 2.3 may give rise to concerns over “over-fitting” parameter estimates to the data, we choose this specification purposefully to capture the reality that the marginal product of contribution labor is arguably higher when the project is in early development stages. 38 Preferences Agent preferences are styled after Bergstrom, Blume, and H. Varian (1986)’s model of private public good provision. We extend this framework to include multiple public goods and time periods. In each period t and for each project p, agents derive utility over (1) direct contribution benefits, (2) project quality, and (3) a numéraire consumption good x it (e.g., time). Specifically, u it =u(a it ,y t ,x it )= X p∈P v ipt a ipt − 1 2 (a ipt ) 2 +y pt +x it , (2.4) where a it ≡ { a ipt } p∈P and y t ≡ { y pt } p∈P respectively collect agent i’s contributions and project quality across for all p ∈ P at time t. Following Bergstrom, Blume, and H. Varian (1986), we assume linear preferences: contribution, public good quality, and private good consumption are perfect substitutes. 62 This simplifies the utility maximization problem into independent choices of optimal contribution between projects, subject only to a budget constraint. Agent preferences are shaped by private contribution benefit shocks v ipt ∈R, which partially determine the optimal level of contribution in equilibrium. It’s critical to note that a realization of v ipt may be such that the agent decides not to contribute to project p at all. Individual project-specific benefit shocks are similar to Athey and Ellison (2014)’s “arrival of needs” model of OSS dynamics at a macro-level. 62 More specifically, preferences are quasilinear in xit and therefore increasing an agent’s endowment of the numéraire good does not influence demand for contribution. 39 Contribution Constraint We assume that agent contribution a ipt and consumption of the private good x it are constrained by (1) productivity shocks 63 c ipt >0 and (2) endowments ω it : x it + X p∈P c ipt a ipt ≤ ω it . (2.5) In our empirical application, ω it is the agent’s endowment of time in period t (i.e. 1 month) and the private good x it is the amount of time spent not contributing. 64 As in the reduced form analysis, we measure a ipt as the number of commits agent i makes to project p at time t. This implies that the (inverse) productivity parameters c ipt measure the time cost incurred by agent i making a ipt commits to project p. 65 If c ipt > c jpt , agent j is more productive contributing to project p at time t. Finally, we naturally normalize ω it =1 for all i and t given its interpretation. Since a ipt ≥ 0, this will in turn imply 0≤ x it <1 and 0<c ipt <1. 66 Selection Mechanism Obviously, agents can elect to contribute nothing to certain projects. We therefore introduce a selection mechanism in the spirit of Heckman (1979). We assume that projects feature fixed costs of contribution, modelled as a latent productivity threshold z p . 6768 Agent i will contribute a ⋆ ipt >0 to 63 As specified in the contribution constraint in Equation (2.5), cipt technically represents agent i’s “cost” of contribution to projectp at timet. The inverse ofcipt is therefore a measure of contribution productivity. Throughout the structural analysis, we will refer to cipt as both “productivity” and “cost” interchangeably. 64 In other words, the number of days in month t in which agent i authored no commits. 65 When an agent’s endowmentωit is measured as the number of days in periodt andxit is measured as the number of days in the period i was not active on the GitHub platform, the shock cipt can be interpreted as the number of commits i makes to project p per days i was active on GitHub. 66 In general, we only bound productivity shocks such that cipt > 0. However, in the data the smallest value of positive contribution is a ⋆ ipt = 1. It can therefore be shown that this normalization implies also cipt < 1 for all a ⋆ ipt >0. 67 We acknowledge that this selection mechanism could also be interpreted as a latent benefit threshold. See the discussion of structural estimates of extensive margin peer effects in Section 2.6.5. 68 See Hsieh, Konig, et al. (2018) and Hsieh, Konig, et al. (2020) for examples of similar selection mechanisms used in models of public good contribution. 40 project p at time t if their private project-specific ability z ipt exceeds z p . Furthermore, we assume thatz ipt is a linear function of observablesW ipt ,z ipt =γ ′ W ipt +ϵ z ipt whereϵ z ipt ∼N (0,1). Therefore, the probability that i contributes to project p in period t is Pr(a ⋆ ipt >0)= Pr(z ipt ≥ z p )=Φ( γ ′ W ipt ), (2.6) where Φ( z) is the standard normal cumulative distribution function. In applications, we normalize the contribution thresholdz p =0 for all projects. 69 The vectorW ipt contains a set of characteristics that influences i’s decision to contribute to project p at time t: the number of peers contributing to projectp and both cumulative and lagged contribution for individual i as well as for all agentsj̸=i (e.g., historical peer contribution). These factors give important signals to prospective contributors deciding which projects to participate in and will serve as the basis for our extensive margin peer effects discussed in more detail in Section 2.6.3. For example, an established project featuring many active contributors can provide a useful signal to newcomers uncertain about its quality and maturity, who may be more inclined to contribute under a belief that their efforts will go towards a worthwhile endeavor. 2.6.2 Equilibrium Timing and Information At the beginning of each period t, each agent first learns their extensive margin shock ϵ z ipt for each project. Next, the set of agents who meet the productivity threshold z ipt ≥ z p and decide to contribute to project p learn their benefit and productivity shocks (v ipt ,c ipt ). We assume that all shocks are public information: agents know who will contribute to which project and how much they 69 To rationalize this normalization, we detail project-specific estimation of γ for each p in Section 2.6.3. 41 will contribute. In the following subsections, we characterize both extensive and intensive margin decisions and the resulting equilibrium. Extensive Margin Decision Following the selection mechanism described in Equation (2.6), agent i will contribute a ⋆ ipt > 0 if and only if z ipt ≥ z p upon learning ϵ z ipt . Otherwise, if an agent does not cover the productivity threshold, they will decide not to contribute to project p at all: z ipt <z p ⇐⇒ a ⋆ ipt =0. Intensive Margin Decision Agents with z ipt ≥ z p next determine an optimal, positive contribution level a ⋆ ipt > 0. Taking marginal private benefit and productivity shocks (v ipt ,c ipt ) as given, each agent i chooses an allo- cation (a ipt ,y pt ,x it ) to maximize incremental utility u it : max a ipt >0,ypt,x it ∈[0,1) X p∈P v ipt a ipt − 1 2 (a ipt ) 2 +y pt +x it s.t. x it + X p∈P c ipt a ipt ≤ 1 y pt =b pt X j X s≤ t a jps . (2.7) Under the intensive margin decision characterized by System 2.7, each agent i explicitly takes into account (1) shocks (v ipt ,c ipt ) and (2) cumulative contribution to project p. To account for affinities andexperienceformedinparticularprojects, weallowanagent’scumulativeandlaggedcontribution history to influence their benefit and productivity shocks in Section 2.6.3. 42 To characterize each agent’s intensive margin contribution behavior in equilibrium, we observe that the first order necessary conditions for optimal, non-zero contribution a ⋆ ipt >0 imply a ⋆ ipt =b pt +v ipt − c ipt . (2.8) In other words, should agent i decide to contribute to project p at time t, her optimal level of contribution equals the sum of the marginal product of labor in terms of public good quality b pt , the marginal private benefit of contribution v ipt , and the marginal private cost of contribution c ipt . All else equal, agents contribute more when either their marginal product of labor or marginal private benefits of contribution are higher and less when the marginal cost of contribution (i.e., inverse productivity) is higher. 70 Combining the optimal intensive margin choice of contribution in Equation (2.8) and the ex- tensive margin decision (i.e., selection mechanism) in Equation (2.6), a given agent i’s equilibrium contribution strategy for project p at period t can be summarized as a ⋆ ipt = b pt +v ipt − c ipt if γ ′ W ipt ≥ ϵ z ipt 0 if γ ′ W ipt <ϵ z ipt . (2.9) 2.6.3 Peer Effects We allow peers to influence equilibrium contribution decisions along both the extensive and in- tensive margins for equilibrium contribution behavior described in Equation (2.9). To disentangle these margins, we will assume separate channels of influence for each mechanism. Historical peer contribution will form the basis for peer effects along the extensive margin. Conditional on the set 70 Equation (2.8) is a linear form of the optimal public good contribution level of derived by Bergstrom, Blume, and H. Varian (1986), reflecting that private public good contribution is driven by heterogeneity in both benefit and cost heterogeneity. 43 of agents who contribute a strictly positive level, correlation between the realized benefit and pro- ductivity shocks, (v ipt ,c ipt ), of an individual and her peers will form the basis for peer effects for the intensive margin choice. We formalize these peer effect channels in the following two subsections. Extensive Margin To integrate peer influence into the extensive margin contribution decision, we disaggregate influ- ences over agent i’s latent ability threshold for project p at period t, z ipt = γ ′ W ipt + ϵ z ipt , into characteristics specific to i or project p, β ′ z X ipt (individual controls), and those related to peers j̸= i, γ ′ W ipt (peer influences). 71 Specifically, we include (1) the number of agents contributing to project p in period t− 1 as well as (2) cumulative and lagged peer contribution to project p in the vector W ipt . This is designed to capture the fact that past contribution to OSS projects by peers is a public information good itself and may lay the foundation for subsequent contribution. 72 On the other hand, agents may also choose to free-ride should cumulative project contribution reach a particularlevel. ThevectorX ipt containsmeasuressuchasindividuali’scumulativeandlaggedcon- tribution, and therefore accounts for i’s own accumulated experience with project p. Furthermore, we allow parameter vectors γ and β z to vary by project and period. In addition to simplifying es- timation 73 , estimating separate parameters for each project implicitly accounts for project-varying characteristics that may influence selection beyond contribution history. 74 For each project, the selection mechanism in Equation (2.6) becomes Pr(a ⋆ ipt >0)=Φ( γ ′ W ipt +β ′ z X ipt ). (2.10) 71 With a slight abuse of notation for simplicity. 72 The influence of past actions by peers is also considered in a similar fashion by Bollinger and Gillingham (2012), who use cumulative solar panel installations in a neighborhood to predict current period adoptions. 73 Estimating an analogous model Pr(a ⋆ ipt > 0) = Φ( γ ′ Wipt +β ′ z Xipt) with a single coefficient γ would entail a single regression with N· P · T =107,250· 2,287· 134=32,867,620,500 observations at the individual level. 74 This amounts to including a distinct constant term in each N-length vectorXipt for each p. 44 Theparametervectorγ capturesproject-specificpeereffectsalongtheextensivemarginasafunction of historical peer contribution activity. If γ >0, the likelihood of contribution is increasing in past peer contribution W ipt . 75 Intensive Margin For each a ⋆ ipt > 0, we can separately recover the shocks v ipt and c ipt by using the equilibrium contribution level in Equation (2.8), the budget constraint in Equation (2.5), and the project quality function in Equation (2.3). 76 Therefore, we can develop a framework for assessing contemporaneous peer influence for both individual private benefits and productivity along the intensive margin, conditional on the set of agents with strictly positive contribution levels. In the context of our model, this can be measured by the degree to which shocks (v ipt ,c ipt ) are correlated between peers in a given project and period. We separate peer effects in contribution productivity, c ipt , from peer effects in private contri- bution benefits, v ipt , by using distinct peer effect specifications similar in structure to the reduced form peer effects specification in Equation (2.1). First, we assume that agent productivity is at least partially determined by peer effects: c ipt =δ c c -ipt +β ′ c X ipt +ϵ c ipt , (2.11) where c -ipt ≡ 1 npt− 1 P j̸=i 1{a ⋆ ipt >0}c jpt and n pt ≡ P i∈N 1{a ⋆ ipt >0} define the mean of productiv- ity shocks for i’s contemporaneous peers in project p. 77 Like the extensive margin specification in Equation (2.10), X ipt are a vector of observables and fixed effects such as lagged and cumulative 75 Notice that ∂Pr(a ⋆ ipt >0) ∂Wipt =γ ϕ (·)>0 76 Estimation is covered in detail in Section 2.6.4 as well as Section 2.E of the appendix. 77 When there is only a single agent contributing to a project, c-ipt =0. 45 contribution. Conditional on covariates, δ c captures the average correlation in productivity shocks amongst peers for a given project and time period. When δ c < 0, individual costs of contribution are negatively correlated with peer costs, suggesting positive peer effects in terms of productivity. Similarly, private benefit shocks (e.g., private “needs” or returns to contribution) are modelled as follows: v ipt =δ v v -ipt +β ′ v X ipt +ϵ v ipt . (2.12) When δ v >0, individual private benefits are positively correlated with those of their peers, suggest- ing pro-social peer effects. To summarize, extensive margin peer effects are parameterized by γ . Conditional on the set of agents who do contribute, intensive margin (contemporaneous) peer effects are parameterized by (δ c ,δ v ). 78 The framework for extensive and intensive margin peer effects in this structural approach captures several desirable properties. First, we model each margin independently, allowing us to es- timatethemseparately. Thesourceofextensivemargineffectsishistoricalpeercontributionandthe source for intensive margin effects is contemporaneous correlation with contributing peers. Second, given that we observe each agent’s extensive margin decision for every project and period, we can estimateγ separately for each p. Motivated in part by the considerable project-level heterogeneity revealed in the reduced form analysis, this parameterization is more flexible than estimating a single parameter and can account for a range of project-varying extensive margin influences. 79 Finally, we use the OSS contributor’s time-constrained utility maximization problem to separately recover benefit and productivity shocks. Unpacking net benefits allows us to further isolate the channels 78 In a more general sense, elements in the vectors (β v,β c) may include terms related to historical (i.e., lagged or cumulative) peer contribution that, similar to the extensive margin parameterization, may also plausibly influence positive contribution levels. With respect to the counterfactual analysis in Section 2.6.6, we are more broadly interested in parameters related to both types of peer influence, contemporaneous and accumulated, on private benefits and productivity. 79 Note that given data limitations, we cannot estimate intensive margin peer effects (δ v,δ c) separately for each project and period. In many cases, our empirical sample contains only a single contribution for the month to a given project. 46 of peer influence in intensive margin contribution. In the next section, we turn our attention to estimating parameters of interest. 2.6.4 Estimation In this section, we provide a high-level overview of our structural estimation strategy and objectives. A more thorough and detailed treatment is provided in Section 2.E of the appendix. Given data (a ipt ,y pt ,x it ) for all i ∈ N,p ∈ P, and t ∈ T , we develop an estimation strategy to recover the following: 1. Marginal product of labor parameters b = (b pt ) from the project quality function in Equa- tion (2.3). 2. Private benefit and productivity shocks s = (v ipt ,c ipt ) for all a ⋆ ipt > 0 from the equilibrium contribution level in Equation (2.8). 3. (Extensive margin peer effects) Parameters (γ ,β z ) from Equation (2.10). 4. (Intensive margin peer effects) Parameters (δ c ,δ v ,β c ,β v ) from Equations (2.11) and (2.12). The parameters of interest areδ =(δ c ,δ v ), which drive intensive margin peer effects, and γ , which drive extensive margin peer effects. For each project p∈P, our estimation strategy is as follows: 1. Assume disturbances are jointly normally distributed (ϵ z ipt ,ϵ v ipt ,ϵ c ipt )∼N (0,Σ ), independent and identically distributed between agents and time. Within the variance-covariance matrix Σ , assume that σ 2 z =1. 2. Given data (a ipt ,y pt ), recover b pt using Equation (2.3). 3. Givendata(a ipt ,y pt ,x it )andb pt ,recovershocks(v ipt ,c ipt )usingEquation(2.9),Equation(2.5), Equation (2.3) by way of generalized method of moments (GMM) estimation 80 . 80 For each i and t, there are 2P unknowns: vipt and cipt for each aipt >0. There are P first order conditions from Equation (2.9), P equations for project quality form Equation (2.3), and one budget constraint. Overall, this implies NT(2P +1) moment conditions and 2NPT unknowns. 47 4. Given data (1{a ipt > 0},W ipt ,X ipt ) and shocks (v ipt ,c ipt ) recover (γ ,δ ,β ,Σ ), where δ = (δ v ,δ c ) andβ =(β z ,β v ,β c ), via maximum likelihood estimation (MLE) (J. Zhao, H.-J. Kim, and H.-M. Kim, 2020). Parameters θ = (b,γ ,δ ,β ,Σ ) allow us to completely characterize the data generating process for the structural model, a necessary prerequisite simulating policy counterfactuals. 2.6.5 Structural Estimates Benefit and Productivity Shocks We present the recovered values for marginal product of labor parameters b pt and shocks (v ipt ,c ipt ) for all a ⋆ ipt >0 in Figure 2.B.10. The first panel of Figure 2.B.10 contains distributions of marginal private benefit shocks v ipt grouped by year. Similarly, productivity (inverse marginal cost) shocks are presented in the second panel. Several patterns emerge from the recovered shocks. First, these distributions are relatively stable over time. Second, when considering the entire sample, benefit and productivity shocks are relatively uncorrelated with one another at the individual level (Corr(v ipt ,c ipt )=− 0.081). There is, however, evidence of a temporal trend in shock correlation over the sample period: Figure 2.B.11 reveals that benefits v ipt demonstrate a strong negative correlation with productivity c ipt (Corr(v ipt ,c ipt )≈− 0.6 to − 0.5) in early periods of GitHub that trend towards 0 nearer the end of the sample period. Recall that Cov(v ipt ,c ipt ) < 0 implies that greater marginal private benefits are associated with lower private marginal costs of contribution. Together, these data seem to suggest that early stages of GitHub OSS collaboration featured highly productive individuals with greater net benefits of contribution relative to later entrants. In later periods, incentives to become more productive may be weaker given greater peer participation. Corroboratingthefindingsofthereducedformanalysis, thisstructuralevidencefurthersupportsthe 48 notion that the prevalence of free-ridership has likely increased on average as the GitHub platform has grown in size. The third panel of Figure 2.B.10 contains estimates of marginal product of labor parameters b pt from Equation (2.3). By virtue of functional form assumption for project quality, b pt tend to be largest in the early stages of project development: the initial commits tend to be the most important in determining project quality. Since b pt tends to decline over a project’s lifespan, productivity and benefit shocks explain sustained contribution. Extensive Margin Peer Effects Figure 2.B.12 contains estimates for extensive margin peer effects captured by the parameter γ of Equation (2.10). Two key patterns emerge. First, the likelihood of contribution is increasing in the number of peers who contributed in the previous period while decreasing in lagged and cumulative contribution levels. Second, the coefficients for lagged number of peers are much larger in magnitude compared with lagged and cumulative contribution. Taken together, these estimates underscore an intuitive if not trivial fact: agents are more likely to join projects growing in the number of contributors. To a lesser extent, the likelihood of contribution declines as projects grow larger in terms of the size of the codebase. We can interpret this finding in several ways. On one hand, actively developed projects provide positive peer effects that incentivize contribution from outsiders. On the other, it may simply be the case that increased development activity in the early stages of a project may signal a project’s promise or quality to prospective contributors. To rule out this signalling mechanism, we estimate extensive margin peer effects at the project level and control for observable project quality. Moreover, it appears that contribution incentives lessen as a project matures into a stable state 81 , as it is likely that less contribution is required. 81 This phase of project development is sometimes referred to as “maintenance mode” as opposed to “active devel- opment”. 49 In Equation (2.6) of Section 2.6.1, we model extensive margin selection into projects as a latent productivity threshold. We acknowledge that the largest driver of project participation, the number of peers contributing, can influence both benefits and costs of contribution. Given that z ipt is unobserved and a function of both individual and peer historical contribution, we could just as easily have modelled z p as a latent benefit threshold for project p. At best, we can only say our structural approach finds evidence that projects with many actively contributing members increaseanindividual’snet benefitofcontributionandthereforepositivelyimpactsextensivemargin participation. Intensive Margin Peer Effects Project-levelestimatesoftheintensivemarginpeereffects δ v andδ c aresummarizedinFigure2.B.13. Much like the project-level reduced form estimates displayed in Figure 2.B.7, both benefit and productivity peer effects are distributed relatively symmetrically around 0. A relatively strong positive correlation between δ v and δ c , Corr(δ v ,δ c ) = 0.843, implies greater benefit correlation betweenpeerswithinprojectsisalsoassociatedwithgreatermarginalcostcorrelationbetweenpeers. Ultimately, this suggests an inverse relation between benefit and productivity shocks correlation: the net effect of peer influence along the intensive margin leads developers to contribute more at greater marginal cost. The lack of correlation between v ipt and c ipt at the individual level further supports this finding. We interpret this positive correlation between δ v and δ c as evidence that pro- social peer effects dominate productivity peer effects. Consistent with the reduced form analysis, there is no strong evidence that contemporaneous peer effects improve intensive margin productivity across projects on average. In other words, we cannot say that OSS contributors make each other more productive along the intensive margin when considering contemporaneous influence. 50 Summary To summarize, structural estimation of benefit and cost shocks along with extensive and intensive margin peer effects seem to corroborate evidence from our reduced form approach and descriptive statistics from the empirical sample. First, extensive margin peer effects are a much more im- portant driver of project growth relative to intensive margin effects. Consistent with the “casual contributor” phenomenon described anecdotally by OSS maintainers, projects with many contrib- utors are more likely to attract incremental contributions from outsiders than they are to attract dedicated maintainers. Second, in terms of the ratio of private contribution benefits to costs, early OSS contributors on the GitHub platform enjoyed greater net benefits of contribution relative to later entrants. Finally, pro-social forces seem to trump peer effects with respect to intensive margin productivity. There is little evidence to suggest peers reduce marginal costs of contribution along the intensive margin. It is important to note that these results and their subsequent interpretation rest on some as- sumptions made in our modelling approach. First, as in the reduced form analysis, we place a restrictive assumption that intensive margin peer influence operates contemporaneously. As shown in Section 2.5.4, relaxing this assumption will likely lead to larger estimates of peer effects along this margin. Second, the functional form assumptions made in our structural approach may sim- plify estimation at the expense of some flexibility. Specifically, Equations (2.3) (project quality) and (2.4) (agent preferences) omit certain terms such that benefit and productivity shocks can be point identified. These assumptions may bias our parameter estimates away from their true values. Subsequent work would do well to relax these assumptions by either additional structure, data, or a more flexible estimation strategy. 51 2.6.6 Counterfactual Analysis Value of Peer Effects While the presence of positive 82 peer effects precludes a socially optimal level of contribution under private provision, they may increase equilibrium contribution beyond what would be provided in a world without peer influence. In this sense, peer effects have the potential to effectively subsidize the costofprivateprovision. Indeed,thepreliminaryanalysisofthestructuralestimatesinthepreceding subsection gives reason to believe that peer behavior can drive preference and cost heterogeneity along both the extensive and intensive margins, albeit to differing degrees. To gauge the “value- added” by highly nuanced peer effects in terms of aggregate contribution labor, we use the estimates of the structural model to derive a counterfactual equilibrium in which peer effects are absent. We consider the following policy counterfactual: suppose peer effects do not exist. In other words, past peer contribution does not influence an individual’s likelihood of contribution and private benefit and productivity shocks are uncorrelated for individuals who decide to contribute. This scenario roughly corresponds to “siloed” development: agents independently contribute to a public good but do so without interaction with peers or the contribution levels of peers. What is the resulting level of contribution? Specifically, we begin by setting extensive and intensive margin peer effect parameters to zero: γ =δ = 0. We then use the remaining parameter estimates (β ,Σ ,b pt ) to re-simulate the data generating process described by the structural model for the entire sample period. 83 To compare the relative impact of extensive and intensive margin effects, we also simulate a counterfactual under which extensive margin peer effects exist while only intensive margin peer effects are absent. 82 Note that in the canonical public goods model of Samuelson (1954), negative peer effects (e.g., a congestion externality) could potentially offset the classic positive externality that drives free-riding and under-contribution relative to the social optimum. 83 We use data(aipt,ypt,xit) and recovered shocks (vipt,cipt) to “seed” initial conditions for period t= 2008–04–01. 52 The results of these counterfactuals, in terms of aggregate contribution across all projects, are summarized alongside the observed data in Figure 2.B.14. Two key patterns emerge. First, the counterfactual without peer effects results in aggregate contribution approximately 55.6% lower compared with the observed equilibrium. By June 2019, aggregate contribution across all projects in the observed sample totals in excess of 5.519 million commits. Under the counterfactual scenario with no peer effects, aggregate contribution is reduced to approximately 2.452 million commits. If contributors make 2–3 commits per labor hour on average, a back-of-the-envelope calculation for this shortfall of 3.067 million commits implies a loss of 1–1.5 million OSS labor hours relative to the observed equilibrium. The median hourly wage of $52.95 for software developers in the U.S. suggests the value-added by OSS peer effects in our sample at $54.132 to $81.199 million USD. Second, extensive margin peer effects constitute the overwhelming of the value added. Figure 2.B.14 shows that the counterfactual scenario in which only extensive margin peer effects exist (i.e., δ =0) closely matches the observed equilibrium under which both extensive and intensive margin effects are active. As discussed in Sections 2.5.4 and 2.6.5, the diminished role of intensive margin effects may be a result of our narrowly tailored definition for peer influence. 2.7 Discussion Using the context of OSS, we have studied the influence of peer effects on the private provision of public goods in detail. We use both reduced form and structural approaches to (1) address non-random selection, (2) distinctly model intensive and extensive margin peer effects, and (3) disentangle marginal private benefits and costs of contribution as distinct channels of influence. Our findings in both approaches are consistent with anecdotal evidence: OSS project growth is largely driven by some combination of dedicated large-share core contributors and the arrival of many small-share contributors. We find little evidence that peers make each other more productive 53 on average: contemporaneous intensive margin peer effects are heterogeneous across projects but do appear to have been larger in the early days of GitHub. Moreover, structural estimates suggest the effect of peer influence on average is that agents contribute greater levels when their peers do, but at greater marginal cost. Our counterfactual analysis seeks to estimate the value-added of peer effects in terms of private public good provision. Driven almost exclusively by extensive margin peer effects, we find that cumulative contribution is approximately 56% lower under the scenario where peer effects are not present. We can interpret the findings of this analysis to highlight some limitations for the potential for peer effects to foster the production of public information goods. We find that while extensive margin effects can drive a significant share of contribution, these effects are decreasing in the size of the peer group. This may arise either (1) if small share contributors free-ride on the efforts of dominant core contributors or (2) become less likely to contribute once a project matures in size. Moreover, the lack of strong, positive peer influence in contribution productivity along the intensive margin suggests that any strategic complementarity or substitution in contribution may simply offset on net. Compared with previous studies which document strong pro-social effects to collaboratively produced public goods (X. M. Zhang and Zhu, 2011; Slivko, 2014), peer effects in the production of more complex information goods like OSS may be significantly more nuanced. Asnotedabove, akeytakeawayofthisstudyisthatagentsdifferintheirwillingnesstocontribute their labor towards the sustained maintenance of OSS used by a larger community. The extent to which peer effects matter for sustaining the quality of OSS public goods likely depends on both (1) the project’s use valuation from the wider community and (2) the project’s position as a component of OSS infrastructure (Eghbal, 2016). 84 Whereas the study of peer effects in this chapter focuses 84 For example, if OSS production fits a combinatoric production model in which developers make small, specialized contributions and move on to other projects, peer effects may be of less importance to delivering an efficient equilib- rium. On the other hand, network externalities might amplify the importance of maintenance labor and positive peer effects. Consider the example of OpenSSL and the Heartbleed Bug. OSS infrastructure that is widely depended upon 54 purely on the production side of public goods, a promising direction for future research is to explore the welfare implications for behavioral patterns uncovered thus far. 85 Better characterizations of optimal contribution patterns that consider the wider set of beneficiaries to OSS quality allow the researcher to better discern the extent to which peer influences on collaboration truly matter. Such efforts can continue to place the economic significance of peer effects, externalities, and public good production into context for OSS. but maintained by a small group could likely benefit from additional contribution labor that can at least partially be generated through peer effects. 85 In other words, subsequent studies would do well to distinguish between the welfare implications of production versus sustained maintenance for complex public information goods like OSS. 55 Appendices 2.A Tables Table 2.A.1: Descriptive Statistics – Primary Measures in Empirical Sample Measure Notation Count Mean SD Min Median Max Project commits (total) ap 2,287 2,490 7,720 23 825 188,292 Individual commits (total) a i 107,921 53 669 1 2 186,464 Project commits (monthly) apt 96,453 59 294 1 16 73,161 Individual commits (monthly) a it 421,879 13 129 1 3 73,145 Cumulative individual commits (monthly) ˜ a it 421,879 278 1,109 1 28 186,464 Cumulative project commits (monthly) ˜ apt 96,453 1,989 6,295 1 532 188,292 Individual commits (project-month) a ipt 440,111 13 126 1 3 73,145 Peer commits (project-month) a -ipt 440,111 188 398 0 59 73,160 Cumulative individual commits (project-month) ˜ a ipt 440,111 256 1,076 1 23 186,447 Cumulative peer commits (project-month) ˜ a -ipt 440,111 2,096 6,630 0 262 124,932 Number of peers (project-month) n ipt 440,111 17 29 0 7 310 Cumulative GitHub Stars (project-month) ypt 96,294 910 2,924 0 161 81,817 GitHub active days (monthly) gad it 411,427 3.88 4.67 1 2 31 56 Table 2.A.2: Reduced Form – Individual Level Peer Effects Estimates (Baseline Esti- mates for peer effect δ from Equation (2.1)) OLS IV 2SLS Individual Commits Individual Commits (1) (2) (3) (4) (5) (6) Peer Commits 0.0078 0.0065 0.0035 -0.0035 0.0089 0.0102 (0.0011) (0.0018) (0.0034) (0.0015) (0.0037) (0.0251) Individual Commits (cumulative) - 0.0332 0.0529 - 0.0333 0.0529 - (0.0134) (0.0434) - (0.0134) (0.0435) Individual Commits (previous month) - 0.2604 0.1853 - 0.2604 0.1853 - (0.1763) (0.0902) - (0.1763) (0.0902) Peer Commits (cumulative) - -0.0020 -0.0024 - -0.0020 -0.0024 - (0.0009) (0.0019) - (0.0009) (0.0019) Peer Commits (previous month) - 0.0051 0.0036 - 0.0044 0.0039 - (0.0022) (0.0032) - (0.0024) (0.0046) Peer Group Size - 0.0017 0.0898 - -0.0093 0.0158 - (0.0662) (0.1457) - (0.0553) (0.3760) Controls No Yes Yes No Yes Yes Fixed Effects No No Yes No No Yes N 440,111 433,867 433,867 436,287 433,867 433,867 R 2 0.0006 0.1802 0.2268 -0.0007 0.1802 0.2267 First stage F statistic 6,520 1,151 64.37 Note: Columns (1)–(6) present the coefficient estimate ˆ δ from Equation (2.1) in which aggregate peer commits are regressed on individual commits. Standard errors appear in parentheses below the coefficient estimate. Columns (1), (2), (4), and (5) use heteroskedasticity-robust standard errors while Columns (3) and (6) cluster standard errors by project. Column (4) through (6) additionally report the cluster-robust F-statistic from the first stage of the two-stage least squares procedure. Control variables include three lags of individual and peer commits, cumulative individual and peers commits, project quality measured in GitHub stars, quadratic terms for project age and peer group size, and dummy variables indicating if the individual is a project owner, project member, or if they are affiliated with a firm. Fixed effects included are individual, project, and year-month. 57 Table 2.A.3: Reduced Form – Individual Level Peer Effects (Estimates of peer effect δ from Equa- tion (2.1) with covariate interaction terms) OLS IV 2SLS (1) (2) (3) (4) (5) (6) Peer Commits 0.0078 0.0065 0.0167 -0.0035 0.0497 -0.0721 (0.0011) (0.0020) (0.0105) (0.0015) (0.0354) (0.1562) Peer Commits× -1.47× 10 − 5 -0.0002 -0.0002 0.0003 Peer Group Size (2.59× 10 − 5 ) (0.0001) (0.0001) (0.0009) Peer Commits× 0.0004 0.0004 0.0004 0.0004 Lagged Individual Commits (0.0001) (0.0002) (0.0001) (0.0002) Peer Commits× -3.09× 10 − 6 -2.32× 10 − 6 -2.54× 10 − 6 -2.6× 10 − 6 Lagged Peer Commits (1.54× 10 − 6 ) (1.13× 10 − 6 ) (1.16× 10 − 6 ) (1.5× 10 − 6 ) Peer Commits× -3.82× 10 − 5 -4.9× 10 − 5 -3.74× 10 − 5 -4.89× 10 − 5 Cumulative Individual Commits (1.86× 10 − 5 ) (3.39× 10 − 5 ) (1.81× 10 − 5 ) (3.36× 10 − 5 ) Peer Commits× 1.81× 10 − 6 1.93× 10 − 6 1.83× 10 − 6 2.03× 10 − 6 Cumulative Peer Commits (1.07× 10 − 6 ) (6.3× 10 − 7 ) (1.08× 10 − 6 ) (7.57× 10 − 7 ) Peer Commits× 6.18× 10 − 8 2.82× 10 − 7 5.24× 10 − 8 -3.03× 10 − 7 Project Quality (1.87× 10 − 8 ) (3.5× 10 − 7 ) (2.42× 10 − 8 ) (9.98× 10 − 7 ) Peer Commits× 0.0451 0.3513 0.0329 0.3732 Project Owner (0.0753) (0.2380) (0.0848) (0.2455) Peer Commits× 0.0093 0.0070 0.0001 0.0311 Project Member (0.0065) (0.0066) (0.0039) (0.0412) Peer Commits× -3.55× 10 − 6 7.89× 10 − 6 -1.23× 10 − 5 3.17× 10 − 5 Project Age (1.54× 10 − 6 ) (7.89× 10 − 6 ) (8.59× 10 − 6 ) (4.73× 10 − 5 ) Controls No Yes Yes No Yes Yes Fixed Effects No No Yes No No Yes N 440,111 433,867 433,867 436,287 433,867 433,867 R 2 0.0006 0.1910 0.2372 -0.0007 0.1891 0.2351 First stage F statistic 6,520 389.5 57.177 Note: Columns (1)–(6) estimate specifications corresponding to those presented in Table 2.A.2 with the inclusion of terms interacted with aggregate peer effort (RHS endogenous term). 58 Table 2.A.4: Reduced Form – Temporal Heterogeneity in Individual Level Peer Effects (Estimates of peer effect δ from Equation (2.1) for subsamples disaggregated by time period) OLS IV 2SLS (1) (2) (3) (4) (5) (6) 2008 - 2012 0.0119 0.0271 0.0267 -0.0104 -0.0270 -0.0359 (0.0011) (0.0017) (0.0077) (0.0020) (0.0235) (0.0487) N 20,399 20,209 20,209 20,262 20,262 20,209 R 2 0.0077 0.5158 0.6288 -0.0196 0.4807 0.6010 First stage F statistic 3,677 57.00 85.29 2012 - 2016 0.0178 0.0155 0.0228 -0.0237 0.0037 -0.1262 (0.0005) (0.0006) (0.0028) (0.0008) (0.0016) (0.2851) N 146,778 145,952 145,952 146,256 145,952 145,952 R 2 0.0279 0.4975 0.5837 -0.1244 0.4937 0.3716 First stage F statistic 6,637.1 1,543 8.405 2016 - 2019 0.0056 0.0052 0.0032 -0.0013 0.0165 -0.0009 (0.0010) (0.0017) (0.0026) (0.0020) (0.0076) (0.0385) N 272,934 267,706 267,706 269,769 267,706 269,706 R 2 0.0003 0.1835 0.2696 -0.0001 0.1829 0.2696 First stage F statistic 3,491.0 607.9 35.63 Note: Columns (1)–(6) estimate specifications corresponding to those presented in Table 2.A.2 distinctly for sub-samples disaggregated by time period. 59 Table 2.A.5: Reduced Form – Beyond Contemporaneous Individual Level Peer Effect Estimates (Estimates for peer effect δ from Equation (2.13)) OLS IV 2SLS Individual Commits Individual Commits (1) (2) (3) (4) (5) (6) Peer Commits 0.0139 0.0212 0.0068 -0.0145 0.0575 0.0881 (0.0009) (0.0042) (0.0070) (0.0018) (0.0110) (0.0914) Individual Commits (cumulative) - -0.0051 -0.0063 - -0.0074 -0.077 - (0.0020) (0.0047) - (0.0021) (0.0056) Individual Commits (previous month) - 0.3158 0.1036 - 0.3014 0.0949 - (0.2402) (0.2433) - (0.2355) (0.2410) Peer Group Size - -0.0175 -0.0071 - -0.7290 -1.835 - (0.1270) (0.4223) - (0.2148) (2.068) Controls No Yes Yes No Yes Yes Fixed Effects No No Yes No No Yes N 440,111 433,867 433,867 436,287 433,867 433,867 R 2 0.0021 0.1802 0.2274 -0.0066 0.2200 0.3100 First stage F statistic 4,376 124.9 32.16 Note: Columns (1)–(6) present the coefficient estimate ˆ δ from Equation (2.13) in which aggregate peer commits from the preceding 3 months are regressed on individual commits for the subsequent 3 months. Covariate controls and fixed effects correspond to the estimates in Table 2.A.2. Table 2.A.6: Reduced Form – Project-Level Estimates (Historical Project Contribution re- gressed on Contemporaneous Project Contribution) OLS Project Commits (1) (2) (3) (4) (5) (6) (7) Project Commits (1 month prior) 0.4502 0.3188 0.3188 0.3186 0.2804 0.3186 0.2803 (0.2216) (0.1998) (0.2000) (0.0387) (0.0449) (0.0386) (0.0450) Project Commits (2 months prior) - -0.0383 -0.0383 -0.0384 -0.0632 -0.0384 -0.0633 - (0.0666) (0.0659) (0.0289) (0.0349) (0.0288) (0.0350) Project Commits (3 months prior) - 0.1224 0.1223 0.1222 0.0882 0.1221 0.0880 - (0.0391) (0.0384) (0.0311) (0.0394) (0.0309) (0.0396) Project Commits (cumulative) - 0.0156 0.0156 0.0157 0.0159 0.0157 0.0160 - (0.0039) (0.0033) (0.0067) (0.0148) (0.0067) (0.0149) Controls No Yes Yes Yes Yes Yes Yes Fixed Effects: Time No No Yes No No Yes Yes Fixed Effects: Project No No No No Yes No Yes Fixed Effects: Language No No No Yes No Yes No N 96,453 96,294 96,294 96,294 96,294 96,294 96,294 R 2 0.19555 0.27487 0.27567 0.27494 0.30079 0.27575 0.30159 Note: Columns (1)–(7) contain coefficient estimates for current month total project contribution regressed on project contribution in previous months and the cumulative project contribution. Other controls include lagged and cumulative numbers of project contributors, project quality, and quadratic terms for project age. Columns (1) and (2) report heteroskedasticity-robust standard errors, Column (3) clusters standard errors by month, Columns (4) and (6) by project language, and Columns (5) and (7) by project. 60 Table 2.A.7: Reduced Form – Project-Level Estimates (Historical Number of Contributors regressed on Contemporaneous Number of Contributors) OLS Number of Project Contributors (1) (2) (3) (4) (5) (6) (7) Number of contributors (1 month prior) 0.8822 0.6082 0.6085 0.6080 0.5187 0.6082 0.5188 (0.0166) (0.0430) (0.0459) (0.0422) (0.0380) (0.0423) (0.0382) Number of contributors (2 months prior) - 0.1175 0.1167 0.1173 0.0778 0.1166 0.0774 - (0.0408) (0.0477) (0.0248) (0.0346) (0.0251) (0.0348) Number of contributors (3 months prior) - 0.1772 0.1769 0.1769 0.1307 0.1767 0.1306 - (0.0311) (0.0336) (0.0266) (0.0196) (0.0263) (0.0194) Number of contributors (cumulative) - 0.0007 0.0007 0.0007 -0.0008 0.0007 -0.0008 - (0.0002) (0.0002) (0.0002) (0.0004) (0.0003) (0.0008) Project Commits (cumulative) - 1.01e-5 1.04e-5 1.01e-5 0.0001 1.05e-5 0.0001 - (8.09e-6) (8.07e-6) (8.1e-6) (1.69e-5) (1.13e-5) (4.33e-5) Controls No Yes Yes Yes Yes Yes Yes Fixed Effects: Time No No Yes No No Yes Yes Fixed Effects: Project No No No No Yes No Yes Fixed Effects: Language No No No Yes No Yes No N 96,453 96,294 96,294 96,294 96,294 96,294 96,294 R 2 0.77107 0.79349 0.79494 0.79354 0.81159 0.79499 0.81299 Note: Columns (1)–(7) contain coefficient estimates for the number of contributors for a project in the current month regressed on the number of contributors in previous months and the cumulative number of project contributors. Other controls include lagged and cumulative project contribution, project quality, and quadratic terms for project age. Columns (1) and (2) report heteroskedasticity-robust standard errors, Column (3) clusters standard errors by month, Columns (4) and (6) by project language, and Columns (5) and (7) by project. 61 Table 2.A.8: Structural Model Notation i,j∈N agents where|N|=N p∈P OSS projects where|P|=P t∈T time periods where|T|=T a ipt ∈R + agent i’s contribution to project p in period t y pt ∈R quality of project p in t b pt = ∂ypt ∂a ipt marginal product of labor in terms of public good quality x it ∈R + agent i’s consumption of numeraire good (e.g. time) ω it ∈R + agent i’s numeraire endowment z ipt agent i’s latent “ability” in project p at time t (see Equation (2.6)) z p project p’s latent “ability threshold” (see Equation (2.6)) v ipt private contribution benefit for agent i in project p at time t c ipt contribution cost (inverse productivity) for agent i in project p at time t γ extensive margin peer effects (see Equation (2.10)) δ v intensive margin peer effects for marginal private benefits of contribution δ c intensive margin peer effects for marginal private costs of contribution (see Equation (2.11)) β z control variable parameters for latent agent productivity z ipt in extensive margin decision (see Equation (2.10)) β v control variable parameters for marginal private benefits of contribution (see Equation (2.12)) β c control variable parameters for marginal private cost of contribution (see Equation (2.12)) ϵ z ipt Unobserved factors influencing extensive margin decision (see Equa- tion (2.10)) ϵ v ipt Unobserved factors influencing marginal private benefit shock v ipt (see Equation (2.12)) ϵ c ipt Unobserved factors influencing marginal private cost shock c ipt (see Equa- tion (2.11)) 62 2.B Figures Figure 2.B.1: Example GitHub Repository Page – twbs/bootstrap 63 Figure 2.B.2: Descriptive Statistics – Project Creation Dates and Earliest Commits 2002 2004 2006 2008 2010 2012 2014 2016 2018 Year 0 50 100 150 200 250 300 350 400 Number of projects Date Created on GitHub Earliest Commit Figure 2.B.3: Descriptive Statistics – Distribution of Project-level Contribution Shares 0.0 0.2 0.4 0.6 0.8 1.0 Individual share of total project contribution per period 0.0 0.1 0.2 0.3 0.4 Proportion 64 Figure 2.B.4: Descriptive Statistics – Aggregate contribution in sample 2008 2010 2012 2014 2016 2018 Year 0 200 400 600 800 1000 1200 1400 1600 T otal Commits (thousands) Figure 2.B.5: Descriptive Statistics – Distinct contributors in sample 2008 2010 2012 2014 2016 2018 Year 0 1000 2000 3000 4000 5000 6000 7000 8000 Distinct contributors 65 Figure 2.B.6: Descriptive Statistics – Mean individual and peer contribution per project 2008 2010 2012 2014 2016 2018 Year 0 50 100 150 200 250 300 350 400 Mean Commits Contribution Individual Peer Figure 2.B.7: Reduced Form – Project Heterogeneity (Distribution of project-level estimates of ˆ δ for Equation (2.1)) 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Coefficient estimate from Equation (1) 0 100 200 300 400 500 600 Count Estimate OLS 2SLS 66 Figure 2.B.8: Reduced Form – Temporal Heterogeneity (Estimates for Equation (2.1) over sample period. The top subplot includes estimates for the peer effect coefficient ˆ δ of Equation (2.1) within annual cross-sectional sub-samples. The bottom subplot includes estimates for the peer effect coef- ficient within cumulative sub-samples (i.e., all observations ≤ t).) 0.04 0.02 0.00 0.02 0.04 0.06 Cross Section OLS 2SLS 2010 2012 2014 2016 2018 Year 0.04 0.02 0.00 0.02 0.04 0.06 Cumulative 67 Figure 2.B.9: Reduced Form – Insider Contribution and Crowding Out (Estimates for δ in Equa- tion (2.14)) 2012 2013 2014 2015 2016 2017 2018 2019 Year 5 4 3 2 1 0 1 2 Member Owner 68 Figure 2.B.10: Structural Model – Recovered Benefit and Productivity Shocks v ipt ,c ipt and marginal product of labor parameters b pt for all observed contribution a ⋆ ipt >0 (Equation (2.8)) 0 10 20 30 Marginal Private Benefits (v ipt ) 0 25 50 75 100 125 150 175 (Inverse) Marginal Cost (1/c ipt ) 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Year 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Marginal Product of Labor (b pt ) 69 Figure 2.B.11: Structural Model – Correlation between Benefit and Productivity Shocks v ipt ,c ipt over sample period 2008 2010 2012 2014 2016 2018 Year 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Cov(v ipt , c ipt ) 70 Figure 2.B.12: Structural Model – Extensive Margin Peer Effects (Project-level estimates for γ from Equation (2.10)) 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Number of Peers ( 2 ) 0 1 2 3 4 5 6 7 8 T otal Contributors (15, 29] (29, 2707] 0.04 0.02 0.00 0.02 0.04 Lagged Peer Contribution ( 3 ) 0 2 4 6 8 10 12 14 Percent T otal Contributors (15, 29] (29, 2707] 0.004 0.002 0.000 0.002 0.004 Cumulative Peer Contribution ( 4 ) 0 2 4 6 8 10 12 14 T otal Contributors (15, 29] (29, 2707] 71 Figure 2.B.13: Structural Model – Intensive Margin Peer Effects (Project-level estimates for δ v from Equation (2.12) and δ c from Equation (2.11)) 0.4 0.2 0.0 0.2 0.4 v 0 1 2 3 4 5 6 7 Percent T otal Contributors (15, 29] (29, 2707] 0.4 0.2 0.0 0.2 0.4 c 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Percent T otal Contributors (15, 29] (29, 2707] 72 Figure 2.B.14: Structural Model – Counterfactual Growth in Aggregate Contribution without Peer Influence 2008 2010 2012 2014 2016 2018 Year 8 9 10 11 12 13 14 15 16 Cumulative aggregate commits (log) Data Counterfactual (no peer effects) Counterfactual (extensive margin peer effects only) 73 2.C Data Details Sources We use several data sources for our empirical sample • GHTorrent 86 , an archive that seeks to provide an offline, historical record of all public activity on the GitHub platform. The data is very large (the June 2019 archive is 104 GB compressed) but includes scripts so that it can be loaded into a relational database management system for out-of-core analysis. As an alternative, the data is also hosted on Google Big Query. • ProjectsourcecodehostedontheGitHubplatform. Projectsareusuallymanagedbyaversion control system (VCS) that, among many other technical features, records a chronological history of changes to the project’s codebase. This allows us to create measures for project characteristics over each point in time in the project’s history. On GitHub, the VCS tool used is git. Sample Selection As the number of projects recorded in the GHTorrent dataset is rather unwieldy for analysis by conventional means, we resort to sampling. We use the following procedure to develop a sample of popular, collaborative OSS projects hosted on GitHub: 1. From the set of public GitHub projects created before 2019–06–01, select the subset with (1) 15 or more distinct contributors and (2) 100 cumulative “stars”. Denote this set of top projects P. 2. Take a 10% random sampleP ⊂ P from the set of top projects. This set of core project will form the basis of projects considered in both the reduced form and structural analysis. 86 Source: https://ghtorrent.org/ 74 3. For all projects p∈P, determine the set of agentsN p ≡{ i|a ipt >0∀t∈T} that contribute to p. For core projects P, contribution is observed over time periods t∈T where inf(T) = 2009–11–01 and sup(T)= 2019–05–01. Collect all core agents into the setN ≡∪ p∈P N p . 4. WithN,P andT defined, we can proceed in collecting measures of contribution levels, project characteristics, and agent characteristics. 75 2.D Additional Reduced Form Results We provide some deeper analysis into the reduced form peer effects estimates in an effort to (1) provide robust support for the baseline peer effects estimates in Table 2.A.2 and (2) disentangle the various forces embedded in the full sample estimates. Interactions Various observable factors may be associated with different levels of peer effect on contribution. For example, agents larger or higher quality projects may respond differently to the contribution levels of their peers. Peer effects might also vary by the size of the project or peer group itself. We investigate these effects by estimating a version of Equation (2.1) that includes interactions between peer effort and various observables. We present interaction term coefficient estimates in Table 2.A.3. At first glance, it is apparent that the interactions between peer effort and these various factors are second order compared to the primary peer effect. Moreover, most are statistically in consequen- tial at conventional levels. It is interesting to note, however, that peer effects are strongest when an agent has some cumulative history of contribution with the project. This effect is stronger than the influence of peer group size and project quality. We interpret this effect as evidence to the notion that agents form strong affinities to OSS projects and contribute to them with limited concern over other exogenous factors. There is weak evidence that peer effects are stronger for agents invested in the project, such as owners and members, but these effects are not statistically different from zero across specifications. Temporal Heterogeneity Given the dramatic growth of OSS participation on GitHub, it is likely that peer effects in the early days of the platform are different from later years. Table 2.A.4 collects estimates of the 76 specifications in Table 2.A.2 for different time periods. Two patterns emerge. First, positive peer effects are stronger in earlier periods. The Column (3) OLS estimates for years 2008 through 2012 are 0.0208 and statistically different from zero at conventional significance levels (compared with -0.0014 for the full sample). This estimate falls to -0.0088 for years 2016 through 2019. It should be noted that the number of observations in this subsamples are 20,209 and 267,706 respectively. Second, comparing Columns (3) and (6) across time periods in Table 2.A.4 suggests some evidence that the OLS estimates are positively biased. Finally, we further disaggregate the sample by time to estimate Equation (2.1) for (1) annual cross-sectional sub-samples and (2) cumulative sub-samples (i.e., for all observations ≤ t for t∈{2008,2009,...,2019}). We plot these coefficient estimates in Figure 2.B.8. Both Table 2.A.2 and Figure 2.B.8 suggest that peer effects were more likely to be positive in the early days of OSS on GitHub. This is consistent with Eghbal (2020)’s observation that “platforms broke the commons”: early OSS collaboration likely featured smaller, more cohesive project communities in which work was distributed evenly. As GitHub grew in size, the arrival of many, small-share contributors helped grow projects in aggregate but coincide with diminished estimates for the peer effect. Project Heterogeneity There is also considerable project-level heterogeneity in peer effects. Figure 2.B.7 plots the distri- bution of peer effects obtained by estimating Equation (2.1) for each project individually. A key takeaway from Figure 2.B.7 is that after accounting for covariates, peer effect estimates are surpris- ingly rather symmetric around the null hypothesis of δ =0. The share of projects in which peer and individual effort are substitutes and those in which they are complements is relatively well-balanced within the sample. 77 Beyond Contemporaneous Peer Effects The contemporaneous specification in Equation (2.1) is likely too narrowly defined to capture peer effects that develop over a span longer than a single month. Since OSS contribution is public record, peer effects in a general sense need not be strictly contemporaneous. We estimate a version of this specification that seeks to estimate the effect of recent peer contribution (e.g., previous three months) on subsequent individual contribution (e.g. preceding three months): 2 X τ =0 a ipt+τ =δ 2 X τ =0 a -ipt− τ +β ′ X ipt +ϵ ipt . (2.13) We present estimates of the specification above in Table 2.A.5. The estimates in Table 2.A.5 are larger in magnitude compared to the baseline results in Table 2.A.2, suggesting peer effects are stronger when considered under a wider temporal bandwidth. Project Level Effects To begin to see how contribution patterns manifest along the extensive margin, we aggregate con- tribution to the project-month level and regress (1) aggregate project contribution on cumulative and lagged contribution (Table 2.A.6) and (2) the number of contributors on cumulative and lagged contributor groups (Table 2.A.7). The estimates in columns (6) and (7) of Table 2.A.6 imply that aggregate project contribution is greater, on average, when lagged project contribution is greater. Similarly, columns (6) and (7) in Table 2.A.7 demonstrate that contributor peer groups is autocor- related on average. Both of these results suggest that past contribution behavior predicts future participation along the extensive margin. We explore the potential for extensive margin peer influ- ence more thoroughly in Section 2.6.3. 78 Insider Contribution and Crowding Out An alternative way to look at peer groups within OSS projects is to distinguish between project “insiders” and contributors from the wider community. A natural question is whether contribution from project insiders crowds out contributions from project outsiders. The wider community may have strong incentives to free-ride on disproportionate contributions from dominant core contrib- utors. We define a project insider as an individual who is either the nominal project owner or a member of the project. We aggregate individual contribution to the project level and split it into insider contribution a in pt and outsider contribution a out pt . Our “crowding-out” specification is a regression of outsider contribution on insider contribution and project level controls: a out pt =δa in pt +β ′ X pt +ϵ pt . (2.14) We estimate Equation (2.14) for each period and plot the coefficient estimates for ˆ δ in Figure 2.B.9. Estimates for ˆ δ are consistently negative and statistically significant, giving strong evidence for crowding out by project insiders. It is also worthy to note that crowding out appears to increase in later periods, coinciding with diminishing peer effects over time in Figure 2.B.8. 79 2.E Structural Estimation Details Given data (a ipt ,y pt ,x it ) for all i ∈ N,p ∈ P, and t ∈ T , we develop an estimation strategy to recover 1. Marginal product of labor parameters b = (b pt ) from the project quality function in Equa- tion (2.3). 2. Private benefit and productivity shocks s = (v ipt ,c ipt ) for all a ⋆ ipt > 0 from the equilibrium contribution level in Equation (2.8). 3. (Extensive margin peer effects) Parameters (γ ,β z ) from Equation (2.10). 4. (Intensive margin peer effects) Parameters (δ c ,δ v ,β c ,β v ) from Equations (2.11) and (2.12). The parameters of interest areδ =(δ c ,δ v ), which drive intensive margin peer effects, and γ , which drive extensive margin peer effects. For each project p∈P, our estimation strategy is as follows: 1. Assume disturbances are jointly normally distributed (ϵ z ipt ,ϵ v ipt ,ϵ c ipt )∼N (0,Σ ), independent and identically distributed between agents and time. Within the variance-covariance matrix Σ , assume that σ 2 z =1. This implies ϵ z ipt ϵ v ipt ϵ c ipt ∼N 0 0 0 , 1 σ zv σ zc σ zv σ 2 v σ vc σ zc σ vc σ 2 c Notice also that σ zv =ρ zv σ v and σ zc =ρ zc σ c . 2. Given data (a ipt ,y pt ), recoverb using Equation (2.3). 3. Given data (a ipt ,y pt ,x it ) andb, recover shockss using Equation (2.9), Equation (2.5), Equa- tion (2.3) by means of GMM. Let P it ≡ { p ∈ P | a ⋆ ipt > 0} be the subset of projects i contributes to in time t and |P it | = P it . For each i and t, there are 2P it unknowns: v ipt and 80 c ipt for each a ⋆ ipt > 0. There are P it first order conditions from Equation (2.9), P it equations for project quality form Equation (2.3), and one budget constraint (Equation (2.5)): a ipt =b pt +v ipt − c ipt >0 ∀p∈P it y pt ≤ b pt X j X s≤ t a jps ∀p∈P it x it + X p c ipt a ipt ≤ 1 (2.15) Combining the moment conditions in (2.15), the GMM formulation to recover (v ipt ,c ipt ) for each agent i and period t given data (a ipt ,y pt ,x it ) and parameters b pt , becomes (v ipt ,c ipt )=argmin v ipt ,c ipt 1 P it X p∈P it (a ipt − b pt − v ipt +c ipt ) 2 s.t. 00},W ipt ,X ipt )andshockssrecover(γ ,δ ,β ,Σ ),whereδ =(δ v ,δ c ) and β = (β z ,β v ,β c ), using the maximum likelihood estimation (MLE) framework for the Heckman Selection model described by J. Zhao, H.-J. Kim, and H.-M. Kim (2020). Collect quantities either observed as data or recovered in the previous stages of estimation into a 81 vectorD =(b pt ,d ipt ,v ipt ,c ipt ,W ipt ,X ipt ). Collect remaining unknown parameters in a vector θ =(γ ,δ ,β ,Σ ). For each project p∈P, the MLE optimization problem becomes max θ ∈Θ L(θ |D)= Y i Y t {f(v ipt − c ipt |d ipt =1)Pr(d ipt =1)} d ipt Pr(d ipt =1) 1− d ipt s.t f(v ipt − c ipt |d ipt =1)= 1 σ ϕ ϵ v ipt − ϵ c ipt σ Φ ρ √ 1− ρ 2 ϵ v ipt − ϵ c ipt σ + γ ′ W ipt +β ′ z X ipt √ 1− ρ 2 Φ( γ ′ W ipt +β ′ z X ipt ) Pr(d ipt =d)=Φ( γ ′ W ipt +β ′ z X ipt ) d ipt Φ( − γ ′ W ipt − β ′ z X ipt ) 1− d ipt v ipt =δ v v -ipt +β ′ v X ipt +ϵ v ipt c ipt =δ c c -ipt +β ′ c X ipt +ϵ c ipt σ 2 =σ 2 v +σ 2 c − 2ρ vc σ v σ c ρσ =σ zv − σ zc (2.17) where ϕ and Φ are the standard normal density and distribution functions, respectively. For computation convenience, we solve the MLE problem by instead minimizing the negative log-transformation of L. 82 Chapter 3 No Free Lunch For Programmers: Digital Supply Chains and the Economics of Software Dependency Management 3.1 Introduction Modern software typically borrows 70 to 90% of functionality from free and open source software (FOSS) projects (Nagle et al., 2022). The use of external software components can significantly lower development costs, reduces the need to “reinvent the wheel”, and allows specialized code to be organized into modular packages. 1 Relationships between projects within this ecosystem are known as software dependency networks, structures akin to “digital supply chains” in which any number of downstream dependents can costlessly share functionality served from an upstream dependency (Eghbal, 2016). 2 Given the nature in which modern software services are produced and deployed, it is important to note that dependent software components can be affected contemporaneously by the dependencies they borrow from. 3 For example, the maintainer of an upstream dependency project may introduce a change that is backwards incompatible for downstream dependents, forcing the 1 To paraphrase the Unix philosophy espoused by Ken Thompson, “Make each program do one thing well.” (McIlroy, Pinson, and Tague, 1978). See Lerner and Tirole (2002) and Baldwin and Clark (2006) for discussions of the economics of software modularity. 2 For example, a visual representation of the dependency network for a sample of Node.js JavaScript projects can be seen at https://graphcommons.com/graphs/a7ec343d-2a0c-47bb-9658-bb8315e8a096. 3 An overview of how services in modern software ecosystems are deployed and maintained can be found in Boldi and Gousios (2020). Importantly, increased uptake of OSS components by users makes them attractive targets for malicious actors (Ladisa et al., 2022). 83 downstream maintainer to expend development effort to maintain functionality. In more serious cases, upstream changes may even introduce faults or exploits that can affect the operation and security of downstream dependents (Ohm et al., 2020). Therefore, a central economic question in this setting is how the maintainer of a given software project must balance the benefits of development expedience offered by using existing codebases against the risk introduced by relying on a network of potentially problematic external dependencies. 4 While attractive for efficiently building complex projects, the hidden costs of using software de- pendencies can range from mild maintenance costs 5 to catastrophic risk for downstream applications and end users. In practice, software developers are said to spend roughly as much time managing their code and dependencies as they do writing new features (Grams, 2019). Digital supply chain risk is not just a problem for the maintainers of software projects. Some famous cases demonstrate the intricate and pervasive nature of open software components and how faults or changes can have widespread and costly impacts across dependent userbases. 6 In March 2016, the maintainer of Node.js package left-pad abruptly removed the package from the npm package registry 7 , making the software unavailable to thousands of downstream dependents (Schlueter, 2016). This led to over 2% of all npm packages failing to operate properly until maintainers could replace the missing functionality. While the package itself was only 17 single lines of code and was replaced imme- diately, the aftermath of this abrupt removal begins to highlight the extent to which developers have come to rely on the availability of open code. In April 2014, a Google engineer reported an exploit that became known as the Heartbleed 8 bug in the source code of the OpenSSL library used 4 To paraphrase DeVault (2021), there is “no free lunch” for the maintainers of software. 5 A quote from an anonymous developer of the Eclipse integrated development environment (IDE) for Java: “I only depend on things that are really worthwhile. Because basically everything that you depend on is going to give you pain every so often. And that’s inevitable.” (Decan, Mens, and Grosjean, 2019) 6 By “userbase”, we mean the incredibly broad set of stakeholders than have come to rely on the functionality and security of a given software component: developers who use software as intermediate inputs, individual end users, private firms, public institutions, etc. 7 npm is the Node Package Manager is the de facto standard for developing and distributing Node.js packages. See https://www.npmjs.com/. 8 See https://heartbleed.com/. 84 for encryption, potentially exposing sensitive user information across an estimated 17% of public web servers (Mutton, 2014). Similarly, a fix was issued the same day the exploit was reported but hundreds of thousands of unpatched servers remained vulnerable as late as 2017, five years after the vulnerability had been introduced into the codebase (Carey, 2017). 910 In September 2017, Equifax publicly announced a vulnerability stemming from their use of the Apache Struts website framework beginning in May 2017, exposing private records of over 147 million users (US CFPB, 2022). The company agreed to a settlement with the Federal Trade Commission and the Consumer Financial Protection Bureau that entitled compromised users of the service up to $425 million USD in resti- tution (US FTC, 2022). In general, the average cost of a data breach in 2021 was estimated to be $4.24 million USD (IBM, 2021). Together, these case studies illustrate the scope of vulnerability under which technological services reliant upon OSS ecosystems operate. Software dependency networks share features with other networked settings commonly studied intheeconomicliterature: jointresearchanddevelopmenteffortsbetweenfirms(GoyalandMoraga- Gonzalez, 2001), innovation and patents (A. B. Jaffe, Trajtenberg, and Henderson, 1993; Hall, A. Jaffe, and Trajtenberg, 2005; Acemoglu, Akcigit, and Kerr, 2016), academic publications (Hsieh, Konig, et al., 2018), linkages between financial institutions (Elliott, Golub, and M. O. Jackson, 2014; Acemoglu, Ozdaglar, and Tahbaz-Salehi, 2015), risk sharing (Fafchamps and Gubert, 2007), and inter-firm trade (Elliott, Golub, and Leduc, 2022). OSS projects are collaboratively developed, in- terdependent network public goods that generate value as both intermediate and final goods. 11 The setting embeds both positive and negative network externalities in complex ways. Prudent or risk averse maintainers create value for downstream dependents by freely sharing software functionality 9 It is thought that the severity of an exploit is amplified if malicious actors are aware that valuable targets remain exposed to the vulnerability even after its disclosure. 10 Yet another recent example of the wide-ranging impact of software faults is the Log4Shell vulnerability, intro- duced in 2013 and disclosed in December 2021, which allows an attacker to leak sensitive information passing between network connected devices (WIRED, 2021). It is estimated that the exploit exposed hundreds of millions vulnerable devices or 93% of enterprise cloud environment (Wiz, 2021). 11 For example, a software engineer seeking to develop a project for consumers may opt to use an external depen- dency as a production input. 85 with minimal fluctuations in dependency project quality. Linkages can also directly or indirectly transmit contagion between projects in the form of lapses in functionality, technical debt 12 , and even software faults and vulnerabilities. With these features in mind, we seek to study the evolution of these networks and the im- plications of equilibrium structure by focusing on the microeconomic behavior of software project maintainers. Specifically, we develop a framework in which each maintainer will make decisions over (1) a level of development resources to invest in their own project and (2) which external projects to use as dependencies in an effort to minimize development costs and maintain a preferred level of expected project quality. 13 In doing so, we can learn which factors influence both (1) the level of overall welfare induced by the dependency ecosystem and (2) the robustness of equilibrium depen- dency structure to cascading failures. After developing some intuition for these mechanics, we can then consider a set of potential policy interventions that can improve equilibrium welfare, allowing maintainers to make project development decisions more efficiently. 14 The chapter is organized as follows. In Section 3.2 we survey the literature. In Section 3.3, we introduce a framework to ground our study of the sociotechnical software dependency ecosystem, which gives focus to the behavior of cost-minimizing yet risk averse maintainers making develop- ment decisions for interrelated projects under uncertainty. We illustrate the key features of this setting with simple examples. In Section 3.4, we introduce the data used in both the reduced form and structural empirical analyses while illustrating several descriptive patterns that guide our methodologies. With the empirical setting in place, we build intuition over equilibrium outcomes with a reduced form methodology in Section 3.5. Finally, in Section 3.6, we develop a complete 12 In engineering and software development, technical or design debt occurs when a short term solution incurs larger costs over the long run (Techopedia, 2017). Proponents of efficient software design patterns argue that excessive dependency reliance contributes to technical debt (J. Jackson, 2019). 13 In this sense, transaction costs driven by information asymmetry confront maintainers with a “make-versus- buy” decision when developing the functionality of their project (Coase, 1937; Williamson, 1975; Williamson, 1985). Under these conditions, some maintainers may prefer to invest more development effort in order to avoid external dependencies and provide a greater degree of “vertical integration” within their project (Grossman and Hart, 1986). 14 In other words, either at lower cost or lower uncertainty over project quality, or both. 86 structural model of software dependency network formation, assess its equilibrium properties, and discuss the specifics of estimation. In Section 3.7 use the estimates of structural parameters to conduct counterfactual analysis of potential policy interventions. We conclude with final remarks in Section 3.8. 3.2 Literature We begin our study by reviewing relevant strands of literature to illustrate relevant empirical pat- ternsinsoftwaredependencymanagement,placethecurrentstudyintocontext,andidentifyexisting methodological approaches that can inform our analysis. Theempiricalsoftwareengineeringliteraturehasestablishedseveralsalientstylizedfactstochar- acterize the state of software dependency networks in the wild. Kikas et al. (2017) and Decan, Mens, and Grosjean (2019) find substantial indirect dependency between projects in software networks also characterized by limited direct dependency. Hence, empirical evidence suggests the observed be- havior of project maintainers results in fragile dependency networks, vulnerable to contagion. 15 Common types of vulnerabilities can “break” functionality or expose sensitive user information for a package and its dependents (Prana et al., 2021). Decan, Mens, and Constantinou (2018b) find that it takes on average 24 months to find 50% of all vulnerabilities 16 , vulnerabilities are preva- lent across releases, and downstream dependents often remain unpatched even after vulnerability is fixed upstream. Vulnerabilities can be further exacerbated by the reuse of software code, both in the form of reused code within packages and by reusing outdated dependencies (Pham et al., 2010). In some cases, up to 40% of errors in packages can be traced to changes in upstream projects (Decan, Mens, Claes, et al., 2016). Up to 80% of maintainers do not keep their dependencies up to 15 As put by Zimmermann et al. (2019), such fragile networks can be described as “small world, high risk”. 16 The delay between the identification of a vulnerability and the distribution of a patch fixing it is known as “technical lag” (Decan, Mens, and Constantinou, 2018a; Zerouali et al., 2018). 87 date while almost 70% are simply unaware of upstream version changes (Kula et al., 2017). Vigilant maintainers must not only manage development decisions within their own codebase, but also track changes upstream A significant body of research has sought to better understand the mechanics driving these ob- served forces in software networks as well as their implications. One strand of literature has explored the costs of software vulnerabilities, including general inquiries into dependency risk (Schueller and Wachs, 2022), the link between vulnerability disclosure and firm valuation (Acquisti, Friedman, and Telang, 2006; Telang and Wattal, 2007; Anwar et al., 2018), the efficacy of vulnerability re- wards programs (Finifter, Akhawe, and Wagner, 2013; M. Zhao, Grossklags, and P. Liu, 2015; Y. Roumani, Nwankpa, and Y. F. Roumani, 2016), end economic theory behind optimal patch management (Huseyin Cavusoglu, Hasan Cavusoglu, and J. Zhang, 2006; Finifter, Akhawe, and Wagner, 2013). Several ambitious empirical studies have even endeavored to estimate the value of the entire OSS digital supply chain itself (Keller et al., 2018; Robbins et al., 2018). Despite these efforts, the present study fills a gap in the literature by studying empirically the decision of the individual maintainer to outsource functionality for their project and how this behavior influences the resulting equilibrium. Our preferred modelling approach attempts to explain the formation of software dependency networks based on the micro-founded behavior of individual maintainers and therefore draws from several distinct efforts within the economic literature. A starting point for the complementarities in production and the formation of fragile supply chains under risk was outlined by Kremer (1993). 17 More general theoretical treatments have developed theory for the micro-foundations of network formation under risk aversion (Kovářík and Van der Leij, 2009; Blume et al., 2013; Kovářík and Van 17 For overviews on production networks and input-output shock propagation, see Carvalho (2014) and Carvalho et al. (2021). 88 der Leij, 2014). 18 The percolation of supply chain disruptions downstream has also been studied empirically through the use of natural experiments (Bernard, Moxnes, and Saito, 2019; Carvalho et al., 2021). Motivated by linkages between financial institutions, a considerable number of studies pay particular attention to conditions under which network structure is susceptible to contagion (Elliott, Golub, and M. O. Jackson, 2014; Acemoglu, Ozdaglar, and Tahbaz-Salehi, 2015; Erol and Vohra, 2018; Marbukh, 2018). The current study is most closely related to recent work by Elliott, Golub, and Leduc (2022), who consider the formation and robustness of supply chains in the presence of risk. The authors modelcomplexproductionnetworksasthemulti-sourcingstrategiesofindividualfirms, eachsubject to idiosyncratic disruptions which therefore exposes the entire network to contagious risk. 19 They find that even when firms can hedge against supply chain risk through multi-sourcing strategies, aggregate production remains quite sensitive to shocks in equilibrium. As anecdotal evidence sug- gests similar patterns may also be present within software dependency networks, the present study represents an empirical application continuing this strand of research in the domain of open source software and public good production. For the purposes of structural estimation, we also look to a more general literature on strategic network formation (Bloch and M. O. Jackson, 2006; Galeotti and Goyal, 2010; Choi, Goyal, and Moisan, 2019; Christakis et al., 2020) as our objective in Section 3.6 is simultaneously model the coevolution of agent choice of actions and link formation by following the work of Hsieh, König, and X. Liu (2022). In the spirit of Ballester, Calvó-Armengol, and Zenou (2006), the authors also integrate a counterfactual analysis that considers the welfare impact of removing critical agents or “keyplayers” fromthenetwork. InSection3.7, weadaptthismethodologybysimulatingdependency 18 Yet another adjacent strand of research has focused on incentives for agents to form networks insure against risk (De Weerdt, 2002; Fafchamps and Lund, 2003; De Weerdt and Dercon, 2006; Fafchamps and Gubert, 2007; Bramoullé and Kranton, 2007; Ambrus, Mobius, and Szeidl, 2014) 19 Elliott, Golub, and Leduc (2022)’s consideration of endogenously chosen “relationship strengths” in inter-firm trade is analogous to our focus on the maintainer’s choice between different dependency projects. 89 Downstream Upstream i j k ← Inherited functionality− Figure 3.3.1: Project i depends directly on project j and project j depends directly on project k. HenceG ij =G jk =1 are the only non-zero elements of the adjacency matrix G. We say projectk is an indirect dependency of project i. Additional terminology: Project k is upstream of both projects i and j. Project i is downstream of both projects j and k. It is important to reiterate that G ij =1 implies that i inherits some functionality from j. In other words, the dependence relationship runs in the opposite direction of the flow of inherited functionality. formation under the absence of “key dependency projects”. We also draw from work defining graph theoretic measures of to characterize properties of social networks, such as node centrality (Bloch, M. O. Jackson, and Tebaldi, 2019; Everett and Schoch, 2022) and network fragility (Doyle et al., 2005; Wan et al., 2021). 3.3 Framework To build intuition for our setting and methodology, we next sketch out a framework for the evolution of software dependency networks, centering our attention on the problem of an individual project maintainer who seeks to efficiently develop a level of software quality under uncertainty. We illus- trate, in turn, the general setting of software dependency networks, how indirect risk accumulates across interdependent projects, the maintainer’s choice over dependencies, and factors that influ- ence overall network robustness to perturbations. The discussion in this section is merely meant to fix ideas and serves as a primer to Section 3.6, in which we develop a complete micro-founded structural model to more formally characterize this behavior. 90 3.3.1 Setting Consider a set of software projects i∈N ={1,...,N}. Each project can be indexed by a measure of its quality y =(y i ) i∈N . Software projects can depend on one another, in which case the dependent (downstream) project borrows a subset of functionality from the dependency (upstream) project at a nominal price of zero 20 , assuming the upstream project is publicly available and released under a permissive license. These unilateral dependency relationships between projects can be collected into a directed graph G=[G ij ] i,j∈N , which in turn can be represented by the N× N adjacency matrix G=[G ij ] i,j∈N with elements G ij ∈{0,1} for all i,j∈N. 21 If G ij =1, then project i imports some functionality from project j and therefore depends the quality of project j to some extent: G ij =1{project i depends on project j}. Otherwise, G ij =0. We say package j is a direct dependency of package i ifG ij =1. 22 Package k is an indirect 23 dependency of package i if there exists a directed path from package i to package k. 24 In the parlance of software dependencies, we can also say that packages i is downstream of package j and package k is upstream of packages i and j. We illustrate the basics of this networked setting in Figure 3.3.1. Each project i ∈ N has a corresponding decision-making agent whom we will refer to as the project maintainer. 2526 The objective of each maintainer is to efficiently develop their project while 20 Note the “nominal” aspect of free and open source software. The very spirit of this chapter is to highlight sources of hidden costs associated with relying on public software infrastructure. 21 Following convention, Gij =0 when i=j. 22 Alternatively, the graph G can be represented as a tuple G=(N,E) whereE ={(i,j)|Gij =1}. 23 In the software development community, indirect dependencies are sometimes known as transitive dependencies. As “transitive relationship” has a different meaning in social networks literatures, we opt to use the term “indirect dependency”. 24 That is, there exists a directed sequence of distinct dependency relationships {Gij | i,j ∈ S ⊂ N} such that Gij =1 for each i,j∈S. 25 We will use the terms project manager and maintainer interchangeably. 26 In reality, contribution and design decisions in large OSS projects are often shaped by the consensus of many distinct developers. We simplify our modeling framework by assuming that any potentially collective decisions made in equilibrium are ultimately made by a single “project maintainer”. 91 maintaining its expected quality above a given threshold. 27 The action space for each maintainer i consists of choices over (1) the level of costly development effort to their project, x i > 0, and (2) the subset of projects to import as dependencies,{G ij } j̸=i . We assume that the cost of maintainer i’s development effort is given by c i (x i ,G) but that importing software dependencies has an upfront cost of zero: 2829 c i (x,G)= 1 2 x 2 i − a i +α X j̸=i G ij x j x i . (3.1) We further assume that project quality y i is a linear function of the dependency network G, the quality of external projects y − i ≡ (y j ) j̸=i , and aggregate effort x≡ (x j ) j∈N : y i (x,y − i ,G)=b i x i +β X j̸=i G ij y j +ξ i . (3.2) To introduce risk and uncertainty in this framework, we assume that some share of project quality ξ i is stochastic, unobservable, and known only in distribution by maintainers. Finally, we assume that maintainers are heterogenous in their relative level of risk aversion and define maintainer i’s preferences over the quality of their project as u i (x,y,G) = E[v i (y i )] where v i (·) is a Bernoulli function. For example, u i (x,y,G)=E − e − r i y i . (3.3) 27 In Section 3.6, we will treat this threshold as exogenously given, unique to each project maintainer, and unob- servable to the econometrician. 28 Despite this assumption, using external software likely entails some fixed cost, as the downstream developer needs to understand and integrate the upstream package into their project. When a rational developer imports a dependency, we infer that the expected benefit of using the external package outweighs both fixed and marginal costs of working with the dependencies as well as the perceived risk of the dependency. In our modeling approach, dependency fixed costs are subsumed into the maintainer’s choice across risky alternatives. 29 The cost function in Equation 3.1 requires some explanation. Without parameter restrictions, aggregate costs may behave bizarrely as contribution effort increases. If ai >0, the marginal cost of effort increases with effort up to a point and then begins to fall. This may capture a scenario in which a developer “learns by doing” and becomes more efficient as she authors more code. A more problematic scenario may arise if ai > 0 is such that costs are actually negative. We discuss how these complications influence our estimation in Appendices 3.C and 3.D. 92 i j k l m n (a) Indirect Network i j k l m n (b) Hub Network Figure 3.3.2: The network in Panel 3.3.2a is subject to more indirect risk than the network in Panel 3.3.2b. When considering the extent of both direct and indirect dependence, project n is the most critical project in both networks. Here r i > 0 captures maintainer i’s relative risk tolerance: as r i increases, i becomes more risk averse and enjoys less utility under network G where the quality of her project is subject to greater fluctuations in quality. Putting all these elements together, we assume that each maintainer chooses (x ⋆ i ,{G ⋆ ij } j̸=i ) to (1) minimize development costs while (2) keeping expected utility over project quality above a threshold u i : 30 (x ⋆ i ,{G ⋆ ij } j̸=i )= argmin x i >0,{G ij } j̸=i c i (x,G) s.t. u i (x,y,G)≥ u i . (3.4) In the remaining subsections we discuss how dependency network structure embeds risk for individ- ual projects, the decision of a project maintainer to import on external projects, and how network structure can be prone to fragility. 93 3.3.2 Risk Embedded in Dependency Network Structure We use Example (3.3.1) and Figure 3.3.2 to show how network structure exposes projects to risk in the form of quality shocks to direct and indirect dependencies. Example 3.3.1 (Risk Embedded in Dependency Network Structure). Consider the example net- works in Figure 3.3.2. The set of projects is N = {i,j,k,l,m,n}. In Panel 3.3.2a, G ij = G jk = G kl = G lm = G nm = 1. In Panel 3.3.2b, G in = G jn = G kn = G ln = G mn = 1. Assume that project quality for each project is the form given by Equation 3.2. Notice that in both networks, the removal of project n has the greatest effect on downstream projects. When considering the extent of both direct and indirect dependence, project n is the most critical project in both networks. In Panel 3.3.2a, the profile of project quality can be represented by the following system: y i =b i x i +βy j +ξ i y j =b j x j +βy k +ξ j y k =b k x k +βy l +ξ k y l =b l x l +βy m +ξ l y m =b m x m +βy n +ξ m y n =b n x n +ξ n Recursive substitution shows that while project n is subject only to fluctuations in ξ n , the remaining projects inherit some risk from indirect upstream dependents. The unobservable portions of quality in projects m,l,k,j, and i are βξ n +ξ m , β 2 ξ n +βξ m +ξ l , β 3 ξ n +β 2 ξ m +βξ l +ξ k , ... and so on. 30 We should acknowledge that given the way in which this modeling framework is written, project maintainer choose dependency relationships on a rather ambiguous basis. A more realistic approach would account that certain projects require very specific inputs and place little to no value on dependencies that are not relevant to them. For example, atext-basedapplicationwouldhavelittleneedfordependenciesprovidinggraphicalprocessingfunctionality. This oversimplification stems from both modeling convenience and the lack of available data classifying OSS projects by functionality. Future work in this space would do well to measure the match quality between various software projects. 94 i j k l m n (a) i j k l m n (b) Figure 3.3.3: In Panel 3.3.3a, maintainer i prefers depending on project k over project l and avoids a greater level of indirect risk embedded in project l. In Panel 3.3.3b, maintainer i prefers l to j despite a greater level of indirect risk embedded by project l. In Panel 3.3.2b, consider the same set of projects under a different (hub) dependency network. Projects i, j, k, l, and m each depend on project n, which itself has no dependencies. Therefore, relative to the network in Panel 3.3.2a, projects i and j are subject to less indirect risk. As we have seen in real world examples in Section 3.1, faults and vulnerabilities in upstream applications can have consequences in downstream dependents. Maintainers understand this risk and make development decisions conditional on the current state of the dependency network. 3.3.3 A Maintainer’s Choice Between Risky Alternatives Using external software in a project can lower development costs and improve quality but also entails risk for maintainers. We seek to model dependency formation as a choice conditional on a maintainer’s private level of risk aversion: a maintainer ought to only use an upstream project as a dependency if they find it beneficial to their project’s quality net of any risk the dependency introduces. Therefore, conditional on project quality and effort choices (y,x), the dependency 95 selection elements of the maintainer’s decision in Equation (3.4) can roughly be summarized as follows: 31 Maintainer i uses project j as a dependency ⇐⇒ y i (x,y − i ,G+ij)≿ i y i (x,y − i ,G− ij). We formalize this choice by specifying a utility functionu i for the preference relation≿ i that reflects maintainer i’s individual level of risk aversion in Section 3.6. Since some portion of project quality is stochastic and unobservable (ξ i ), preferences can be represented in terms of expected utility à la von Neumann-Morgenstern. 32 Maintainer preference heterogeneity is a result of in variation across v i (·;r i ), a Bernoulli value function parameterized by a measure of risk aversion r i >0. 33 We illustrate the maintainer’s choice amongst dependencies with a simple example. Example 3.3.2 (Maintainer Risk Aversion and Dependency Choice). Consider maintainer i’s choice between two candidate dependency projects, j and l, represented in Figure 3.3.3. In both panels, project j depends on project k while project l depends on project m and project n. Condi- tional on maintainer i’s preferences for risk, she will choose to depend on a particular project if it improves the expected quality of her own project. In Panel 3.3.3a, maintainer i imports project j as a dependency over packages l, indicating that she prefers the quality improvement and lower level of indirect risk introduced by relying on project j over that offered by project l. In Panel 3.3.3b, maintainer i instead prefers to use project l as a dependency. This indicates that although project l embeds more indirect risk than project j, maintainer i finds the benefits of using l outweigh the costs. 31 Some notation for modifying a single relationship in a given dependency graph G: Let G + ij denote the dependency graph that differs from G only in that Gij = 1 and therefore project manager i imports functionality from project j. Similarly, let G− ij denote a dependency network where the only difference from G is that Gij =0. 32 Inotherwords,≿i andvi aresuchthatyi(x,y− i,G+ij)≿i yi(x,y− i,G− ij) ⇐⇒ ui(x,y,G+ij)≥ ui(x,y,G− ij) 33 Therefore, preferences are represented by ui(x,y,G)=E[vi(x,y,G;ri)]. 96 i j k l m n (a) Central project is low risk i j k l m n (b) Central project is high risk i j k l m n (c) Structure isolates risk i j k l m n (d) Structure amplifies risk Figure 3.3.4: In Panels 3.3.4a and 3.3.4b, different project characteristics can influence system- wide fragility for networks with identical structure. In Panels 3.3.4c and 3.3.4d, different network structures can influence fragility when projects characteristics are held constant. Maintainer preferences for dependency stability is a driving force that determines equilibrium structure of the network. As we discuss in the following section, this behavior has implications on the relative robustness or fragility of the entire ecosystem. 3.3.4 Fragile Dependency Networks Both individual characteristics and the structure of the dependency network combine to expose individual projects to varying levels of risk, with implications for the overall value or health of the software ecosystem. We present two examples to illustrate these different channels. Example 3.3.3 (Fragile Dependency Networks). Consider two alternative dependency networks in Panel 3.3.4a and Panel 3.3.4b of Figure 3.3.4. Assume that the only difference between these 97 networks is that the variability in quality for project n is greater in Panel 3.3.4b than it is in Panel 3.3.4a. Notice that given the structure of the dependency networks in both settings, all projects are exposed to disturbances stemming from project n. In this sense, the network in Panel 3.3.4b is relatively more fragile than the network in Panel 3.3.4a since the central or hub project n is riskier. Next, consider the networks in Panel 3.3.4c and Panel 3.3.4d. In this case, difference in network structure can lead to increased fragility. The removal of projectn in Panel 3.3.4c impacts only project m since the network is the union of three disconnected components. In Panel 3.3.4d, project n is a dependency, either direct or indirect, for all of its peers. Hence, we can say that the network structure in Panel 3.3.4d is relatively more fragile than in Panel 3.3.4c, since the removal of the same project is more disruptive to overall project quality. It is useful to consider measures with the potential to characterize a given network G in terms of fragility. One approach is to consider measures of network centrality. Specifically, consider Katz-Bonacich centrality for the nodes of the graph G. Roughly speaking, a node has greater Katz-Bonacich centrality when it is a hub for many other high in-degree nodes. 34 In the world of software networks, central hub dependencies lie at the core of the dependency network and serve, both directly and indirectly, as the foundation for many other projects. Using the logic outlined in the beginning of this subsection, software networks characterized by highly centralized projects may efficiently serve functionality to many dependents, but do so at the cost of increased network fragility. 34 Formally, denote the Katz-Bonacich centrality for project node i in graph G as ki. Then for a decay factor ρ ∈ (0,1), Katz-Bonacich centrality is defined as ki(ρ,G ) = P ℓ ρ ℓ P j G ℓ ij where ℓ is the length of a walk between nodes i and j (Bloch, M. O. Jackson, and Tebaldi, 2019). In matrix notation, this becomes (I− ρG ) − 1 ρG 1. 98 3.4 Data The data used in both our reduced form and structural approaches seeks to characterize (1) the features and outcomes within software projects and (1) the dependency relationships between them. How do upstream dependencies influence project contribution and quality? What social or technical characteristics of projects are associated with many upstream or downstream dependency relation- ships? Can the equilibrium structure of software dependency networks result from or contribute to these dynamics? What is the economic significance of these outcomes? To address these questions empirically, we develop a dataset of sociotechnical measures for a sample of interrelated OSS projects. We choose to focus on projects from the Node.js JavaScript ecosystem. 35 The dependency relationships between widely used open source Node.js projects are tracked over time by the npm (Node.js package manager) registry. Most critically, the npm registry records (1) timestamps for when specific versions of packages are published and (2) the set of external dependencies, along with their respective versions, declared by the parent package. Hence, byknowingwhatexternalcomponentsapackagereliesonatagivenpointintime, wecanobservethe evolutionofasoftwareecosystemanditsdependencygraphasanetworkpanel. Anotableadvantage of this data is that it captures more information about the exact timing of package publication dates and dependency formation compared to single-network observations prevalent in the literature. 36 In addition to simplifying structural estimation, this data enables our structural approach to consider the interrelated decisions of a project manager over internal project development and dependency formation with external projects. 37 35 The rationale for this choice is discussed in sections below. Simply put, the npm is the largest open source package ecosystem in terms of number of packages (2.61 million packages as of January 2020 (Katz, 2020)). Moreover, we focus on a single programming language ecosystem to make more appropriate comparisons between packages. 36 Previous authors have developed estimation strategies to exploit repeated observations of networks (Snijders, Koskinen, and Schweinberger, 2010). 37 Critically, we can capture the initial conditions of the sample network to overcome any bias that might arise characterizing the data generating process in our structural approach. 99 Another attractive property of this empirical setting is that it is possible to observe how so- ciotechnical features for each individual project evolve over time. The npm registry documents the repository URL for the package’s source code. Furthermore, a project’s source code is typically managed using a version control system (VCS), such as git, which has the benefit of chronicling development of the project at high granularity: one can use the version control log to know which developer contributed which lines of code to the project at specific moments in time. We use both the dependency relationships between packages 38 and the technical features of the project recorded in the VCS log. 39 In the following sections, we discuss the procedure we use to develop our empirical sample, the measurement of software quality, and illustrate the dataset with a selection of descriptive statistics. 3.4.1 Sampling Procedure Software dependency networks can grow incredibly large and can be observed at a high temporal granularity. We must therefore resort to sampling a set of representative projects with the potential to capture the essence of dependency management dynamics. We focus on a single packaging ecosystem 40 to minimize irregularities that may arise from cross-language comparisons. Motivated by the case studies of major disruptions caused by widely used software projects mentioned in Section 3.1, we choose to focus on the largest packaging ecosystem tracked by the Libraries.io service: npm JavaScript packages. We will begin by describing how we obtain a set of OSS projects and record their dependency relationships over time. Our sampling procedure can be summarized by the following steps. In 38 Dependency relationships are tracked by the npm registry, a publishing platform for Node.js packages. Once published on the registry, users and developers can install these packages using the npm (Node.js Package Manager) tool. 39 Technical project features can be observed in the source code of the repository. Some technical details: the granularity of the VCS log allows us to download the source code of a project and “rewind” it to its state at a specific point in time. 40 In other words, a set of OSS projects written in a common language. 100 Step 1, we sample the top ten most widely depended upon Node.js packages in the npm registry as of September 2022. In Step 2, for each of these packages, we record a sequence of timestamps associated with minor version releases. 41 In Step 3, for each package at a specific timestamped version, we record the set of upstream runtime 42 dependencies the package depends upon at that point in time. We add this set of dependencies to the running list of sampled projects and return to Step 2. To reduce the size of the resulting sample, we restrict the set of timestamps sampled to minor package versions and limit the depth of upstream dependency projects sampled to 5th degree neighbors of the initial set of 10 seed projects. We refer to the set of versioned timestamps at which a sampled package and its dependencies are observed as our set of sample moments, points in time at which the dependency network potentially changes. 43 Throughout this analysis, we will refer to this specific recursive network sampling procedure as an upstream sample that captures the most central projects of the Node.js ecosystem. While this choice of sampling procedure naturally biases the selection of projects towards core libraries used in the development of larger, user-oriented projects, it is deliberate. In addition to keeping the size of the sample within reason 44 , any dynamics affecting this set of core packages will have widespread influence on downstream packages outside the sample. Hence, any welfare effects estimated within this core sample can be viewed as a lower- bound estimate for the npm ecosystem at large. The resulting dependency subnetwork contains 1,263 Node.js projects observed at 40,440 distinct sample moments from October 2010 through September 41 Best practices in software development encourage the use of semantic versioning, a labelling system for published releases of software to indicate the degree to which the project has changed. Among other reasons, this is done to improve downstream compatibility, as managers of dependent projects can use the semantic version to determine if their dependency is likely to have any breaking or backwards-incompatible changes. See https://semver.org/ for more details. 42 Meaning the dependency is required for the dependent for basic functionality. Maintainers can also declare dependencies needed only for project development or extended functionality. 43 It’s important to note the use of the term “potentially”. A new version release of a software package may likely contain the exact same set of dependencies as the previous version. 44 If we had conversely sampled downstream from the top ten most widely depended upon packages, the resulting sample may include hundreds of thousands of packages. 101 2022. AsnapshotofthenetworksampleasitwasobservedinSeptember2022isdepictedgraphically in Figure 3.A.1. Evolution of the sample dependency graph over time can be seen in Figure 3.A.2. Once we have obtained a panel of dependency relationships, we next use the source code of each project to derive measures of sociotechnical outcomes. For each package, we observe these outcomes for the set of project-specific moments, defined as when either (1) a minor version of the package is published or (2) a minor version of a package is declared as a dependency of another project. 45 We use the repository URL of the project to download a copy of its source code and version control history. 46 For each project moment, we use the VCS log to observe social features such as the cumulative level of commits to the project (i.e., contribution), the cumulative number of contributors, and the number of core contributors to the project and its associated “bus factor” 47 , the project’s age, and an estimate of the number of hours spent 48 on project development. We also use the source code of the project itself to measure 49 technical features of the codebase such as the number of single lines of code (SLOC), the cumulative size of the codebase in megabytes, the number of files, the number of distinct languages used, and the number of lines in the codebase that are considered documentation, and other derived measures such as “modularity”, defined as 45 Note that to keep the number of observations in the empirical dataset manageable for the purposes of structural estimation, we do not observe every single package for each moment, only the packages specified in a version’s changeset. We can get a sense of this technicality from the reduced form estimates in Tables 3.B.2 and 3.B.3, where the number of observations ranges from 206,598 to 196,894, depending on the availability of each covariate measure. 46 A forensic analysis of software source code revision history for sociotechnical measures falls under a branch of research in the computer science literature known as mining software repositories (MSR). Notable tools in this space include reaper (Munaiah et al., 2017a), pydriller (Spadini, Aniche, and Bacchelli, 2018), augur (CHAOSS, 2017), and grimoirelab (Dueñas et al., 2021). 47 We define the number of core contributors as the smallest number of contributors who together have contributed at least 80% of aggregate commits to the project. This measure is related to the so-called “bus factor” commonly discussed in the literature, which is used as an estimate of how susceptible a project is to the loss of key contributors. In our study, we define the bus factor for a project as the count of total cumulative contributors divided by the count of cumulative core contributors: the greater the bus factor (i.e., closer to 1), the more the project relies on a smaller set of core contributors. See https://chaoss.community/metric-bus-factor/ for more details. 48 We use an algorithm developed by Brunfeldt (2014). See https://github.com/kimmobrunfeldt/git-hours. The algorithm takes the revision history of the project (i.e. the git log) and identifies distinct “coding sessions”, defined by sequences of commits made less than 2 hours apart from each other. For each coding session, time allocation is estimated by the duration as measured by the time between the first commit timestamp of the session and the last. Finally, the sum of all session durations, from the initial commit at t0 and the final commit before observation at time t, is the estimated time allocation for the entire codebase observed at t. 49 To generate these technical metrics, we use the static code analysis tool Succinct Code Counter, scc (Boyter, 2018). See https://github.com/boyter/scc for more information. 102 the number of lines per file in the codebase, and “churn”, the ratio of cumulative commits to SLOC in the codebase. Projects in which the same sections of code are constantly under revision will, all else equal, have larger values for the churn measure. Finally, most importantly for our measure of software project quality, we can derive a measure of sophistication for the codebase known as cyclomatic complexity. 50 3.4.2 Measuring Software Quality The notion of a software’s quality is a nebulous concept. 51 In the simplest sense, software code is a collection of instructions for a machine to perform a specific task. Developers and users of a particular project may derive value from it in different ways. For example, a user may consider a software of high quality if it can perform its stated purpose successfully, perform efficiently, and do so with minimal errors. Developers, on the other hand, may understandably place more emphasis on the “maintainability” of the software’s codebase. 52 Even after settling on a particular definition of software quality, how can it be measured? The use quality of a project may be proxied by the extent of popular uptake. How many “followers” have indicated interest in the project on software development platforms like GitHub? How frequently is the software discussed by users in external communities? 53 Perhaps most pertinent to the present study, howmanyexternalprojectsdependuponaparticularsoftwarepackage? Thetechnicalquality oftheprojectcanbemeasuredinyetotherways. Forexample, staticcodeanalysistoolsand“linters” can scan the project’s codebases for potential vulnerabilities, poorly written or documented code, or other bad software development practices. It is important, however, to acknowledge that each of 50 Cyclomatic complexity measures the number of linearly independent paths through the control flow of a soft- ware’s functionality. Simply put, smaller and simpler software projects will likely have lower measures of cyclomatic complexity. See https://www.ibm.com/docs/en/raa/6.1?topic=metrics-cyclomatic-complexity 51 See Spinellis et al. (2009) for an overview in methods for evaluating the quality of OSS. 52 In economic terms, maintenance costs. 53 For example, we can measure this using the relative frequency of the project’s name in search engine trends or in software-specific Q&A forums such as StackOverflow. 103 these measures have relative strengths and weaknesses. The exact definition of software quality is likely best defined contingent on the context of its application. For the purposes of the reduced form and structural analyses, we opt for a rudimentary measure of software quality designed to reflect two distinct notions of a codebase’s overall value. We say a project is of high quality if it is (1) complex and (2) attracts numerous contributors. 54 This measure captures both developer interest in contributing to the project along with a rough proxy for the level of engineering sophistication it entails. 55 In some reduced form specifications in Section 3.5, we will also argue that the number of downstream dependents a project serves can also be used a measure of the value or quality of the software project. 3.4.3 Descriptive Statistics The sample gives insight over the (1) actions, (2) outcomes, and (3) structure that characterize a software dependency network. We briefly provide some descriptive statistics for this empirical sample. Figure 3.A.1 presents a snapshot of the sample dependency network as it is observed in Septem- ber 2022. At first glance, this snapshot reveals a tendency towards a hub structure for our empirical sample: a relatively small group of central nodes support their remaining peers both directly and indirectly. Figure 3.A.2 shows the growth in the network over time, revealing that as new packages enter the ecosystem, the dependency network becomes less dense. 56 Lower density networks may involve additional software development expenditures but at the same time can satisfy a wider ar- range of computing application needs and can also serve to isolate dependency risk. Another way 54 Specifically, wewilldefinequalityasthesumoflogcyclomaticcomplexityandthelogofthenumberofcumulative contributors to the project. We then scale the resulting sum to reside within the interval [0,1]. 55 Similar measures of OSS codebase quality are used in Libraries.io’s SourceRank metric (Katz, 2020). For additional information on the SourceRank measure, see https://docs.libraries.io/overview.html#sourcerank. 56 We acknowledge this phenomenon may simply be an artifact of our sampling methods. However, Decan, Mens, and Constantinou (2018a) document growth in dependency networks by observing the entire population of packages for several ecosystems. In particular, the authors find that package growth in the Node.js ecosystem is exponential over the observation period, roughly similar to the finding for our empirical sample in Figure 3.A.2. 104 to characterize the extent to which certain packages are relied upon in the dependency network is through measures of the package’s centrality. For the purposes of illustration, we observe the network in several annual snapshots and calculate each node’s (1) Katz-Bonacich centrality and (2) betweenness centrality. 57 We present a bivariate scatter plot of each node’s centrality measures in Figure 3.A.3. Naturally, we can see that smaller networks feature nodes with greater central- ity. However, we can also see that as the network grows larger in later periods, small groups of outlying nodes have exceptionally greater measures of relative centrality. In a sense, the larger dependency networks diversify some risk away with the introduction of new packages but few de- pendency “hubs” serve a larger number of dependents. Despite these insights, the overall effect of such network structures on maintainer welfare remains unclear. Table 3.B.1 in Appendix 3.B contains summary statistics, notation, and brief descriptions of the key sociotechnical project-level measures used in both the reduced form and structural analysis. We highlight the key insight from these features. Most importantly, the vast majority of these project-level measures convey a common pattern of (right) skewness across projects that ought to have bearing on the interpretation of any sample-wide estimates. For example, the median project in the sample is a terminal dependency with no dependencies of its own, and hence dependency quality and contribution are both absent (i.e., zero). 58 Another important feature of this sample is that it is skewed towards upstream dependencies: the average (median) package has 2 (1) up- stream dependencies but 5 (2) downstream dependents. Moreover, the average (median) package in the sample consists of 2,703 (195) cumulative commits and features dependencies with 1,451 (0) cumulative commits. 59 Finally, the average (median) package observation consists of 233 (20) 57 Greater levels of either centrality metric opens up the network to additional risk, all else being held equal. See Bloch, M. O. Jackson, and Tebaldi (2019) and Everett and Schoch (2022) for deeper discussions of network centrality metrics and their implications for social networks. 58 The prevalence of skewness is a pattern reminiscent to the contribution behavior observed in Chapter 2. 59 This is also likely an artifact of sampling, as many of our observations consist of early period core dependencies with no observed dependents. 105 cumulative contributors and 20 (1) core contributors, highlighting the skewed distribution of work in these ecosystems. A significant share of core dependencies rely on maintenance efforts by a small group of dedicated individuals. Guided by the framework discussed in Section 3.3, we give deeper consideration to both (1) the relationships between these various features and (2) network structure itself throughout the reduced form analysis in Section 3.5. 3.5 Reduced Form Before developing a fully structural model of software dependency management, we begin our anal- ysis with a reduced form approach to build intuition over empirical patterns. Our objectives in this section are two-fold. First, we begin to explore the extent to which upstream dependencies influ- ence downstream dependent projects, by lowering contribution costs or improving project quality. Second, we estimate several linear specifications in which (1) the number of upstream projects a maintainer depends upon and (2) the number of downstream dependents a project supports are regressed in turn on a set of observables such as features of the project itself and the current state of the dependency network as a whole. We assume that panel data D t ≡ (y t ,x t ,G t ,W t ), is observable by both maintainers and the econometrician where t ∈ T represent a sequence of observations. 60 To be consistent with the notation of our framework outlined in Section 3.3, here the vector x t ≡ (x it ) i∈N contains all project contribution levels measured in number of commits, the vector y t ≡ (y it ) i∈N contains measures of project quality, G t captures the dependency network structure 61 , and W t ≡ (W it ) i∈N collects node (project) characteristics at the sample moment t∈T . We assume that for the equilibrium captured in these observables, behavior for maintainer i in each period t is a function of peer actions j ̸= i, the state of the world D t− 1 , and stochastic shocks. Our discussion outlines a set of econometric 60 We will assume that time is discrete and therefore without loss of generality, let T ⊆ N={1,2,...}. 61 In other words, the adjacency matrix for the empirical sample network at moment t. 106 specifications, provides economic intuition for estimated parameters, and addresses issues pertaining to identification. 62 3.5.1 Contribution Levels Suppose we are first interested in the relationship between upstream and downstream contribution. Our preferred econometric specification, given in Equation (3.5), mirrors the first order necessary condition from the maintainer’s choice over contribution effort: 63 x it =a i +α X j̸=i G ijt x jt +δ ′ W it +ϵ it (3.5) . In this specification, the fixed effect a i captures a time invariant propensity for contribution to project i. The term P j̸=i G ijt x jt in Equation (3.5) is the sum of contribution activity in project i’s dependencies. Therefore the parameter of interest in this specification, α ∈ R, measures the relative influence of upstream contribution on (downstream) contribution to project i. If α < 0, then increased upstream contribution is associated with less downstream contribution on average. Conversely, α > 0 implies that downstream contribution increases with the level of upstream de- pendency development. The direction of this net effect therefore maps into substitution: if α > 0, upstream and downstream contribution are gross complements. We can interpret this in two ways. First, the level of upstream development lowers the marginal cost of downstream contribution, gen- erating a positive productivity effect. Second, large dependency trees require the maintainer of the downstream dependent to exert considerable effort to integrate and maintain. If α < 0, upstream and downstream contribution are gross substitutes. This implies that larger dependencies allow 62 Chandrasekhar (2016), Bramoullé, Djebbari, and Fortin (2020), Á. De Paula (2020), B. S. Graham (2020), and B. Graham and A. De Paula (2020) provide excellent surveys of empirical methods in social network analysis. 63 Recall the maintainer’s problem given in System (3.4). We will fully specify the maintainer’s optimization problem in Section 3.6. 107 downstream dependents to exert less development effort. Any one of these mechanisms seems plau- sible and none can be ruled out ex ante. Moreover while the specification assumes a common net pattern of substitution across projects and time, heterogeneous effects are likely more realistic. The vector of controls W it includes other observable characteristics for project i that might con- ceivably influence contribution levels. 64 Specifically, we include controls such as a measure of project quality y it , a quadratic term in project age, the total number of contributors as well as the number of core contributors, technical characteristics of the project such as single lines of code and cyclo- matic complexity, and temporal lags of both contribution to project i and upstream contribution 65 . The term ϵ it represents project contribution influences that are unobserved by the econometrician, independent and identically distributed 66 , and mean zero in expectation. In a setting in which (1) dependencies are formed unilaterally, (2) project managers are distinct across projects, and (3) the dependency network is acyclical, the terms on the right-hand side of Equation (3.5) are plausibly exogenous. 6768 Therefore, in lieu of more rigorous argumentation, we are reasonably comfortable interpretingα as the causal effect of upstream contribution on downstream contribution in our fixed effects specifications. We summarize coefficient estimates for the specification in Equation (3.5) in Table 3.B.2. To reiterate, the interpretation for the coefficient estimates for α is the effect of increased dependency contribution on the level of contribution in downstream projects. 69 The main takeaway from these results is that while in some specifications it would appear that there is a small positive productivity 64 Therefore δ is a vector of coefficients corresponding to these covariate controls. 65 That is to say, we include multiple lags of both the left-hand side endogenous and key right-hand side exogenous variables. 66 E[ϵ itϵ js] for each j̸=i∈N and t̸=s∈T. 67 That is,E[ϵ it P j̸=i Gijtxjt]=E[ϵ itW k it ]=0 for i∈N, t∈T , all covariates W k it in the vector Wit. 68 Consider the case in which the same set of developers contribute to a project i and its dependencyj. In this case, the potential for simultaneity or reverse causality threatens naive estimates of α with endogeneity bias. A similar form of endogeneity may arise whenever a set of package dependencies form a cycle. For example, if for the set of projects i,j, and k, Gij = G jk = G ki = 1. We assume away any pervasive threats from the former case and argue that software engineering best practices mitigate the latter. 69 On average, ceteris paribus. 108 effect from upstream dependencies ( ˆ α = 0.034 in Model 1 of Table 3.B.2), this effect seems to diminishorvanishcompletelyonaverageaftercontrollingforproject-specificfixedeffects( ˆ α =0.003 in Model 3) and/or covariate controls (ˆ α = 0.000 in Model 2). Therefore we cannot say that on averageadownstreamprojectwithmanylargedependenciesisnecessarilylargerintermsofcommits once individual project characteristics are accounted for. There are several ways to interpret this pooled estimate. First, we must acknowledge that this particular reduced form model captures only intensive margin productivity effects. One may ar- gue the sheer fact that the downstream dependent exists at all is simply because the upstream dependency sufficiently lowers some fixed cost of development. Second, a contemporaneous link between the project development level across dependencies simply may not exist. This would arise if a developer imports a dependency once and does not change her own contribution patterns in light of upstream changes. This certainly can be the case if the dependency is small and not un- dergoing significant development. Finally, the observed equilibrium may describe a situation where large, general-purpose dependencies enable the efficient development of smaller, more specialized dependent projects. 70 3.5.2 Project Quality In a similar fashion, we can next turn our attention to the relationship between the quality of a project and the quality of its dependencies. We use the specification in Equation (3.6): y it =b 0i +b 1i x i +β X j̸=i G ijt y jt +δ ′ W it +ϵ it (3.6) 70 An apt analogy in this case may be a parallel between basic (i.e., generic dependencies) versus applied (i.e. dependents) research studied in the innovation literature. 109 . The project quality fixed effect b 0i captures an intrinsic level of quality independent of upstream dependencies, controls, or temporal fluctuations. The term b 1i captures the marginal product of contribution in terms of improving the quality of project i. The aggregate quality of upstream dependencies at time t is P j̸=i G ijt y ijt and therefore the parameter of interest β ∈R represents an attenuation factor with respect to quality influences transmitted through the dependency network. Similar to Equation (3.5), a vector of controls W it includes other observables that can potentially influence quality: cumulative contribution, project age, the size of the contributor base, maintainer characteristics, andlagsofprojectquality, contribution, andupstreamquality. AsinEquation(3.5), we make similar assumptions for the unobserved component ϵ it in Equation (3.6). 71 We summarize coefficient estimates for the specification in Equation (3.6) in Table 3.B.3. Sim- ilar to our analysis of contribution productivity in the previous section, the effect that upstream dependency projects have on downstream quality is small and largely determined by individual project characteristics. Moreover, it is interesting to note that the number of commits in a project has little effect on its quality (i.e. ˆ b 1i ≈ 0 in all specifications). We cannot say, given our chosen project quality metric and conditional on individual project features, that upstream dependencies significantly improve the quality of downstream dependents. 3.5.3 Dependency Formation Up until now, our reduced form analysis on the influence of upstream dependencies has focused exclusively on intensive margin effects from upstream dependencies. We have not addressed factors that influence the likelihood of dependency formation and therefore know very little about the extent to which project characteristics and maintainer preferences can drive equilibrium dependency structure. Inthissection,weinvestigatefeaturesofOSSprojectsthateither(1)formmanyupstream 71 Note that the residuals of the regression from estimating the specification in Equation (3.6) can be used to proxy for volatility or uncertainty in project quality. More details can be found in our discussion of structural estimation in Section 3.6.3. 110 dependencies or (2) serve many downstream dependents. We operationalize this by regressing both the number of upstream dependencies or the number of downstream dependents a package has on covariate controls. 72 Let d out it ≡ P j̸=i G ijt denote the number of external projects that package i has declared as (upstream) dependencies at timet. 73 We study factors that drive a package to form many upstream dependency relationship using the specification described in Equation (3.7): d out it =δ ′ W it +ϵ it (3.7) . where W it is a vector of observables for project i at time t drawn from Table 3.B.1. Similarly, let d in it ≡ P j̸=i G jit denote the number of external (downstream) projects that declare package i as a dependency at t. 74 Factors that are associated with the attractiveness of package i as a dependency can be studied using the specification in Equation (3.8): d in it =δ ′ W it +ϵ it (3.8) . As the number of downstream dependents is one way to measure a project’s importance or quality, the specification in Equation (3.8) is an alternative to the specification in Equation (3.6) to reveal which observables features contribute to package quality. We present coefficient estimates for the specifications in Equations (3.7) and (3.8) in Table 3.B.4. Several patterns emerge. First, models (1) through (4) of Equation (3.7) suggest that higher quality 72 We acknowledge that there are alternatives to count regression to study characteristics of dependency formation. For example, we could assess the effect of observables on dependency formation using dyadic regression (Helmers, Patnam, and Rau, 2017; Bramoullé, Djebbari, and Fortin, 2020): Gijt =1{ri +γ j +δ ′ Wijt +ϵ ijt≥ 0} . Without sub-sampling, to estimate such a specification on the entire sample entails an onerous computation burden and hence we opt for the simpler and arguably more interpretable approach of count regression. 73 In graph terminology, the out-degree of node i at time t for the graph Gt. 74 In other words, package i’s in-degree. 111 packages declare a larger number of upstream dependencies. This pattern is notably stronger than the intensive margin quality effects collected in Table 3.B.3 and underscores the notion that pop- ular, complex projects likely outsource much of their functionality to external packages. Second, as expected from somewhat of a mechanical correlation, packages with more dependencies have higher dependency quality. However on the other hand, packages with many downstream depen- dents feature fewer upstream dependencies and therefore enjoy less quality effects from their own dependencies. 75 Third, hub dependencies tend to be well documented while packages with many dependencies are not. It is likely that well documented software is easier to work with and therefore more attractive to use. Finally, a project’s lines of code, the number of contributors, and age are all not strong predictors of either upstream or downstream dependency. 76 3.5.4 Robustness Pooled estimates of the effect of dependencies on downstream contribution and quality may mask effects present in various sub-samples. To this end, we also estimate the dependency effects α of Equation (3.5) and β of Equation (3.6) at both (1) the project-level and (2) over time, and (3) for the sub-sample of projects with at least one dependency. In Figure 3.A.4, we can see the project-level estimates for the impact of dependencies on down- stream contribution α are somewhat symmetrically centered around zero. The same is true for up- stream quality effects β at the project level. In Figure 3.A.5, we estimate α by annual sub-sample. Interestingly enough, we can see that the effect of dependencies on downstream productivity is much greater in earlier sample years when both projects and the dependency network itself were much smaller. This would suggest earlier periods of the sample dependency network featured a stronger 75 One potential explanation for this pattern is modularity: larger and more complex packages import more de- pendencies and are more likely located downstream in the network. 76 We acknowledge that our measure of quality correlates somewhat strongly with SLOC and therefore may simply not add much predictive power with respect to dependency formation. 112 degree of complementarity between upstream and downstream contribution for core Node.js pack- ages. On the other hand, upstream quality effects are not markedly different in earlier sample periods compared to later years or the fully pooled sample. Finally, we estimate these specifications for the sub-sample of projects with at least one de- pendency declared. These estimates ought to reflect dependency influences on the projects that actually rely on external software for some functionality. However, we find that these estimates are actually quite similar to those for the pooled sample, especially after controlling for project-specific fixed effects. 3.5.5 Summary Overall, our reduced form methodology finds a limited impact of upstream contribution and quality on downstream project contribution or quality, respectively. The notable exception is that the impactofdependencycontributionseemstohavehadastrongerimpactondownstreamcontribution productivity in earlier periods of the sample (Figure 3.A.5). An obvious potential explanation for this effect is that a considerable level of software functionality was absent in earlier periods of the sample and therefore increased project development effort was required on average. As the space of available functionality grows with the arrival of new dependencies, less “glue code” was required to integrate various functional components. These insights guide our structural approach. First, reduced form analysis emphasizes the com- plexity of dynamics within the empirical setting of software dependency management. Without a well-specified structural model, it’s unclear to what extent any of these estimated effects impact equilibrium welfare. Second, project-level fixed effects (i.e. a i ,b 0i ,b 1i ) seem to matter much more than average, intensive margin effects (e.g. α,β ) when considering the influence of upstream depen- dencies on downstream outcomes. This result further motivates a structural approach that permits 113 counterfactualanalysisinwhichkeycentralprojectsareremoved. Third, thereducedformapproach does not attempt to address factors such as project development costs, uncertainty over dependency quality, or maintainer risk aversion. We place these considerations at the forefront of our structural model. Finally, sample evidence suggests that (1) higher quality packages have import dependencies and (2) well documented packages are more likely to serve as dependencies. 3.6 Structural Approach Reduced form analysis serves as a starting point for characterizing key empirical patterns that begin to illustrate the framework outlined in Section 3.3. We next seek to formalize the microeconomic behaviorofsoftwareprojectmaintainersinanefforttoexplainhowdependencynetworksevolveover time and deliver benefits to users. Using the network formation model suggested by Hsieh, König, and X. Liu (2022) as a basis, our structural approach models the coevolution of both individual software projects and the dependency network. The structural model allows us to conduct two distinct types of counterfactual policy analysis and assess changes to equilibrium welfare, which we measure as the aggregate time cost for software developers. First, we can perturb structural parameters such as the distribution of maintainer risk aversion or variation in project quality. Second, we can simulate the removal of “key projects” (Ballester, Calvó-Armengol, and Zenou, 2006). 3.6.1 Setup The setup of the structural model follows the framework from Section 3.3. We specify a project quality relation, contribution costs, preferences, and information available to each maintainer. 77 77 While the model structure captures a sequence of static equilibria over a number of periods, we will suppress the time subscript throughout most of Sections 3.6.1 and 3.6.2 for the sake of streamlined notation. This should not affect any implications of the model. Assumption 6 in Section 3.6.2.1 discusses the specific sequence which these static equilibria follow. 114 3.6.1.1 Project Quality Project quality y i is a function of (1) contribution effort x i > 0 and (2) dependency relationships summarized by the directed network: y i (x i ,y − i ,G)=b i x i +β X j̸=i G ij y j +ξ i ∀i∈N. (3.9) Here, b i is the marginal product of manager i’s contribution and β captures the attenuation fac- tor over quality derived from project i’s upstream dependencies. 78 The term ξ i are unobservable influences that partially determine project quality. WeoptforalinearqualityspecificationinEquation(3.9)tosimplifythemathematicsofstrategic network formation under conventional methods (Mele, 2017; Á. De Paula, 2020; Badev, 2021; Hsieh, König, and X. Liu, 2022). This assumption is not without perils. For example, this functional form implies that dependencies linearly and continuously influence dependent quality as a function of their size. In reality, the addition or removal of a key dependency may make or break a package, suggesting that quality effects are non-linear. In its current form, the best we can do is adapt our data such that the specification in Equation (3.9) is log-linear. Either innovations in strategic network formation modeling or a completely different methodological approach are required to account for more arbitrary kinds of non-linearity. 3.6.1.2 Contribution costs Contribution costs are assumed to be a convex function of effort level x i >0: c i (x,G)= 1 2 x 2 i − a i +α X j̸=i G ij x j x i . (3.10) 78 During estimation, we include a constant term in Equation (3.9), similar to the reduced form analog in Equa- tion (3.6), which we omit here to simplify notation. 115 Notice that c i is decreasing in a i and α , which capture manager i’s own productivity and any productivity spillovers from contribution in upstream dependencies, respectively. 79 Compared with parameters(b i ,β ) in (3.9) which capture the marginal productivity of contribution effort in terms of project quality, parameters (a i ,α ) in (3.10) allow us to distinctly characterize contribution produc- tivity in terms of marginal costs. As discussed in Section 3.5, however, if upstream and downstream contribution are gross complements, then α < 0 and dependency usage imposes net costs on the maintainer. 3.6.1.3 Preferences Project managers derive utility from their expected private valuation of their project: u i (x,y,G)=E[v i (y i )]. (3.11) By allowing variation in the Bernoulli function v i (·), we capture the idea that maintainers will differ with respect to how much dependency risk they are willing to take on. In particular, we will assume some level of concavity in the function v i (·): Assumption 4 (Exponential Utility). v i (z;r i ) is an exponential or constant absolute risk aversion (CARA) utility function v i (z;r i )=− e − r i z (3.12) where the absolute risk aversion parameter, r i >0, varies across maintainers i∈N. 79 While contribution costs in Equation (3.10) are expressed in rather arbitrary terms, we can derive a mapping between contribution costs implied by the structural model and time allocation (hours) to project development, ωi, observed in the empirical sample. Given estimates for ai,α and data x,G, we can estimate parameters γ 0,γ 1 from a simple linear specification: ci(x,G)=γ 0 +γ 1ωi +ϵ i 116 Under Assumption (4), the parameter r i ∈ R captures project manager i’s relative level of risk aversion: maintainer i is said to be more risk averse as r i → ∞. Since a portion of project quality in Equation (3.9) is uncertain and unobservable, the expected quality preferences under Equation (3.11) and Assumption 4 together imply that as a project manager becomes more risk averse, she suffers greater disutility with increased volatility of both her own package and the inherited volatility of upstream dependencies. We make an assumption over the specific form of this uncertainty in the following section. 3.6.1.4 Information Sets To introduce uncertainty over project quality, we assume that stochastic quality disturbances ξ i are unobservable to maintainers ex ante. Assumption 5 (Uncertainty in Package Quality). Assume the following 1. ξ iid ∼ N(0,Σ) where Σ = Iσ 2 and σ 2 = (σ 2 i ) i∈N are known only in distribution by project maintainers. 2. ξ is independent and identically distributed across time periods. 3. Observables (x,y,G) and parameters θ = (a,α,b,β,r, Σ) are public information to all main- tainers i∈N. Therefore, maintainers are uncertain about the quality of all projects, including their own. The risk averse maintainer will enjoy greater utility when she takes actions to minimize exposure of her project to any sources of quality risk. Since the distribution ξ is common knowledge, the setting is characterized by a shared level of uncertainty rather than information asymmetries between agents (e.g., Akerlof (1978)). 117 3.6.2 Equilibrium With the basic elements of the structural model now established, we now discuss how maintainers are expected to behave in equilibrium. Assume that each project maintainer i ∈ N chooses a tuple (x ⋆ i ,{G ⋆ ij } j̸=i ) to minimize development costs c i (x,G) while keeping their private utility of expected project quality 80 E[v i (y i )] above a threshold u i . We can express the maintainer’s static cost minimization problem as min x i ≥ 0,{G ij } j̸=i c i (x,G) s.t. u i (x,y,G)≥ u i , (3.13) where c i ,y i , and u i are defined in Equation (3.10), Equation (3.9), and Assumption 4, respectively. We analyze the equilibrium of this system is several distinct phases. 81 First, we must make an as- sumption over the sequence of project development choices for each individual maintainer. Second, we characterize the equilibrium choices of the continuous quantities (y ⋆ ,x ⋆ ). Third, we derive an expression for the project maintainer’s expected utility over the quality of their project in equi- librium, E[v i (y ⋆ i )]. Fourth, we characterize factors influencing the formation and dissolution of software dependencies and derive an expression for Pr(G ij = 1), the probability that maintainer i imports project j. Finally, we conclude with some comparative statics for equilibrium quantities. 3.6.2.1 Timing Maintainers myopically best respond to solve the development cost minimization problem in Equa- tion (3.13), conditional on both the state of the system at the beginning of the period, D t− 1 80 Or put more precisely, maintainer i’s private valuation of expected project quality. 81 It should be noted that the maintainer’s optimal choice over contribution levels and dependencies can be repre- sented with alternative formulations. We present some of these alternatives in Appendix 3.C.1 and discuss how we use them to map between the exposition here in Section 3.6.2 and structural estimation covered in Section 3.6.3. 118 and the optimal strategies of other maintainers. Here the state of the dependency ecosystem D t ≡{ y t ,x t ,G t ,W t } is defined as it was in Section 3.5. 82 Assumption 6 (Timing). At the beginning of each period t, a single maintainer i∈N is presented with the opportunity to change the dependency relationships of their software project. Next, all agents i,j ∈N adjust their contribution levels under the new network. These developments unfold according to the following sequence: (S1) Maintainer i chooses a set of optimal dependencies {G ⋆ ijt } j̸=i , conditional on the state of the ecosystem in the last period, D t− 1 . This updates the network to G t− 1 7→G ⋆ t . (S2) In accordance with the response functions derived in Equation (3.13), all agents i,j ∈ N determine their best response contribution levels x ⋆ it under the new network G ⋆ t . This updates the remaining observables in the ecosystem: x t 7→x ⋆ t , y t− 1 7→y ⋆ t , and W t− 1 7→W t . Therefore, by the end of (S2), D t− 1 7→D t . The purpose of Assumption 6, beyond providing some structure to the game, is to connect the data generating process of the observed data to an estimation strategy that we outline in Sec- tion 3.6.3. 83 Observing the disaggregated evolution software dependency networks over time allows us to model network formation as a sequence choices made by individual agents in each period, greatly simplifying the estimation procedure compared situations often found in the literature 84 in which only a single network observation is observed. In the following two sections, we derive a char- acterization of the equilibrium data generating process via backwards induction of the maintainer’s game. 82 Note that Dt tracks other observable features of projects Wt even though it has no bearing on the structural model as specified. 83 In the empirical sample, a software project and its dependencies when a new version is registered with the npm registry. Hence the timing of our model associates each sample moment t∈T with a single agent i who then makes j ̸= i linking decisions, {Gijt} j̸=i . This updates the network Gt− 1 7→ Gt. Then we allow all agents to optimally adjust their contribution levels conditional on the new network Gt so that (xt− 1,yt− 1,Wt− 1)7→(xt,yt,Wt). 84 For example, Leung (2015), Mele (2017), Christakis et al. (2020), and Ridder and Sheng (2020), to name a few. 119 3.6.2.2 Optimal contribution decision (x i ) We first derive equilibrium contribution effort x ⋆ and project quality y ⋆ , taking the dependency network G as given. The first order necessary conditions for maintainer i’s optimal choice of x i >0 imply 85 x ⋆ i =a i +α X j̸=i G ij x ⋆ j . (3.14) In matrix form, the system described in Equation (3.14) becomes x ⋆ = Aa where A≡ (I− αG ) − 1 anda=(a i ) i∈N . For A to exist, it must be that|α |<1. Therefore, x ⋆ i = P j A ij a j . In equilibrium, this implies project quality will be given by y ⋆ i =b i a i +α X j̸=i G ij x ⋆ j +β X j̸=i G ij y ⋆ j +ξ i . (3.15) Similarly, y ⋆ =B(b◦ x ⋆ +ξ )=B(b◦ (Aa)+ξ ) where B≡ (I− βG ) − 1 ,|β |<1, and b◦ x ⋆ denotes the Hadamard product of the vectors b and x ⋆ : (b◦ x ⋆ ) i =b i x ⋆ i for i∈N. Remark (Leontief Inverse for Software Dependency Networks). The matrices A and B would be known as the “Leontief inverse” in the literature on input-output modelling and capture the extent to which the effects, such as increased contribution or fluctuations in quality, percolate through the dependency network G to affect the welfare of dependents. This is the source of network externalities in this framework. Notice that if the spectral radius of αG is less than 1, then A=I+ P ∞ k=1 α k G k . 86 Roughly speaking, if maintainer j increases her contribution effort by 1%, then maintainer i will be 85 Technically, since the first-order necessary conditions for the maintainer’s problem in System (3.13) imply ∂c i (x ⋆ ,G) ∂x i = λ i ∂u i (x ⋆ ,y ⋆ ,G) ∂x i for x ⋆ i > 0 and a Lagrange multiplier λ i ≥ 0, the parameter ai in equilibrium condi- tion Equation (3.14) does not exactly equal the parameter ai from the maintainer’s cost function in Equation (3.10). We take a simplifying approach to represent equilibrium effort choice in Equation (3.14) to simplify both exposi- tion and estimation. We discuss these details in Appendix 3.C.1, providing alternative characterizations for the maintainer’s problem in (3.13) and provide conditions under which the equilibrium allocations (x ⋆ ,y ⋆ ) across these characterizations coincide. See Proposition 1 in Appendix 3.C.1. 86 This concept of attenuating indirect influence from dependencies further and further upstream is akin to Katz- Bonacich centrality from the perspective of the dependent package. We represent the intuition of this influence in Example 3.3.1. 120 induced to increase her contribution effort by A ij a j %. This is the source of productivity and quality externalities: the use of dependency j influences the marginal cost of contribution for maintainer i in her own project. For project quality, the influence of dependencies can operate in different directions since the relative influence of dependency j on project i, B ij = b j x j +ξ j , is the sum of an upstream contribution effect b j x j and unobservable fluctuations or uncertainty ξ j . Finally, these simplifications together imply that equilibrium project quality for project i in Equation (3.15) can also be expressed as follows: y ⋆ i = X j B ij b j X k A jk a k ! +ξ j ! . Noticethatequilibriumprojectqualityisthereforesimplyafunctionofequilibriumcontributionand fluctuations ξ . The advantage of this characterization of project quality implied by Equations (3.14) and (3.15) is that a maintainer’s utility over the expected quality of their project, u i (x,y,G), can now be expressed as simply a function of the network G and parameters. We derive an expression for maintainer utility in the following section. 3.6.2.3 A Maintainer’s Utility over Expected Project Quality Before discussing optimal dependency formation behavior, we first seek to simplify the project manager’s expected utility of their project u i (x,y,G) = E[v i (y i )] under the dependency graph G and equilibrium contribution x ⋆ . We will use the derived expression in the subsequent section to evaluate maintainer i’s expected incremental utility from forming a dependency relationship with project j. 121 Conditional on optimal choices of contribution effort, x ⋆ , the resulting project quality y ⋆ , and the dependency graph G, by the normality of ξ i stipulated by Assumption 5, u i (x ⋆ ,y ⋆ ,G) becomes u i (x ⋆ ,y ⋆ ,G)=− exp − r i X j B ij b j x ⋆ j − r i 2 B ij σ 2 j =− exp − r i X j B ij b j X k A jk a k ! − r i 2 B ij σ 2 j ! . (3.16) Details for this simplification can be found in Appendix 3.C.2. Equation (3.16) makes clear the notion that maintainer preferences are shaped by both (1) their relative degree of risk tolerance r i and (2) the extent to which upstream dependents vary with respect to quality, σ 2 j for j∈N. 3.6.2.4 Optimal dependency formation decision (G ij ) In the previous section, we saw how preferences reflecting risk aversion influences a manager’s expected utility of their project in equilibrium. In this section we explore expected incremental utility changes under the formation of new links. Intuitively, manager i ought only to form a dependency with project j if, conditional on the realization of linking disturbances, their expected utility u + ij ≡ u i (x ⋆ ,y ⋆ ,G + ij) is greater than the utility they expect without using project j, u − ij ≡ u i (x ⋆ ,y ⋆ ,G− ij). First, inordertosimplifyourmodelingofthemaintainer’sdependencymanagementdecision, we make a rather strong assumption that there is no upfront cost to creating or removing dependency relationships. Assumption 7 will change our interpretation of the risk aversion parameter r i such that it now implicitly subsumes both (1) a level of maintainer risk aversion and (2) the average net benefit project maintainer i derives from importing dependencies. To introduce this assumption, we temporarily introduce the notion of time to this exposition. 122 Assumption 7 (Costless dependency formation). Project maintainers incur a cost of zero to either import a dependency, G ⋆ ijt =07→G ⋆ ijt+1 =1, or remove a dependency, G ⋆ ijt =17→G ⋆ ijt+1 =0 With Assumption 7, we can approach the maintainer’s dependency management using a con- ventional methodology (M. O. Jackson and Wolinsky, 1996). However, we must also address an additional complication arising from our specification of maintainer preferences. The non-linearity of u i (·) is to designed reflect the fact maintainers may vary with respect to the amount of risk they are willing to introduce into their own project by importing dependencies. This is a notable departure from much of the literature on strategic network formation, which typically utilize linear utility specifications to simplify the calculation of incremental changes to utility under different links (Á. De Paula, 2020). To address this complication, we first make an assumption on the way in which stochastic and unobserved utility shocks, realized when either forming or not forming the dependency, enter into this decision. This assumption will then allow us to adopt a change-of-variable technique suggested by Fosgerau and Bierlaire (2009) in an effort to facilitate easier estimation of parameters within the maintainer’s discrete choice problem, exploiting the fact that by Assumption 4, u i (·)=E[v i (·)]<0 over its domain. Assume that ε + ij and ε − ij are stochastic and unobserved utility shocks realized by maintainer i from either forming (ε + ij ) or not forming (ε − ij ) the dependency with j. Assumption 8 (Link Formation with Multiplicative Disturbances). Assume ε + ij ,ε − ij ∈ (0,+∞) are independent and identically distributed across project pairs and enter the dependency formation problem of the maintainer multiplicatively. Further, assume that Assumption 7 holds. Upon learning a realization for (ε + ij ,ε − ij ), project manageri uses projectj as a dependency according to the following rule: G ij =1 ⇐⇒ u + ij ε + ij ≥ u − ij ε − ij (3.17) 123 and G ij =0 otherwise. Following the linearization transformation of Fosgerau and Bierlaire (2009), we can show that underAssumption8,theequilibriumprobabilitythatmaintaineriimportsprojectj asadependency becomes Pr(G ij =1)=F ϵ r i X j ∆ B ij b j X k ∆ A jk a k ! | {z } Z 0ij − 1 2 r 2 i X j ∆ B 2 ij σ 2 j | {z } Z 1ij ;θ ϵ , (3.18) where θ ϵ is a vector of parameters for the random variable ϵ ij ≡ ϵ + ij − ϵ − ij ∼ F ϵ . Complete details for this derivation can be found in Appendix 3.C.3. 87 In Equation (3.18), we define ∆ B ij ≡ B + ij − B − ij and∆ A jk =A + jk − A − jk as the difference between elements of the Leontief inverse matrices for project quality and contribution effects under G+ij and G− ij. 88 Intuitively, ∆ B ij and ∆ A ij will reflect the change in exposure to net quality and contribution cost influences that arise when maintainer i imports project j as a dependency. For the purposes of exposition, we rearrange u ij into a quadratic function of r i and label the coefficients Z 0ij and Z 1ij in the last equality of Equation (3.18). These coefficients are functions of the networkG and parameters(α,a,β,b, Σ) . In Section 3.6.3 and Appendix 3.D, we show that since (α,a,β,b, Σ) can be estimated using moment conditions contained in Equations (3.9) and (3.14) via generalized method of moments, the result in Equation (3.18) will form the basis for a likelihood function for the remaining unknown parameters, r and θ ϵ . Hence, under the assumptions outlined in this section, r and θ ϵ can be estimated via maximum likelihood. 87 Briefly, this simplification works as follows. First, define u + ij ≡ − λ ln(− u + ij ) and u − ij ≡ − λ ln(− u − ij ). Second, define ϵ + ij ≡− λ ln(ε + ij ), ϵ − ij ≡− λ ln(ε − ij ). Third, define ϵ ij ≡ ϵ − ij − ϵ + ij iid ∼ Fϵ (z;θ ϵ ) where λ > 0 and uij ≡ u + ij − u − ij , ϵ ij ≡ ϵ − ij − ϵ + ij . Ultimately, Pr(Gij = 1) = Pr(uij ≥ ϵ ij) = Fϵ (uij;θ ϵ /λ ) connects equilibrium linking behavior in Assumption (8) with the empirical likelihood of observing a dependency relationship in Equation (3.18). 88 Let∆ Bij ≡ B + ij − B − ij whereB+ij =[B + ij ]i,j∈N =(I− β (G+ij)) − 1 andB− ij =[B − ij ]i,j∈N =(I− β (G− ij)) − 1 . Equivalently, let ∆ A jk = A + jk − A − jk where A+ij = [A + jk ] j,k∈N = (I− α (G+ij)) − 1 and A− ij = [A − jk ] j,k∈N = (I− α (G− ij)) − 1 . 124 Finally, a distributional assumption over ϵ ij to refine Assumption 8 helps inform comparative statics in Section 3.6.2.5 and the estimation procedure described in Section 3.6.3. Assumption 9. ϵ ij is a logistic random variable, independent and identically distributed across potential links and time. NotethatAssumption9canbesupportedifconditionalonAssumption8,ϵ + ij andϵ − ij areassumed to be mutually (i.e., across both project pairs and time) independent Gumbel random variables. 3.6.2.5 Comparative Statics How do various parameters influence the dependency formation in equilibrium? Consider the effect of slight perturbations to parameters on the likelihood of dependency formation (Equation (3.18)): 1. Risk Aversion (r i ): Using the likelihood that maintainer i imports project j, one can show that ∂Pr(G ij =1) ∂r i <0 if Z ij <r i ≥ 0 if Z ij ≥ r i since ∂Fϵ ∂z >0 andr i >0 andZ ij ≡ Z 0ij Z 1ij . Therefore, maintaineri becomes less likely to import dependency j as her risk aversion r i increases beyond a threshold, Z ij . We can interpret Z ij as a net benefit threshold for maintainer i. If Z ij < r i , then maintainer i’s is risk averse to the point that she considers the additional risk of depending upon project j to outweigh the net benefits in terms of (dis)utility. 2. Project Quality Variance (σ 2 k ): ∂Pr(G ij =1) ∂σ 2 k ≤ 0 125 Noticealsothat,sincewehaveruledoutriskseekingpreferencesbyrestrictingr i >0,increased project volatility will deter all maintainers to some extent. This effect will be stronger for more risk averse maintainers: ∂ 2 Pr(G ij =1) ∂r i ∂σ 2 k ≤ 0. In other words, conditional on our set of assumptions and chosen functional forms, more risk averse maintainers are less likely to rely on dependencies, ceteris paribus, once they are beyond a certain threshold. 89 Similarly, increased volatility in project quality reduces the likelihood of dependency formation. Overall, while intuitive, these comparative statics reveal a level of sophistication em- bedded in the dependency management choice that confronts the project maintainer. Predicting dependency formation is a function of a variety of different influences. These complexities ought to guide the design and interpretation of simulated counterfactuals 3.6.3 Estimation In general, estimating strategic network formation models is complicated. Two broad classes of approaches deal with estimating network formation when only a single network observation is avail- able: (1) non-iterative estimation using link formation strategy is assumed to take place under incomplete information (Leung, 2015; Ridder and Sheng, 2020) and (2) iterative strategic network formation where opportunities to form or dissolve links arrive according to some specified sequence (Mele, 2017; Christakis et al., 2020; Badev, 2021; Hsieh, König, and X. Liu, 2022). While our model falls into the latter class, we can further exploit the fact that the dependency network is effectively observed in continuous time. In this case, the complete sequence of linking decisions is known to the econometrician, an advantage not present in empirical settings where only the final equilibrium net- work is observed. By Assumption 6, we leverage this feature to model the data generating process as a Markov chain: in each period, a single maintainer makes new or revises existing dependency 89 In other words, as ri→∞. 126 decisions based on the current state of the network. After this dependency revision is made, all agents can subsequently adjust their contribution levels under the new network, and the process repeats with another maintainer’s link decision. Our estimation framework is therefore quite similar to the approach described in Snijders, Koskinen, and Schweinberger (2010), as the empirical setting bears a closer resemblance to a network panel and the model falls into a broad class of “stochastic actor-oriented models”. Consequently, our model of the coevolution of both contribution actions and dependency formation decisions is actually much simpler than similar approaches in which only a single network observation is available (Badev, 2021; Hsieh, König, and X. Liu, 2022). In these cases, a high-dimensionality state space of potential networks and action profiles must be tracked 90 in order to explain observed equilibria. We relegate full details and discussion of our structural estimation strategy (i.e., the moment conditions and likelihood function) to Appendix 3.D. In broad strokes, the procedure can be de- scribed as follows. The observed data is D = (x t ,y t ,G t ,W t ) t∈T . Structural parameters to be estimated are θ = (a,α,b,β, Σ ,r,θ ϵ ). 91 We break estimation into two phases. In the first phase described in Steps 1 through 3, we use moment conditions and identities from the structural model to recover estimates for (a,α,b,β, Σ) using the generalized method of moments (GMM). 92 In the secondphasedescribedinStep4, wecombineobserveddatawiththeparameterestimatesinthefirst phase to estimate (r,θ ϵ ), using equilibrium dependency formation characterized in Equation (3.18) in maximum likelihood estimation (MLE). 90 Badev (2021) and (Hsieh, König, and X. Liu, 2022) use an approach in which behavior is determined by an exact potential game, the equilibrium of which can be characterized by a Gibbs measure. Both the presence of externalities and the non-linear nature of our structural model would make the use of a similar method quite complicated. 91 While not a part of the discussion of the structural model in the main body of Section 3.6, we introduce the parameter γ in Appendix 3.C. 92 We also describe how to use simpler methods such as ordinary least squares (OLS) to recover these parameters in sequence. 127 3.6.3.1 Dimensionality Reduction As discussed previously in the description of the empirical sample, we take steps to reduce the di- mensionality to facilitate and simplify structural estimation. First, we limit the size of the empirical sample by beginning with a “seed” of just 10 core projects, sample upstream 93 , and restrict the set of observed sample moments to only minor versions of package releases. Second, we restrict the range of the risk aversion parameter r to the half-interval (0,1] as it is somewhat of a nuisance parame- ter which is really only of interest in distribution to characterize the DGP. From a computational perspective, we discuss additional simplifications to ease structural estimation in Appendix 3.D.1. 3.7 Counterfactual Analysis Developing a structural model allows us to completely characterize the data generating process for the decision-making process of project maintainers and the evolution of the software dependency network. Importantly, it allows us to explore the effect of counterfactual interventions on the resulting equilibrium of the system. Specifically, we can either (1) perturb parameters or (2) remove certain key projects, re-simulate the data generating process 94 , and analyze the impact of the counterfactual on contribution, project quality, contribution costs, and project maintainer welfare (i.e., utility). To evaluate the impact of each counterfactual, we first must define a social welfare function for the software dependency graph G, conditional on the set of projectN and parameters θ : u N (G;θ )= X i∈N u i (G;θ ) 1/λ i . (3.19) 93 As opposed to sampling downstream. The top 10 most depended upon packages in the NPM ecosystem features tens of thousands of downstream dependent packages each. 94 That is, we can use the arrival sequence of projects observed in the sample moments and use the equilibrium conditions of the structural model to simulate contribution decisions, project quality evolution, and dependency formation decisions from for the sample period from start to finish. 128 Notice several notational simplifications. 95 Importantly, we rescale each maintainer’s utility by a factor λ i . For the purposes of our counterfactual analysis, we set λ i = r i for all i∈N, effectively normalizing welfare by the risk aversion profile. This is done so that differences in welfare under a given network is not driven by variation in maintainer risk aversion. We also consider simpler aggregate functions for contribution levels, project quality, and contribution costs. For each coun- terfactual, we specify a change to parameters or the set of projects and then simulate the data generating process under this new system, beginning at the beginning of the sample period. 96 Us- ing the aggregate welfare functions, we compare the counterfactual equilibrium for the final sample period 97 with a baseline based on the observed data. 3.7.1 Reducing Fluctuations in Project Quality Our discussion of structural model comparative statics in Section 3.6 establishes that risk averse project maintainers are less likely to use packages highly volatile in quality as dependencies. More- over, both conventional wisdom and empirical evidence suggest developers are reluctant to import dependenciesthatareeitherimmatureorsubjecttofrequent, backwards-incompatiblechanges(Zer- ouali et al., 2018). Innovations in software best practices can promote stability in the quality of OSS projects, such as including testing frameworks to ensure intended functionality (Ellims, Bridges, and Ince, 2006), using automation and continuous integration to efficiently and safely integrate contri- butions from the wider community (Vasilescu et al., 2015; Hilton et al., 2016b), keeping the design scope of the project focused and succinct 98 , and using systems like semantic versioning to release software often but in a manner respectful of downstream dependents (Raemaekers, Deursen, and 95 Sinceeachyi isultimatelyjustafunctionofx, wecanwriteui(x,G;θ )anduN(x,G). Furthermore, inequilibrium x ⋆ is really just a function of the network G and parameters θ , we could even go one step further to write uN(G;θ ). 96 We take the average of 10 counterfactual simulations. 97 September 2022 98 Recall the UNIX philosophy: “Make each program do one thing well.” 129 Visser, 2017; DecanandMens, 2019). Forthepurposesofthecounterfactualanalysis, wespecifically consider alternative levels of project quality volatility Σ 7→Σ ′ for Σ ′ ∈{0.5Σ ,2Σ ,4Σ }. 3.7.2 Increasing Developer Risk Aversion While some theoretical (Walsh and Schneider, 2002) and experimental (Kina et al., 2016) analyses of risk aversion for software developers exist, there is comparatively little empirical evidence of how risk tolerance influences project maintainer decision-making and the resulting equilibrium. In our structural model, the likelihood to import dependencies increases with a maintainer’s level of risk tolerance, which has the potential to improve package quality and reduce development costs. On the other hand, increased risk aversion may prevent risky dependency relationships from being formed, albeit at increased contribution costs. In our counterfactuals, we modify the profile of maintainer risk aversion r7→r ′ for r ′ ∈{r+σ r ,1,min(r)}. 99 3.7.3 Key Projects In the spirit of the “key player analysis” described by Ballester, Calvó-Armengol, and Zenou (2006), Lee et al. (2021), and Hsieh, König, and X. Liu (2022), we define a key software project i ⋆ such that i ⋆ =argmax i∈N u N (x,G;θ )− u N\i (x,G;θ ), (3.20) conditional on a set of parametersθ . We could determine the key projecti ⋆ by iteratively simulating aggregatewelfarebyEquation(3.20). However, simulating1,263counterfactualsentailsasignificant computational burden. We therefore opt for a simpler approach of removing the package with the most downstream dependents in the final period observed in the sample. The package with the most downstream dependents in our sample is babel, a “tool that helps you write in the latest version of 99 That is, when risk aversion for each project maintainer is increased by one standard deviation, equal to 1, and equal to the minimum value of the estimated from the sample. 130 JavaScript” (Babel, 2014). 100 Additionally, we estimate a counterfactual equilibrium after removing the top ten packages ranked by number of downstream dependents in the final sample period. For the key player analysis, we compare the welfare of the remaining packages under the baseline with their outcomes in a world in which the critical set of packages is removed. In this way, we are estimating the value of externalities the key packages generate for the software network as a whole. 3.7.4 Summary We present the results of the counterfactual analysis in Table 3.B.5. Overall, the overall impact of our counterfactuals is comparatively small in percentage terms. This result is largely driven by the fact that none of our counterfactuals significantly alters the network formation path. We argue this pattern is driven by the fact that individual project features largely drive project management decisions, which are in turn robust to marginal perturbations to project quality volatility, risk aversion, and the removal of core packages. 101 In particular, contribution is virtually unchanged under different levels of risk aversion. On the other hand, the results in Table 3.B.5 seem to indicate that significantly increasing maintainer risk aversion (r ′ i = 1) can increase aggregate package quality (2.11%) enough to effectively offset the direct welfare effects 102 (− 0.04%). In other words, upstream maintainers can create value for downstream dependents by exercising more discipline when choosing what software to rely on that in turn offsets increased disutility from quality uncertainty in aggregate. 103 The effect of increasing package volatility is slightly more puzzling. We can see that reducing volatility actually has a small increase on aggregate project quality (0.19%). Increasing volatility 100 As of October 2022, the babel package has over 15,450 commits, 1,033 distinct contributors, 5,500 forks, and 41,000 stars on GitHub. 101 We should also be forthcoming that this lack of influence on the data generating process may arise because our model largely treats packages as perfect substitutes. We could enrich further analysis with better information on each package’s particular functionality. 102∂u i ∂r i <0 for ri >0. 103 One could also interpret this as shifting the burden of risk from dependents to upstream maintainers. 131 has a larger positive influence on aggregate quality. There is virtually no change to contribution patterns or welfare. We argue this is a result of strong package fixed characteristics: maintainers seem less concerned with uncertainty over package unobservables compared with their immediately appreciable benefits. Finally, our counterfactuals that remove key packages reveal the most interesting results. Core packages with the highest Katz-Bonacich centrality measures create significant value for the depen- dency network: removing the top 10 packages, which number less than 0.8% of the sample, reduces aggregate package quality by − 5.73% for their remaining peers. Moreover, aggregate contribution falls by− 1.3%, suggesting that maintainers find contribution in their own packages complementary with these upstream core packages. 3.8 Discussion Inanefforttounderstandthedynamicsandvaluecreatedbysoftwaredependencynetworks, wehave studied micro-founded decision-making from the perspective of the maintainer of an OSS project. We have developed both reduced form and structural methodologies and brought them to bear on an empirical sample of 1,263 Node.js packages observed over time. Overall, we find that individ- ual project features are largely responsible for driving maintainer decisions. In our reduced form approach, we find that upstream projects have relatively limited effect on downstream quality and contribution levels on average. Complementarity between upstream and downstream contribution was greatest in the earlier periods of the dependency network. Our structural approach, to the best of our knowledge, is the first attempt to micro-found the cost minimization decision of a risk averse software maintainer within a strategic network formation model. In doing so we can charac- terize aggregate network evolution as the aggregation of individual decision-making over time. Our counterfactuals reveal that while the network formation process is relatively robust to perturbing 132 individual parameters like maintainer risk aversion and project quality volatility, removing highly critical core dependencies can have outsized influences on package quality downstream. A better understanding of software dependency formation and its impact on downstream users concerns a broad population of stakeholders and has even garnered attention at the public policy level (Executive Order 14028, 2021). Our study develops a framework that would clearly benefit from extension and further inquiry. In particular, while we have endeavored to cast welfare effects of dependency networks in terms of production costs, estimates of the consumption value of OSS remains an ongoing challenge. Additional research is needed to connect the implications of software dependency management to value created with respect to labor markets, firm profitability, and innovation. Finally, while we have attempted to characterize the salient features of this setting in our frame- work, our results seem to indicate that individual project features, as well as features of their maintainers, are critically important to understand welfare effects in detail. A major innovation in this line of inquiry would be to integrate individual characteristics of maintainers themselves. These characteristics have historically been difficult to observe in aggregate, but recent efforts are under- way to collect more refined information about OSS collaboration communities (CHAOSS, 2017; Dueñas et al., 2021). 133 Appendices 3.A Figures 134 Figure 3.A.1: Empirical Node.js Dependency Network Sample (September 2022 snapshot) 135 2012 2014 2016 2018 2020 2022 Year 0 500 1000 1500 2000 Count Packages Dependencies Figure 3.A.2: Empirical Dependency Network Sample (growth over time) 136 0.028 0.030 0.032 0.034 0.036 Katz-Bonacich Centrality 0.00000 0.00025 0.00050 0.00075 0.00100 0.00125 0.00150 0.00175 Betweeness Centrality 2016 2018 2020 2022 Figure 3.A.3: Empirical Dependency Network Sample (relationship between Katz-Bonacich and Betweenness centrality at the project level) 137 Relative Frequency 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Figure 3.A.4: Project-level Heterogeneity in Reduced Form Estimates for Equation (3.5) and Equa- tion (3.6). 138 2010 2012 2014 2016 2018 2020 2022 Year 0.0 0.1 0.2 0.3 0.4 0.5 0.6 , Figure 3.A.5: Temporal Heterogeneity in Reduced Form Estimates for Equation (3.5) and Equa- tion (3.6). 139 (a) Pr(G ij =1) as β varies (b) Pr(G ij =1) as σ 2 j varies Figure 3.A.6: Comparative Statics – Probability of dependency formation, Pr(G ij = 1), as risk aversion, r i , varies. 140 3.B Tables Table 3.B.1: Empirical Node.js Dependency Network Sample Descriptive Statistics Notation Measure Obs. Mean SD Min Median Max x it Project Contribution: Cumulative total of commits to project i at time t. 206,598 2,703 6,573 1 195 81,641 P j̸=i G ijt x jt Dependency Contribution: Cumu- lative total of commits to project i’s dependencies. 206,598 1,451 19,427 0 0 1,098,604 y it Project Quality: Sum of (log) com- plexity and cumulative contributors (log), scaled to [0,1]. 206,598 0.363 0.199 0 0.334 1 P j̸=i G ijt y jt Dependency Quality: Sum of qual- ity for project i’s dependencies. 206,598 0.225 1.236 0 0 55.2 ω it Time Allocation: Cumulative project labor hours. 206,598 2,909 7,571 2 202 105,504 W it Contributors: Cumulativenumber of contributors. 206,598 233 983 1 20 17,195 W it Core Contributors: Smallest num- ber of cumulative contributors with ≥ 80% of total contribution. 206,598 20 233 1 1 4,421 W it Bus Factor: Ratio of core contribu- tors to total contributors. 206,598 0.209 0.260 0.003 0.121 1 W it Files: Number of files in codebase. 206,598 2,692 7,109 1 26 65,724 W it SLOC: Single lines of code in code- base. 206,598 129,556 326,412 2 2,717 2,974,829 W it Modularity: Ratio of file count to SLOC. 206,598 222 2,528 1.87 50.5 74,349 W it Documentation: Ratio of com- mented lines to SLOC. 206,598 0.110 0.172 0 0.046 3.18 W it Numberoflanguages: Count of dis- tinct programming languages. 206,598 7.87 3.34 1 7 29 W it Project Age: Days since first com- mit. 206,598 1,240 987 0 1,019 4,278 d out it ≡ P j̸=i G ijt Upstream Dependencies: Num- ber of external project dependencies project i declares. 161,211 2 3.76 0 1 74 d in it ≡ P j̸=i G jit DownstreamDependents: Number of external projects that depend on project i. 161,211 5 7.94 0 2 66 141 Table 3.B.2: Reduced Form – Effect of Upstream Dependencies on Project Contribution Project Commits (1) (2) (3) (4) (5) (6) (7) (8) Constant 2,653*** -68.19*** (14.51) (9.744) Dependency Commits 0.034*** 0.000*** 0.003*** 0.007*** 0.002 0.000* 0.000 -0.000 (0.002) (0.000) (0.001) (0.001) (0.001) (0.000) (0.000) (0.000) Project Quality 156.2*** 959.5 126.9*** 790.1*** (22.26) (518.5) (19.57) (322.0) # Contributors 0.101*** 0.546* 0.211*** 1.542* (0.017) (0.276) (0.034) (0.742) Bus Factor 18.96*** 68.50* 7.401* 6.286 (3.851) (32.56) (2.928) (21.80) SLOC 0.000*** 0.000 0.000*** 0.000 (0.000) (0.000) (0.000) (0.000) Documentation -6.203 108.5 9.413*** 136.5* (3.201) (67.99) (3.578) (61.60) Modularity -0.001*** -0.001 -0.001*** -0.002 (0.000) (0.001) (0.000) (0.001) # Languages 6.423*** 7.306 5.836*** 2.513 (0.996) (11.41) (0.973) (11.57) Age -0.012*** -0.002 -0.017*** -3.617 (0.002) (0.026) (0.004) (2.415) Controls ✓ ✓ ✓ ✓ Project FE ✓ ✓ ✓ ✓ Time FE ✓ ✓ ✓ ✓ R 2 0.010 0.999 0.964 0.674 0.988 0.999 0.999 0.999 Observations 206,598 202,905 206,575 201,336 201,308 202,869 196,930 196,894 Note: This table contains coefficient estimates from ordinary least squares (OLS) and fixed effect (FE) estimates for the reduced form relationship between project contribution and upstream dependency contribution in Equation (3.5). Specifica- tion variations are reported in columns. Standard errors, heteroskedasticity-robust for OLS and clustered by project for FE models, are reported in parentheses below each coefficient. Additional covariate controls not reported in the table include 3 lags each of project commits and dependency commits and the square of project age. All terms rounded to four significant figures. Statistical significance indicators: *** ⇒p≤ 0.001, ***⇒p≤ 0.01, and *⇒p≤ 0.1 where p is the p-value for the coefficient estimate. 142 Table 3.B.3: Reduced Form Equation (3.5) – Effect of Upstream Dependencies on Project Quality Project Quality (1) (2) (3) (4) (5) (6) (7) (8) Constant 0.302*** 0.002*** (0.001) (0.049) Dependency Quality 0.017*** 0.000 0.004*** 0.009*** 0.002 0.001 0.000** 0.000 (0.001) (0.000) (0.002) (0.001) (0.001) (0.001) (0.000) (0.001) Project Commits 0.000*** 0.000 0.000* 0.000*** 0.000*** 0.000 0.000 0.000 (0.000) (0.000) (0.000) (0.000) (0.000) (0.063) (0.047) (0.052) # Contributors 0.000 -0.000 0.000 0.000 (0.000) (0.000) (0.000) (0.002) Bus Factor -0.003*** -0.007*** -0.003*** -0.007*** (0.000) (0.001) (0.000) (0.001) SLOC -0.000 0.000 -0.000 -0.000 (0.000) (0.000) (0.000) (0.001) Documentation 0.001*** 0.002*** 0.001*** 0.003 (0.000) (0.000) (0.000) (0.003) Modularity 0.000 0.000 0.000 0.000 (0.000) (0.000) (0.000) (0.000) # Languages 0.000*** 0.002*** 0.000*** 0.002 (0.000) (0.000) (0.000) (0.002) Age 0.000 0.000 0.000 0.000 (0.000) (0.000) (0.000) (0.000) Controls ✓ ✓ ✓ ✓ Project FE ✓ ✓ ✓ ✓ Time FE ✓ ✓ ✓ ✓ R 2 0.514 0.999 0.966 0.764 0.987 0.999 0.999 0.999 Observations 206,598 202,905 206,575 201,336 201,308 202,869 196,930 196,894 Note: This table contains coefficient estimates from ordinary least squares (OLS) and fixed effect (FE) estimates for the reduced form relationship between project quality and upstream dependency quality in Equation (3.6). Specification variations are reported in columns. Standard errors, heteroskedasticity-robust for OLS and clustered by project for FE models, are reported in parentheses below each coefficient. Additional covariate controls not reported in the table include 3 lags each of project quality and dependency quality and the square of project age. All terms rounded to four significant figures. Statistical significance indicators: *** ⇒p≤ 0.001, ***⇒p≤ 0.01, and *⇒p≤ 0.1 where p is the p-value for the coefficient estimate. 143 Table 3.B.4: Reduced Form Equations (3.7) and (3.8) – Effect of Project Features on Dependency Formation # of Upstream Dependencies # of Downstream Dependents (1) (2) (3) (4) (5) (6) (7) (8) Constant 0.720*** 4.720*** (0.121) (0.147) Project Commits -0.000 -0.000 0.000 -0.000 0.001 0.000 0.001 0.001 (0.027) (2.493) (0.126) (0.000) (0.038) (0.001) (0.210) (0.001) Upstream Commits -0.000 -0.000 -0.000 -0.000 0.000 0.000** 0.000 0.000** (0.001) (0.205) (0.028) (0.000) (0.017) (0.000) (0.038) (0.000) Project Quality 3.073*** 4.898*** 5.180*** 4.306* 3.935*** 7.444 5.850*** 5.350 (0.029) (0.255) (0.002) (1.986) (0.098) (4.481) (0.044) (3.444) Upstream Quality 3.000*** 1.227*** 2.818*** 1.176*** -1.683*** -0.313*** -1.585*** -0.327*** (0.005) (0.037) (0.030) (0.222) (0.012) (0.089) (0.035) (0.080) # of Contributors -0.000 0.000 -0.001 0.002 -0.009*** -0.002 -0.008 -0.004 (0.001) (0.019) (0.000) (0.001) (0.003) (0.003) (0.005) (0.004) Bus Factor -0.181*** 0.078*** -0.227*** 0.215 -1.098*** 0.595 -0.787*** 1.040* (0.010) (0.003) (0.000) (0.243) (0.018) (0.667) (0.001) (0.526) SLOC -0.000 -0.000 -0.000 0.000 -0.000 -0.000 -0.000 -0.000* (0.004) (0.037) (0.005) (0.000) (0.016) (0.000) (0.001) (0.000) Documentation -0.941*** -0.979*** -1.170*** -0.873 4.325*** 2.607* 4.640*** 2.513*** (0.058) (0.042) (0.001) (0.544) (0.124) (1.047) (0.005) (0.697) Modularity -0.000 -0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 (0.005) (0.010) (0.002) (0.000) (0.012) (0.000) (0.009) (0.000) # of Languages -0.055*** -0.036 -0.061 0.002 -0.575*** -0.272 -0.427** -0.280 (0.002) (0.693) (0.063) (0.001) (0.005) (0.217) (0.156) (0.166) Age -0.000 -0.000 -0.000 -0.000 0.002 0.002*** 0.004 -0.056 (0.002) (0.004) (0.001) (0.000) (0.004) (0.000) (0.003) (0.067) Project FE ✓ ✓ ✓ ✓ Time FE ✓ ✓ ✓ ✓ R 2 0.535 0.875 0.622 0.899 0.143 0.946 0.384 0.965 Observations 161,211 161,183 161,211 161,175 161,211 161,183 161,211 161,275 Note: Columns (1) through (4) report coefficient estimates for the specification in Equation (3.7), a regression of the number of upstream dependencies for a given project (i.e., out-degree) on observable project characteristics. Columns (5) through (8) report coefficient estimates for the specification in Equation (3.8), a regression of the number of downstream dependents for a given project (i.e., in-degree) on observable project characteristics. 144 Table 3.B.5: Counterfactual Analysis Counterfactual Details Welfare (∆ %) Project Quality (∆ %) Contribution (∆ %) Costs (∆ %) Minimize Risk Aversion (r ′ i =min(r i )∀i∈N) 0.22 0.40 0.00 0.00 r→r ′ Increase risk aversion (r ′ i =r i +σ r∀i∈N) -0.07 -0.09 0.00 0.00 Maximize Risk Aversion (r ′ i =1∀i∈N) -0.04 2.11 0.00 0.00 Reduce quality volatility (Σ ′ =0.5Σ ) 0.00 0.19 0.00 0.00 Σ →Σ ′ Increase quality volatility (Σ ′ =2Σ ) 0.00 1.51 0.00 0.00 Increase quality volatility (Σ ′ =4Σ ) 0.05 1.14 0.00 0.00 Remove top package 0.00 0.03 0.00 0.00 Key Package Analysis Remove top 10 packages -0.07 -5.73 -1.30 -0.87 145 3.C Mathematical Details 3.C.1 Alternative Representations for the Maintainer’s Problem We note that the maintainer’s cost minimization problem presented in Equation 3.13 can have alternative representations that under certain conditions, can deliver an equivalent set of equilibria. Alternative representations can be helpful during estimation. First, the cost minimization problem of Equation 3.13 is presented here in Equation P1: min x i ≥ 0,y i ,{G ij } j̸=i c i (x,G) s.t. u i (x,y,G)≥ u i (P1) Alternatively, a project maintainer could equivalently be modelled as maximizing expected project quality subject to a cost constraint. max x i ≥ 0,y i ,{G ij } j̸=i u i (x,y,G) s.t. c i (x,G)≤ ω i (P2) Notice that by assumptions over u i and c i , both P1 and P2 are convex problems in x i . Finally, the decision problem can be reframed to make it more amenable to established network formation estimation procedures (Hsieh, König, and X. Liu, 2022). Consider P3, in which contribution cost is embedded into the maintainer’s utility function: max x i ≥ 0,y i ,{G ij } j̸=i E[v i (γy i (x,G)− c i (x,G))] (P3) 146 Note that the parameter γ can be interpreted as converting project quality into the value the maintainer places on the project in terms of time saved for some computing task. Hence, γy i − c i is the net value of project i in measured in hours. The following proposition establishes an equivalence between the solutions of P1, P2, and P3. Proposition 1 (Equivalence between P1, P2, and P3). Assume x ⋆ i >0. 1. The solutionsto P1 and P3 coincide ifthe minimumproject quality threshold binds: u i (x,y,G)= u i . 2. The solutions to P2 and P3 coincide if the contribution cost constraint binds: c i (x,G)=ω i . Proof. The Lagrangian for P1 L i (x i ,G;λ i )=c i (x,G)+λ i (u i − u i (x,G)) First Order Necessary Conditions (FONCS) for P1: ∂c i ∂x i − λ i ∂u i ∂x i ≤ 0 x i ∂c i ∂x i − λ i ∂u i ∂x i =0 x i ≥ 0 u i (x,G)≥ u i λ i (u i − u i (x,G))=0 λ i ≥ 0 (3.21) The Lagrangian for P2 L i (x i ,G;µ i )=− u i (x i ,G)+µ i (c i (x i ,G)− ω i ) 147 First Order Necessary Conditions (FONCS) for P2: − ∂u i ∂x i +µ i ∂c i ∂x i ≤ 0 x i − ∂u i ∂x i +µ i ∂c i ∂x i =0 x i ≥ 0 c i (x i ,G)≤ ω i µ i (c i (x i ,G)− ω i )=0 µ i ≥ 0 (3.22) Finally, the FONCs for P3: − γ ∂y i ∂x i + ∂c i ∂x i ≤ 0 x i − γ ∂y i ∂x i + ∂c i ∂x i =0 x i ≥ 0 (3.23) If x i >0, then ∂c i ∂x i =λ i ∂u i ∂x i , µ i ∂c i ∂x i = ∂u i ∂x i , and ∂c i ∂x i =γ ∂y i ∂x i for P1, P2, and P3 respectively. Case 1: Suppose the minimum utility threshold constraint from P1 binds. Then λ i > 0 and since ∂u i ∂x i > 0 under the exponential utility Assumption 4, then ∂c i ∂x i . Then further assuming that γ ∂y i ∂x i >0, then the equilibrium solutions to P1 and P3 coincide if and only if for each x ⋆ i ∂c i ∂x i =λ i ∂u i ∂x i =γ ∂y i ∂x i >0 Case 2: Suppose the contribution cost constraint in P2. Then µ i >0. Similar to the argument in Case 1, the equilibria of P1 and P3 coincide if and only if for each i∈N ∂c i ∂x i = 1 µ i ∂u i ∂x i =γ ∂y i ∂x i >0 148 ■ Proposition 1 implies that in the edge case where both the cost and minimum quality constraints bind, all representations of the maintainer’s problem result in the same solution and how the La- grange multipliers of P1 and P2 correspond to the parameter γ in P3. We use the result of this proposition along with its assumptions to simplify estimation. Remark (Equilibrium Contribution in Equation 3.14). The exposition in Proposition 1, although tedious and perhaps a bit excessive, makes clear the relationship between observed equilibria (e.g., x i > 0 or c i (x,G) = ω i ) and the equilibrium conditions implied by each of the FONCs of P1, P2, and P3. The parameter a i , representing maintainer i’s intrinsic marginal cost of contribution, one would obtain from estimating Equation 3.14 does not exactly, equal a i from Equation 3.10. Instead, for the purposes of estimation in Section 3.6.3, we assume that the cost constraint binds. Then by Proposition 1, we replace a i in Equation 3.14 with ˜ a i =a i +γb i for the purposes of estimation. This should have no effect on the exposition in Section 3.6. 3.C.2 Expected Project Quality Under Assumption (5), maintainer utility becomes u i (x ⋆ ,y ⋆ ,G)=E[v i (y ⋆ i )] =E − exp − r i X j B ij (b j x ⋆ j +ξ j ) =− exp − r i X j B ij b j x ⋆ j E exp − r i X j B ij ξ j =− exp − r i X j B ij b j x ⋆ j exp r 2 i 2 X j B 2 ij σ 2 j =− exp − r i X j B ij b j x ⋆ j − r i 2 B ij σ 2 j (3.24) 149 The third line of (3.24) follows from Assumption 5 on maintainer information sets. By the normality of ξ established by Assumption (5), the fourth line of Equation (3.24) uses the moment generating function for a linear combination of the normally distributed random vector ξ . 3.C.3 Optimal Dependency Formation Following Fosgerau and Bierlaire (2009), let ϵ + ij = − λ ln(ε + ij ) and ϵ − ij = − λ ln(ε − ij ) where λ > 0. Under Assumption 8, we can show Pr(G ij =1)=Pr u + ij ε + ij ≥ u − ij ε − ij =Pr − ln(− u + ij )− ln(ε + ij )≥− ln(− u − ij )− ln(ε − ij ) =Pr u + ij +ϵ + ij ≥ u − ij +ϵ − ij =Pr(u ij ≥ ϵ ij ) (3.25) where (u + ij ,u − ij )≡ (− λ ln(− u + ij ),− λ ln(− u − ij )), u ij ≡ u + ij − u − ij , and ϵ ij ≡ ϵ − ij − ϵ + ij . The advantage of this approach is that now the equilibrium link formation decision can be represented as a random utility with additive disturbances and is linear-quadratic in the parameter of interest, r i . If we substitute the expression for expected project quality under both G+ij and G− ij from Equa- tion (3.24) to formu + ij andu − ij , the likelihood that maintaineri imports projectj in Equation (3.25) becomes Pr(G ij =1)=Pr − ln E h exp − r i y + ij i +ln E h exp − r i y − ij i ≥ ϵ ij /λ =Pr r i X j ∆ B ij b j X k ∆ A jk a k ! − r i 2 ∆ B ij σ 2 j ! ≥ ϵ ij /λ =F ϵ r i X j ∆ B ij b j X k ∆ A jk a k ! | {z } Z 0ij − 1 2 r 2 i X j ∆ B 2 ij σ 2 j | {z } Z 1ij ;θ ϵ (3.26) 150 where u ij ≡ u + ij − u − ij , ϵ ij ≡ ϵ − ij − ϵ + ij , y + ij ≡ y i (x,y − i ,G+ij) and y − ij ≡ y i (x,y − i ,G− ij). Finally, we define ∆ B ij ≡ B + ij − B − ij asthedifferencebetweenelementsoftheLeontiefinversematricesforproject quality under G+ij andG− ij: B+ij =[B + ij ] i,j∈N =(I− β (G+ij)) − 1 andB− ij =[B − ij ] i,j∈N = (I− β (G− ij)) − 1 . Equivalently, we define ∆ A jk = A + jk − A − jk for the Leontief inverse matrices for project contribution: A + ij = [A + jk ] j,k∈N = (I− α (G+ij)) − 1 and A− ij = [A − jk ] j,k∈N = (I− α (G− ij)) − 1 . For notational convenience, in the last equality of Equation (3.26), we rearrange u ij into a quadratic function of r i and label the coefficients Z 0ij and Z 1ij . These coefficients are functions of the network G and parameters (α,a,β, Σ) . In Section 3.6.3 and Appendix 3.D, we show that since(α,a,β, Σ) can be estimated using moment conditions (3.9) and (3.14) via GMM, the result in Equation (3.26) will form the basis for a likelihood function for the remaining unknown parameters, r andθ ϵ . Hence, under the assumptions outlined in the body of the paper, r andθ ϵ can be estimated via maximum likelihood. We will assume that ϵ ij is a logistic random variable, independent and identically distributed across potential links. This can arise if ϵ + ij and ϵ − ij are independent Gumbel random variables. Therefore, F ϵ (·) is the logistic function which has the convenient property that F ′ ϵ (·) = F ϵ (·)(1− F ϵ (·)). Furthermore, θ ϵ is a vector of two parameters for a logistic distribution (i.e., location and scale). We make no specific assumption on the value of the free parameter λ other than λ> 0. Therefore, the value of λ will simply influence the estimated scale parameter for the distribution of ϵ ij . 151 3.D Estimation Details Data isD =(x t ,y t ,G t ,W t ) t∈T . Parameters are θ =(a,α,b,β, Σ ,r,γ,θ ϵ ). 1. Estimate b,β givenD using the project quality specification Equation (3.9) in separate OLS regressions for each project i∈N. y it =b i x it +β X j̸=i G ijt y jt + ˜ ξ it where ˜ ξ it = δ ′ W ijt +ξ it is an are other influences partitioned in observable δ ′ W ijt and un- observable x it components. Furthermore, we can use the residuals of each OLS regression to estimate Σ . Therefore the structural estimation of b,β is equivalent to the reduced form estimation of project quality influences in Equation 3.6. Furthermore, our consideration of Heterogeneity beyond the framework of the model matches the structural estimation approach outlined by Hsieh, König, and X. Liu (2022). In our setup this may allow us to control for technical aspects of projects that at least partially determine quality or fixed costs of project contribution that are absent from our structural discussion in Section 3.6. 2. Estimate a,α,γ givenD and estimates for b using Proposition 1 and a modified version of the project contribution specification in Equation 3.14 using OLS regressions for each project x it = ˜ a i +α X j̸=i G ijt x jt +˜ ν it =a i +γb i +α X j̸=i G ijt x jt +˜ ν it where, as before, ˜ ν it =δ ′ W ijt +ν it andν it is independent and identically distributed and mean zero in expectation. 152 3. (Optional) If we assume that contribution costs c it (x,G) exactly equal, conditional on some noise or measurement error d it , estimates of time allocation ω it , we can estimate fixed costs of contribution for each project i∈N using the following specifications to back out a residual d it : d it =ω it − 1 2 x 2 it +a i x it +α X j̸=i G ijt x jt where ω,x,G are observed inD and a,α were recovered in previous steps. Taking the average of for each project gives an estimate of the fixed costs of contribution: d it = 1 |T i | P d it where T i is defined as the number of time periods in which project i appears in the empirical sample. These estimates will help refine the estimation of welfare effects under counterfactual analysis. While we suggest that the parameters in Steps 1–3 above can be estimated with simple OLS, it might be more prudent to organize analogs of Equations (3.15), (3.14), and (3.10) described above into a set of moment conditions ans subsequently estimate (a,α,b,β, Σ ,γ,d ) using the generalized method of moments (GMM) with constraints: α,β ∈(− 1,1), σ i >0 4. Estimate r,θ ϵ by means of MLE, maximizing a likelihood function based on equilibrium link formation described in Equation (3.18). L(θ |D)= Y t∈T Pr(D t |D t− 1 ,θ )= Y t Y j̸=i Pr(G ij =1|D t− 1 ,θ ) = Y t Y j̸=i F ϵ (u ijt ) G ijt (1− F ϵ (u ijt )) 1− G ijt (3.27) As mentioned previously, this likelihood function forms a Markov chain of likelihoods for the observed sequence of linking decisions. MLE estimates for r and θ ϵ minimize− lnL(θ |D). 153 3.D.1 Additional simplifications to reduce computational burden Given the size of our empirical sample, the estimation procedure as specified remains a time in- tensive task on available hardware. In conjunction with the dimensionality reduction we discuss in Section 3.6.3.1, we take a few computation shortcuts to calculate the coefficients Z 0ijt and Z ijt for each sample moment t ∈ T , the project i ∈ N t associated with that particular moment t and each potential dependency j ̸= i ∈ N t . First, we approximate the true matrix inverse using A=I+ P K k=1 α k G k whereK =5providesadecentapproximation. Second, followingthesuggestion of Hsieh, König, and X. Liu (2022), we use the Sherman-Morrison formula to efficiently calculate a new proposal Leontief inverse A or B to calculate ∆ A ij and ∆ B ij . Third, instead of considering proposal links for all j̸=i in the current network, we consider a set of randomly selected potential dependencies from N t that is (1) equal in size to the set of current dependencies for project i and (2) contains potential dependencies not in the current set of dependencies for project i. 154 Bibliography Acemoglu, Daron, Ufuk Akcigit, and William R Kerr (2016). “Innovation network”. In: Proceedings of the National Academy of Sciences 113.41, pp. 11483–11488. Acemoglu, Daron, Asuman Ozdaglar, and Alireza Tahbaz-Salehi (2015). “Systemic risk and stability in financial networks”. In: American Economic Review 105.2, pp. 564–608. Acquisti, Alessandro, Allan Friedman, and Rahul Telang (2006). “Is there a cost to privacy breaches? An event study”. In: ICIS 2006 Proceedings, p. 94. Akerlof, George A (1978). “The market for “lemons”: Quality uncertainty and the market mechanism”. In: Uncertainty in economics. Elsevier, pp. 235–251. Ambrus, Attila, Markus Mobius, and Adam Szeidl (2014). “Consumption risk-sharing in social networks”. In: American Economic Review 104.1, pp. 149–82. Andersen-Gott, Morten, Gheorghita Ghinea, and Bendik Bygstad (2012). “Why do commercial companies contribute to open source software?” In: International journal of information management 32.2, pp. 106–117. Andreoni, James (1990). “Impure altruism and donations to public goods: A theory of warm-glow giving”. In: The economic journal 100.401, pp. 464–477. Andrews, Isaiah, James H Stock, and Liyang Sun (2019). “Weak instruments in instrumental variables regression: Theory and practice”. In: Annual Review of Economics 11, pp. 727–753. Angrist, Joshua D (2014). “The perils of peer effects”. In: Labour Economics 30, pp. 98–108. Angrist, Joshua D and Victor Lavy (1999). “Using Maimonides’ rule to estimate the effect of class size on scholastic achievement”. In: The Quarterly journal of economics 114.2, pp. 533–575. Anwar, Afsah, Aminollah Khormali, DaeHun Nyang, and Aziz Mohaisen (2018). “Understanding the hidden cost of software vulnerabilities: Measurements and predictions”. In: International Conference on Security and Privacy in Communication Systems. Springer, pp. 377–395. 155 Archambault, Caroline, Matthieu Chemin, and Joost de Laat (2016). “Can peers increase the voluntary contributions in community driven projects? Evidence from a field experiment”. In: Journal of Economic Behavior & Organization 132, pp. 62–77. Arcidiacono, Peter and Sean Nicholson (2005). “Peer effects in medical school”. In: Journal of public Economics 89.2-3, pp. 327–350. Athey, Susan and Glenn Ellison (2014). “Dynamics of open source movements”. In: Journal of Economics & Management Strategy 23.2, pp. 294–316. Babel (2014). babel. url: https://github.com/babel/babel. Badev, Anton (2021). “Nash equilibria on (un) stable networks”. In: Econometrica 89.3, pp. 1179–1206. Baldwin, Carliss Y and Kim B Clark (2006). “The architecture of participation: Does code architecture mitigate free riding in the open source development model?” In: Management science 52.7, pp. 1116–1127. Ballester, Coralio, Antoni Calvó-Armengol, and Yves Zenou (2006). “Who’s who in networks. Wanted: The key player”. In: Econometrica 74.5, pp. 1403–1417. Benkler, Yochai (2002). “Coase’s Penguin, or, Linux and" The Nature of the Firm"”. In: Yale law journal, pp. 369–446. — (2006). The wealth of networks. New Haven and London: Yale University Press. Benkler, Yochai and Helen Nissenbaum (2006). “Commons-based peer production and virtue”. In: Journal of political philosophy 14.4. Bergquist, Magnus and Jan Ljungberg (2001). “The power of gifts: organizing social relationships in open source communities”. In: Information Systems Journal 11.4, pp. 305–320. Bergstrom, Theodore, Lawrence Blume, and Hal Varian (1986). “On the private provision of public goods”. In: Journal of public economics 29.1, pp. 25–49. Bernard, Andrew B, Andreas Moxnes, and Yukiko U Saito (2019). “Production networks, geography, and firm performance”. In: Journal of Political Economy 127.2, pp. 639–688. Bessen, James (2006). “Open source software: Free provision of complex public goods”. In: The economics of open source software development. Elsevier, pp. 57–81. Bloch, Francis and Matthew O Jackson (2006). “Definitions of equilibrium in network formation games”. In: International Journal of Game Theory 34.3, pp. 305–318. Bloch, Francis, Matthew O Jackson, and Pietro Tebaldi (2019). “Centrality measures in networks”. In: Available at SSRN 2749124. 156 Blume, Lawrence, David Easley, Jon Kleinberg, Robert Kleinberg, and Éva Tardos (2013). “Network formation in the presence of contagious risk”. In: ACM Transactions on Economics and Computation (TEAC) 1.2, pp. 1–20. Boehm, Barry (1981). Software Engineering Economics. Vol. 197. Prentice-Hall, New York. Boldi, Paolo and Georgios Gousios (2020). “Fine-grained network analysis for modern software ecosystems”. In: ACM Transactions on Internet Technology (TOIT) 21.1, pp. 1–14. Bollinger, Bryan and Kenneth Gillingham (2012). “Peer effects in the diffusion of solar photovoltaic panels”. In: Marketing Science 31.6, pp. 900–912. Bonaccorsi, Andrea, Silvia Giannangeli, and Cristina Rossi (2006). “Entry strategies under competing standards: Hybrid business models in the open source software industry”. In: Management science 52.7, pp. 1085–1098. Bonaccorsi, Andrea and Cristina Rossi Lamastra (2003). “Altruistic individuals, selfish firms? The structure of motivation in Open Source software”. In: The Structure of Motivation in Open Source Software. Boyter, Ben (2018). scc. url: https://github.com/boyter/scc. Bramoullé, Yann, Habiba Djebbari, and Bernard Fortin (2009). “Identification of peer effects through social networks”. In: Journal of econometrics 150.1, pp. 41–55. — (2020). “Peer effects in networks: A survey”. In: Annual Review of Economics 12, pp. 603–629. Bramoullé, Yann and Rachel Kranton (2007). “Risk-sharing networks”. In: Journal of Economic Behavior & Organization 64.3-4, pp. 275–294. Bretthauer, David (2001). “Open source software: A history”. In: Published Works. url: %5Curl%7Bhttps://opencommons.uconn.edu/libr_pubs/7%7D. Brooks Jr, Frederick P (1995). The mythical man-month: essays on software engineering. Pearson Education. Brown, C. Titus (July 2018). A framework for thinking about Open Source Sustainability? Accessed: 2021–12–04. url: %5Curl%7Bhttp://ivory.idyll.org/blog/2018-oss-framework-cpr.html%7D. Brunfeldt, Kimmo (2014). git-hours. url: https://github.com/kimmobrunfeldt/git-hours. Carey, Patrick (July 2017). “Heartbleed’s Heartburn: Why a 5 Year Old Vulnerability Continues to Bite”. In: The Security Ledger. Accessed: 2022–06–01. url: %5Curl%7Bhttps://securityledger.com/2017/07/heartbleeds-heartburn-why-a-5-year-old- vulnerability-continues-to-bite/%7D. 157 Carrell, Scott E, Bruce I Sacerdote, and James E West (2011). From natural variation to optimal policy? The Lucas critique meets peer effects . Tech. rep. National Bureau of Economic Research. Carvalho, Vasco M (2014). “From micro to macro via production networks”. In: Journal of Economic Perspectives 28.4, pp. 23–48. Carvalho, Vasco M, Makoto Nirei, Yukiko U Saito, and Alireza Tahbaz-Salehi (2021). “Supply chain disruptions: Evidence from the great east japan earthquake”. In: The Quarterly Journal of Economics 136.2, pp. 1255–1321. Cavusoglu, Huseyin, Hasan Cavusoglu, and Jun Zhang (2006). “Economics of Security Patch Management.” In: WEIS. Citeseer. Chandrasekhar, Arun (2016). “Econometrics of network formation”. In: The Oxford handbook of the economics of networks, pp. 303–357. CHAOSS (2017). augur. url: https://github.com/chaoss/augur. Chesbrough, Henry William (2003). Open innovation: The new imperative for creating and profiting from technology . Harvard Business Press. Choi, Syngjoo, Sanjeev Goyal, and Frédéric Moisan (2019). Network formation in large groups. Tech. rep. Christakis, Nicholas, James Fowler, Guido W Imbens, and Karthik Kalyanaraman (2020). “An empirical model for strategic network formation”. In: The Econometric Analysis of Network Data. Elsevier, pp. 123–148. Ciliberto, Federico, Amalia R Miller, Helena Skyt Nielsen, and Marianne Simonsen (2016). “Playing the fertility game at work: An equilibrium model of peer effects”. In: International Economic Review 57.3, pp. 827–856. Coase, Ronald Harry (1937). “The nature of the firm”. In: economica 4.16, pp. 386–405. Cornelissen, Thomas, Christian Dustmann, and Uta Schönberg (2017). “Peer effects in the workplace”. In: American Economic Review 107.2, pp. 425–56. Cornes, Richard and Todd Sandler (1985). “The simple analytics of pure public good provision”. In: Economica 52.205, pp. 103–116. Dahl, Gordon B, Katrine V Løken, and Magne Mogstad (2014). “Peer effects in program participation”. In: American Economic Review 104.7, pp. 2049–74. De Giorgi, Giacomo, Michele Pellizzari, and Silvia Redaelli (2010). “Identification of social interactions through partially overlapping peer groups”. In: American Economic Journal: Applied Economics 2.2, pp. 241–75. 158 De Paula, Áureo (2020). “Econometric models of network formation”. In: Annual Review of Economics 12, pp. 775–799. De Weerdt, Joachim (2002). Risk-sharing and endogenous network formation. 2002/57. WIDER Discussion Paper. De Weerdt, Joachim and Stefan Dercon (2006). “Risk-sharing networks and insurance against illness”. In: Journal of development Economics 81.2, pp. 337–356. Decan, Alexandre and Tom Mens (2019). “What do package dependencies tell us about semantic versioning?” In: IEEE Transactions on Software Engineering 47.6, pp. 1226–1240. Decan, Alexandre, Tom Mens, Maëlick Claes, and Philippe Grosjean (2016). “When GitHub meets CRAN: An analysis of inter-repository package dependency problems”. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). Vol. 1. IEEE, pp. 493–504. Decan, Alexandre, Tom Mens, and Eleni Constantinou (2018a). “On the evolution of technical lag in the npm package dependency network”. In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, pp. 404–414. — (2018b). “On the impact of security vulnerabilities in the npm package dependency network”. In: Proceedings of the 15th international conference on mining software repositories, pp. 181–191. Decan, Alexandre, Tom Mens, and Philippe Grosjean (2019). “An empirical comparison of dependency network evolution in seven software packaging ecosystems”. In: Empirical Software Engineering 24.1, pp. 381–416. DeVault, Drew (Nov. 2021). I will pay you cash to delete your npm module. English. Accessed: 2022–06–01. url: %5Curl%7Bhttps://drewdevault.com/2021/11/16/Cash-for-leftpad.html%7D. Doyle, John C, David L Alderson, Lun Li, Steven Low, Matthew Roughan, Stanislav Shalunov, Reiko Tanaka, and Walter Willinger (2005). “The “robust yet fragile” nature of the Internet”. In: Proceedings of the National Academy of Sciences 102.41, pp. 14497–14502. Dueñas, Santiago, Valerio Cosentino, Jesus M. Gonzalez-Barahona, Alvaro del Castillo San Felix, Daniel Izquierdo-Cortazar, Luis Cañas-Díaz, and Alberto Pérez García-Plaza (July 9, 2021). “GrimoireLab: A toolset for software development analytics”. In: PeerJ Computer Science 7.e601. doi: 10.7717/peerj-cs.601. Eghbal, Nadia (2016). Roads and bridges: The unseen labor behind our digital infrastructure. Ford Foundation. — (2020). Working in public: the making and maintenance of open source software. Stripe Press. Ellims, Michael, James Bridges, and Darrel C Ince (2006). “The economics of unit testing”. In: Empirical Software Engineering 11.1, pp. 5–31. 159 Elliott, Matthew and Benjamin Golub (2019). “A network approach to public goods”. In: Journal of Political Economy 127.2, pp. 730–776. Elliott, Matthew, Benjamin Golub, and Matthew O Jackson (2014). “Financial networks and contagion”. In: American Economic Review 104.10, pp. 3115–53. Elliott, Matthew, Benjamin Golub, and Matthew V Leduc (2022). “Supply network formation and fragility”. In: Available at SSRN 3525459. Erol, Selman and Rakesh Vohra (2018). “Network formation and systemic risk”. In: Available at SSRN 2546310. Everett, Martin and David Schoch (2022). “An extended family of measures for directed networks”. In: Social Networks 70, pp. 334–340. Executive Order 14028 (May 2021). Improving the Nation’s Cybersecurity. url: %5Curl%7Bhttps://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/executive- order-on-improving-the-nations-cybersecurity/%7D. Fafchamps, Marcel and Flore Gubert (2007). “The formation of risk sharing networks”. In: Journal of development Economics 83.2, pp. 326–350. Fafchamps, Marcel and Susan Lund (2003). “Risk-sharing networks in rural Philippines”. In: Journal of development Economics 71.2, pp. 261–287. Falk, Armin and Andrea Ichino (2006). “Clean evidence on peer effects”. In: Journal of labor economics 24.1, pp. 39–57. Fehr, Ernst and Simon Gächter (2000). “Cooperation and punishment in public goods experiments”. In: American Economic Review 90.4, pp. 980–994. Feller, Joseph and Brian Fitzgerald (2002). Understanding open source software development. Addison-Wesley Longman Publishing Co., Inc. Fershtman, Chaim and Neil Gandal (2004). “The determinants of output per contributor in open source projects: An empirical examination”. In: Available at SSRN 515282. — (2007). “Open source software: Motivation and restrictive licensing”. In: International Economics and Economic Policy 4.2, pp. 209–225. — (2011). “Direct and indirect knowledge spillovers: the “social network” of open-source projects”. In: The RAND Journal of Economics 42.1, pp. 70–91. Finifter, Matthew, Devdatta Akhawe, and David Wagner (2013). “An empirical study of vulnerability rewards programs”. In: 22nd USENIX Security Symposium (USENIX Security 13), pp. 273–288. Fischbacher, Urs and Simon Gächter (2006). “Heterogeneous social preferences and the dynamics of free riding in public goods”. In. 160 Fitzgerald, Brian (2006). “The transformation of open source software”. In: MIS quarterly, pp. 587–598. Fogel, Karl (2005). Producing open source software: How to run a successful free software project. " O’Reilly Media, Inc." Fosgerau, Mogens and Michel Bierlaire (2009). “Discrete choice models with multiplicative error terms”. In: Transportation Research Part B: Methodological 43.5, pp. 494–505. Galeotti, Andrea and Sanjeev Goyal (2010). “The law of the few”. In: American Economic Review 100.4, pp. 1468–92. GitHub, Inc. (2020). The State of the Octoverse. Accessed: 2021–08–27. url: %5Curl%7Bhttps://octoverse.github.com/%7D. — (2022a). Choose an open source license: Licenses. Accessed: 2022–06–16. url: %5Curl%7Bhttps://choosealicense.com/licenses/%7D. — (2022b). Code Search. Accessed 2022–10–20. url: %5Curl%7Bhttps://github.com/search%7D. Glaeser, Edward L, Bruce I Sacerdote, and Jose A Scheinkman (2003). “The social multiplier”. In: Journal of the European Economic Association 1.2-3, pp. 345–353. Glazer, Amihai and Kai A Konrad (1996). “A signaling explanation for charity”. In: The American Economic Review 86.4, pp. 1019–1028. Goldfarb, Avi, Shane M Greenstein, and Catherine E Tucker (2015). Economic analysis of the digital economy. University of Chicago Press. Goldsmith-Pinkham, Paul and Guido W Imbens (2013). “Social networks and the identification of peer effects”. In: Journal of Business & Economic Statistics 31.3, pp. 253–264. Gousios, Georgios (2013). “The GHTorrent dataset and tool suite”. In: Proceedings of the 10th Working Conference on Mining Software Repositories. MSR ’13. San Francisco, CA, USA: IEEE Press, pp. 233–236. isbn: 978-1-4673-2936-1. url: %5Curl%7Bhttp://dl.acm.org/citation.cfm?id=2487085.2487132%7D. Goyal, Sanjeev and Jose Luis Moraga-Gonzalez (2001). “R&d networks”. In: Rand Journal of Economics, pp. 686–707. Graham, Bryan and Aureo De Paula (2020). The Econometric Analysis of Network Data. Academic Press. Graham, Bryan S (2015). “Methods of identification in social networks”. In: Annu. Rev. Econ. 7.1, pp. 465–485. — (2020). “Network data”. In: Handbook of Econometrics. Vol. 7. Elsevier, pp. 111–218. 161 Grams, Chris (Oct. 2019). How much time do developers spend actually writing code? English. Acessed: 2022–06–01. url: %5Curl%7Bhttps://blog.tidelift.com/how-much-time-do-developers-spend-actually-writing-code%7D. Greenstein, Shane and Frank Nagle (2014). “Digital dark matter and the economic contribution of Apache”. In: Research Policy 43.4, pp. 623–631. Grossman, Sanford J and Oliver D Hart (1986). “The costs and benefits of ownership: A theory of vertical and lateral integration”. In: Journal of political economy 94.4, pp. 691–719. Guryan, Jonathan, Kory Kroft, and Matthew J Notowidigdo (2009). “Peer effects in the workplace: Evidence from random groupings in professional golf tournaments”. In: American Economic Journal: Applied Economics 1.4, pp. 34–68. Hahn, Jungpil, Jae Yun Moon, and Chen Zhang (2008). “Emergence of new project teams from open source software developer networks: Impact of prior collaboration ties”. In: Information Systems Research 19.3, pp. 369–391. Hall, Bronwyn H, Adam Jaffe, and Manuel Trajtenberg (2005). “Market value and patent citations”. In: RAND Journal of economics, pp. 16–38. Heckman, James J (1979). “Sample selection bias as a specification error”. In: Econometrica: Journal of the econometric society, pp. 153–161. Helmers, Christian, Manasa Patnam, and P Raghavendra Rau (2017). “Do board interlocks increase innovation? Evidence from a corporate governance reform in India”. In: Journal of Banking & Finance 80, pp. 51–70. Hilton, Michael, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig (2016a). “Usage, costs, and benefits of continuous integration in open-source projects”. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 426–437. — (2016b). “Usage, costs, and benefits of continuous integration in open-source projects”. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, pp. 426–437. Holländer, Heinz (1990). “A social exchange approach to voluntary cooperation”. In: The American Economic Review, pp. 1157–1167. Hsieh, Chih-Sheng, Michael D Konig, Xiaodong Liu, and Christian Zimmermann (2018). “Superstar Economists: Coauthorship networks and research output”. In: Available at SSRN 3266432. — (2020). “Collaboration in bipartite networks, with an application to coauthorship networks”. In: Tinbergen Institute Discussion Paper 2020-056/VIII. Hsieh, Chih-Sheng, Michael D König, and Xiaodong Liu (2022). “A structural model for the coevolution of networks and behavior”. In: Review of Economics and Statistics 104.2, pp. 355–367. 162 IBM (Sept. 2021). Cost of a Data Breach Report 2021. en-us. Accessed: 2022–06–01. url: %5Curl%7Bhttps://www.ibm.com/security/data-breach%7D. Jackson, Joab (Feb. 2019). To Reduce Tech Debt, Eliminate Dependencies (and Refactoring). English. News Website. Accessed: 2022–06–01. url: %5Curl%7Bhttps://thenewstack.io/to-reduce-tech-debt-eliminate-dependencies-and-refactoring/%7D. Jackson, Matthew O and Asher Wolinsky (1996). “A Strategic Model of Social and Economic Networks”. In: Journal of Economic Theory 71, pp. 44–74. Jacobsen, Mark, Jacob LaRiviere, and Michael Price (2017). “Public policy and the private provision of public goods under heterogeneous preferences”. In: Journal of the Association of Environmental and Resource Economists 4.1, pp. 243–280. Jaffe, Adam B, Manuel Trajtenberg, and Rebecca Henderson (1993). “Geographic localization of knowledge spillovers as evidenced by patent citations”. In: the Quarterly journal of Economics 108.3, pp. 577–598. Johnson, Justin Pappas (2002). “Open source software: Private provision of a public good”. In: Journal of Economics & Management Strategy 11.4, pp. 637–662. Kalliamvakou, Eirini, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian (2014). “The promises and perils of mining github”. In: Proceedings of the 11th working conference on mining software repositories, pp. 92–101. Katz, Jeremy (Jan. 2020). Libraries.io Open Source Repository and Dependency Metadata. Version 1.6.0. Version 1.6.0. doi: 10.5281/zenodo.3626071. Keller, Sallie, Gizem Korkmaz, Carol Robbins, and Stephanie Shipp (2018). “Opportunities to observe and measure intangible inputs to innovation: Definitions, operationalization, and examples”. In: Proceedings of the National Academy of Sciences 115.50, pp. 12638–12645. Kerner, Sean Michael (Apr. 2014). Heartbleed SSL Flaw’s True Cost Will Take Time to Tally. Accessed: 2021–12–04. url: %5Curl%7Bhttps://www.eweek.com/security/heartbleed-ssl-flaw-s- true-cost-will-take-time-to-tally/%7D. Kikas, Riivo, Georgios Gousios, Marlon Dumas, and Dietmar Pfahl (2017). “Structure and evolution of package dependency networks”. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, pp. 102–112. Kina, Kanako, Masateru Tsunoda, Hideaki Hata, Haruaki Tamada, and Hiroshi Igaki (2016). “Analyzing the decision criteria of software developers based on prospect theory”. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). Vol. 1. IEEE, pp. 644–648. Kotchen, Matthew J (2009). “Voluntary provision of public goods for bads: A theory of environmental offsets”. In: The Economic Journal 119.537, pp. 883–899. 163 Kovářík, Jaromír and Marco J. Van der Leij (2009). “Risk aversion and networks: Microfoundations for network formation”. In. — (2014). “Risk aversion and social networks”. In: Review of Network Economics 13.2, pp. 121–155. Kremer, Michael (1993). “The O-ring theory of economic development”. In: The Quarterly Journal of Economics 108.3, pp. 551–575. Krueger, Alan B (2003). “Economic considerations and class size”. In: The economic journal 113.485, F34–F63. Kula, Raula Gaikovina, Daniel M. German, Ali Ouni, Takashi Ishio, and Katsuro Inoue (May 2017). “Do developers update their library dependencies?” In: Empirical Software Engineering 23.1, pp. 384–417. doi: 10.1007/s10664-017-9521-5. Ladisa, Piergiorgio, Henrik Plate, Matias Martinez, and Olivier Barais (2022). Taxonomy of Attacks on Open-Source Software Supply Chains. doi: 10.48550/ARXIV.2204.04008. Lakhani, Karim R and Robert G Wolf (2003). “Why hackers do what they do: Understanding motivation and effort in free/open source software projects”. In: Open Source Software Projects (September 2003). Laurent, Andrew M St (2004). Understanding open source and free software licensing: guide to navigating licensing issues in existing & new software. " O’Reilly Media, Inc." Lee, Lung-Fei (2007). “Identification and estimation of econometric models with group interactions, contextual factors and fixed effects”. In: Journal of Econometrics 140.2, pp. 333–374. Lee, Lung-Fei, Xiaodong Liu, Eleonora Patacchini, and Yves Zenou (2021). “Who is the key player? A network analysis of juvenile delinquency”. In: Journal of Business & Economic Statistics 39.3, pp. 849–857. Lerner, Josh and Jean Tirole (2002). “Some simple economics of open source”. In: The journal of industrial economics 50.2, pp. 197–234. — (2005a). “The economics of technology sharing: Open source and beyond”. In: Journal of Economic Perspectives 19.2, pp. 99–120. — (2005b). “The scope of open source licensing”. In: Journal of Law, Economics, and Organization 21.1, pp. 20–56. Leung, Michael P (2015). “Two-step estimation of network-formation models with incomplete information”. In: Journal of Econometrics 188.1, pp. 182–195. Lewbel, Arthur (2019). “The identification zoo: Meanings of identification in econometrics”. In: Journal of Economic Literature 57.4, pp. 835–903. 164 Lindquist, Matthew J, Jan Sauermann, and Yves Zenou (2015). “Network effects on worker productivity”. In: CEPR Discussion Paper No. DP10928. Manski, Charles F (1993). “Identification of endogenous social effects: The reflection problem”. In: The review of economic studies 60.3, pp. 531–542. Marbukh, Vladimir (2018). “Network formation by contagion averse agents: modeling bounded rationality with logit learning”. In: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, pp. 1232–1233. Mas, Alexandre and Enrico Moretti (2009). “Peers at work”. In: American Economic Review 99.1, pp. 112–45. McIlroy, M, EN Pinson, and BA Tague (1978). “UNIX time-sharing system”. In: The Bell system technical journal 57.6, pp. 1899–1904. McQuaid, Mike (Aug. 2018). The Open Source Contributor Funnel (or: Why People Don’t Contribute To Your Open Source Project). Accessed: 2021–12–04. url: %5Curl%7Bhttps://mikemcquaid.com/2018/08/14/the-open-source-contributor-funnel-why-people- dont-contribute-to-your-open-source-project/%20%7D. Mele, Angelo (2017). “A structural model of dense network formation”. In: Econometrica 85.3, pp. 825–850. Meneely, Andrew, Alberto C Rodriguez Tejeda, Brian Spates, Shannon Trudeau, Danielle Neuberger, Katherine Whitlock, Christopher Ketant, and Kayla Davis (2014). “An empirical investigation of socio-technical code review metrics and security vulnerabilities”. In: Proceedings of the 6th International Workshop on Social Software Engineering, pp. 37–44. Mockus, Audris, Roy T Fielding, and James Herbsleb (2000). “A case study of open source software development: the Apache server”. In: Proceedings of the 22nd international conference on Software engineering, pp. 263–272. Munaiah, Nuthan, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan (2017a). Curating GitHub for engineered software projects. — (2017b). “Curating github for engineered software projects”. In: Empirical Software Engineering 22.6, pp. 3219–3253. Mutton, Paul (Apr. 2014). Half a million widely trusted websites vulnerable to Heartbleed bug. English. url: https://news.netcraft.com/archives/2014/04/08/half-a-million-widely-trusted- websites-vulnerable-to-heartbleed-bug.html (visited on 06/01/2022). Nagle, Frank (2019). “Open source software and firm productivity”. In: Management Science 65.3, pp. 1191–1215. 165 Nagle, Frank, James Dana, Jennifer Hoffman, Steven Randazzo, and Yanuo Zhou (Mar. 2022). Census II of Free and Open Source Software — Application Libraries. English. Tech. rep. The Linux Foundation and The Laboratory for Innovation Science at Harvard, p. 162. url: https://www.linuxfoundation.org/tools/census-ii-of-free-and-open-source-software-application- libraries/ (visited on 06/01/2022). Nitzan, Shmuel and Richard E Romano (1990). “Private provision of a discrete public good with uncertain cost”. In: Journal of Public Economics 42.3, pp. 357–370. Oberhaus, Daniel (Feb. 2019). “The Complicated Economy of Open Source Software”. In: Accessed: 2021–09-23. url: %5Curl%7Bhttps://www.vice.com/en/article/43zak3/the-internet-was- built-on-the-free-labor-of-open-source-developers-is-that-sustainable%7D. Ohm, Marc, Henrik Plate, Arnold Sykosch, and Michael Meier (2020). “Backstabber’s knife collection: A review of open source software supply chain attacks”. In: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, pp. 23–43. Olea, José Luis Montiel and Carolin Pflueger (2013). “A robust test for weak instruments”. In: Journal of Business & Economic Statistics 31.3, pp. 358–369. Open Source Initiative (Mar. 2007). The Open Source Definition . Accessed: 2022–06–16. url: %5Curl%7Bhttps://opensource.org/osd%7D. Ostrom, Elinor (1990). Governing the commons: The evolution of institutions for collective action. Cambridge university press. Patnam, Manasa (2011). “Corporate networks and peer effects in firm policies”. In: Emerging Markets Finance Conference, Indira Gandhi Institute of Development Research. Perens, Bruce et al. (1999). “The open source definition”. In: Open sources: voices from the open source revolution 1, pp. 171–188. Pham, Nam H, Tung Thanh Nguyen, Hoan Anh Nguyen, and Tien N Nguyen (2010). “Detection of recurring software vulnerabilities”. In: Proceedings of the IEEE/ACM international conference on Automated software engineering, pp. 447–456. Prana, Gede Artha Azriadi, Abhishek Sharma, Lwin Khin Shar, Darius Foo, Andrew E Santosa, Asankhaya Sharma, and David Lo (2021). “Out of sight, out of mind? How vulnerable dependencies affect open-source projects”. In: Empirical Software Engineering 26.4, pp. 1–34. Raemaekers, Steven, Arie van Deursen, and Joost Visser (2017). “Semantic versioning and impact of breaking changes in the Maven repository”. In: Journal of Systems and Software 129, pp. 140–158. Raymond, Eric (1999). “The cathedral and the bazaar”. In: Knowledge, Technology & Policy 12.3, pp. 23–49. Ridder, Geert and Shuyang Sheng (2020). “Estimation of large network formation games”. In: arXiv preprint arXiv:2001.03838. 166 Robbins, Carol A, Gizem Korkmaz, José Bayoán Santiago Calderón, Daniel Chen, Claire Kelling, Stephanie Shipp, and Sallie Keller (2018). “Open source software as intangible capital: measuring the cost and impact of free digital tools”. In: Paper from 6th IMF Statistical Forum on Measuring Economic Welfare in the Digital Age: What and How, pp. 19–20. Roberts, Jeffrey A, Il-Horn Hann, and Sandra A Slaughter (2006). “Understanding the motivations, participation, and performance of open source software developers: A longitudinal study of the Apache projects”. In: Management science 52.7, pp. 984–999. Roumani, Yaman, Joseph K Nwankpa, and Yazan F Roumani (2016). “Examining the relationship between firm’s financial records and security vulnerabilities”. In: International Journal of Information Management 36.6, pp. 987–994. Sacerdote, Bruce (2001). “Peer effects with random assignment: Results for Dartmouth roommates”. In: The Quarterly journal of economics 116.2, pp. 681–704. Samuelson, Paul A (1954). “The pure theory of public expenditure”. In: The review of economics and statistics 36.4, pp. 387–389. Schlueter, Isaac Z. (Mar. 2016). kik, left-pad, and npm. English. Accessed: 2022–06–01. url: https://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm. Schueller, Willam and Johannes Wachs (2022). “Modeling Interconnected Social and Technical Risks in Open Source Software Ecosystems”. In: arXiv preprint arXiv:2205.04268. Sherwood, Paul (2015). “Estimating the costs of open-source development”. Embedded Linux Conference Europe. Dublin, Ireland. url: %5Curl%7Bhttps://lwn.net/Articles/659241/%7D. Sholler, Dan, Igor Steinmacher, Denae Ford, Mara Averick, Mike Hoye, and Greg Wilson (2019). “Ten simple rules for helping newcomers become contributors to open projects”. In: PLoS computational biology 15.9, e1007296. Slivko, Olga (2014). “Peer effects in collaborative content generation: The evidence from German Wikipedia”. In: ZEW-Centre for European Economic Research Discussion Paper 14-128. Snijders, Tom AB, Johan Koskinen, and Michael Schweinberger (2010). “Maximum likelihood estimation for social network dynamics”. In: The annals of applied statistics 4.2, p. 567. Sorhus, Sindre (May 2019). (@sindresorhus) Some observations from having merged thousands of pull requests in the past few years. Accessed: 2021–12–04. url: %5Curl%7Bhttps://twitter.com/sindresorhus/status/1130791267393163267%7D. — (Dec. 2021). Sindre Sorhus (personal web page). Accessed: 2021–12–04. url: %5Curl%7Bhttps://sindresorhus.com/%7D. 167 Spadini, Davide, Maurício Aniche, and Alberto Bacchelli (2018). “PyDriller: Python framework for mining software repositories”. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2018. New York, New York, USA: ACM Press, pp. 908–911. isbn: 9781450355735. doi: 10.1145/3236024.3264598. Spinellis, Diomidis, Georgios Gousios, Vassilios Karakoidas, Panagiotis Louridas, Paul J Adams, Ioannis Samoladas, and Ioannis Stamelos (2009). “Evaluating the quality of open source software”. In: Electronic Notes in Theoretical Computer Science 233, pp. 5–28. Squire, Megan and David Williams (2012). “Describing the software forge ecosystem”. In: 2012 45th Hawaii International Conference on System Sciences. IEEE, pp. 3416–3425. Stack Exchange, Inc. (2022). Stack Overflow Developer Survey: Version Control Systems . Accessed 2022–10–20. url: %5Curl%7Bhttps://survey.stackoverflow.co/2022/#section-version-control- version-control-systems%7D. Stiglitz, Joseph E (1981). Public goods in open economies with heterogeneous individuals. — (1982). The theory of local public goods twenty-five years after Tiebout: A perspective . Tech. rep. National Bureau of Economic Research. Techopedia (2017). What is Technical Debt? - Definition from Techopedia . en. Accessed: 2022–06–01. url: %5Curl%7Bhttp://www.techopedia.com/definition/27913/technical-debt%7D. Telang, Rahul and Sunil Wattal (2007). “An empirical analysis of the impact of software vulnerability announcements on firm stock price”. In: IEEE Transactions on Software engineering 33.8, pp. 544–557. The OpenSSL Project Authors (2021). OpenSSL. Copyright 1999–2021. Tiebout, Charles M (1956). “A pure theory of local expenditures”. In: Journal of political economy 64.5, pp. 416–424. U.S. Bureau of Labor Statistics (Sept. 2021). Software Developers, Quality Assurance Analysts, and Testers : Occupational Outlook Handbook. url: %5Curl%7Bhttps://www.bls.gov/ooh/computer- and-information-technology/software-developers.htm%7D. US CFPB (2022). Equifax Data breach settlement. en. url: https://www.consumerfinance.gov/equifax-settlement/ (visited on 06/01/2022). US FTC (Feb. 2022). Equifax Data Breach Settlement. en. url: http://www.ftc.gov/enforcement/refunds/equifax-data-breach-settlement (visited on 06/01/2022). Varian, Hal R (2000). “Buying, sharing and renting information goods”. In: The Journal of Industrial Economics 48.4, pp. 473–488. 168 Vasilescu, Bogdan, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir Filkov (2015). “Quality and productivity outcomes relating to continuous integration in GitHub”. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp. 805–816. Von Hippel, Eric (2006). Democratizing innovation. the MIT Press. Von Krogh, Georg and Eric Von Hippel (2003). “Special issue on open source software development”. In: Research Policy 32.7, pp. 1149–1157. Walsh, Kenneth R and Helmut Schneider (2002). “The role of motivation and risk behaviour in software development success”. In: Information research 7.3, pp. 7–3. Wan, Zelin, Yash Mahajan, Beom Woo Kang, Terrence J Moore, and Jin-Hee Cho (2021). “A Survey on Centrality Metrics and Their Network Resilience Analysis”. In: IEEE Access 9, pp. 104773–104819. Wickham, Hadley (2015). R packages: organize, test, document, and share your code. " O’Reilly Media, Inc." Williamson, Oliver E (1975). “Markets and hierarchies: analysis and antitrust implications: a study in the economics of internal organization”. In: University of Illinois at Urbana-Champaign’s Academy for Entrepreneurial Leadership Historical Research Reference in Entrepreneurship. — (1985). “The Economic Institutions of Capitalism: Firms, Markets, Relational Contracting”. In: University of Illinois at Urbana-Champaign’s Academy for Entrepreneurial Leadership Historical Research Reference in Entrepreneurship. WIRED (Dec. 2021). A Log4J Vulnerability Has Set the Internet ’On Fire’. Accessed: 2022–06–01. url: %5Curl%7Bhttps://www.wired.com/story/log4j-flaw-hacking-internet/%7D. Wiz (Dec. 2021). Log4Shell 10 days later: Enterprises halfway through patching. en. Accessed: 2022–06–01. url: %5Curl%7Bhttps://www.wiz.io/blog/10-days-later-enterprises-halfway-through- patching-log4shell/%7D. Zerouali, Ahmed, Eleni Constantinou, Tom Mens, Gregorio Robles, and Jesús González-Barahona (2018). “An empirical analysis of technical lag in npm package dependencies”. In: International Conference on Software Reuse. Springer, pp. 95–110. Zhang, Xiaoquan Michael and Feng Zhu (2011). “Group size and incentives to contribute: A natural experiment at Chinese Wikipedia”. In: American Economic Review 101.4, pp. 1601–15. Zhao, Jun, Hea-Jung Kim, and Hyoung-Moon Kim (2020). “New EM-type algorithms for the Heckman selection model”. In: Computational Statistics & Data Analysis 146, p. 106930. Zhao, Mingyi, Jens Grossklags, and Peng Liu (2015). “An empirical study of web vulnerability discovery ecosystems”. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1105–1117. 169 Zimmermann, Markus, Cristian-Alexandru Staicu, Cam Tenny, and Michael Pradel (2019). “Small world with high risks: A study of security threats in the npm ecosystem”. In: 28th USENIX Security Symposium (USENIX Security 19), pp. 995–1010. 170
Abstract (if available)
Abstract
We explore microeconomic behavior shaping the production of open source software (OSS), fortifying economic structure with empirical analysis. In Chapter 2, we examine the extent to which peer effects influence the private provision of public goods. In the case of OSS, peer contribution may facilitate or otherwise incentivize further contribution from others, effectively subsidizing private provision. We first utilize a reduced form approach to derive causal estimates of net peer effects in public goods contribution by exploiting a peers-of-peers identification strategy. We next develop a structural model of peer-influenced public good provision to decompose contribution decision margins. We apply these methodologies using a sample of collaborative OSS projects hosted on the GitHub platform. Both reduced form and structural approaches suggest peer effects are much stronger along the extensive margin. Our counterfactual analysis suggests (extensive margin) peer effects account for nearly 56% of cumulative aggregate contribution. In Chapter 3, we consider the formation of software dependency networks. Developers of software projects can leverage the functionality of external projects. This practice can potentially lower the cost of development albeit at the inherent risk of relying on external components. Centering our analysis around the dependency management problem faced by the risk averse project maintainer, we use both reduced form and structural approaches to the implications of strategic network formation. Using a sample of projects from Node.js JavaScript ecosystem, we find that removing less than 1% of core projects can reduce aggregate project quality by more than 5% for the remaining peers.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
The effects of required security on software development effort
PDF
Empirical analysis of factors driving stock options grants and firms volatility
PDF
Software architecture recovery using text classification -- recover and RELAX
PDF
Essays on the economics of climate change adaptation in developing countries
PDF
Automatic test generation system for software
PDF
Improving binary program analysis to enhance the security of modern software systems
PDF
Essays in the economics of education and conflict
PDF
The motivational power of beauty: how aesthetically appealing products drive purchase effort in consumers
PDF
Essays on the empirics of risk and time preferences in Indonesia
PDF
Essays on wellbeing disparities in the United States and their social determinants
PDF
Engaging together: exploring the peer effects of civic engagement
PDF
An empirical analysis of the quality of primary education across countries and over time
PDF
Quality investment and advertising: an empirical analysis of the auto industry
PDF
Three essays on the microeconometric analysis of the labor market
PDF
Satisfying QoS requirements through user-system interaction analysis
PDF
An improvement study of leading a sustainable electric utility future through organizational change effectiveness
PDF
Iron-dependent response mechanisms of the nitrogen-fixing cyanobacterium Crocosphaera to climate change
PDF
An evaluative study on implementing customer relationship management software through the perspective of first level managers
Asset Metadata
Creator
Boysel, Samuel Jospeh
(author)
Core Title
Sustaining open source software production: an empirical analysis through the lens of microeconomics
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Economics
Degree Conferral Date
2022-12
Publication Date
12/06/2022
Defense Date
10/21/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
digital economics,empirical,industrial organization,labor economics,microeconomics,OAI-PMH Harvest,open source software,peer effects,peer production,production economics,productivity,public goods,risk aversion,sustainability
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kahn, Matthew (
committee chair
), Kempe, David (
committee member
), Metcalfe, Robert (
committee member
), Oliva, Paulina (
committee member
)
Creator Email
boysel@usc.edu,sboysel@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112617620
Unique identifier
UC112617620
Identifier
etd-BoyselSamu-11342.pdf (filename)
Legacy Identifier
etd-BoyselSamu-11342
Document Type
Dissertation
Format
theses (aat)
Rights
Boysel, Samuel Jospeh
Internet Media Type
application/pdf
Type
texts
Source
20221207-usctheses-batch-994
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
digital economics
empirical
industrial organization
labor economics
microeconomics
open source software
peer effects
peer production
production economics
productivity
public goods
risk aversion
sustainability