Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Commercialization of logistics infrastructure as an offline platform
(USC Thesis Other)
Commercialization of logistics infrastructure as an offline platform
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMMERCIALIZATION OF LOGISTICS INFRASTRUCTURE AS AN OFFLINE PLATFORM by Rongqing Han A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) May 2020 Copyright 2020 Rongqing Han Dedication To my wife for your love and support, And most importantly, To my mother, who has sacrificed way too much for me, “Mom, this is for you. ” ii Acknowledgements This dissertation is supported by Marshall School of Business at the University of Southern California, for which I am particularly thankful. My deep gratitude goes first to Professor Leon Yang Zhu, who I believe is the best advisor one can ask for, a great mentor, and a dear friend. His guidance and wisdom have helped me explore new territories, overcome difficulties, and gain confidence through the past six years. But six years have been too short for me to believe how much I have learnt from him both in academia and in life. I am also grateful for his tolerance on all the mistakes I have made. I am deeply in debt to my co-authors, without whom this dissertation would not have been possible. I thank Professor Tianshu Sun for the opportunities he has created for me. His strong expertise in field exper- iment largely influences me as a researcher in the long run. I am also extremely grateful to Professor Vishal Gupta and Professor Song-Hee Kim for the endless support both in research and in life. They hand-held me through the research process in earlier years when I knew very little about modeling, writing, and presenting. They have provided so much care and support on my life, when I was, as usual, struggling. Because of my co-authors, I am who I am today. Words cannot express how much I wholeheartedly appreciate everything they have done for me. I am also grateful to Professor Sampath Rajagopalan for his strong support and kind help on research, career, and life. I thank Professor Sha Yang for being on my thesis committee, as well as her encouragement iii and appreciation of my work. Special thanks to Professor Greys Sosic, as our Ph.D. coordinator. She literally treated us like her own children with great care and love. I would like to thank many individuals at the Department of Data Sciences and Operations for the exten- sive professional and personal support. Specifically, I would like to thank Professor Ramandeep Randhawa, Professor Kimon Drakopoulos, Professor Peng Shi, Professor Paat Rusmevichientong, and Professor Hamid Nazerzadeh for the excellent comments on my work and a great support on my job searching process. I thank Julie Phaneuf and Professor Michelle Silver Lee for their contribution to the Marshal Ph.D. program. I also thank my Ph.D. friends in Marshall for an incredibly inspiring environment. Moreover, I would like to thank my industry collaborator Lixia Wu from Alibaba group and Dr. Hyung Paek from Yale New Haven Hospital for the interesting discussion and generous support. Finally, I would like to acknowledge with gratitude, the support and love of my family, especially my wife Manyao, who is my champaign and source of strength, and my mother Huifang, who has been the ultimate role model. I dedicate this thesis to them. iv Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii Abstract ix 1 Overview 1 2 Connecting Customers and Merchants Offline: Experimental Evidence From the Commer- cialization of Last-Mile Stations at Alibaba 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Causal Effect of Organic Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Causal Effect of Induced Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Mechanism Underlying the Effectiveness of Induced Interaction: Self-Selection . . . . . . . 27 2.6 Target-to-Induce the “Right” Customers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Commercializing the Package Flow: Cross-sampling Physical Products Through E-commerce Warehouses 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Research Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Causal Impact of Cross-sampling on the Sampling Brand . . . . . . . . . . . . . . . . . . . 55 3.5 Value of User and Item Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4 Maximizing Intervention Effectiveness 72 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4 Robust Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5 Case Study: Comparison of Targeting Methods . . . . . . . . . . . . . . . . . . . . . . . . 106 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 References 121 v Appendices 131 A Supporting Materials for Chapter 2 132 A.1 Free Sample Distribution Inside of An Alibaba Station . . . . . . . . . . . . . . . . . . . . 132 A.2 Two-Stage Matching from Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.3 Robustness Check and Long-Term Dynamics from Section 3 . . . . . . . . . . . . . . . . . 136 A.4 Long-Term Effect of Induced Sample Claim from Section 4 . . . . . . . . . . . . . . . . . . 139 A.5 Additional Analysis on Offline Spillover of Organic Sample Claim Following Section 3 . . . 140 B Supporting Materials for Chapter 3 141 B.1 Robustness Check for Section 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 B.2 An Example of Affinity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 B.3 Survey Evidence Combined with Affinity Scores . . . . . . . . . . . . . . . . . . . . . . . 145 C Supporting Materials for Chapter 4 147 C.1 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 C.2 Extensions of the Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 C.3 Additional Graphs and Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 vi List of Tables 2.1 Summary of Free Sample Claims by Brands After Matching . . . . . . . . . . . . . . . . . 19 2.2 Causal Effect of Organic Sample Claim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Summary Statistics and Randomization Check . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Causal Effect of Online Notification on Sample Claims . . . . . . . . . . . . . . . . . . . . 25 2.5 Causal Effect of Induced Sample Claim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Claimer Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Characteristic Difference of Induced Claimers and Organic Claimers (Based on Table 2.6) . 29 2.8 Testing for Self-Selection Based on Observed Characteristics . . . . . . . . . . . . . . . . . 30 2.9 Testing for Self-Selection Based on Unobserved Characteristics (LATE with Control Variables) 31 2.10 Testing For Self-Selection Based on Unobserved Characteristics . . . . . . . . . . . . . . . 33 3.1 Categorization of Business Models by Control and Information . . . . . . . . . . . . . . . . 41 3.2 Number of Treated Customers Before and After Matching . . . . . . . . . . . . . . . . . . 52 3.3 Summary Statistics Before and After Matching . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Causal Impact of Cross-sampling on the Sampling Brand . . . . . . . . . . . . . . . . . . . 56 3.5 Information Matrix for Free Sampling of Physical Products . . . . . . . . . . . . . . . . . . 59 3.6 Balance Check for Affinity Scores by Brand . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Influence of User-based Affinity Score on the Causal Impact of Cross-sampling . . . . . . . 67 3.8 Influence of Item-based Affinity Score on the Causal Impact of Cross-sampling . . . . . . . 67 3.9 Influence of 50-50 Compound Affinity Score on the Causal Impact of Cross-sampling . . . . 69 3.10 Percentage Revenue Gain of Different Channels (in Table 3.5) Under Resource Constraint . 71 4.1 Evidence of Causal Effects for Case Management from [145] for the Study Population . . . 84 4.2 Summary Statistics for the Study and Candidate Populations . . . . . . . . . . . . . . . . . 85 4.3 Inclusion Criteria and Summary Statistics for the Candidate Population . . . . . . . . . . . 109 4.4 Characteristics and Reward-Weighted Average Covariates for the Targeted Patients by Method111 A.1 Main Matching Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.2 Estimated Impact of Organic Sample Claim: Leads/Lags Model . . . . . . . . . . . . . . . 138 A.3 Causal Effect of Induced Sample Claim (with 10-week Post-Treatment Window) . . . . . . 139 A.4 Offline Spillover of Organic Sample Claim . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.1 Causal Impact of Cross-sampling on the Sampling Brand (with Control Variables) . . . . . . 142 B.2 Causal Impact of Cross-sampling on the Sampling Brand (Falsification Test) . . . . . . . . . 143 B.3 Summary of Both Affinity Scores for Consumption 1 as the Distributing Brand . . . . . . . 144 B.4 Comparing Affinity Scores by Each Survey Answer . . . . . . . . . . . . . . . . . . . . . . 146 vii List of Figures 2.1 A Follow-up Survey on Usage of Free Samples . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2 Relative Variable Importance in the Instrumental Forest . . . . . . . . . . . . . . . . . . . . 36 2.3 Single-Tree Representation of Instrumental Forest . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Density of the Propensity Scores by Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Spending and Item-view Trends by Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Estimated Moderating Effect of the Compound Affinity Score Regarding Spending at the Sampling Brand by Different Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1 Worst-Case Relative Performance of Reward Scoring (r-Scoring) for Case Management in Our Partner Hospital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2 Relative Performance as Both 1 and 2 Vary Uniformly within Their Confidence Intervals . 112 4.3 Relative Performance as Either 1 or 2 Varies . . . . . . . . . . . . . . . . . . . . . . . . 114 4.4 Worst-Case Performance When CATEs Depend Only on Demographics . . . . . . . . . . . 116 4.5 Difference between Robust-Full-Linear and Reward Scoring Varying the Resource Con- straint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.6 Distribution of Rewardsr 0 c defined in eq. (4.21) . . . . . . . . . . . . . . . . . . . . . . . . 118 4.7 Difference between Robust-Full-Linear and Reward Scoring Varying Reward Distribution . 119 A.1 Free Sample Distribution Inside of An Alibaba Station . . . . . . . . . . . . . . . . . . . . 132 A.2 Matching Variable Trends Before and After Sample Claims by Group . . . . . . . . . . . . 134 A.3 Density of Propensity Score by Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.4 Causal Effect of Organic Interaction By Week Relative to Sample Claim Week: Leads/Lags Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 C.1 Histogram of Avg. ED Visit Charges By Patient . . . . . . . . . . . . . . . . . . . . . . . . 166 C.2 Worst-Case Relative Performance of Outcome Scoring (ry(0)-Scoring) . . . . . . . . . . . 167 C.3 Budget Utilization when varying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 C.4 Box-Plot of Rewards by Each Stratum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 C.5 Explaining Performance of Robust Methods under the Setting in Section 4.5.5 . . . . . . . 170 C.6 Difference between Robust-2 and Reward Scoring Varying the Resource Constraint . . . . . 170 C.7 Difference between Robust-2 and Reward Scoring Varying Reward Distribution . . . . . . . 170 viii Abstract Many e-commerce platforms have established their own logistics infrastructure such as last-mile/pickup stations and warehouses, which has largely benefited the online marketplaces as a cost center. While the ex- isting literature has focused on either the operational efficiency of logistics or the role of digital marketplaces as separate issues, my thesis studies how e-commerce platforms can leverage the logistics infrastructure as an offline platform for commercial activities. Our key idea is to integrate the offline capacity with the re- sources from the online and transform the logistics infrastructure into also a profit center. Along this line, we empirically investigate how the network of last-mile stations and warehouses can serve as a new media to connect brands and customers with physical promotional items (e.g., free samples) in the first two essays, respectively. Since there is no prior examples/observational data, we conducted large-scale field experiments in close collaboration with Alibaba—one of the world’s largest e-commerce platforms—to understand the causal impact of the new business practices and explore the underlying mechanism. Moreover, because the available data from field experiments in the offline setting is often too noisy to accurately learn the het- erogeneous treatment effect, decision-makers have to rely on limited information to improve their business practices. In the third essay, we develop a general data-driven optimization framework to link aggregate- level experimental data to decision making and maximize intervention effectiveness under uncertainty. In this way, my thesis demonstrates abundant opportunities and provide actionable insights for e-commerce platforms that manage their own logistics infrastructure. ix Chapter 1 Overview Major e-commerce platforms around the world have invested in an extended network of offline infrastructure to meet their logistics needs. For example, Alibaba—one of the world’s largest e-commerce platforms— delivers approximately 80 million packages every day in China through its warehouses. More than 15% of these packages are self-picked up by customers from the large network of more than 30,000 last-mile/pickup stations across China. Amazon has also invested in more than 175 fulfillment centers all over the globe [4] and deployed hub lockers over 900 cities in the US [5]. Despite the expansion of the logistics infrastructure as a cost center, e-commerce platforms constantly struggle to operate in a cost-effective fashion [146, 11]. My doctoral thesis proposes that an alternative approach is to commercialize the existing logistics infrastructure with business innovation and transform it into a profit center. Specifically, the last-mile stations are embedded within neighborhoods and easy for customers to self-pickup or drop off packages for shipping at their convenience, which makes them an ideal offline touchpoint for commercial activities. For example, the platform can collect physical promotional items such as free samples or offline ads from its online brands and distribute them to the last-mile stations. The spontaneous walk-in customers can then interact with the brands offline, which may, in turn, increase online sales of the focal brands. Moreover, the platform-owned warehouses can also deliver free samples with packages, which allows online brands to reach many customers from other unrelated brands (even in a different category) with physical products. In this way, the logistics infrastructure becomes an offline platform for connecting customers and brands. 1 As today’s online marketplaces for advertising and promotion are becoming increasingly saturated, it creates unprecedented challenges for online brands that sell physical experience goods to reach customers. This offline platform presents a fresh opportunity for e-commerce platforms to help its online brands to reach customers offline. Studying whether and how this new commercial channel can derive additional value would impact how e-commerce platforms should evaluate their logistics infrastructure and how the brands can allocate their budget between the online marketplace and the offline platform. Since May 2018, I have been working as a Data Scientist at Alibaba Group—one of the world’s largest e-commerce platforms. Along the way, I have identified four distinctive features of the offline infrastructure, which signals tremendous opportunities for commercialization. First, (geographic segmentation) the last- mile stations, for example, are established within small neighborhoods for self-pickup. Any station-based campaigns such as in-station promotion naturally segment customers within each neighborhood. Second, (screening effect) different from the walk-in customers, customers who are informed by online interven- tion have to decide whether to travel the last mile to a nearby station. Because of the additional traveling cost, only those who are more interested in the campaigns would be attracted. As a result, the last-mile stations facilitate a unique self-selection mechanism. Third, (logistics control) the logistics infrastructure is designed to support the physical flow of packages, which allows the platform also to distribute physical promotional items such as free samples. Moreover, the platform can use the rich data accumulated from its online marketplace to personalize promotion based on customers’ online behavior. Fourth, (localized information) besides the rich data online, the offline infrastructure is embedded with exclusively localized information. For example, some station owners would leverage their knowledge with nearby customers and sell grocery products such as fresh vegetables and fruits through Alibaba’s infrastructure. Recognizing these potentials, I frequently interact with industry partners to propose innovative practices and explore how to integrate the offline capacity and the online resources and generate business value. In 2 particular, my thesis studies how e-commerce platforms can leverage the logistics infrastructure for com- mercial activities. Previous research has been focusing on the operational efficiency of logistics such as order fullfilment [16, 153] and package delivery [119, 137] or on the role of digital platforms from the per- spective of information sharing [120] such as advertising and targeting [75, 108, 122, 78] and e-commerce [39, 37, 165] as separate issues. My research contributes to the existing literature by taking a first step to connect the dots between offline logistics and online marketplaces. As such, we suggest an expansion of e-commerce platforms’ business models from the digital world to the physical world. Specifically, the first essay (Chapter 2, coauthored with Tianshu Sun, Leon Yang Chu, and Lixia Wu) in- vestigates how the large network of last-mile stations may serve as an offline platform to connect customers and brands in the physical world and stimulate offline-to-online effect by leveraging the spontaneous walk- in traffic (organic interaction) and by prompting interested customers through online intervention (induced interaction). Using free sample distribution as an example, we design two large-scale experimental studies in collaboration with Alibaba—a quasi-experiment across 1,032 stations and a randomized field experiment among 189,019 customers—to examine the causal effects of organic and induced interactions on customers’ subsequent online purchases at the focal brands, respectively. We find that induced interaction drives sig- nificantly more online sales compared with organic interaction. We further identify a screening mechanism underlying its effectiveness. Because of the additional traveling cost on the customer side, induced interac- tion tends to attract customers who have a stronger preference over the focal products. Such advantageous self-selection, in turn, leads to a large increase in customers’ subsequent online engagement. Finally, we develop a customized targeting framework using instrumental forest to further enhance the effectiveness of induced interaction at the last-mile stations. Our first study demonstrates the potential for the commercialization of last-mile stations, which essen- tially serve as a distributed physical intermediary. For example, the station owners can also host “community group buy” where they gather nearby customers and place orders (e.g., grocery shopping), and the platform 3 would ship the orders to the stations for self-pickup. Therefore, future research can also explore different interventions to motivate participation and leverage the localized information to match the right products with each neighborhood. Moreover, our study also shows that combined with online resources (specifically, online intervention), we can further improve the effectiveness of the offline commercial activities. In other worlds, the success of the offline platform lies in the proper integration of offline capacity and online resources, which, in today’s omni-channel world, is extremely promising. Along this line, the second essay (Chapter 3, coauthored with Leon Yang Chu, Tianshu Sun, and Lixia Wu) proposes a novel practice—cross-sampling—for the platform-owned warehouses with direct control over physical packages and show how it can integrate the offline control and online information to commercialize the package flow. Cross-sampling allows physical free samples provided by one brand to be distributed with the existing packages of other unrelated brands. In close collaboration with Alibaba, we designed and implemented cross-sampling though a large-scale field experiment, in which more than 55,000 free samples were distributed, to provide the first empirical examination of its effectiveness in driving online sales. We show that cross-sampling offers three distinct advantages. First, it can be implemented at a large scale with negligible cost by leveraging the existing package flow. Second, it is highly effective in expanding customer bases and increasing brand impressions and online sales of the sampling brand. Third, it is the only channel to leverage both user information and item information (including real-time purchase information) to personalize free-sampling. In particular, we develop an analytical framework to explore the competitive advantage of cross-sampling based and characterize the value of such information. Overall, our proposed business practice is proven effective and the proposed analytical framework can be applied to other opportunities including cross-selling and bundling across brands and categories. The two empirical studies exemplify the business value of the offline platform and the necessity of a good intervention design. Moreover, because the available data from field experiments in the offline setting 4 is often too noisy to accurately learn the heterogeneous treatment effect, decision-makers have to rely on limited information to improve their business practices. Therefore, new targeting methods are required to further maximize the effectiveness of different business practices. In the third essay (Chapter 4, coauthored with Vishal Gupta, Song-Hee kim, and Hyung Paek), we de- velop a novel data-driven optimization framework to link limited experimental data to decision making in a more general context. In particular, policymakers frequently seek to roll out an intervention previously proven effective in a research study, perhaps subject to resource constraints. However, since different sub- populations may respond differently to the same treatment, there is no a priori guarantee that the intervention will be as effective in the targeted population as it was in the study. How then should policymakers target individuals to maximize intervention effectiveness? We propose a novel robust optimization approach that leverages evidence typically available in a published study. Our approach is tractable – real-world instances are easily optimized in minutes with off-the-shelf software – and flexible enough to accommodate a variety of resource and fairness constraints. We compare our approach with current practice by proving tight, per- formance guarantees for both approaches which emphasize their structural differences. We also prove an intuitive interpretation of our model in terms of regularization, penalizing differences in the demographic distribution between targeted individuals and the study population. Although the precise penalty depends on the choice of uncertainty set, we show for special cases that we can recover classical penalties from the covariate matching literature on causal inference. Finally, using real data from a large teaching hospital, we compare our approach to current practice in the particular context of reducing emergency department utilization by Medicaid patients through case management. We find that our approach can offer significant benefits over current practice, particularly when the heterogeneity in patient response to the treatment is large. Although this study has considered a more general setting, it can be further extended to the offline platform. For instance, the first essay mainly focuses on how to target customers for induced interaction 5 at the individual level. However, the platform can also selectively target particular stations for organic interaction and leverage the geographical segmentation of the stations. And the station-level data can be considered as a summary statistics of the individual-level data. It will be interesting for future research to investigate how the customer-level findings can be aggregated to the station level and facilitate station selection in a data-driven manner. All in all, motivated by the four unique features of the offline platform, I have studied two innovative business practices and develop a data-driven decision tool to maximize the intervention effectiveness. In addition, I also discuss several future research directions along this line. As such, my thesis takes the first step in this new direction to understand the fundamental advantage of the offline platform, which can provide actionable insights to guide practice. 6 Chapter 2 Connecting Customers and Merchants Offline: Experimental Evidence From the Commercialization of Last-Mile Stations at Alibaba 2.1 Introduction Major e-commerce platforms such as Amazon, Alibaba, and JD.com have invested in an extended network of offline infrastructure at the last mile to meet their logistics needs. For example, Cainiao Network—Alibaba’s logistics subsidiary—has established more than 30,000 last-mile stations across China. Amazon has also deployed hub lockers over 900 cities in the US [5]. These stations are embedded within neighborhoods and easy for customers to self-pickup or drop off packages for shipping at their convenience, which has significantly relieved the burden of last-mile logistics. Beyond its crucial logistics functions, the last-mile infrastructure may also serve as an offline intermedi- ary for commercial activities. For instance, the platform can collect physical promotional items such as free samples or offline ads from its online merchants/brands. (We will use “brand” and “merchant” interchange- ably throughout the paper.) and distribute them to the last-mile stations. The spontaneous walk-in customers can then interact with the merchants offline, which may, in turn, increase online sales of the focal merchants. Moreover, the platform can leverage the online channel such as mobile notification to induce near-by cus- tomers to pay a visit to the stations, which serve as a local hub. In this way, the last-mile infrastructure becomes an offline platform for connecting customers and merchants. As today’s online marketplace is becoming increasingly saturated (There are many indicators. For in- stance, the merchants’ advertising spending on paid exposure (as measured by the percentage of merchants’ 7 revenue on Alibaba) has significantly increased in the past few years [62].), this new commercial channel presents a fresh opportunity for e-commerce platforms to help merchants reach customers offline. As such, it is crucial for e-commerce platforms to evaluate the effectiveness of this novel channel in driving online sales. Studying whether and how this channel can derive additional value would impact how e-commerce platforms should evaluate their logistics infrastructure and how the brands (especially those who sell ex- perience goods) can allocate their budget between the online marketplace and the offline platform. Using free sample distribution as an example, we provide the first empirical examination on the business potential of last-mile stations in driving customers’ online purchases at the focal brands. We show that the traffic to logistic infrastructure may fundamentally differ from that to traditional offline stores in its value. And customized solutions may be required to achieve meaningful offline-to-online effect when leveraging the last-mile infrastructure. Specifically, we anticipate two unique advantages of the last-mile infrastructure in facilitating commer- cial activities and driving the offline-to-online effect. First, the straightforward approach is to leverage the walk-in traffic at the last-mile stations. The platform can display free samples inside the stations and allow any walk-in customers to claim. Such organic interaction between customers and merchants may increase online sales because these customers are active online users who just made purchases. However, on the balance, it is not clear to what extent the interaction would be effective as the customers are not visiting for the focal brand. Second, the platform can also use the stations as a local hub and nudge near-by customers to pay a visit through online notification. Consequently, some interested customers would be induced from the online channel to interact with the brands offline, i.e., induced interaction. Importantly, while the platform cannot control customers’ organic visits, it can control which customers to be informed by the on- line notification. Such flexibility allows the platform to leverage the rich data accumulated from the online marketplace to target particular segments of customers online and maximize the effectiveness of the offline interaction. 8 Although both types of offline interactions are expected to increase online sales, they differ from each other in a nontrivial way. Take the free sample distribution as an example. The organic sample claimers (customers who claim free samples regardless of online notification) were exposed to the events at the stations, so it is easy for them to claim free samples. In contrast, the induced claimers (customers who claim free samples if and only if being targeted by online notification) were informed through their online app when they were away from the stations. Thus, the induced claimers have to make a decision and incur an additional traveling cost. The fact that the induced claimers have self-selected to claim free samples indicates that, on average, they may have different (observed or unobserved) characteristics. Such self- selection behavior could either increase or decrease online sales at the focal brand. Either the induced claimers are attracted only by the “freebie” (adverse self-selection), or they are genuinely interested in the product and have a stronger intention to experience the focal brand (advantageous self-selection). If, for instance, the advantageous self-selection outweighs the adverse self-selection, the online notification further highlights the value of the last-mile stations. That is, the stations serve as a screening device to only attract customers who are more interested and more likely to make purchases after claiming the samples. Given the uniqueness of this last-mile infrastructure, we are interested in two questions: First, is the organic interaction at the logistic infrastructure effective in driving the offline-to-online effect? Second, compared to the organic traffic at the last-mile station, would the induced traffic be more effective in enabling the offline-to-online effect? These key questions motivate us to make a clear distinction between the two types of offline interactions and study their impact on online sales. In close collaboration with Alibaba—one of the world’s largest e-commerce platforms and several major cosmetics brands, we conducted two large- scale experimental studies (one quasi-experiment and one randomized field experiment). Respectively, we use the two studies to estimate the causal effect of organic and induced interaction on focal brands’ online sales. Since every customer has to scan a QR code to claim a free sample, we have detailed customer-level data on station visits. We then use the randomized field experiment to investigate the difference between 9 the two interactions and its underlying mechanism. The insights gained from this study can also be applied to other types of commercial activities, such as offline advertising. Importantly, we also seek to shed some light on a general phenomenon in omnichannel retailing, where online intervention is often used to increase offline traffic. In the first study of organic interaction, we leveraged a large-scale quasi-experiment of eight free sample distribution campaigns by different brands from March to June 2018. Customers claimed a total of 139,204 free samples across 1,032 last-mile stations. Note that since we cannot control customers’ organic walk- in behavior, we cannot randomize customers’ organic sample claiming behavior. So we have to use the observational study to estimate the causal effect for organic claimers. Specifically, we tailor a two-stage matching procedure coupled with difference-in-differences (DID) analysis. The results suggest that organic interaction only slightly increases customers’ online purchases at the focal brand. For example, purchase probability over the next five weeks is only lifted by 0.05%, i.e., purchase conversion rate. Unlike the case of offline stores/branches, the organic traffic at the last-mile infrastructure is not as valuable potentially because of a lack of interest/purpose in their visits. In other words, the insights from the offline-to-online literature in the retailing context cannot be naturally extended to our case. A customized solution is needed to enable the offline-to-online effect. In the second study of induced interaction, we designed a large-scale randomized field experiment with a major brand across 40 stations in August 2018. We identified all customers (189,019) in the vicinity of these stations and randomly notified 80% of them on their mobile Taobao apps about the free sample distribution. The random assignment created an exogenous increase in customers’ awareness of the event, which results in 14% more sample claims. Based on the local average treatment effect (LATE) framework [92, 7], we estimate that the conversion rate for induced claimers is 3.8%, which is much larger than that of the organic claimers. 10 Moreover, we again use the randomized field experiment to identify the underlying mechanism. Based on the LATE framework, we propose a method to systematically test the existence of self-selection behavior of induced claimers. We find that the advantageous self-selection mechanism outweighs the adverse self- selection mechanism. It provides an essential insight into one key role of this last-mile infrastructure. That is, when the online notification is introduced to boost the offline traffic, the physical infrastructure can serve as a screening device and attract a subgroup of more interested customers to engage with the focal brands (i.e., the online-to-offline-to-online effect). Furthermore, the advantageous self-selection cannot be explained by customers’ offline characteristics (such as self-pickup, parcel shipping, and location) nor online behavior (such as visits, item-views, and purchases). It suggests that customers from the online channel self- selected to claim free samples offline based on unobserved preference over the focal brand. This particular finding motivates new techniques for the platform to target the “right”’ customers with online notification to maximize effectiveness. Finally, we propose a practical targeting framework based on a state-of-the-art machine learning technique—instrumental forest—to illustrate how the platform can target-to-induce the “right” customers and maximize return for a focal brand. Our key idea is that the “right” customers should correspond to the ones who are 1) likely to be induced from online to claim samples and also 2) likely to make a purchase afterward. In this way, although customers are induced through unobserved preference, our framework takes into account the two-stage decision and learns the heterogeneous causal effects of induced interaction from observed data. We find that both customer’s offline and online characteristics are important in predicting the heterogeneous causal effect. Given the predicted return for each customer, the platform can strike a balance between cost (the total number of samples distributed) and return (the predicted total number of purchases). Before proceeding, we summarize our key managerial insights to facilitate data-driven decision making of the offline platform: 11 The effect of organic interaction on online sales of the focal brands is not economically significant. The offline-to-online effect cannot be taken for granted, and a customized solution is needed. By introducing online intervention, the platform can effectively induce customers through an advanta- geous self-selection mechanism, which in turn generates a significantly larger increase in customers’ online engagement with the focal brand. The last-mile infrastructure essentially serves as a screening device with the traveling cost and encourages interested customers. Using online data, the platform can target-to-induce a particular set of customers by taking into ac- count both customers’ sample claiming decision and purchase decision to further maximize the effec- tiveness of induced interaction. The structure of the paper is as follows. We draw related research and articulate our contribution in Section 3.2. In Sections 2.3 and 2.4, we examine the causal effects of organic and induced interactions, respectively. We further explore the underlying mechanism in Section 2.5 and propose a personalized tar- geting framework for induced interaction in Section 2.6. In Section 3.6, we conclude and discuss several directions for future research. 2.2 Literature Review Our work relates to four different streams of literature that arises from the studies of information systems, operations management, and marketing, among others. First, we contribute to the literature that studies how offline stores affect the online channel, i.e., the offline-to-online effect [42, 141, 148]. Previous research in retailing has shown that depending on the popularity of the brand [160], product assortment or pricing [36, 156], and customer segments [106], of- fline retail stores may either complement or substitute the online channel. Other studies have investigated 12 how the opening or closing of offline branches affects the online channel [13, 118, 71, 141]. However, no study has examined how offline logistics infrastructure can also be leveraged for commercial activi- ties and increase online sales. Practically speaking, this is a fast-growing and novel commercial chan- nel that could largely benefit online brands. In contrast to retail stores or banks, customers do not visit the offline establishment for engagement with the brands. Thus, the online brands can take the initia- tive to reach a fundamentally different customer base, mostly new customers. (In our studies of both organic and induced interactions, over 99% of subjects are new customers to the focal brands.) As the number and coverage of offline stations continue to grow in the US and China (Cainiao Network is tar- geting 100,000 last-mile stations in the next three years in China. See news announcement: https: //finance.sina.com.cn/roll/2019-05-29/doc-ihvhiqay2311441.shtml.), it is cru- cial to understand the business value of the organic traffic to the station and also effective way to induce the right customers to interact at the station. Theoretically speaking, this new type of offline interaction between customers and brand takes advantage of a key theoretical grounding “show-rooming” that drives the complementarity of the offline and online channel, but also put a new spin on the mechanism. Specif- ically, “Show-rooming” refers to that a consumer visits a physical store to experience or inspect a product and, if she likes it, buys it from an online seller and often at a lower price. In a similar vein, the last-mile stations serve as a physical intermediary for customers to learn about the products and, if they like it, they will make purchases at the focal brands’ online store. Both analytical model [69, 126, 105, 96] and empir- ical studies [133, 20, 19, 21, 166, 128] have justified such effect in the omnichannel retailing context. Our study extends the stream of literature and presents the first empirical evidence on the commercialization of logistics infrastructure. We find that merely leveraging the organic walk-in traffic is not effective and only slightly increases the online sales of the focal brand. It highlights the fact that the logistics infrastructure fundamentally differs from traditional offline stores/branches in that it is void of brand affiliation and, thus, 13 less effective in attracting potential customers to buy at the focal brands. This particular finding also mo- tivates an alternative approach to leverage online notification and actively induce self-selection of near-by customers. With such customized solution, the brands can now take advantage of the offline-to-online effect at the last mile using the logistic infrastructure. Our second important contribution to the literature regards the impact of online interventions on cus- tomers’ offline behavior, i.e., the online-to-offline effect. Previous studies have examined the effectiveness of buy-online-and-pickup-in-store [67, 68], the presence of electronic commerce [132, 152], online/mobile advertising [114, 97, 173], informational website [135, 42], and sponsored search [43, 57] on offline sales (Another stream of research has leveraged the entry of platforms (e.g. Airbnb, Craiglist) to examine the economic and societal impact of online marketplace on offline activities [41, 143, 79, 80, 169, 3].). Recent studies have shown that mobile targeting is particularly effective in influencing customers’ offline behavior [6, 72, 73, 74], especially when the stores are nearby [88, 64, 122, 66, 47]. These studies have focused either on only the induced customers (driven from the online channel by different types of online intervention) or the induced and organic customers as a whole. To the best of our knowledge, no study has distinguished and compared the value of two different types of customers: organic and induced customers. The detailed customer-level data in our research allows us to disentangle the two different types of offline traffic and pro- vide important implications for the online-to-offline effect. We show that the additional offline customers driven by online intervention are different from the organic traffic and more likely to make a purchase after- ward due to the additional traveling cost. In this way, our study extends the online-to-offline literature and shows that online intervention not only serves as a tool to increase offline traffic but also may encourage the right selection. We expect that our proposed self-selection mechanism and the framework of testing such effect also applies to the general context of the online-to-offline effect. Moreover, we contribute to the stream of literature on free sampling. A long line of research studies the free sampling of physical products in offline stores [85, 124, 18] or information products online [46, 159, 48, 14 112, 130]. Recently, [117] study the rating-bias effect of distributing physical products online from the e- commerce platform’s perspective. Using free sample distribution as an example in our study, we contribute to this stream of literature by exploring a new channel for distributing physical free samples offline. Last but not least, previous literature illustrates the concept of “smart city”, in which different oppor- tunities such as zero-emission buildings, robotics [58], internet infrastructure [22], electrified and shared mobility [138], urban logistics and retail may rise. On the one hand, the smart city enables the recording of consumer behavior offline at a more granular level. For instance, [167] leverages GPS tracking data to inves- tigate the learning behavior of taxi drivers. In our case, the rich data on customers’ offline behavior (sample claims, parcel pickup) fully takes advantage of the prevalence of mobile technology, e.g., QR code. On the other hand, the smart city allows us to synergize seemingly-unrelated aspects, e.g., commercialization and logistics. The majority of research on the last-mile infrastructure focuses on its logistics capabilities such as shared mobility in delivery [137], data-driven order assignment [119], or vehicle routing [161]. Our study, instead, investigates the commercial aspect of the last-mile infrastructure and brings out a novel aspect of a smart city: the business value of last-mile infrastructure. Similar to previous literature [162], We also lever- age rich online data to predict and influence offline activities. In this way, our study enriches the literature, opens up a new area of focus, and demonstrates abundant opportunities for e-commerce platforms. 2.3 Causal Effect of Organic Interaction We study the causal effect of organic interaction and induced interaction in collaboration with Alibaba. As one of the world’s largest e-commerce platform, Alibaba has built a large network of more than 30,000 last- mile stations across China to accommodate the delivery and collection of tens of millions of packages on a daily basis. Leveraging the logistics infrastructure, we worked closely with the platform and experimented with its online merchants on a series of commercial activities with a focus on free sample distribution. In a 15 typical free sample distribution, a merchant provides a batch of (10,000 - 100,000) free samples for offline distribution to a number of last-mile stations. The station owners would place the free samples on the shelves together with a QR code banner. Any walk-in customers can claim one free sample by scanning the QR code using their Taobao app. In this way, each sample claim is linked to the customer’s Taobao online user ID. (Please also see Figure A.1 in Appendix A.1 for pictures of free sample distribution inside a station) Each free sample distribution campaign could last for a few weeks. On average, one station can distribute a few hundred free samples in two weeks. In this section, we first investigate the causal effect of organic sample claim on the focal brands’ online sales. 2.3.1 Experiment Setting and Data The organic offline interaction can be viewed as a quasi-experiment. The organic visitors’ exposure to the free samples was, to some extent, exogenously determined. Specifically, all campaigns were conducted at the stations without any online advertising. Thus, sample claimers did not have prior knowledge about the specific campaigns when coming to the stations (and when making purchases online before that). Such an exogenous assignment helps our identification of the causal effect. Nevertheless, a direct comparison between the sample claimers and the average Alibaba customers who did not claim free samples may be subject to bias. For example, the sample claimers were more likely to be frequent station visitors (who had self-picked up packages more often), and they were more active online in general. Consequently, compared with the average customers, they were more likely to engage the merchant online after claiming free samples, regardless of the actual impact of the sample claim. Thus, it is crucial to take into account these differences in customers’ characteristics and past behavior (both offline and online) and rule out confounding explanation. Combining multiple approaches documented in previous observational studies [13, 19, 106], we seek to apply a series of econometric techniques (exact matching, propensity score matching, and DID) to account 16 for such bias at a customer level. To achieve so, we have to construct a rich dataset to include a proper control group and to capture possible confounding factors. First, we obtain all sample claims from March 2018 to June 2018, which encompasses eight campaigns from different brands at different stations. Because a typical station distributes a few hundred free samples in two weeks, we focus on the stations that distributed at least 20 free samples. The resulting data consists of 210,934 sample claims in 1,032 stations across 20 cities and 15 provinces in China. Second, we include all customers who did not claim free samples but had ever been to the 1,032 stations within three months before the campaigns, which amounts to a total of 3,874,389 customers. In total, our raw data includes over 4.08 million customers. Third, we append a granular set of data including customer characteristics, detailed online behaviors (purchases and item- views), and a complete record of offline logistics services (self-pickup and parcel shipping) for each of the more than 4 million customers. Last, we construct a series of outcome measures and build a proper control group through matching using this large and rich dataset. To facilitate a multi-period DID analysis, we follow the tradition in previous literature and choose customer-week as the unit of analysis. We then define four outcome variables to measure the impact on the focal brands. Outcome variables: Brand purchase or not: a binary variable indicating whether a customer makes a purchase at the focal brand’s online store during a given week Brand spending (log): a continuous variable measuring a customer’s total spending at the focal brand’s online store during a given week Brand view or not: a binary variable indicating whether a customer visits the focal brand’s online store during a given week 17 # of brand item-views (log): a continuous variable measuring the total number of items viewed by a customer at the focal brand’s online store during a given week All outcome variables are aggregated to a customer-week panel structure from 2018/02/05 to 2018/08/26 (24 weeks in total). The time span of our data sample ensures that we have observations for at least five weeks before and five weeks after the sample claim week for all customers. 2.3.2 Two-Stage Matching Recall that the DID analysis is built on the parallel trend assumption, which asserts that the difference between the treatment and control groups is constant over time in the absence of treatment. Leveraging the rich data set, we combine exact matching and propensity score matching [154] to construct a control group such that its pre-treatment behavior and customer characteristics are identical to the treatment group to the extent possible. We defer a detailed discussion of our two-stage matching procedure to Appendix A.2 and summarize our overall idea here. In the first stage, we exactly match on a set of binary variables, including customers’ purchase and item-view behavior at both the focal brand level and the category level. We also exactly match customers’ self-pickup and parcel shipping patterns. Consequently, the pre-treatment trends of the two groups are the same regarding these binary variables. Nevertheless, the exact matching cannot fully guar- antee the pre-treatment trends to be parallel regarding the corresponding numeric variables. For example, the binary outcome “self-pickup or not” does not necessarily characterize how many packages were picked up, which may prevent the two trends regarding the number of self-pickups from being parallel. Moreover, the time-invariant characteristics (such as demographics) of the two groups may also be different. Thus, we adopt propensity score matching to further balance the treatment and control group over a larger number of 18 features in the second stage. In this way, the matched groups have not only similar tendencies in their behav- ior but also comparable time-invariant characteristics, which strengthens the validity of our identification. After matching, we are able to find the best single match for a total of 139,204 sample claims distributed across 1,032 stations. We summarize the organic sample claims by each brand in Table 2.1. Importantly, the matched treatment and control groups now have the same past purchase, browsing, and station usage patterns (through the first-stage exact matching) and indistinguishable propensity scores based on a wide range of characteristics such as demographics, past spending, and location (through the second-stage propensity score matching). To further quantify the causal relationship, we next perform the DID analysis. Table 2.1: Summary of Free Sample Claims by Brands After Matching Brand ID Starting week # of stations # of sample claims 1 11 32 2,000 2 17 102 7,589 3 18 730 65,921 4 20 291 26,294 5 20 144 10,134 6 20 143 10,953 7 23 131 8,787 8 24 129 7,526 Total 1,032 139,204 2.3.3 Main Results on Organic Sample Claim We follow the literature (e.g., [123, 154]) and specify our linear panel data model as below. Our main estimating equation is Outcome it = +SampleClaimer i +SampleClaimer i After it +v i +After it +u t + it : (2.1) 19 The variable Outcome it denotes the particular outcome of customer i at week t. The treatment-group dummy SampleClaimer i is a binary variable indicating whether customer i has ever claimed a sample during our study period (i.e., treatment status).After it is a binary variable that indicates the post-treatment period for each treated user and the matched counterpart. This variable accounts for potential temporal factors that may simultaneously influence the treated and control customers. In addition,v i is the customer- specific fixed effect, and u t is the week-specific fixed effect. Note that given the panel structure of our data set, we can utilize variation in the outcomes for each customer over time to control for pre-treatment heterogeneity at a granular customer level. And there is no need to include other time-invariant control variables. Moreover, the week-specific fixed effect controls for any temporal differences of outcomes across all individuals. Finally, we are interested in the estimated value of the parameter (i.e., the DID estimate), which represents the average treatment effect on the treated, i.e., the average effect of organic sample claim. In all of our results below, we cluster the standard errors of it at each matched treatment and control pair to allow for heteroscedasticity. We present our main results in Table 2.2. The DID estimates regarding all four outcomes are positive and statistically significant. However, the economic impact of organic sample claims is limited. For instance, it increases brand purchase probability by 0.01% (Column (1)) and # of item-views by 0.24% (Column (4)) each week on average. Besides our main DID model, we also carry out a series of robustness checks in Appendix A.3. The DID estimates are both qualitatively and quantitatively similar. Overall, our first study on organic interaction suggests that only leveraging the organic traffic at the last-mile stations for commercial activities could result in low return. This finding is different from previous literature [85, 124, 18] that studied the effect of free sampling in brick-and-mortar retailers. One of the possible explanations is that, in our case, the organic customers do not visit the stations for shopping, i.e., for the focal brands. For example, less than 1% of the customers are existing customers who had made a purchase during the previous year. Converting the arbitrary walk-in customers (who visit the stations for 20 Table 2.2: Causal Effect of Organic Sample Claim (1) (2) (3) (4) Brand purchase or not Brand spending Brand view or not # of brand item- views (Log) (Log) SampleClaimer i After it 0.0001 0.0003 0.0025 0.0024 (0.0000) (0.0001) (0.0001) (0.0001) After it 0.0002 0.0007 0.0017 0.0019 (0.0000) (0.0001) (0.0001) (0.0001) Customer FE Yes Yes Yes Yes Week FE Yes Yes Yes Yes N 6,681,792 6,681,792 6,681,792 6,681,792 Robust standard errors clustered at each matched pair in parentheses p< 0:1, p< 0:05, p< 0:01 logistics purposes) to make purchases through sampling may not be as effective as in the retail stores. More importantly, by allowing any walk-in customers to claim free samples, the platform cannot selectively target particular customers, which motivates alternative approaches to improve the effectiveness of such offline interaction. Next, we investigate the causal effect of induced interaction, in which customers are driven from the online channel. 2.4 Causal Effect of Induced Interaction In this section, we design a large-scale field experiment to investigate 1) whether the platform can lever- age online notification to causally induce more sample claims, and 2) the causal effect of such induced interaction, respectively. 2.4.1 Experiment Setting and Data We collaborated with a major domestic brand and launched a three-week free sample distribution across 40 stations at four major cities in August 2018. We first identified all eligible Alibaba users in the vicinity of 21 the stations (409,704 users in total) and randomly assigned 80% of them as the treatment group to receive an online ad in the Taobao mobile app about the free sample distribution at the nearby station. We then exclude customers who did not visit the particular page where we place the ads. As a result, all customers were exposed to the sample page in the Taobao app, which served as control ads [113, 99, 98]. The resulting sample size reduces from 409,704 to 189,019. Since the treatment assignment is completely randomized in our original data sample, the assignment should stay randomized conditional on the page exposure. In Table 2.3, we check the balance of various pre-treatment covariates and confirm the ran- domization at work. In this way, the ad creates an exogenous variation of awareness (of the free sample distribution) among a group of customers who had a similar tendency of visiting the app in general, which may further boost sample claims. Table 2.3: Summary Statistics and Randomization Check Control 37; 469 (20%) Treatment 151; 550 (80%) P-value (Treatment = Control) Female (%) 56.5 56.7 0.12 Age (%) 25 35.6 35.7 0.37 26 - 30 22.4 22.4 0.87 31 - 35 14.0 14.0 0.86 > 35 28.0 27.9 0.34 Distance to assigned station 2 (%) 500 m 84.3 84.2 0.77 Annual platform total orders (count) 95 95 0.11 Online behavior (monthly average over past 3 months) Brand active 3 (%) 0.01 0.01 0.46 Category active(%) 44.0 44.4 0.20 Category orders (count) 0.45 0.45 0.96 Offline behavior (monthly average over past 3 months) Assigned station active user (%) 32.8 32.6 0.56 Assigned station pickup (count) 0.88 0.88 0.58 Assigned station parcel shipping (count) 0.07 0.07 0.69 Notes: 1 Each order corresponds to an item (regardless of the quantity). One purchase could involve multiple order/item. 2 Assigned station refers to the station who host the free sample distribution. 3 “Active” refers to whether a customer made a purchase at the focal brand or category level, or used pickup or parcel-shipping services. 22 Finally, we define the post-treatment period as five weeks after the free sample distribution ends. For each customer, we compute outcome variables defined in section 2.3 and aggregate them to a five-week post-treatment window. Thus, the only difference is that each outcome variable in this Section 2.4 covers five weeks, whereas the outcome variables in Section 2.3 cover one week (because of the customer-week panel structure). 2.4.2 Identification Strategy We now leverage the randomized online notification to identify the causal impact of induced interaction, which can be broken down into two stages corresponding to the two causal effects mentioned earlier. In the first stage, we estimate the following linear model [59] SampleClaimer i = 0 + 1 OnlineAds i +u i ; (2.2) where SampleClaimer i is a binary variable indicating whether customer i claimed a free sample and OnlineAds i is a dummy variable indicating whether customer i was notified with the mobile ads. The coefficient 1 represents the causal effect of online notification on sample claims. To estimate the causal effect of induced sample claims, we first use the local average treatment effect (LATE) framework to formally define the induced claimers [7]. There are three types of customers: organic claimer, no claimer, and induced claimer. The organic claimers are those who would always claim samples regardless of the online intervention. (Recall that the organic claimers are also the central focus of Sec- tion 2.3.) The no claimers are those who would never claim samples. The induced claimers are those who would claim free samples if and only if being notified. These three types of customers (organic claimer, no 23 claimer, and induced claimer) corresponds to always-taker, never-taker, and complier in the LATE frame- work [7, 91]. We are particularly interested in the causal effect for the induced claimers, i.e., the causal effect of induced sample claim. Following the LATE literature, we carry out a two-stage least square (2SLS) estimation using OnlineAds i as an instrumental variable (IV) forSampleClaimer i Outcomes i = 0 + 1 \ SampleClaimer i + i ; (2.3) where \ SampleClaimer i is the predicted value from Equation (2.2): Note that the 2SLS estimation is built on two assumptions. First, the inclusion criterion requires that the IV (online notification) correlates with sample claims. This assumption holds if the first-stage causal effect is significant, which is shown to be the case, as discussed in the following results. Second, the exclusion restriction asserts that the online notification should not directly affect the final outcomes (e.g., online pur- chases). Our particular experiment meets this requirement for two reasons. First, the mobile ads did not provide a direct link to the focal merchant’s online store. So the ads could not have directly affected cus- tomers’ online purchases or item-views. Second, the stations were not informed of the online intervention, so the control and treatment groups could not have received an additional notification or advertising apart from the online notification. Using the LATE framework, we investigate the LATE coefficient 1 , which represents the causal effect of sample claims for the induced claimers only [91, 7]. 2.4.3 Main Results on Induced Sample Claim We estimate Equation (2.2) in Table 2.4. The online notification increases sample claims by 14% (p< 0:05) (from a baseline 1.4% in the control group to 1.6% in the treatment group). Recall that in our study sample of the organic interaction, the platform was able to distribute 139,204 free samples organically (Table 2.1). 24 Our estimate suggests that if we were to leverage online notification and inform near-by customers, the plat- form could induce 19,489 more sample claims within the same period. Such an increase is economically significant and demonstrates that the platform can indeed mobilize customers using simple ads on mobile app pages (as an online notification). In this way, the induced interaction may complement organic interac- tion and further boost offline sample claims. This finding extends previous literature (e.g., [66, 122, 166]) on mobile targeting by showing how such online intervention can effectively influence customers’ participation in the commercial activities offline at the last-mile stations. Table 2.4: Causal Effect of Online Notification on Sample Claims Sample Claims OnlineAds i 0.002 (0.001) Constant 0.014 (0.001) N 189,019 Robust standard errors in parentheses p< 0:1, p< 0:05, p< 0:01 Note that the significant result in Table 2.4 also confirms that the first assumption of LATE (inclusion criterion) is met so that we have a strong IV for the following 2SLS estimation (based on Equation (2.3)). We present the results in Table 2.5. Contrary to organic sample claim, the induced sample claim significantly increases customers’ online purchases at the focal brand. Specifically, purchase probability at the focal brand is lifted by 3.8% (p < 0:01) in Column (1). However, the induced interaction does not affect customers’ item-views at the focal brand. We further examine the long-term effect by extending the post-treatment period to ten weeks in Table A.3. There is no significant long-term effect. The significant increase in customers’ online purchases at the focal brand is particularly interesting. Compared with the estimated sales conversion rate over five weeks (0:01% 5 = 0:05%) in the organic interaction (Table 2.2), the induced interaction is highly effective in generating more sales for the focal brand. It suggests that the additional 14% of samples were claimed by a group of induced claimers who 25 may have very different characteristics from the organic claimers. Moreover, the insignificant impact on customers’ item-views of the focal brand implies that the induced claimers possibly experienced the product differently (e.g., through usage of the product instead of browsing items online). Table 2.5: Causal Effect of Induced Sample Claim (1) (2) (3) (4) Brand purchase or not Brand spending Brand view or not # of brand item- views (log) (log) \ SampleClaimer i 0.038 0.208 0.154 0.131 (0.013) (0.070) (0.111) (0.176) N 189,019 189,019 189,019 189,019 Robust standard errors in parentheses p< 0:1, p< 0:05, p< 0:01 To summarize, we have used two large-scale experimental studies to accurately estimate the causal effects of organic interaction (Section 2.3) and induced interaction (Section 2.4), respectively. We find that compared with the organic interaction, the induced interaction is more effective in increasing sales for the focal brands. Moreover, only the organic interaction is effective in delivering brand impression. The different causal effects provide concrete guidelines since the platform can design the offline interaction and online notification to control the proportion of induced claimers compared to the organic claimers. For example, if the focal brand requests to focus more on short-term sales, the platform can choose not to display free samples in the stations so that there will be no organic sample claimer. All sample claimers would be induced claimers only and are more likely to make a purchase afterward based on our analysis. Nevertheless, it is not clear why induced interaction is more effective than the organic interaction in increasing short-term sales for the focal brand. Leveraging our field experiment, we next investigate the underlying mechanism. 26 2.5 Mechanism Underlying the Effectiveness of Induced Interaction: Self- Selection Based on our discussion in Section 3.1, the fundamental difference between the induced and organic inter- actions is that their sample claiming processes are rather different. While organic claimers were exposed to the events at the stations, the induced claimers were informed through their mobile phones when they were away from the stations. As a result, it was more convenient for the organic claimers to claim free samples, whereas the induced claimers might consciously incur an additional traveling cost. It is indeed the case in our experiment as about 13.1% of the claimers in the treatment group claimed free samples without receiving any logistics services (i.e., self-pickup or parcel shipping) within a one-hour window. Motivated by this peculiar observation, we propose a key mechanism—self-selection—as a plausible ex- planation for the behavioral difference between organic and induced claimers. Self-selection has its origin in information economics [149] and asserts that individuals (e.g., job applicants, customers) could choose specific arrangements (e.g., labor contract, technology adoption) based on their observed characteristics or unobserved preferences (please also see [140, 52] and references therein for a typical example in the insur- ance market). In our case, the fact that the induced claimers have undertaken additional trips to claim free samples indicates that they may have a stronger preference to the samples, and they may have significantly different (observed or unobserved) characteristics compared with the organic claimers. Moreover, such induced claims could either adversely or advantageously affect whether customers will purchase at the focal brand in the future. Either the induced claimers are attracted only by the “freebie” (adverse self-selection), or they are genuinely interested in the product and have a stronger intention to try (advantageous self-selection). Our findings in Sections 2.3 and 2.4 show that induced interaction is more effective in increasing customers’ purchases at the focal brand. Thus, it signals that the advantageous self- selection outweighs the adverse self-selection. However, since these two studies are carried out for different 27 last-mile stations in a different time period, it is not necessarily a fair comparison. Using the randomized field experiment, we next provide empirical tests to systematically investigate whether the observed or un- observed characteristics between induced and organic claimers would drive the different outcomes and, if so, advantageously or adversely. Recall that there are three types of customers: organic claimer, no claimer, and induced claimer. Based on the definition of these customer types, we can partition our study population into four groups (Table 2.6) based on their treatment status (whether being notified with mobile ads) and first-stage decision (whether claimed a sample). Table 2.6: Claimer Type Online Notification (Randomized) . No (Control) Yes (Treatment) Claim No 1. Induced claimer/ No claimer (36,931) 2. No claimer (149,137) Yes 3. Organic claimer (538) 4. Induced claimer/ Organic claimer (2,413 ) In Table 2.6, we can identify all the claimers in the control group (Group 3) to be organic claimers, because they claimed free samples even without being targeted. Furthermore, the claimers in the treatment group (Group 4) may consist of both induced claimers and organic claimers because both would claim free samples if being targeted. Although it is impossible to identify whether a claimer is organic or induced individually [7, 91], this partition is sufficient for us to identify the systematic difference between organic and induced claimers. 2.5.1 Self-Selection Based On Observed Characteristics To test self-selection based on observed characteristics, we first verify whether organic claimers and induced claimers are different in their observed characteristics, and then examine whether such observed difference would explain the outcomes. By directly comparing the pre-treatment covariates of Group 3 and Group 4 28 (Table 2.7), we find that the claimers in Group 4 had significantly more annual purchases on the platform and were less active in the stations. As the difference is only caused by the additionally induced claimers in Group 4, the induce claimers are generally more active online and less active offline, compared with the organic claimers. Table 2.7: Characteristic Difference of Induced Claimers and Organic Claimers (Based on Table 2.6) Group 3 (Organic) Group 4 (Induced/Organic) P-value Female (%) 73.2 74.8 0.46 Age (%) 25 27.1 26.9 0.92 26 - 30 28.8 26.1 0.20 31 - 35 19.7 19.3 0.82 > 35 24.3 27.7 0.10 Distance to station (%) 500 m 90.9 91.2 0.84 Annual platform total orders (count) 164 180 0.03 Online behavior (monthly average over past 3 months) Brand active (%) 0.19 0.04 0.45 Category active(%) 57.2 56.6 0.79 Category orders (count) 0.99 1.06 0.45 Offline behavior (monthly average over past 3 months) Station active user (%) 92.9 90.6 0.06 Station pickup (count) 5.1 4.9 0.55 Station parcel shipping (count) 1.01 1.48 0.38 p< 0:1, p< 0:05, p< 0:01 We then carry out a sub-sample analysis by partitioning our study population based on the median value of customers’ past platform purchases in Table 2.8. We see that only those “high spenders” are induced to claim samples in the first stage and also have a significant increase in sales at the focal brand. This particular evidence suggests that the customers’ self-selection based on observed characteristics (past purchases) is likely to affect the outcomes in a positive way, i.e., advantageous self-selection. 29 Table 2.8: Testing for Self-Selection Based on Observed Characteristics Sample Claims Sample Claims Subsample: past total order 95 < 95 OnlineAds i 0.003 0.0003 (0.001) (0.001) Constant 0.018 0.011 (0.001) (0.001) N 95,076 93,943 Robust standard errors in parentheses p< 0:1, p< 0:05, p< 0:01 Brand purchase or not Brand spending Brand view or not # of brand item- views (log) (log) \ SampleClaimer i 0.038 0.211 0.989 0.248 (0.013) (0.075) (1.333) (0.174) N (past platform total order 95 ) 95,076 95,076 95,076 95,076 Robust standard errors in parentheses p< 0:1, p< 0:05, p< 0:01 2.5.2 Self-Selection Based on Unobserved Characteristics To further investigate self-selection based on unobserved characteristics, we ask: can we use the observed characteristics to fully predict the causal effect of induced interaction? If the answer is yes, the platform can directly target customers using observed characteristics. If the answer is no, customers may possess private information about their preference over the focal product, and the offline platform has to leverage the last-mile stations as a screening device to induce the right customers. Consequently, it is vital that the platform takes into account customers’ motives to claim free samples. We employ two empirical tests to answer this question. In the first test, we use observed characteristics to predict whether customers could be induced to claim samples and whether such induced claims would still have a significant impact on the outcomes. Specifically, we add various pre-treatment covariates (defined in Table 2.3) to both stages (Equations 2.2 and 2.3) of our 2SLS estimation [155]. In Table 2.9, the causal effect of the induced sample claim remains statistically 30 significant and is quantitatively the same. It shows that the observed characteristics cannot be used to predict the observed differences in the first-stage sample claims and the second-stage causal effects, which suggests that the self-selection is based on unobserved characteristics. Table 2.9: Testing for Self-Selection Based on Unobserved Characteristics (LATE with Control Variables) Brand purchase or not Brand spending Brand view or not # of brand item-views (log) (log) \ SampleClaimer i 0.038 0.206 0.005 0.141 (0.013) (0.069) (1.564) (0.175) Female 0.0001 0.0003 0.285 0.001 (0.00002) (0.0001) (0.002) (0.0002) Age 26-30 0.00003 0.0001 0.086 0.0002 (0.00004) (0.0002) (0.003) (0.0003) Age 31-35 0.00002 0.0002 0.120 0.0001 (0.0001) (0.0003) (0.003) (0.0004) Age> 35 0.00001 0.0001 0.157 0.00005 (0.00004) (0.0002) (0.003) (0.0003) Distance to station 500 m 0.00000 0.00001 0.013 0.001 (0.00003) (0.0001) (0.002) (0.0002) Annual platform total spendings 0.00000 0.00000 0.0001 0.00000 (0.00000) (0.00000) (0.00002) (0.00000) Annual platform total orders 0.00004 0.0002 0.193 0.0004 (0.0001) (0.0003) (0.035) (0.0004) Brand active 0.00004 0.0002 0.002 0.553 (0.0001) (0.0003) (0.128) (0.223) Brand monthly spend- ing 0.00000 0.00001 0.001 0.001 (0.00000) (0.00001) (0.001) (0.003) Category active 0.00001 0.0001 0.252 0.001 (0.00003) (0.0002) (0.003) (0.0002) Category monthly spending 0.000 0.00000 0.00000 0.00000 (0.000) (0.00000) (0.00000) (0.00000) Category monthly or- ders 0.00001 0.00004 0.014 0.0003 (0.00001) (0.0001) (0.001) (0.0001) Station active 0.00005 0.0002 0.011 0.001 (0.00005) (0.0002) (0.003) (0.0003) Station pickup 0.00003 0.0002 0.004 0.0003 (0.00002) (0.0001) (0.001) (0.0001) Station pickup 0.00000 0.00000 0.0004 0.00003 (0.00000) (0.00001) (0.0003) (0.00002) Constant 0.001 0.004 0.571 0.002 (0.0002) (0.001) (0.025) (0.003) N 189,019 189,019 189,019 189,019 Robust standard errors in parentheses p< 0:1, p< 0:05, p< 0:01 31 Our second test relies on a testable prediction of the self-selection theory [49]. As we expect that the ad- vantageous self-selection dominates the adverse self-selection, we should see a positive correlation between customers’ group membership (Group 3 versus Group 4) and outcomes. Moreover, if the advantageous self-selection is based on the unobserved characteristics, the positive correlation should exist even after con- trolling for observed characteristics. We include only customers in Group 3 and Group 4 and regress the direct impact outcome variables on a group indicator (indicating whether each customer belongs to Group 4) together with the pre-treatment covariates (Table 2.10). The group membership still significantly corre- lates with the direct impact outcomes, which suggests that the observed characteristics cannot fully explain the different outcomes of Group 3 and 4. This finding again suggests that the self-selection is based on customers’ hidden preference over the product or intention to try. More importantly, it confirms that such a self-selection mechanism is beneficial for the focal brands by attracting customers who are more interested and more likely to buy, i.e., advantageous self-selection. Last but not least, we carried out a follow-up survey (by phone call) to all 2,951 sample claimers in our experiment one month after the event (Figure 2.1) to shed light on the customers’ hidden preferences. The response rate is 32%. Recall that the claimers in Group 3 and 4 already differ significantly in both observed characteristics (Table 2.7) and outcomes (Table 2.8). It would be helpful if we can, at least to some extent, measure customers’ hidden preference. In the survey, we asked whether they had used the free samples and their experience with it. We find that significantly more claimers in Group 4 answered that either they had already used or planned to use the free samples. Such finding provides additional evidence and further supports our argument of the existence of the unobserved advantageous self-selection mechanism. 32 Table 2.10: Testing For Self-Selection Based on Unobserved Characteristics Brand purchase or not Brand spending Brand view or not # of brand item-views (Log) (Log) In Group 4 0.002 0.014 0.001 0.002 (0.001) (0.006) (0.019) (0.013) Female 0.003 0.018 0.158 0.032 (0.001) (0.008) (0.020) (0.010) Age 26-30 0.0003 0.003 0.039 0.012 (0.002) (0.011) (0.020) (0.015) Age 31-35 0.002 0.012 0.084 0.011 (0.003) (0.018) (0.023) (0.017) Age> 35 0.001 0.004 0.097 0.005 (0.003) (0.016) (0.021) (0.017) Distance to station 500 m 0.002 0.011 0.008 0.031 (0.001) (0.005) (0.025) (0.013) Annual platform total spending 0.000 0.00000 0.00000 0.00000 (0.00000) (0.00000) (0.00000) (0.00000) Annual platform total orders 0.00000 0.00002 0.0003 0.00005 (0.00000) (0.00002) (0.0001) (0.00003) Brand active 0.028 0.148 0.765 28.191 (0.046) (0.256) (0.494) (0.420) Brand monthly spend- ing 0.0003 0.002 0.010 0.328 (0.001) (0.003) (0.006) (0.005) Category active 0.001 0.006 0.170 0.008 (0.002) (0.011) (0.020) (0.013) Category monthly spending 0.00000 0.00000 0.00002 0.00001 (0.00000) (0.00000) (0.00001) (0.00001) Category monthly or- ders 0.0003 0.001 0.006 0.007 (0.001) (0.003) (0.004) (0.005) Station active 0.002 0.010 0.001 0.020 (0.001) (0.005) (0.028) (0.019) Station pickup 0.0001 0.0003 0.002 0.001 (0.0001) (0.0004) (0.001) (0.001) Station parcel shipping 0.00001 0.00004 0.0003 0.0002 (0.00001) (0.00004) (0.0003) (0.0002) Constant 0.007 0.039 0.571 0.013 (0.003) (0.018) (0.044) (0.025) N (Keeping only Group 3 and 4 defined in Table 2.6) 2,951 2,951 2,951 2,951 Robust standard errors in parentheses p< 0:1, p< 0:05, p< 0:01 2.6 Target-to-Induce the “Right” Customers So far, we have shown that the platform can systematically leverage the stations as a physical hub to induce self-selection and screen customers with strong preference. However, such advantageous self-selection 33 0.0% 20.0% 40.0% 60.0% No Yes or Plan to Group 3: Organic Group 4: Organic/Induced Figure 2.1: A Follow-up Survey on Usage of Free Samples Notes: We carried out a follow-up survey (by phone call) to all sample claimers one month after the free sample distribution. Out of 2,951 sample claimers, 945 (32%) responded. The percentages of claimers who have answered either they have used the free sample or plan to use are 59% in Group 3 and 68% in Group 4 (p = 0:03), respectively. mechanism cannot be explained by the observed characteristics (Table 2.9 and Table 2.10). It creates an additional challenge when the platform seeks to selectively target certain customers because not all cus- tomers being targeted (based on observed characteristics) would comply and claim samples offline. In this section, we take advantage of a state-of-the-art machine learning technique to illustrate how the platform can leverage its big data accumulated from online to further enhance the effectiveness of offline interaction for a focal merchant and also to provide free samples to those who may benefit the most. Our key idea is that the “right” customers to be induced should correspond to the ones who are 1) likely to be induced from online to claim samples and also 2) likely to make a purchase afterward. Specifically, the advantageous self-selection mechanism guides our particular choice of method to find the “right” customers. It implies that only a subset of customers would be induced to claim a sample (first- stage decision), and such self-selection is further associated with the effect of sample claim (second-stage outcome). In other words, the customers’ first-stage decision (whether to claim a sample) is correlated with the final outcomes (whether to purchase at the focal merchant) through both observed customer character- istics and hidden preference. The goal of the platform (and our method) is not to maximize the first-stage sample claim but to “induce” those customers who would have a significant increase in their engagement 34 with the merchant to claim the sample. Thus, we choose a method that can take into account both the first- stage decision and second-stage outcomes to accommodate such targeting. Put another way, the platform needs to target-to-induce the right outcomes at the second stage instead of simply targeting to maximize the first-stage turnout. Instrumental forest [10] is particularly designed to estimate the heterogeneous causal effect of induced sample claim (LATE). The customers with the highest LATE estimates are the “right” (or high-value) customers, which corresponds precisely to induced claimers who are more likely to purchase at the focal brand. Our approach is as follows. We train an instrumental forest (with 2,000 trees as suggested in [10]) on our study population with the binary outcome variable “brand purchase or not” to predict the conversion of purchase behavior at the focal brand. We comprise all the covariates defined in Table 2.3, which includes both customers’ online and offline characteristics. For example, we use the number of monthly self-pickup and parcel shipping to capture customers’ offline activities and annual platform orders to capture customers’ overall engagement with the online channel. Since both the first-stage decision (offline) and the second- stage outcome (online) affect the effectiveness of induced interaction, we expect both online and offline characteristics to be associated with the heterogeneity of LATE. The instrumental forest algorithm essen- tially explores the correlation between these covariates and the LATE. For a typical random forest-based method, we can obtain the importance of each variable (in explaining the variations of LATE) by calculating the total amount of decrease in node impurity across all trees [95, p. 319]. We present the top five most im- portant variables in Figure 2.2. The most important variable is station self-pickup, as one of the key offline characteristics, whereas the second most important variable is the total annual order, which measures the overall online purchasing frequency of a customer. It suggests that these variables are the most useful in ex- plaining the heterogeneity of LATE across different customers. In this way, the machine learning algorithm effectively leverages the rich online and offline data and predicts the purchase probability of each customer if she was being targeted and induced for offline interaction. 35 Station Parcel Shipping Category Monthly Spending Platform Annual Spending Platform Annual Orders Station Self−Pickup 0% 25% 50% 75% 100% Relative Variable Importance Figure 2.2: Relative Variable Importance in the Instrumental Forest Notes: The importance of each variable is calculated based on the total amount of decrease in node impurity across all trees in the forest. Moreover, the relative variable importance metric is relative to the most important variable (in this case, station self-pickup). Figure 2.3: Single-Tree Representation of Instrumental Forest Notes: We train a single decision tree with a maximum depth of two on the predicted purchase probability and the five most important variables obtained from the instrumental forest in Figure 2.2. Finally, to provide more insights into the particular types of customers the platform should target, we train a single decision tree with a maximum depth of two to learn the predicted purchase probability of instrumental forest with these five most important variables in Figure 2.3. The result shows that the best segment (3.6% of the study population) includes customers with at least 5 monthly self-pickups and 159 total annual orders, which yields an average purchase probability of up to 6.23%. Our proposed framework further facilitates data-driven decision making of the platform. If we can leverage the rich data to predict the 36 effectiveness of sample claim for each customer accurately, the platform can strike a balance between cost (the total number of samples distributed) and return (the predicted total number of purchases). 2.7 Conclusion and Discussion In summary, we provide the first empirical examination of how an e-commerce platform can leverage the last-mile logistics infrastructure as an offline platform to connect merchants and customers. Using free sam- ple distribution as an example, we make a clear distinction between organic interaction (the spontaneous walk-in traffic) and induced interaction (customers driven from online and self-select into sample claim) and empirically examine their causal impact on customers’ subsequent online purchases at the focal brands, respectively. Our major contribution to the literature is twofold. First, we document that the organic in- teraction at the last-mile stations only slightly increases online sales of the focal brands, which suggests that previous wisdom in the retailing context cannot be extended into our case. Second, we show that the induced interaction drives significantly more online sales of the focal brands and identify the underlying mechanism—advantageous self-selection. It not only guides the design of offline interaction but also further our understanding of the role of last-mile infrastructure in connecting customers and merchants offline. We expect that this phenomenon can be generalized to other cases in the omnichannel retailing setting (see, for example, [168, 166]). Last but not least, we propose a target-to-induce framework using instrumental forest to enhance the effectiveness of induced interaction. Establishing local logistics infrastructure has been a fast-growing trend for e-commerce platforms to relieve the burden of last-mile delivery. At the same time, it is equally important to develop a smart lo- gistics network in the offline world and generate additional value from the spared capacity of the logistics infrastructure, according to Jack Ma—Founder of Alibaba [131]. Overall, our proposed idea of an “offline platform” is exciting and opens up new opportunities for e-commerce platforms such as Amazon, JD, and 37 Alibaba, who manage their own logistics network to leverage the last-mile infrastructure as a new intermedi- ary for various offline commercial activities. Beyond practice, we believe that our study may lead to several directions for future research. First, we focus on interventions (e.g., free sample distribution) that are purposefully created to bene- fit the platform’s online merchants. Nonetheless, with proper design, offline interaction may also benefit last-mile logistics itself. For example, the platform may distribute coupons to motivate customers to return products to the stations, which may dramatically reduce the logistics costs for e-commerce platforms [109]. We expect this to be effective because even in the case of our particular intervention (i.e., the offline sample claim is designed to benefit the online merchants rather than the station usage), we have also documented an significant offline spillover. We carry out an additional analysis to assess how organic sample claim af- fects the usage of last-mile stations in Appendix A.5. Interestingly, we found that the weekly number of packages picked up and shipped from the stations increase by 2.39% (p < 0:01) and 0.6% (p < 0:01), respectively. Thus, it is vital for future research to explore the best intervention (e.g., offline distribution or online distribution) to improve last-mile logistics efficiency. Generally speaking, in today’s omnichan- nel world [38, 20], we believe that such connection and integration of the online channel and the offline infrastructure are extremely promising in improving the logistics process itself. Second, our study mainly focuses on how to target for induced interaction using random assignment at the customer level. However, the platform can also selectively target particular stations for organic inter- action. Future research could conduct field experiments at the station level and collect a rich set of station characteristics to address the question. We hope that our experiment and proposed targeting method can pro- vide a framework to address the optimal selection of last-mile stations. Future research may also examine how platforms could combine customer-level and station-level data to maximize the effectiveness of offline interactions. An important question is how the customer-level findings can be integrated with station-level data such as geographic location to facilitate station selection in a data-driven manner. 38 Finally, Section 2.6 illustrates our targeting approach in a retrospective way based on a well-executed field experiment. The study population in a single free sample distribution event is both used for estimation and evaluation of targeting methods. However, in practice, the platform often has to rely on historical data for decision making. For example, the merchants in past free sample distribution events might be different from the one in the new campaign. Or, sometimes, we do not have the luxury to carry out randomized experiments every time a new intervention is proposed (e.g., induced interaction for in-station sales). As a result, it imposes new challenges and forces the platform to rely on limited historical data or experiments to facilitate the targeting decision. It is also imperative for the platform to incorporate the uncertainty into their decision making [81, 101]. Nevertheless, our study takes the first step and demonstrate the value of data-driven decision making for the offline platform. 39 Chapter 3 Commercializing the Package Flow: Cross-sampling Physical Products Through E-commerce Warehouses 3.1 Introduction Major e-commerce platforms around the world have established their own warehouses to facilitate the stor- age and delivery of packages. For example, Alibaba—one of the world’s largest e-commerce platforms— delivers approximately 80 million packages every day in China through its warehouses, many of which are managed by Cainiao Network—Alibaba’s logistics subsidiary company. Amazon has also invested in more than 175 fulfillment centers all over the globe [4]. The primary function of these warehouses is to fulfill orders generated from the online marketplace. In this study, we demonstrate that the emergence of these platform-owned warehouses also enables the e-commerce platforms to achieve a natural integration of control and information (Table 1) and design new business models in the offline world. “Control” refers to the control of physical products/packages by the offline logistics infrastructure. “Information” refers to the rich data generated from the online marketplace. E-commerce warehouse, i.e., warehouses directly managed by e-commerce platforms, is fundamentally new and different from previous models. On one hand, unlike the warehouses operated by third-party logistics providers such as UPS and FedEx, e-commerce warehouses also have access to the detailed customer and product information accumulated from the online marketplaces. On the other hand, comparing to the digital platforms such as eBay and Craigslist, e-commerce warehouses also directly control the flow of the physical products and packages. 40 Table 3.1: Categorization of Business Models by Control and Information Information . No Yes Control No N/A Pure digital platform (e.g., eBay, Craigslist) Yes Third-party logistics provider (e.g., UPS, FedEx) E-commerce warehouse (e.g., Alibaba-Cainiao, Amazon) Because of this unique integration of control and information, platforms with e-commerce warehouses actively adapt their business models and expand their commercial activities from the digital world to the physical world. In particular, similar to commercial activities in online marketplaces such as sponsored search or recommender systems that enable communication of information across brands and customers, the offline infrastructure can serve as a profit center and a specialized business platform to connect brands and customers with physical products. As such, various business innovations have occurred regarding the commercialization of the offline platform [83]. Previous research has mainly focused on the role of digital platforms from the perspective of information sharing [120] such as advertising and targeting [75, 108, 122, 78] and e-commerce [39, 37, 164] or on the logistics perspective of warehouses such as order fulfilment [16, 153] and package delivery [119, 137]. Much less is known about how e-commerce platforms can leverage the logistics infrastructure for commercial activities. To bridge this gap, we study how physical products and samples from different brands can be delivered together in the same package based on the real-time purchase information. Our key idea is to take advantage of the control and information naturally endowed by e-commerce warehouses and commercialize the package flow. 41 In close collaboration with Alibaba, we propose and implement a novel business practice–cross- sampling–through which physical free samples provided by one brand (which we term sampling brand) can be distributed with items purchased from another brand (which we term distributing brand). For exam- ple, a customer who purchases a pack of Lay’s chip may also receive a free napkin sample from Fluffy in her package. The two brands (Lay’s and Fluffy in the previous example) do not have to be associated with each other and can be from different categories. As many brands may be matched up with various customers, cross-sampling is a novel instrument for promotion and targeting arose from the business platform. While free sampling of physical products is an essential promotional strategy for brands selling physical experience goods (e.g., cosmetics products), our proposed cross-sampling practice offers three unique ad- vantages. First, cross-sampling takes advantage of the delivery of existing packages. Instead of incurring a notable delivery cost for free samples, cross-sampling can leverage the millions of packages generated by the e-commerce platform everyday. As a result, large-scale cross-samplings can be implemented at negligible costs. Second, while free sampling is often limited to retailers’ existing (in-store) customers, cross-sampling enables the retailers to acquire new customers and expand their customer bases, which are crucial for the business success of the retailers. Through the business platform, one brand may reach many prospective cus- tomers across brands and categories. Last but not the least, combined with the rich information about each package, cross-sampling can personalize free sampling and maximize its effectiveness in a data-driven man- ner. Specifically, each package contains two types of information: which item is purchased and which user places the order. Both pieces of information may further improve the effectiveness of customer targeting. At Alibaba, we proposed and implemented cross-sampling through a large-scale field experiment in 2019. More than 55,000 samples provided by six online brands were distributed. We coordinated with each brand and designed a quasi-experiment on cross-sampling. This study provides the first empirical investigation of its effectiveness in driving online sales of the sampling brand. Our empirical findings echo the the aforementioned advantages of cross-sampling. 42 First, despite that we have limited participating brands (and associated orders), a typical batch of 10,000 free samples can be distributed within a few days with a per-unit cost of several cents. This is because our proposed business practice leverages the existing warehousing system so that free samples are distributed with real-time online orders. Consequently, the cost is negligible and the volume of samples distributed can be large. Second, we find that cross-sampling significantly increases both sales and browsing of the sampling brand. For example, it generates 107% more spending and 213% more number of item-views during the following month in the treatment group compared with the control group. We also find a significant long- term effect regarding item-views over a three-month period. Additionally, more than 93% of customers in the treatment group had not purchased at the sampling brand in the previous year. Therefore, cross- sampling can be highly effective in increasing brand awareness and converting prospective customers into paying ones. Third, we develop an analytical framework to explore the competitive advantage of cross-sampling based on the value of information. In particular, we compare our proposed channel to the existing channels for free sampling of physical products regarding the usage of User information: customer’s historical purchases and browsing information; and Item information: real-time purchase information and item association information. Cross-sampling is the only channel to leverage both types of information and distribute physical products to the right customers at the right time. In contrast, the existing channels may be limited to leveraging either one of the two types of information. For example, the recently-emerged free sampling platforms (e.g., try.taobao.com, www.amazon.com/samples) allow customers to apply for free samples provided by a wide range of online brands [117]. The platforms select relevant samples based on user information. Meanwhile, 43 customers do not make actual purchases through these channels, so the platform cannot leverage the real- time purchase information for personalization. On the other hand, brick-and-mortar retailers often bundle promotional samples to associated products based on item information, while giving up the opportunity of customization based on user information. Because of this unique advantage of cross-sampling, we need to understand how to utilize both pieces of information and quantify its value. Thus, we further develop a user-based affinity score to measure the similarity between the focal customers and the typical customers of the sampling brand and an item-based affinity score to measure the similarity between the typical customers who would purchase the particular item in the distributing brand and the typical customers of the sampling brand, respectively. Using our empirical data, we find that the user-based affinity score only partially moderates the causal impact of cross sampling regarding sales, while the item-based affinity score has a positive and significant moderating effect. The magnitude of the moderating effect also measures how useful each type of information can be used to improve the effectiveness of free sampling. Finally, we construct a compound affinity score as the weighted average of the user-based and item- based affinity score to examine the advantage of cross-sampling. We find that the compound affinity score has the largest moderating effect, which suggests that combing both user and item information can generate significant business value. Through a counterfactual analysis, we compare the revenue of cross-sampling with the existing channels under resource constraint. We find that cross-sampling can achieve higher revenue by targeting less customers. And such advantage is more prominent when the resource constraint is tight. In this way, we quantify the value of user and item information (both incorporated in each package) and succinctly describe the relationship between our proposed free-sampling channel and the existing ones. Before proceeding, we summarize our major contribution below: 44 1. We suggest an expansion of business model as e-commerce platforms start to leverage the offline infrastructure for commercial activities. Because of the natural integration of control and informa- tion, the logistics infrastructure can serve as a specialized business platform to connect brands and customers with physical products. 2. To commercialize the package flow, we propose a novel business practice—cross-sampling—in which physical products and free samples from two unrelated brands can be delivered together in the same package. In close collaboration with Alibaba, we implement and study cross-sampling through a large-scale field experiment and provide the first empirical evidence of its significance. Moreover, we demonstrate that (1) cross-sampling is cost-effective and scalable; (2) it can be used for new customer acquisition; and (3) it enables effective personalized distribution. 3. We develop an analytical framework to understand the competitive advantage of cross-sampling re- garding the information usage, both qualitatively and quantitatively. We highlight the value of the real-time purchase information and show that leveraging both user and item information can further increase the return of free sampling. The structure of this paper is as follows. We highlight our contribution to the literature in Section 3.2. We then describe our research setting in Section 3.3. We present the empirical results and develop the analytical framework to quantify the value of item and user information in Section 3.4 and Section 3.5, respectively. Finally, we conclude and provide a general discussion in Section 3.6. 45 3.2 Literature Review Our study relates to four major streams of literature. The first stream of literature studies the operational efficiency of logistics infrastructure. The other three streams of research investigates commercial activities that are related to cross-sampling. First, warehousing, as a crucial part of the supply chain, has been the main focus of a long line of research. Earlier studies has developed various methodologies for location selection [17, 65], inventory management [125, 157], and fulfillment [51]. Recent studies have also examined the behavioral element of warehouse workers regarding order-picking [16] and pin-packing [153]. Previous literature has studied warehousing from the viewpoint of a cost center. On the contrary, we contribute to this stream of research by proposing a novel business model that treats the logistics infrastructure as a profit center. The second stream of literature studies the economic impact of free sampling. Most of the existing work focuses on either free sampling of information products online [46, 159, 48, 112, 130] or sampling of physical products offline [124, 85, 18]. These studies investigate free sampling under the setting of in-store customers, who are already aware of the sampling brand. A few recent work has also examined the rating- bias effect of free sampling from the online free-sampling platform’s perspective [117] and how to improve the return of free sampling at the last-mile pickup stations [83]. In these two studies, brands distribute samples to customers who are likely unfamiliar with the brand. We contribute to this line of research by proposing an alternative channel, and advance our understanding on how to increase the effectiveness of free- sampling. Compared to the previous study [83], our proposed cross-sampling channel is more effective in increasing sales and brand impressions, even without personalization. More importantly, as cross-sampling is capable of leveraging both user and item information for personalization across brands and categories, we propose a corresponding analytical framework. We show that the real-time item information is highly valuable in improving the effectiveness of free-sampling. 46 The third stream of literature studies cross-product promotional strategies within a single brand: cross- selling and product bundling. Cross-selling refers to selling related products or services to existing con- sumers. The existing literature has focused on how to improve the effectiveness of cross-selling by predict- ing the purchase probability of new products using customer information [103, 116], finding the right time [102, 115], or from the portfolio optimization’s perspective [2]. Similarly, product bundling is the bundled sales of two or more products or services [14, 31], which has been proven effective, especially for new products [147, 150, 111, 127]. We connect to this stream of literature because both strategies interact with customers’ real-time purchase decisions. However, they are discussed only within a single brand. We add to the literature by taking the stance of a platform, which allows one brand to connect with customers from other brands with physical products. We also shed some light on a new direction for cross-product strategies. Based on our proposed business practice, it is also possible to allow one brand to bundle its products with products from another brand. Finally, we contribute to the literature that studies cross-brand promotional strategies from a platform’s perspective: recommender systems and cross-promotion. Recommender systems is a popular methodol- ogy in online platforms to recommend the right products to the right customers at the right time [139, 1]. The common methods include collaborative filtering [142], content-based filtering [121], and hybrid rec- ommender systems [40]. Each method leverages user information or item information and relies on a basic intuition that, for example, similar customers would also like similar products. Recently, [108] show how a mobile App platform (such as iOS or Android) can match different but similar apps based on user infor- mation and app description (item information) for cross-promotion. Similarly, cross-sampling practice can match complementary brands based on user’s real time purchase information. We build on previous liter- ature and leverage both user and item information to improve the effectiveness of cross-sampling. While only information such as ads, deals, or product features is recommended to the customers in existing stud- ies, our study contributes to this stream of literature by expanding the recommendation of information to 47 the distribution of physical products. Moreover, we propose a practical scoring system and document an inverted U-shape of the value of user and item information, which helps to find the optimal combination of both types of information. 3.3 Research Settings In collaboration with Alibaba Group and its logistics subsidiary company Cainiao Network, we proposed and implemented cross-sampling through a large-scale quasi-experiment during February and March 2019. We worked closely with the platform and recruited six online brands to participate in the pilot project. The goal of the experiment was to implement cross-sampling at scale and generate a proper set of field data to identify the causal effect of cross-sampling. In this section, we showcase the particular implementation, discuss the experiment design, and describe our data preparation process. 3.3.1 Implementation Procedure of Cross-Sampling To implement cross-sampling at scale, it is vital for us to fully leverage the existing warehousing system. As the Alibaba’s logistics arm, Cainiao warehouses store and deliver packages for many of its online brands, including the six brands that we were collaborating with. Moreover, online brands such as cosmetics brands frequently carry out free sampling in its online store, for instance, to promote new products. Consequently, the centralized warehousing system distributes free samples with purchases for many brands on a daily basis. To implement cross-sampling, we need to adjust the information system and allow one brand to distribute free samples provided by another brand, potentially from a different category (so as to avoid direct competition between similar brands). Specifically, for each round of our experiment, we invited one brand from the cosmetic product category and one brand from the daily consumption product category. At the beginning of each round, one brand 48 would provide, for example, 10,000 identical free samples. This brand is what we refer to as the sampling brand. To leverage the existing system, the platform would change the ownership of these samples from the sampling brand to its counter-party, i.e., the distributing brand. Finally, the distributing brand carry out a free sample promotion in its online store and treat the free samples provided by the other brand as their own. So any customers who purchased from one brand’s online store would receive a free sample provided the other brand in their packages. We observe that a batch of 10,000 free samples can be distributed in just a few days as the free samples are distributed with the online orders in real-time. By leveraging the warehousing system, the number of samples distributed each day potentially can match the large number of online orders. More importantly, leveraging the existing package flow, cross-sampling does not result in an additional delivery cost and only a per-unit cost of several cent in the order-picking procedure. In this way, our implementation suggests that cross-sampling can indeed be carried out at a large scale with negligible cost, which confirms its first unique advantage (discussed earlier in Section 3.1). Our experiment had six rounds. Each of the six brands provided free samples in some round and dis- tributed samples for another brand in some other round. That is, all brands took on the role of both a sampling brand and a distributing brand. The main focus of our empirical study is the causal effect of cross- sampling on the sampling brand. With this in mind, we next elaborate the elements of the experiment that allows us to cleanly identify such causal effect. 3.3.2 Experiment Design Due to some practical constraints, we were not able to implement a complete randomization of cross- sampling, i.e., the treatment. Because the brands would like to distribute samples for some consecutive 49 time intervals, we designed a quasi-experiment in which the treatment was switched on and off at the spe- cific time points of a day (e.g., beginning of the shift, middle of the shift). Moreover, the treatment and control group were randomized into the resulting time segments across days, so no group is always occupy- ing a specific time slot of the day. By this way, we aimed to maintain the exogeneity of the treatment to the best extent possible. Throughout the experiment, all customers were only informed by a banner ad on the front page of the distributing brand’s online store, saying that all orders would be randomly selected to receive a free sample. After the experiment, customers were sorted into two groups. One group of customers received a free sample from the sampling brand after purchasing at the distributing brand, i.e., the treatment group. The other group of customers also purchased at the distributing brand but did not receive a free sample, i.e., the control group. Note that both groups did not know whether they would receive a free sample before they received the packages (and the underlying order-time-based mechanism that determines whether a customer receives a sample). The only difference between these two groups was when they placed the orders (and the resulting assignment). For this reason, we can claim that the treatment was exogenously determined from the customers’ point of view. Furthermore, we couple our analysis with a propensity score matching (PSM) procedure to refine the balance of the treatment and control group. We next describe the detail of our data collection process and the matching procedure. 3.3.3 Data We first construct a rich customer-level data from the field. During the experiment, a total of 55,434 samples were distributed. We focus on customers who only purchased one item, which leaves us with 52,288 cus- tomers. There were 139,518 customers who also made a single-item purchase but did not receive samples 50 during the experiment. Based on the user ID of these 191,806 customers, we collected detailed purchase and browsing data. At the same time, we appended the subcategory, spending, and discount rate of each item purchased from the distributing brand. Based on the dataset, we carry out a clustered one-to-one PSM on a wide range of pre-treatment co- variates. Specifically, we estimate a logistic regression of the treatment status on the following covariates. We include gender, age segment (25; 26 30; 31 35; > 35), annual total number of orders, and annual total spending from the online marketplace to capture the demographics and overall spending power. We also compute the number of item-views and spending at the sampling brand, sampling category, distributing brand, and distributing category three months before the treatment, respectively. In addition, we include a sequence of the week-level of these variables up to 8 weeks before treatment. These week-level variables measure, for example, the dynamics of customers’ purchase behavior [154, 83]. Finally, we include the spending and discount rate of the items purchased from the distributing brand. Overall, we have constructed a rich set of covariates at both the customer level and the item level. We then find the closest one-to-one match [165, 83, 154] based on the predicted propensity scores clustered at each-item subcategory and each distributing brand. In other words, each treated customer is matched with a control customer who purchased a similar item under the same subcategory (e.g., lipstick) and at the same distributing brand (e.g., L’Oreal). Note that each brand sells hundreds of SKU spanning from 15 to 30 subcategories. Throughout our analysis, we will use subcategory to represent the identity of each item, which can also be considered as a type of item. After matching, we are left with 41,566 customers in the treatment and control group, respectively. We summarize the number of treated customers before and after matching by each brand in Table 3.2. Although the matching rate may vary across different rounds of campaigns, we achieve an overall matching rate of approximately 80%, which means that our matching procedure preserves 80% of the raw data. 51 Table 3.2: Number of Treated Customers Before and After Matching Sampling Brand Before Matching After Matching Matching Rate (%) Cosmetic 1 3,908 3,803 97.31 Consumption 1 8,089 7,742 95.71 Cosmetic 2 17,867 10,279 57.53 Consumption 2 15,574 12,942 83.10 Cosmetic 3 4,615 4,602 99.72 Consumption 3 2,235 2,198 98.34 Total 52,288 41,566 79.49 Notes: We anonymize the identity of the six brands. For example, “Cosmetic 1” refers to the online brand in the first round of cross-sampling from the cosmetic product category. As the goal of our experiment is to cleanly identify the causal effect of cross-sampling, it is important that the treatment can be considered as if completely-randomized. We now proceed to verify the effectiveness of both our experimental design and matching procedure in two ways. First, we plot the density curve of the propensity scores for the treatment, matched control, and raw control group in Figure 3.1. As we can see, the treatment and matched control group have almost identical distributions of the propensity scores. It suggests that after the PSM, the two groups have a similar propensity to receive the treatment. In other words, the treatment assignment is similar to a randomized assignment. Notice also that the density curve for the raw control group (i.e., the control group before matching) is not very different from that of the treatment group due to our particular experimental design, which ensures the exogeneity of the treatment. Second, to further examine the improvement of covariate balance, we summarize the pre-treatment covariates of the treatment and control group before and after matching in Table 3.3. We again confirm that there is no statistical difference between the two groups after matching regarding these covariates. 52 0.00 0.25 0.50 0.75 1.00 Propensity Score Density Group Treatment Matched control Raw control Figure 3.1: Density of the Propensity Scores by Group Table 3.3: Summary Statistics Before and After Matching Before Matching After Matching Control Treatment p-Value Control Treatment p-Value Female (%) 81.90 80.24 0.00 80.52 80.40 0.65 Age (%) 25 59.72 57.63 0.00 58.54 58.39 0.64 26 - 30 16.55 17.28 0.00 16.70 16.86 0.53 31 - 35 8.83 9.61 0.00 9.27 9.29 0.95 > 35 14.89 15.47 0.00 15.48 15.46 0.94 Annual total orders 185.00 187.29 0.35 178.23 178.63 0.90 Annual total spending (RMB) 15,021.99 15,887.17 0.00 14,800.12 14,891.51 0.74 Purchased both brands during previous year (%) 1.89 2.61 0.20 2.29 2.30 0.95 Purchased both categories during past three months (%) 47.73 49.72 0.00 47.36 47.22 0.70 Sampling brand Existing customer 1 (%) 6.31 6.93 0.00 6.73 6.75 0.90 Spending during past three months (RMB) 1.18 1.39 0.01 1.32 1.30 0.77 Viewed items during past three months (%) 10.69 11.29 0.00 11.04 10.97 0.78 Distributing brand Existing customer (%) 22.82 29.86 0.00 26.74 26.67 0.80 Spending during past three months (RMB) 9.17 10.29 0.16 9.25 8.97 0.30 Viewed items during past three months (%) 50.74 52.04 0.00 50.11 49.65 0.18 Sampling category Spending during past three months (RMB) 169.17 196.27 0.00 156.50 154.84 0.63 Number of item-views during past three months 39.47 41.88 0.00 34.90 34.46 0.35 Distributing category Spending during past three months (RMB) 168.92 188.85 0.00 172.69 171.22 0.59 Number of item-views (count) 38.50 44.31 0.00 42.39 42.03 0.48 Single-item purchase from the distributing brand Spending (RMB) 70.79 74.14 0.00 75.08 75.07 0.97 Discount rate (%) 42.97 40.58 0.00 38.96 39.15 0.17 N 139,518 52,288 41,566 41,566 Notes: 1 We define existing customers as those who have have made at least one purchase at the particular brand during previous year. 53 3.3.4 Outcome Variables Finally, we define the following outcome variables to capture the impact of cross-sampling on the sampling brand’s online sales and browsing traffic, respectively. 1. Spending: the total amount of expenditure in RMB. 2. Item-views: the number of item-views at the focal online store. Based on this definition, we plot the outcome trends by each group in Figure A.2. We have three important observations. First, before matching, the treatment group and the raw control group have fairly similar pre- treatment outcome trends regarding both spending and item-views at the sampling brand, which is again due to our carefully-designed quasi-experiment. The similarity of pre-treatment outcome trends suggests that the treatment is an exogenous shock. Second, after matching, we find that the treatment group and the matched control group have very similar pre-treatment outcome trends, which results from our PSM procedure. Specifically, because we have included sequences of week-level variables of these two outcomes when estimating the propensity scores (i.e., the logistic regression), the matching procedure would also try to balance the pre-treatment outcome trends of the two groups. It further accounts for the difference between the treatment and control group that has not already been captured by the pre-treatment covariates. Last but not least, based on Figure A.2, we can also see significant increases in both outcomes in the treatment group, compared with the matched control group, after cross-sampling. It signals a significant causal effect as we can visualize it from the data before carrying out any analysis. To further quantify such an effect, we next perform our econometric analysis. 54 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.04 0.08 0.12 0.16 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 11 12 Relative Week Spending (RMB) Group ● Treatment Matched control Raw control (a) Weekly Spending at the Sampling Brand ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.02 0.04 0.06 0.08 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 11 12 Relative Week Number of Item−views (b) Weekly Item-views at the Sampling Brand Figure 3.2: Spending and Item-view Trends by Group 3.4 Causal Impact of Cross-sampling on the Sampling Brand Our main estimation equation is Outcome i = +Cross Sampling i + i ; (3.1) whereCross sampling i is the treatment variable, indicating whether customeri has received a free sample in her package, andOutcome i denotes the particular outcome (spending or item-views) of customeri at the sampling brand’s online store. We are interested in the estimated value of the parameter, which represents the average treatment effect of cross-sampling. In what follows, we decompose the outcome period into 1-4 weeks, 5-8 weeks, and 9-12 weeks after the treatment, respectively, to examine the short-term and long-term 55 effects. We also report the robust standard errors to allow for heteroscedasticity. The estimation result based on Equation (3.1) is presented in Table 3.4. Table 3.4: Causal Impact of Cross-sampling on the Sampling Brand Spending Item-views Spending Item-views Spending Item-views (1-4 weeks) (5-8 weeks) ( 9-12 weeks) Cross Sampling 0.218 0.115 0.082 0.030 0.052 0.021 (0.045) (0.006) (0.035) (0.004) (0.035) (0.004) Constant 0.203 0.054 0.150 0.046 0.155 0.040 (0.023) (0.003) (0.020) (0.002) (0.023) (0.002) N 83,132 83,132 83,132 83,132 83,132 83,132 R 2 0.0003 0.004 0.0001 0.001 0.00003 0.0003 Robust standard errors in parentheses; p<0.1; p<0.05; p<0.01 As we can see, cross-sampling has a significant impact on both spending and item-views, during the following month. Specifically, the treatment group, on average, spend 0.218 RMB (p< 0:01) more than the control group (0.203 RMB) in the sampling brand’s online store, which yields an 0:218=0:203 = 107% relative increase. Similarly, the number of item-views is lifted by 0:115=0:054 = 213%. Furthermore, the causal effects on item-views are significant for all the outcome periods (1-4, 5-8, and 9-12 weeks after treatment), which suggests that cross-sampling is also effective in increasing brand impressions in the long run. The results from Table 3.4 verifies the second advantage of our proposed cross-sampling channel—new customer acquisition. As shown in the summary statistics (Table 3.3), less than 7% of the customers have made a purchase at the sampling brand in the previous year. It means that the majority (93%) are not very familiar with the sampling brand. And yet, our empirical findings show that cross-sampling is highly effective in converting prospective customers into paying ones and increase brand awareness even in the long run. As such, our proposed cross-sampling practice (even without any personalization) can serve as an attractive alternative for online brands that sell physical experience goods to further expand their customer bases. 56 We further carry out two robustness checks to validate our estimates of the average treatment ef- fect. First, our main model in Equation (3.1) is built on a key assumption that the treatment variable Cross Sampling i is exogeneous. To test this assumption, we estimate the following model with control variables Outcome i = +Cross Sampling i + T x i + i ; (3.2) where x i includes all the pre-treatment covariates defined earlier in Table 3.3. IfCross Sampling i is en- dogenous, the treatment variableCross Sampling i might be correlated with some of the control variables x i , which means that our previous results in Table 3.4 are biased. As a result, we may have different es- timates after including the control variables here. IfCross Sampling i is indeed exogenous, on the other hand, we should obtain the same results. We defer the detailed estimation table in Appendix B.1 Table B.1. We find that all our results are both qualitatively and quantitatively the same, which confirms the robustness of our estimates. Second, another alternative explanation of our main results in Table 3.4 is that the significant difference in outcomes after treatment might be driven by the different pre-treatment trends of the treatment and control group. To make sure that this is not the case, we carry out a falsification test, where we replace the dependent variables as the pre-treatment outcomes (Appendix B.1 Table B.2). For example, we define the outcome “Spending” as spending at the sampling brand 1-4 and 5-8 weeks before the treatment week. We find that there is no significant difference between the two groups during the two months before the treatment, which again enhances the validity of our results. 57 3.5 Value of User and Item Information As discussed in Section 3.1, cross-sampling has three unique advantages. So far, we have shown that it can be implemented at scale with negligible cost and is highly effective in expanding customer bases. Both advantages are primarily due to the “control” over physical products by the logistics infrastructure. The last important advantage is that cross-sampling can leverage the rich “information” contained in each package (both user and item information) for personalization and further improve its effectiveness. To explore this advantage, we develop an analytical framework to understand the competitive advantage of cross-sampling regarding the usage of information. The structure of this section is as follows. We first qualitatively compare our proposed free sampling channel to the existing ones in Section 3.5.1 using a two-by-two matrix. We then propose an affinity scoring system to utilize the user and item information in Section 3.5.2. Using our experimental data, we further examine how each affinity score moderates the causal effect of cross-sampling in Section 3.5.3. Finally, in Section 3.5.4, we leverage both user and item information and compare the performance of different free sampling channels through a counterfactual analysis. 3.5.1 Comparing to the Existing Channels We first compare our proposed free-sampling channel to the existing channels for free sampling of physical products. Recall that we can decompose information contained in the package flow into user informa- tion (customer’s historical purchases and browsing information) and item information (real-time purchase information and item association information). Depending on whether a particular channel can leverage such information for free sampling of physical products, we can characterize the existing channels into a two-by-two information matrix in Table 3.5. 58 Table 3.5: Information Matrix for Free Sampling of Physical Products Item Info. . No Yes User Info. No 1. Organic distribution (e.g., in a mall/supermarket) 2. Bundling to item (e.g., brick-and-mortar retailers) Yes 3. Platform-based free-sampling (e.g., try.taobao.com, www.amazon.com/samples) 4. Cross-sampling As a benchmark, Quadrant 1 is the most traditional way of distributing free samples. For example, a food brand may distribute peanut butter samples in a supermarket. Since any customers who walk by can claim a free sample, the brand cannot leverage the user information (e.g., demographics) or item information (e.g., whether a customer just brought bread) to customize the distribution. Put another way, the targeted customers are determined only at the aggregate level, such as geographic region. Quadrant 2 is commonly seen at brick-and-mortar retailers. For instance, a cosmetic brand may bundle a free perfume to each facial mask for sales. The targeted customer segment is those who buy facial masks. In this case, the real-time purchase behavior, i.e., item information, is exploited as a mechanism to segment customers for free sampling. However, this channel cannot selectively target individual customers based on user information (In some cases, retailers may have aggregated user information for each item. For example, they may know, on average, what percentage of females who have purchased each item. But the user information cannot be used directly for personalization.). Quadrant 3 emerges only recently due to the popularity of e-commerce platforms. Many e-commerce platforms have launched online free sampling platforms (e.g., try.taobao.com, www.amazon.com/samples), where customers can apply for free samples provided by a wide range of online brands [117]. The platforms 59 select relevant samples based on user information. Meanwhile, customers do not make actual purchases through these channels, so the platform cannot leverage the real-time purchase information for personaliza- tion. Finally, our proposed cross-sampling is the only channel (Quadrant 4 in Table 3.5) to leverage both user and item information and distribute physical products to the right customers at the right time. Note that although many digital platforms (e.g., eBay, Craiglist) have access to both types of information, they lack the control of physical products. Because of the natural integration of control and information, e- commerce warehouses have the ability to selectively distribute free samples inside packages generated by various customers (based on user information) who purchase different items (based on item information). 3.5.2 Measuring the “Right” Package Recognizing such an advantage, we need to understand how to utilize the user and item information and maximize its effectiveness in a data-driven way. It corresponds to an important practical question: Given the rich information, which package should we choose to distribute the free sample to maximize the effective- ness of cross-sampling? To answer this question, we propose a practical scoring system to exploit the rich user and item information and measure the “right” package for cross-sampling. Our approach is motivated by a basic observation. That is, similar customers should also behave sim- ilarly. For example, if the free sample is distributed to a customer who has similar characteristics as the existing customers of the sampling brand, we should expect that she is likely to make a purchase. This notion of “similarity” is the foundation of many data-driven decision-making tools and machine learning algorithms [139, 1, 108, 81]. One of the most related examples is recommender systems. Regardless of the different forms of recommender systems [40, 142, 121], they are all developed based on the similarity 60 between customers (or sometimes items). For example, a typical recommendation is “what similar users also purchased”. Building on the previous literature, we develop a scoring system to precisely quantify this notion of “similarity” using user and item information, respectively. Our general idea is the following. We define a user-based affinity score to measure the similarity between the focal customer (who places the order) and the existing customers of the sampling brand. We expect that if the focal customer has similar characteristics as the sampling brand’s customer base, i.e., a higher user-based affinity score, she is likely to make a purchase after receiving the sample. For example, if the existing customers of a sampling brand are mostly female between the age of 25 and 30, the user-based affinity score should assign a higher value for customers in the distributing brand with similar demographics, whom we expect to be more likely to buy if provided the free samples. We also define an item-based affinity score to measure the similarity between the customers who usually purchase the focal item (ordered by the focal user) and the existing customers of the sampling brand. If the typical customers who usually purchase the focal item has similar characteristics as the sampling brand’s customer base, we expect that the focal customer who just bought the item (with high item-based affinity score) is likely to purchase after receiving the sample. Using the same example in the previous paragraph, if most customers who used to purchase a particular item are also female between the age of 25 and 30, the item-based affinity score should assign a high value to the item. And we expect that free samples (from the sampling brand) bundled to this item would generate more return of free sampling, compared to other items. In this way, the item-based affinity score also measures the complementarity between an item and the sampling brand. 61 Along these lines, we now introduce some notations and formally define the user-based and item-based affinity score, respectively. For both scores, we will use the matrix S = 0 B B B B B B B B B B @ s T 1 s T 2 . . . s T m 1 C C C C C C C C C C A 2R mp to represent the characteristics of customers from the sampling brand. Each row is a vector of characteristics s j of customerj = 1;:::;m. There are a total ofp covariates andm customers. This matrix specifically captures the distribution of customers who recently purchased at the sampling brand, for example, during the past three months. Thus, we may refer to S as the sampling brand’s customer base, which can be easily obtained from historical data. Both affinity scores is designed to measure the similarity between a user or an item and the sampling brand’s customer base S. User-based Affinity Score: We use u2R p to denote the vector of characteristics for the focal user, i.e., the customer who places the order. Then, the user-based affinity score is defined based on Mahalanobis distance q (u a) T 1 (u a); (3.3) where a2R p is the mean vector of sampling brand’s customer base S and S T S=m is the variance- covariance matrix. Mahalanobis distance is a common metric for covariate matching [81]. It measures the distance between a point (u) and a cluster of points (S) in thep-dimensional space. The inverse variance- covariance matrix is applied to take into account the different magnitude (or unit) of dimensions. For example, one covariate may be “age” (ranging from 18-80), whereas the other is “annual total spending” 62 (ranging from 60,000 to 200,000). We put a negative sign in the front to measure the closeness instead of distance. As such, the user-based affinity score measures the similarity between the focal customer u and the sampling brand’s customer base S. The higher the user-based affinity score is, the more similar she is to the sampling brand’s customer base. Item-based Affinity Score: To operationalize the item information, we further define matrix I2R np similarly as for the sampling brand’s customer base S. Given a focal item, i.e., the item purchased by the focal user, each row i k in I is the vector of characteristics of a customerk = 1;:::;n who recently purchased the item from the distributing brand. So the matrix I represents the distribution of the customers who usually purchase the particular item and can also be obtained from historical data. We then define the item-based affinity score based on a discrete version of theL 1 Wasserstein distance, which can be computed using the following optimization problem min n X k=1 m X j=1 ki k s j k L 1 kj s.t. m X j=1 kj = 1=n 8k = 1;:::;n; n X k=1 kj = 1=m 8j = 1;:::;m; kj 0 8k = 1;:::;n andj = 1;:::;m: (3.4) The notationkk L 1 indicates theL 1 norm (e.g.,kbk L 1 =jb 1 j +jb 2 j + +jb p j). Wasserstein distance is a popular metric to measure the distance between two distributions [134]. In this manner, the item-based affinity score effectively measures the similarity between the typical customers I who usually purchase the focal item and the sampling brand’s customer base S. The higher the item-based affinity score is, the more similar the focal item is to the sampling brand regarding the characteristics of their customerss. 63 Note that the definition above is not necessarily the only or best way to formalize our general idea. For example, the item-based affinity score can also be defined based on other types of distributional distance measure such as KL-divergence or 2 distance [134]. Nonetheless, our key idea to measure the notion of “similarity” (between a user or an item and the sampling brand’s customer base) is general and intuitive, and our particular construction here is a sensible way to formalize such idea. More importantly, there are several favorable properties of our proposed framework. First, we now have a practical tool to summarize a wide range of information (or data) into a single score, which can be used to rank each package. We expect that higher affinity scores would lead to more return of cross- sampling. Second, our framework is not particularly constrained by the size of the data. Specifically, the user-based affinity score can be calculated efficiently as long asp is not too large (e.g.,p = 100; 000). For the item-based affinity score, the optimization problem does not grow withp and can be solved efficiently for largen orm, since there exist several efficient algorithms for linear optimization problems. Finally, both affinity scores are based on the same setup ofp covariates. This is important, especially when we want to compare the two scores. Since both scores leverage the same source of information, any differences in the performances of the two scores would be driven only by the structure of the information. Overall, our proposed scoring system is general, intuitive, and computationally tractable. To further verify its effectiveness, we next compute the affinity scores using our field experimental data and examine whether the scores can significantly moderate the causal effect of cross-sampling. 3.5.3 Moderating the Causal Effect We first compute the two affinity scores based on the following covariates (p = 11): gender (whether female), age segment (25-30, 30-35, > 35), annual order, annual spending, three-month spending at the 64 distributing category, three-month item-views at the distributing brand, three-month spending at the sam- pling category, three-month item-views at the sampling brand, and whether purchase both categories during past three months. For the sampling brand’s customer base (S), we include all customers who have made purchases at the sampling brand during the past three months. For the item information (I), we characterize each item based on the subcategory. For example, a hand soap will be characterized using all the customers who have purchased any hand soap at the distributing brand. Finally, we winsorize both scores by each brand (at 1% and 99%) and normalize them to be between 0 and 100. Before proceeding, we summarize the two scores by each brand in Table 3.6 to examine the balance between the control and treatment groups. We find no significant difference regarding the user-based affinity score, and there is a perfect balance regarding the item-based affinity score. This is due to our clustered one- to-one PSM within each subcategory. Because each subcategory has the same number of customers in the control and treatment group, the distribution of item-based affinity score (defined for each subcategory) is the same for the two groups. (We also summarize both scores by each subcategory for the first distributing brand “Consumption 1” in Appendix B.2 Table B.3 as an example.) Table 3.6: Balance Check for Affinity Scores by Brand User-based Affinity Score Item-based Affinity Score Distributing Brand Control Treatment p-Value Control Treatment p-Value Consumption 1 65.2 65.4 0.81 91.4 91.4 1 Cosmetic 1 65.5 65.3 0.73 86.1 86.1 1 Consumption 2 53.0 53.6 0.65 83.5 83.5 1 Cosmetic 2 60.7 60.6 0.88 43.4 43.4 1 Consumption 3 72.2 71.7 0.44 67.1 67.1 1 Cosmetic 3 58.8 58.7 0.95 86.8 86.8 1 Total 61.3 61.1 0.83 70.1 70.1 1 65 Having confirmed the balance regarding the two scores, we proceed to estimate the following model with an interaction term Outcome i = +Cross Sampling i + Cross Sampling i Score i + T x i + i ; (3.5) where the incremental effect of the score is captured by the coefficient . The statistical significancy of would validate the effectiveness of our proposed scoring system, whereas the economic significancy of precisely quantifies the value of user and item information. Specifically, if the estimate for is, for example, significantly positive for the item-based affinity score, it suggests that the score can be used to predict the heterogeneous treatment effect of cross-sampling. That is, items with higher scores indeed result in a larger causal effect, which suggests that the scoring system is useful. Furthermore, if the coefficient is also larger in magnitude for the item-based affinity score (compared with the user-based affinity score), it means that the item-based score can explain more variation of heterogeneous treatment effect. Consequently, targeting the same amount of customers with the highest item-based affinity scores would generate more return. In this case, the purchase information based on customers’ real-time actions is more valuable than user information. Therefore, the moderating effect is our coefficient of interest. We estimate Equation (3.5) for the user-based and item-based affinity scores in Table 3.7 and Table 3.8, respectively. For the user-based affinity score (Table 3.7), we find a positive moderating effect of the user-based affinity score regarding item-views in the short term (1-4 weeks) and a positive moderating effect regarding spending in the long term (9-12 weeks). For example, cross-sampling a customer with user-based affinity score of 70 (out 100) would double the number of item-views in the short term, compared with a customer with user-based affinity score of 0. As the moderating effect is not consistent for different outcome periods, we find that the user-based affinity score is partially effective moderating the causal effect of cross-sampling. 66 Table 3.7: Influence of User-based Affinity Score on the Causal Impact of Cross-sampling Spending Item-views Spending Item-views Spending Item-views (1-4 weeks) (5-8 weeks) ( 9-12 weeks) Cross Sampling 0.148 0.070 0.053 0.015 0.236 0.021 (0.083) (0.012) (0.076) (0.008) (0.113) (0.009) Cross Sampling User Score 0.001 0.001 0.001 0.0003 0.004 0.00000 (0.001) (0.0002) (0.001) (0.0001) (0.002) (0.0001) Control variables Yes Yes Yes Yes Yes Yes N 83,132 83,132 83,132 83,132 83,132 83,132 R 2 0.003 0.028 0.002 0.014 0.012 0.009 Robust standard errors in parentheses; p<0.1; p<0.05; p<0.01 Table 3.8: Influence of Item-based Affinity Score on the Causal Impact of Cross-sampling Spending Item-views Spending Item-views Spending Item-views (1-4 weeks) (5-8 weeks) ( 9-12 weeks) Cross Sampling 0.152 0.062 0.001 0.004 0.054 0.017 (0.076) (0.011) (0.067) (0.009) (0.072) (0.009) Cross Sampling Item Score 0.005 0.003 0.001 0.0005 0.002 0.001 (0.001) (0.0002) (0.001) (0.0001) (0.001) (0.0001) Control variables Yes Yes Yes Yes Yes Yes N 83,132 83,132 83,132 83,132 83,132 83,132 R 2 0.003 0.032 0.002 0.015 0.004 0.010 Robust standard errors in parentheses; p<0.1; p<0.05; p<0.01 For the item-based affinity score (Table 3.8), we find that the item-based affinity score can significantly explain the increase in spending in the short term. According to our estimation, a 10% increase in the item-based affinity score can increase spending by 0:005 10=0:203 = 24:6% (p< 0:01, and 0.203 is the control average from Table 3.4) compared with the control group. And targeting customers with 0 item- based affinity scores would even decrease the spending at the sampling brand (p< 0:05). Besides, the score can consistently moderate the increase in item-views for in the following three months. Putting together, we find that 1) the user-based affinity score is partially effective in explaining the causal effect; and 2) the item-based affinity score has a significant moderating effect regarding both spending and item-views. The empirical evidence suggests that our proposed scoring system can be applied in practice and predict the heterogeneous treatment effect of free sampling. (We also carry out a follow survey and find 67 that the affinity scores also significantly correlate with customers’ satisfaction of cross-sampling. We defer the details in Appendix B.3.) More importantly, it provides an insight for different channels (specifically Quadrant 2 and 3 in Table 3.5) regarding the usage of real-time purchase information. That is, the real-time actions by the customers can serve as a crucial mechanism, which can be further explored to improve the effectiveness of free sampling. Nevertheless, as shown in Table 3.5, our proposed business practice has the unique advantage of lever- aging both user and item information at the same time. Although we might expect that combining both information would not performance worse that using only one of them, it is not clear what is the exact improvement. In the next section, we further quantify the competitive advantage of our proposed cross- sampling channel (Quadrant 4 in Table 3.5) regarding information usage. 3.5.4 Quantifying the Value of Information We define a compound affinity score as the weighted average of the user-based affinity score and the item- based affinity score Compound Score =User Score + (1)User Score: (3.6) The compound affinity score essentially combines the user and item information. And the weighting pa- rameter controls the extent to which user information is used. When = 0, the compound affinity score reduces to the item-based affinity score. When = 1, it reduces to the user-based affinity score. If we can show that the compound affinity score has a larger moderating effect, it suggests that combining both user and item information can further improve the effectiveness of free sampling. We again estimate Equation (3.5) using a 50-50 compound affinity score ( = 0:5) in Table 3.9. We find that the moderating effect of the compound affinity score is positive and significant for both spending and 68 item-views. And the significant effect is consistent for both the short term and the long term. For example, a 10 point of increase in the compound affinity score can increase the causal effect (regarding spending in the following month) by 0:007 10=0:203 = 34:5% compared with the control group. Note that the magnitude of the moderating effect here (0:007, p<0.01) in Table 3.9 is larger than the moderating effect (0:005, p<0.01) for the item-based affinity score in Table 3.8. It suggests that targeting a particular package with a high user-based affinity score and high item-based affinity score can further enhance the effectiveness of cross-sampling. Table 3.9: Influence of 50-50 Compound Affinity Score on the Causal Impact of Cross-sampling Spending Item-views Spending Item-views Spending Item-views (1-4 weeks) (5-8 weeks) ( 9-12 weeks) Cross Sampling 0.230 0.090 0.036 0.012 0.258 0.017 (0.121) (0.016) (0.109) (0.014) (0.132) (0.014) Cross Sampling Compound Score 0.007 0.003 0.001 0.001 0.005 0.001 (0.002) (0.0003) (0.002) (0.0002) (0.002) (0.0002) Control variables Yes Yes Yes Yes Yes Yes N 83,132 83,132 83,132 83,132 83,132 83,132 R 2 0.003 0.032 0.002 0.015 0.004 0.010 Robust standard errors in parentheses; p<0.1; p<0.05; p<0.01 Moreover, by leveraging both information, our proposed channel can also change the weighting param- eter to find the combination that yields the largest moderating effect. We plot the moderating effect of our compound affinity score for different in Figure 3.3. We find that = 0:4 maximizes the moderating effect. Finally, we carry out a counterfactual analysis to compare the revenue of different channels in Table 3.5 under resource constraint. For example, if we have 100 customers and only 50 free samples, each channel in Table 3.5 can use one of our proposed (user-based, item-based, or compound) affinity scores to sort and distribute free samples to the top 50 customers. We ask which channel can causally generate more return, i.e., spending at the sampling brand? 69 ● ● ● ● ● ● ● ● ● ● ● 0.0000 0.0025 0.0050 0.0075 0.0100 0 (Only Item Info.) 0.2 0.4 0.6 0.8 1 (Only User Info.) λ Moderating Effect Figure 3.3: Estimated Moderating Effect of the Compound Affinity Score Regarding Spending at the Sam- pling Brand by Different Weighting Notes: We estimate Equation (3.5) with the compound score defined in Equation (3.6) for different and obtain the estimated coefficient ^ . The error bars are the 95 percent confidence interval of the point estimates. In Table 3.10, given different resource constraint level, we present how much revenue each channel can collect from the customers in the following month at the sampling brand. The key metric here is the percentage revenue gain, which is define as a ratio. Specifically, using our experimental data, we first calculate the total revenue of free sampling as the average treatment effect (0.218 RMB estimated from Table 3.4) times total number of customers (N =83,132). Similarly, for each channel, we can also compute the revenue of targeted customers based on estimated heterogenous effect from previous section ( P N i=1 ( ^ + ^ Score i ) based on Equation (3.5)). Then, the “percentage revenue gain” is defined as the ratio of the revenue from the targeted customers and the total revenue on the table. It precisely measures the proportion of revenue gain by personalization based on user or item information. In Table 3.10, since Quadrant 1 can not leverage any information for personalization, it would randomly select customers for free sampling. So its percentage revenue gain is the same as the resource constraint. Quadrant 2 and Quadrant 3 personalize free sampling by ranking customers or items based on the user-based and item-based affinity score, respectively. Our proposed channel (Quadrant 4) would first find the optimal weight parameter for each resource constraint and then rank each package (which consists of a customer and an item) based on the compound affinity score. We find that across different levels of resource con- straints, Quadrant 4 always achieves the largest percentage revenue gain. For instance, our channel can gain 70 81.65% of the total revenue by targeting only 60% of the customers. Moreover, such competitive advantage is more prominent when the resource constraint is tight. For example, when the resource constraint is 60%, Quadrant 4 has only (81:6581:27)=81:27 = 0:46% improvement over Quadrant 3. And when the resource constraint is 10%, Quadrant 4 has (17:22 15:54)=15:54 = 10:81% improvement over Quadrant 3. Table 3.10: Percentage Revenue Gain of Different Channels (in Table 3.5) Under Resource Constraint Resource Constraint (%) Quadrant 1 Quadrant 2 Quadrant 3 Quadrant 4 (No Info.) (Only User Info.) (Only Item Info.) (Full Info.) 10 10.00 12.03 15.54 17.22 20 20.00 23.84 29.90 32.83 30 30.00 35.37 43.48 46.95 40 40.00 46.67 56.56 59.81 50 50.00 57.31 69.09 71.45 60 60.00 67.30 81.27 81.65 Notes: The percentage revenue gain is computed as the sum of revenue from the targeted customers (by different channel), divided by the revenue from all customers. 3.6 Conclusion In summary, this paper studies a business model that takes advantage of the control and information natu- rally endowed by e-commerce warehouses. We implement a novel business practice–cross-sampling–with Alibaba through a large-scale field experiment and provide the first empirical evidence of its effectiveness. We show that because of the control of physical products, cross-sampling allows online brands to effec- tively acquire a large number of new customers with negligible costs by directly recommending physical products. Combined with the rich information (user and item information) accumulated from the online marketplace, we develop an analytical framework to quantify the value of information (both incorporated in each package) and succinctly describe the relationship between our proposed free-sampling channel and the existing ones. In this way, we illustrate how an e-commerce platform can leverage the warehousing system to commercialize the package flow. 71 Chapter 4 Maximizing Intervention Effectiveness 4.1 Introduction Across domains, we observe an increasingly common decision-making paradigm: Researchers assess whether a particular intervention is effective in treating a study population, such as in a randomized con- trol trial. Practitioners then roll out successful interventions to a potentially different candidate population, hoping to achieve similar effectiveness. Although this paradigm is perhaps most familiar in medicine, “in- tervention” and “treatment” can be interpreted quite generally, as these terms might refer to an after-school training program to reduce childhood obesity [158] or a tuition reduction program to increase college en- rollment [54]. Despite its ubiquity, this paradigm faces practical challenges. Many interventions are too expensive to provide to everyone in the candidate population (see, e.g., our case study below). Thus, practitioners must solve a resource allocation problem: Who should be targeted for treatment? A first intuition, motivated by medical practice, might be to target the “sickest patients,” i.e., the individuals most in need of treatment. However, these sickest patients may be too sick to benefit from treatment, so targeting them is arguably an inefficient use of resources. More poignantly, different individuals may respond differently to the same intervention. A prudent decision-maker would ideally target those individuals who benefit the most in order to maximize the aggregate benefit subject to resource constraints. The challenge, of course, is identifying the potential benefit for each individual prior to administering the intervention. 72 In this paper, we propose a robust optimization approach to this resource allocation problem using only the evidence typically published in a research study, including its inclusion/exclusion criteria, the demographic features of its study population and estimates of the average benefit. We provide a precise specification of this evidence in section 4.2.2, but note that it does not usually include the raw, individual- level study data. This restriction to the published evidence is a key distinguishing feature of our work. There is a growing body of literature on predicting the potential benefit of treatment at an individual level, i.e., learning a heterogeneous causal effect, for either estimation or personalization (see, e.g., [90, 9, 101] and references therein). In the marketing literature, these methods are sometimes referred to as “uplift modeling” [82, 170, 8]. There is also a second stream of literature on estimating the average causal effect in a population distinct from the original study population (e.g., [50, 151, 84]). In principle, either approach might be adapted to solve the above resource allocation problem. However, these methods typically require access to individual-level data from patients in the study. Such data are rarely available in practice [60]. Indeed, unlike the typical marketing and personalization settings, the decision-makers rolling out interventions in healthcare and policy-making contexts are often distinct from the researchers studying those interventions. Moreover, laws such as the Health Insurance Privacy and Portability Act (HIPPA), the General Data Protection Regulation (GDPR), and Family Educational Rights and Privacy Act (FERPA) heavily restrict those researchers from sharing the raw study data with policy makers out of patient-privacy and ethical concerns. Without this raw study data, it is not possible to implement the above approaches. Worse, even when study data are available, they are frequently inadequate for the task. Since conducting randomized control trials is notoriously expensive, most studies are sized to have just enough power to detect an average causal effect but are typically too small to learn the precise heterogeneous effects across patients. Learning this heterogenous effect is critical to the above resource allocation problem. 73 Since learning heterogeneous effects without the raw study data seems impractical, we take a different approach via robust optimization [23]. Generally, robust optimization methods optimize worst-case per- formance over an uncertainty set of possible realizations of uncertain parameters. Our particular robust approach seeks the subset of patients that maximizes the worst-case aggregate intervention effectiveness, where the worst-case is taken over an uncertainty set of models for the heterogeneous causal effect that are consistent with the published study evidence. In other words, rather than learning a single model for causal effects, we optimize the targeting to ensure the effect over many plausible models is as large as pos- sible. Depending on the precise assumptions on the set of models, this optimization problem can be cast as a mixed-binary linear optimization problem or a mixed-binary second-order cone optimization problem, both of which can be readily solved with off-the-shelf software. The resulting formulations are also flexi- ble enough to easily accommodate side constraints on the targeting, such as budget, operational or fairness constraints, and to incorporate evidence from multiple papers. We prove that our robust approach is equivalent to approximating the targeted subset’s average effec- tiveness by the study population’s average effectiveness minus a penalty that depends on the differences in demographics between these two groups, an insight that we term “covariate matching as regularization.” Intuitively, the more different the targeted subset and study population are, the less accurately the study population’s effectiveness approximates the targeted subset’s effectiveness. The precise form of the penalty depends on the particular uncertainty set. For special cases, we show that the penalty coincides with common techniques used for covariate matching in causal inference – 2 -matching, mean matching, and Mahalanobis matching – highlighting an interesting theoretical connection between these two areas. ([100] observes a similar connection in the context of designing experiments.) We stress that our robust approach does not directly estimate individual-level causal effects, but only approximates the aggregate effect over the targeted subset. This “portfolio” viewpoint sharply contrasts with both the aforementioned statistical literature and current practice. Indeed, most common approaches to 74 targeting employ so-called scoring rules: Practitioners assign each individual in the candidate population a score approximating her unknown heterogeneous causal effect and then target individuals with the highest scores. Scores are typically informed by some combination of domain expertise and predictive modeling. However, recent empirical studies have called the effectiveness of such rules into question [94]. In Section 4.3, we provide a theoretical analysis of scoring rules, providing sufficient conditions for their optimality and a tight performance guarantee when they are suboptimal. In particular, we prove that if patients may experience adverse effects from the particular treatment, scoring rules may perform arbitrarily badly and may be worse than not providing treatment to anyone. This research was inspired by our partner hospital, which seeks to reduce excessive emergency de- partment (ED) utilization by adult Medicaid patients by rolling out a case-management intervention. (See Section 4.1.2 for details.) We use real data from this case study as a running example and to assess our methodology (Section 4.5). We summarize our work as follows: 1. We formalize an optimization approach to maximize intervention effectiveness using the evidence typically available in a published study. To the best of our knowledge, we are the first to address this problem using only published study data. 2. We prove tight performance bounds for current practice (scoring rules). In particular, we prove that scoring rules can perform arbitrarily badly when the treatment is potentially harmful. 3. We propose a robust optimization approach to maximize the worst-case performance over a large class of models that agree with the study evidence. To the best of our knowledge, we are the first to apply robust optimization methods to the analysis of summary-level causal inference data. Our model is flexible enough to accommodate a variety of side constraints and can be solved for real-world instances within a few minutes using commercial software. Moreover, its worst-case performance is bounded by a value that depends on the class of models and the true heterogeneous effect. Under 75 some mild assumptions, this constant is zero, ensuring that our robust approach is never worse than not targeting, even when the treatment could be potentially harmful. 4. We provide an intuitive interpretation of our robust model as “covariate matching as regularization,” connecting with the literature on causal inference and illustrating how canonical covariate matching techniques can be recovered as special cases. 5. Using data from our partner hospital, we show that our robust approach performs almost as well as scoring rules when the degree of heterogeneity in causal effects is small and can perform much better than scoring rules as the degree of heterogeneity increases, especially when the treatment is potentially harmful. 4.1.1 Connections to Existing Literature Our work connects to a growing body of robust optimization applications, particularly in healthcare opera- tions (e.g., [34, 56, 44, 45, 77].) Adopting a worst-case perspective is appealing, particularly in healthcare, where decision-makers aspire to “do no harm”, and high-costs and consequences fuel risk aversion. A dis- tinguishing feature of our work is that while many robust optimization models seek to immunize solutions against parameter uncertainty or implementation uncertainty, our formulation is more naturally interpreted as immunizing solutions against model uncertainty, i.e., our uncertainty about the true model for hetero- geneous causal effects. In this respect, we are most similar to [30]. At the same time, we contribute to a large body of work connecting robust optimization and regularization [163, 107, 26, 70]. In particular, our work elucidates the connection between the form of regularizer and assumed structure of the causal effects. This perspective, we feel, helps provide an alternate, statistical interpretation of common regularizers and uncertainty sets. 76 Moreover, our restriction to the published study evidence distinguishes our work from existing tech- niques in data-driven robust optimization. In particular, typical data-driven robust optimization models (e.g., [53, 63, 27, 28]) assume that the data are noisy versions of the underlying parameters or a sequence of i.i.d. realizations of random variables depending on those parameters. In our setting, such data would correspond to noisy observations of the potential causal effect in the candidate population before adminis- tering treatment. However, in causal inference settings, it is impossible to observe this effect directly (even noisily), a phenomenon sometimes called the Fundamental Problem of Causal Inference (see [87] or our discussion in section 4.2). Worse, in our setting of interest, the data we do have pertains to a different population, the study population. These features are intrinsic to our application and require new modeling and methods. In focusing on the published evidence, our work also connects to meta-regression techniques in statistics that aim to “pool” the results of different published studies to form a refined estimate of causal effects [86, 29]. We differ from these works in two important respects: First, these methods’ successes rely upon access to multiple distinct papers; it is by combining distinct sources of information that they refine estimates. By contrast, although our robust approach can be applied when multiple published studies are available, it applies equally well with only a single study. Second, and more critically, these methods generally focus on estimation and inference, not on decision-making. A notable exception is [29], which does consider an optimization problem, but in a different context – designing a clinical trial versus rolling out an intervention – and with a different mathematical structure. Finally, we contrast our work with the reinforcement learning literature. An alternate approach to rolling out an intervention might be to proceed sequentially, treating individuals in the candidate population one at a time, observing their response to treatment, and using those observations to decide whom to treat next. This approach is naturally modeled as a contextual multi-arm bandit problem [15, 129]. While reinforcement learning is a reasonable strategy for some interventions, for many others, the outcome of interest may take 77 a long time to observe. For example, it may take months to check for a reduction in the childhood obesity rate. With this time delay, online approaches that proceed sequentially may be impractical, motivating our off-line treatment of the problem. 4.1.2 Case Study: Emergency Department Visits by Adult Medicaid Patients Medicaid is a public insurance program for low-income, disabled and needy people under the age of 65. At our partner hospital, a large teaching hospital, approximately 50% of all ED visits are from Medicaid patients. Medicaid typically pays 50% less than private insurance[174]. Consequently, our partner hospital is underpaid for each Medicaid patient’s ED visit. On the other hand, Medicaid patients generally suffer from multiple chronic diseases and lack easy access to primary care [33]. This combination of financial burden and patient need has sparked interest in intervention programs that might reduce unnecessary Medicaid ED visits while improving patients’ health outcomes. Case management is the most widely used intervention in reducing ED visits. While the implementation details differ between studies, at a high level, case management involves a team of social workers, nurses, and physicians providing crisis intervention, supportive therapy by phone or in person, referral to substance abuse services, linkage to primary care providers and assistance with making appointments to outpatient care. This team of professionals may also liaise with other assistance programs on the patient’s behalf, such as to find subsidized housing. Prior research has shown case management to be effective in specific populations regarding reducing ED visits and improving patient outcomes [145, 144]. Unsurprisingly, case management is expensive, both financially and in terms of resources. Limited avail- ability of physicians, nurses, social workers and psychiatrists prevents enrollment of all Medicaid patients in the program at our partner hospital. Thus, our case study seeks to use data to target a subset of adult Medicaid ED patients for case management to reduce ED utilization and underpayments while maintaining 78 quality of care for this population. Based on their resource constraints, our partner hospital would ideally like to target approximately 200 patients. 4.2 Model Setup 4.2.1 Candidate Population We seek to target at mostK > 0 patients for intervention from a candidate population of sizeC >K in order to maximize total intervention effectiveness. We adopt a potential outcome framework for causal inference [93]. For each patientc2f1;:::;Cg, there exists a fixed tuple (x c ;y c (0);y c (1);r c ). The parameters x c and r c are assumed known, whiley c (0) andy c (1) are unknown and represent potential outcomes. Specifically, x c 2X denotes patient c’s pre-treatment covariates, e.g., demographic characteristics, and may include discrete and continuous components. The quantityy c (0) (resp. y c (1)) represents the outcome of interest for patientc if she does not (resp. does) receive the treatment. Before choosing whether to administer treatment, we know neithery c (0) nor y c (1). After this choice, exactly one ofy c (0) andy c (1) is revealed depending on on our choice for patientc. In our case study in Section 4.5,y c (0) andy c (1) denote the number of times patientc visits the ED in the next 6 months. In particular, smaller values are better. Consequently, we adopt a non-standard convention and define the causal effect/treatment effect of patientc as c y c (0)y c (1). (It is usually the negative of this quantity.) We stress that c may be positive or negative, i.e., the treatment may benefit or harm an individual patient. Moreover, since we never observe bothy c (0) andy c (1), we cannot observe c , directly, not even noisily. 79 We define the intervention effectiveness of patientc to ber c c , wherer c 0 represents a known reward. Adopting a linear model for effectiveness is with some loss of generality. However, we believe this model is a good approximation for our case study (see below) and many other applications. In many medical applications, one is not interested in a monetary outcome, but simply the aggregate benefit (in units of c ) across patients. In these cases, one can taker c = 1 for allc. We stress thatr c may differ by patient and might depend in a complex way on the covariatesx c . For example, it might be the output of a machine- learning model that givenx c predicts (dollar) savings for each unit decrease in the outcome. In what follows, we assume without loss of generality thatr c is one of the components of x c since both are known before targeting. We seek to maximize the total intervention effectiveness as follows: max z2Z C X c=1 z c r c c ; whereZ ( z2f0; 1g C C X c=1 z c K ) : (4.1) If we were to observe the causal effects c directly, the optimal solution would be to rank each patient based on the intervention effectivenessr c c and to target the topK patients with non-negative values. Let B f1;:::;Cg denote this solution, which we call the full-information benchmark. The challenge is that since we only observe one ofy c (0) ory c (1) depending on our treatment assignment, we cannot observe the causal effects c directly. 4.2.2 Study Population and Evidence for Treatment Although we cannot learn c , we will assume that we have some evidence from a published paper that the treatment is effective, namely, a confidence interval for the average causal effect in a study population and summary statistics for the pre-treatment covariates of that study population. 80 Formally, let (x s ;y s (0);y s (1)) for s2f1;:::;Sg be the pre-treatment covariates and potential out- comes for each patient in the study population. (Note the distinction between superscripts and subscripts for x. The value x 1 describes the first patient in the study population, while x 1 describes the first patient in the candidate population.) In general, the study and candidate populations may be distinct. The param- eters (x s ;y s (0);y s (1)) are fixed but unknown. Instead, we assume we know an interval [I;I] such that I 1 S P S s=1 s I, where s y s (0)y s (1), for alls2f1;:::;Sg. The quantity 1 S P S s=1 s is the Sample Average Treatment Effect (SATE). Knowing [I;I] is a mild assumption. Most studies, regardless of their precise statistical methodologies, report a confidence interval for SATE that can be used for [I;I]. For example, in randomized control trials (the gold standard for medical research), a simple t-test, a linear regression including pre-treatment covariates and the treatment assignment, or a matching estimator yields a confidence interval for SATE (see, e.g., [91]). There do exist studies that do not report a confidence interval for SATE because of their chosen statistical design, such as a stratified analysis, which instead estimates average causal effects in each stratum. In our opinion, however, such designs are less common in healthcare. In special cases, we can still approximate a confidence interval for SATE for these studies (see, e.g., Section 4.2.3). The interval [I;I] tells us nothing about the distribution of x s in the study population. Most studies therefore also report summary statistics for x s and detailed inclusion/exclusion criteria. The precise statistics used (mean, median, standard deviation, etc.) often differ between studies. To provide a flexible modeling framework for summary statistics, we assume that the study reports a set of description functions g :X7!R,g = 1;:::;G and their expectations over the study population, i.e., g 1 S P S s=1 g (x s ). By suitably choosing the functions g (according to the study paper), we can model a wide variety of possible summary statistics as generalized moments of the study population’s covariate 81 distribution. In our opinion, most studies present a combination of summary statistics of the following three types: Partition Description Functions: When there exists a natural partition ofX = S I i=1 X i , e.g., patient race, studies often report the proportion of typei patients i = 1 S P S s=1 I(x s 2X i ) fori = 1;:::;I. We model these statistics with description functions i (x) I(x2X i ) for i = 1;:::;I 1. Note that since I = 1 P I1 i=1 i and I (x) = 1 P I1 i=1 i (x), it suffices to only specify theseI 1 description functions to capture allI statistics. We assume for simplicity that i > 0 fori = 1;:::;I. Linear Description Functions: WhenX R I contains continuous variables, studies often report their mean values in the study population i = 1 S P S s=1 x s i for alli = 1;:::;I. We model these statistics with description functions i (x)x i fori = 1;:::;I. Quadratic Description Functions: WhenX R I , studies may report, in addition to the mean m i 1 S P S s=1 x s i , the standard deviation 2 i 1 S P S s=1 (x s i m i ) 2 of each covariate for alli = 1;:::;I. We model the meanm i with theI description functions above and the standard deviation with additional I description functions: I+i (x)x 2 i and I+i =m 2 i + 2 i for alli = 1;:::;I. We stress that these data – summary statistics and SATE in a separate population – strongly contrast with the data typically available in data-driven robust optimization models. Indeed, traditional data-driven models often assume we observe (noisy) realizations of c which, as mentioned, is impossible in our setting. Nonetheless, we will combine the description functions, their statistics and the study SATE in Section 4.4 to formulate our robust optimization model. 82 4.2.3 Case Study: Setup The candidate population in our partner hospital is of size C = 951. (We defer a detailed description until Section 4.5.1.) Let y c (0) and y c (1) be the number of ED visits in the next 6 months if patient c in the candidate population does not or does, respectively, receive case management, Definey s (0) andy s (1) similarly for the study population. The unknown causal effect c is the potential number of ED visits reduced by case management for patientc. Finally, let the known rewardr c be an estimate of the average charges per ED visit for patientc based on their medical history. We assume a linear model for effectiveness for this application. ED charges typically consist of a rela- tively large fixed component common to most visits and a more variable idiosyncratic component that differs between visits. The fixed component corresponds to charges for doctor and staff time, basic equipment, and routine testing. The variable component corresponds to the additional services for the specific complaint on that visit and can be large for very sick patients. Consequently, charges are highly concentrated around the fixed component with a long tail. (See fig. C.1 in the e-companion.) Intuitively, case management is unlikely to prevent visits corresponding to extreme medical events (e.g., strokes, falls among the elderly), i.e. the visits with high variable costs. Rather, case management might help prevent “less serious” visits (e.g., a person with dehydration from chronic malnutrition) whose costs are closer to the fixed cost [32]. Thus, we approximate the marginal benefit of reducing 1 visit as a constant that may depend on the patient’s covariates. We use data from [145] as the study evidence because their study population mainly consists of low- income patients with behavioral problems who are similar to the Medicaid population in our hospital. [145] investigate the causal effects of case management in reducing ED visits among ED frequent users at San Francisco General Hospital, an urban public hospital. Patients were eligible for study participation if they had at least 5 visits to the ED in the previous year, were San Francisco residents, were at least 18 years 83 Table 4.1: Evidence of Causal Effects for Case Management from [145] for the Study Population Stratum 1 No. of ED Visits 5 - 11 y Stratum 2 No. of ED Visits 12 Total No. of Patients Treatment 81 (32%) 86 (34%) 167 (66%) Control 40 (15%) 45 (18%) 85 (34%) No. of ED Visits in 6 Months Treatment, mean sd 2.5 3.2 5.2 5.6 3.9 2.0 Control, mean sd 4.6 6.2 8.5 9.6 6.7 8.2 CATE , mean sd, [95% CI] 2.1 1.0, [0.1, 4.1] 3.3 1.6, [0.2, 6.4] SATE, mean sd, [95% CI] 2.7 1.0 z , [0.8, 4.6] Notes.y Patients are stratified based on number of ED visits in the previous year. Approximated by taking the weighted average of the mean and variance for each group. For example, the mean outcome for the treatment group is (812:5+865:2)=(81+86) = 3:9 ED visits. CATE refers to the Conditional Average Treatment Effect within each stratum. We formally define CATE in Section 4.4.1. z The standard deviation is approximated as p s 2 1 =n1 +s 2 2 =n2, wheresi is the standard deviation of the outcome andni is the number of people in groupi fori = 1; 2. old and had psychosocial problems that might be addressed with case management. Such problems include housing problems, medical care problems, substance abuse, and mental health disorders. The study was conducted between 1997 and 1999, and a total of S = 252 eligible patients were enrolled. The authors performed a stratified analysis on two strata based on previous ED visits. We reproduce their results in Table 4.1 and summary statistics in Table 4.2 for the 252 patients. Since [145] do not directly present a confidence interval for SATE, we approximate it (see notes in Table 4.1 and also discussion at the beginning of Section 4.5). The second column of Table 4.2 presents the same summary statistics for theC = 951 Medicaid patients at our partner hospital who satisfy the inclusion/exclusion criteria of [145]. Despite these criteria, these two populations still display some systematic differences. Finally, we map Table 4.2 into our framework with the following description functions: The proportion of male patients can be represented as a partition description function with an indicator for gender. Race and whether the number of ED visits in the previous year exceeds 11 can also be represented with indicators. 84 Table 4.2: Summary Statistics for the Study and Candidate Populations Study Population (S = 252 patients) Candidate Population (C = 951 patients) Male 188 (75%) 442 (46%) Race/Ethnicity African American 138 (54%) 475 (50%) Hispanic 55 (22%) 7 (1%) White 34 (13%) 283 (29%) Other 28 (11%) 186 (20%) Age, mean sd 43.3 9.5 38.3 12.5 No. of ED Visits in Previous Year 5 - 11 121 (48%) 860 (90%) 12 131 (52%) 91 (10%) Most Frequent Diagnosis y during ED Visits mental disorder (22%) alcohol-related disorders (10%) injury (16%) abdominal pain (6%) skin diseases (8%) back problems (5%) endocrine discorders (5%) nonspecific chest pain (4%) digestion disorders (5%) connective tissue diseases (3%) respiratory illnesses (5%) non-traumatic joint disorders (3%) Notes. Both populations only include patients who have had at least 5 ED visits. y Calculated from primary the ICD-10-CM diagnosis code using Clinical Classification Software [61]. The average age can be represented as a linear description function. The standard deviation of age can be represented by a quadratic description function with correspond- ing summary statistics 43:3 2 + 9:5 2 . 4.3 Scoring Rules Given the structure of the optimal solution to Problem (4.1), a natural heuristic is to approximate the un- known causal effect c with some observable proxy ^ c and then rank patients accordingly. Definition 4.3.1. Given a proxy ^ c > 0, ther ^ -scoring rule ranks each patientc in the candidate population byr c ^ c and targets theK highest-ranked patients with non-negative scores. In principle, one can use any observable metric as the proxy. We focus on two proxies that correspond to common assumptions and methods employed by practitioners: 85 Constant Effect Sizes and Reward Scoring: tIf one believes that the true causal effect is constant, i.e., c = 0 > 0 for allc2f1;:::;Cg, then no matter what the value of 0 is, using the proxy ^ c = 1 and ranking patients byr c (i.e., reward scoring, orr-scoring) yields an optimal solution to Problem (4.1). The assumption of constant effect sizes is common in statistical inference for randomized control trials. In particular, the most common approach for estimating the sampling variance of the SATE estimator assumes that the causal effects are constant across all individuals in the study population imbens2004nonparametric. Proportional Effect Sizes and Outcome Scoring: tIf one believes that the true causal effect is propor- tional to the outcome without treatment, i.e., c = y c (0) for allc2f1;:::;Cg and some > 0, then no matter what the value of is, using the proxy ^ c =y c (0) and ranking patients byr c y c (0) (i.e., outcome scoring, orry(0)-scoring) yields an optimal solution to Problem (4.1). In words, outcome scoring targets high-risk patients. Many studies have developed statistical or machine-learning models to predict the number of ED visits by a specific patient in the near future billings2013dispelling. They estimatey c (0) (i.e., number of ED visits for patientc) and suggest fo- cusing on patients with large estimates. Implicitly, such recommendations assume that the true causal effect c is proportional toy c (0). Neither reward scoring nor outcome scoring leverages the summary statistics for the study population but may leverage the candidate population covariates in a sophisticated way to estimater c andy c (0). In the rest of this section, we show that scoring rules may be highly suboptimal when these underlying assumptions about the causal effects are violated. 86 4.3.1 Performance of Scoring Rules with Benign Treatment The performance of scoring rules depends heavily on whether the treatment might be harmful. If we assume that we can identify and avoid treating patients with potential adverse events to that all treated patients experience positive causal effects, then scoring rules may perform very well. Theorem 4.3.1 (Worst-Case Performance of Scoring Rules with Benign Treatment). Without loss of gener- ality, index patients so thatr 1 ^ 1 r C ^ C 0. SupposeKC=2 and there exists 0< < <1 such that c = ^ c > 0, for allc2f1;:::;Kg, and c = ^ c , where > 0 for allc2fK + 1;:::;Cg. Then, ther ^ -scoring rule obtains at least!(=) of the full-information benchmark optimal value, where !(=) (=) P K c=1 r c ^ c (=) P k c=1 r c ^ c + P 2Kk c=K+1 r c ^ c (4.2) and k = 8 > > > > < > > > > : 0; if (=)r 2K ^ 2K =r 1 ^ 1 arg maxfcj 1cK; (=)r 2Kc+1 ^ 2Kc+1 =r c ^ c g; otherwise. Moreover, for a given value of r and ^ , there exist values of such that the bound is tight. Intuitively, measures how much we may have underestimated the causal effects of patients that were not picked, while measures how much we may have overestimated the causal effect of patients that were picked. With this interpretation, the critical assumption in Theorem 4.3.1 is that c = ^ c > 0, for all c2f1;:::;Kg. Indeed, since r c 0, this implies that r c c 0 for all c2f1;:::;Kg, i.e., targeted patients do not experience adverse effects. 87 At first reading, the function!() in Theorem 4.3.1 appears quite complicated. Importantly, it depends on the unknown causal effects only through the ratio =. Intuitively, this ratio measures the degree of correspondence between the proxy ^ c and the true causal effect c . If the proxy is reasonably accurate, i.e., ^ c c for allc2f1;:::;Cg, then we would expect, and the ratio is close to 1. At the other extreme, if ^ c and c are very different for patientsf1;:::;Kg compared to patientsfK + 1;:::;Cg, the ratio= will be close to 0. Intuitively, scoring rules should improve as the degree of correspondence increases. We prove that our bound!() shares these features: Corollary 4.3.1.1. Under the assumptions of Theorem 4.3.1, (1) !(=) is increasing in=; (2) If=r K+1 ^ K+1 =r K ^ K ,!(=) = 1 and ther ^ -scoring rule is optimal; (3) !(=)! 0 as=! 0. A numerical example showing typical values of the bound and its shape is in Section 4.3.3. 4.3.2 Performance of Scoring Rules with Potentially Harmful Treatment Assuming that we can identify and avoid treating patients that would have an adverse response to the treat- ment is particularly strong; in practice, the treatment may be ineffective or even harmful. The evidence for case management, in particular, is mixed. [110] provided case management to patients with over 3 ED visits in the previous month and found no statistically significant changes in the number of ED visits. [136] provided case management to high-risk patients manually selected by physicians and found a statistically significant increase in the number of ED visits afterward. These mixed results strongly suggest that case management might be ineffective or even harmful when provided to the wrong subpopulation. Unfortu- nately, if the treatment is potentially harmful, scoring rules can perform arbitrarily badly. 88 Remark (Scoring Rules Can be Worse Than not Targeting). Index patients as in Theorem 4.3.1, and sup- pose that c = ^ c = < 0 for all c2f1; ;Kg and P c2B r c c > 0. Such a scenario might occur if the treatment only benefitted a small subgroup of the population but potentially harmed others, and scoring rules cannot perfectly determine which is which. Then, r ^ -scoring has intervention effectiveness P K c=1 r c ^ c < 0, which, in an absolute sense, is worse than not providing treatment to anyone. In terms of relative performance, the performance can be arbitrarily bad when the treatment is only marginally effective since P K c=1 r c ^ c P c2B r c c !1 as X c2B r c c ! 0: (4.3) In terms of the performance difference, this may be as large as 2K X c=K+1 r c ^ c K X c=1 r c ^ c ; (4.4) ifB =fK + 1;:::; 2Kg and c = ^ c = > 0 for allc2B . Such a scenario might occur if there do exist high-reward patients who would benefit from the treatment, but the particular scoring rule does not identify them. 4.3.3 Case Study: Performance of Scoring Rules Using patient data from our partner hospital, we apply Theorem 4.3.1 and Remark 4.3.2 for reward scoring ( ^ c = 1) in Figure 4.1. Since ^ c = 1, = measures the degree of heterogeneity in causal effects. For example, = = 0:1 means that the causal effects differ by a factor of 10, i.e., if there exist patients for whom case management reduces the number of ED visits by 1, there also exist patients for whom case management reduces the number of ED visits by 10. 89 When case management always reduces ED visits (i.e., benign treatment), reward scoring performs well and improves as the resource constraint is relaxed (K=C increases). For example, when= = 0:6, reward scoring obtains approximately 90% of the full-information optimal value. As predicted by Corollary 4.3.1.1, the performance improves as the value of= approaches 1. The particularly strong performance of reward scoring in this example is due to a small group of patients in our data who incur a very large charge per ED visit, i.e.,r c has a long tail. The 99 th percentile ofr c is 2:8 times the 90 th percentile value. Consequently,k is generally large. ● ● ● ● ● ● ● ● ● ● ● ● Benign Treatment Potentially Harmful Treatment −50% −25% 0% 25% 50% 75% 100% −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Degree of Correspondence δ δ Relative Performance (%) Resource Constraint K C ● 10% 20% 30% 40% 50% Figure 4.1: Worst-Case Relative Performance of Reward Scoring (r-Scoring) for Case Management in Our Partner Hospital When the treatment is benign, we plot the worst-case relative performance bound (4.2) provided in Theorem 4.3.1. When the treatment is potentially harmful, the worst-case relative performance is1, as mentioned in Remark 4.3.2. Thus, we plot P K c=1 rc= P 2K c=K+1 rc for comparison. To the contrary, when case management may be ineffective or increase ED visits (i.e., potentially harmful treatment), reward scoring performs quite badly even when case management could increase the number of ED visits to a smaller degree compared to what it can reduce (e.g., = =0:1). This poor worst-case performance is also caused by the long tail ofr c . Intuitively, when there exists a small proportion of patients with very high marginal rewards, targeting these patients is “risky.” If those patients benefit from treatment, the overall effectiveness will be high, but if they respond negatively, the overall effectiveness will be low. Similar behavior is seen for outcome scoring in Figure C.2 in the e-companion. Both figures highlight the 90 fact that when the particular score has a long tail for a fixed degree of correspondence, scoring rules will be very sensitive to whether or not the treatment is potentially harmful. 4.4 Robust Targeting The worst-case performance of scoring rules depends strongly on whether the treatment is potentially harm- ful. We next introduce our robust approach, which is less sensitive to this distinction. 4.4.1 Similar Patients Respond Similarly The key idea of our approach stems from the simple intuition that patients with similar pre-treatment covari- ates should respond similarly to treatment. To formalize this idea, we first define the Conditional Average Treatment Effect (CATE) and re-express eq. (4.1). To avoid writing long summations in what follows, we define the random variables ~ c to be a ran- domly chosen patient from the candidate population. Thus, (x ~ c ; ~ c ) denote the pre-treatment covariates and causal effect for this randomly chosen patient. Define ~ s and (x ~ s ; ~ s ) similarly for the study population. We stress that ~ c and ~ s are only defined to simplify the notation; the values (x c ; c )c = 1:::;C and (x s ; s ) s = 1;:::;S are fixed, unknown constants, i.e., non-random. With this notation, the SATE in the study population can now be concisely expressed as 1 S P S s=1 s =E[ ~ s ]: Let the study population CATE for a patient with pre-treatment covariates x2X be E[ ~ s jx ~ s = x] = 1 jfsj x s = xgj X s:x s =x s : 91 The study population CATE is a function of x. Define the candidate population CATE, i.e.,E[ ~ c jx ~ c = x], similarly. Intuitively, CATE represents the average causal effect across all patients in the given population with a particular value of covariate. Using CATE, we can re-express the objective of eq. (4.1). Recall,r c is a component of x c , so thatr ~ c is x ~ c measurable. Suppose thatz 1 ;:::z C represent a targeting policy wherez c depends only on x c , i.e,z ~ c is x ~ c measurable. Then, the objective of Eq. (1) for this policy is C X c=1 z c r c c = CE [z ~ c r ~ c ~ c ] = CE z ~ c r ~ c E [ ~ c j x ~ c ] = C X c=1 z c r c E [ ~ c j x ~ c = x c ]; (4.5) where the first and last equalities follow from the definition of ~ c, and the middle equality uses thatr ~ c is x ~ c measurable. Thus, the objective of eq. (4.1) is equivalent to the objective of: max z2Z C X c=1 z c r c E[ ~ c jx ~ c = x c ]: (4.6) Replacing the objective of eq. (4.1) with the objective of eq. (4.6) is conceptually appealing. Recall, c are fundamentally unobservable since we cannot observe bothy c (0) andy c (1). By contrast,E [ ~ c j x ~ c = x] is estimable given a large enough RCT in the candidate population. This is perhaps why most personalization schemes focus on eq. (4.6) directly (see, e.g., kallus2017recursive,athey2017efficient). Moreover, via a similar argument, we can rewrite the confidence interval for the study SATE as a constraint on the study CATE, i.e., E[ ~ s ] 2 [I;I] () E E[ ~ s j x ~ s ] 2 [I;I]; and focus on CATEs exclusively. The challenge with eq. (4.6) in our setting, however, is that we do not know the candidate population CATE; our data on effectiveness is from the study population. We must assume some “link” between the candidate CATE and the study CATE in order to leverage the study evidence. 92 To that end, fix any normkk link onR C and define the constant by E[ ~ s j x ~ s = x c ]E[ ~ c j x ~ c = x c ] C c=1 link : (4.7) In words, measures an aggregate distance between the study CATE and candidate CATE on the candidate population. Importantly, makes rigorous the idea that similar patients in both populations should respond similarly to treatment. For example, when = 0, the CATEs are identical in both populations, and the expected treatment effectiveness of a patient given her pre-treatment covariates does not depend on her population. Positive, but small, bounds the difference in effect between the populations. (The assumption that = 0 and the CATEs are identical is common in statistical techniques that generalize a causal effect from one population to another (e.g., cole2010generalizing, stuart2011use, hartman2015sate). Our proposed procedure will not depend on the true value of , although its performance will (see corollary 4.4.2.4). Consequently, we will formulate our model in the general case when 0 and the study and candidate CATE may differ, but on first reading, the reader may assume = 0 without much loss of generality.) 4.4.2 A First Robust Model Since the candidate CATE is unknown, one approach might be to model this CATE as a random function, say ~ C (), and then seek to solve max z2Z P C c=1 z c r c E h ~ C (x c ) i : Unfortunately, the data at hand do not contain information about the precise structure of the candidate CATE. Consequently, defining the proba- bility distribution of ~ C () would require strong a priori assumptions on this structure that may not be easily validated. 93 Instead, we adopt a robust optimization perspective. Specifically, we maximize the worst-case interven- tion effectiveness over possible values for the candidate CATE that are consistent with the study findings: max z2Z min C ()2U C X c=1 z c r c C (x c ); (4.8) where C () approximates the candidate CATE. Given our study evidence, a first choice for the uncertainty setU might be U ^ = C :X7!R 9 S :X7!R s.t. IE[ S (x ~ s )]I; k S (x c ) C (x c ) C c=1 k link ^ ; (4.9) where S () approximates the study CATE, and the user-defined parameter ^ approximates. Unfortunately, this uncertainty set does not yield practically implementable solutions. Specifically, for any fixed z2f0; 1g C , z6= 0 and any x2X , letq z (x) P C c=1 z c r c I(x c = x)= P C c=1 z c r c denote the reward-weighted covariate distribution of the targeted patients from the candidate population. Define the set Z ( z2f0; 1g C C X c=1 z c K; q z (x) =P(x ~ s = x); 8x2X ) : Theorem 4.4.1 (Trivial Solutions for Unbounded CATEs). For any ^ 0, either 0 is an optimal solution to (4.8) withU ^ or every optimal solution is contained inZ . Theorem 4.4.1 asserts that under uncertainty set (4.9), the worst-case optimal targeting either chooses no patients or matches the study distribution of covariates exactly. Verifying that a solution matches the study distribution of covariates exactly, however, requires access to the full distribution of x ~ s , i.e., the raw study data, making it impossible to compute such a solution. 94 4.4.3 Incorporating Description Functions The fundamental issue is thatU ^ in (4.9) is “too large” and contains many pathological pairs C ; S . Ideally, we would prefer to restrictU ^ to a suitably well-behaved, nonparametric class of functions, such as those belonging to a kernel space kallus2017recursive. However, for an arbitrary nonparametric specification of the study CATE, verifyingE[ ~ s ]2 [I;I] may require the full distribution of x ~ s . Consequently, we restrict U ^ to a particular nonparametric class of functions for which we can easily verify this condition. To this end, consider projecting the study CATE onto the affine space spanned by the description func- tionsf1; 1 ();:::; G ()g. Then, for any x2X , we can writeE[ ~ s j x ~ s = x] = 0 + P G g=1 g g (x) + (x); where ( 0 ; )2 arg min 0 ; 0 @ E[ ~ s j x ~ s = x s ] 0 G X g=1 g g (x s ) 1 A S s=1 2 2 : (4.10) Herekk 2 is the ordinary` 2 -norm. Consequently, by construction,E[ (x ~ s )] = 0. By itself, this decomposition does not restrict the set of CATEs under consideration; any CATE can be projected onto this affine subspace. This is why we describe our specification as nonparametric. Nonethe- less, becauseE[ (x ~ s )] = 0, we have thatE[ S (x ~ s )] = 0 + P G g=1 g E[ g (x ~ s )] = 0 + P G g=1 g g . Thus, despite the non-parametric specification, verifying the study SATE agrees with the evidence only requires knowing 0 ; , i.e.,E ~ s 2 [I;I] () 0 + P G g=1 g g 2 [I;I]: Motivated by this decomposition, our new uncertainty set is formed by restricting the size of the coeffi- cients and residual in this decomposition. Specifically, letkk be a norm onR G andkk res be a norm on R C . The main uncertainty set of our robust model is then 95 U ^ ;^ = ( C () :X7!R 9() :X7!R; 0 2R; 2R G ; s.t. S (x) = 0 + G X g=1 g g (x) +(x); (4.11) I 0 + G X g=1 g g I; S (x c ) C (x c ) C c=1 link ^ ; kk ^ 1 ; ((x c )) C c=1 res ^ 2 ; ) : In words, the first equality of our uncertainty set decomposes S () into its projection onto the affine space of description functions and a residual. Any function can be decomposed in this way, so this equality does not limit the class of study CATEs under consideration. The second pair of inequalities model the study evidence. The third inequality bounds the distance between the CATEs as in eq. (4.9). The last two inequalities depend on the user-defined parameters ^ 1 ; ^ 2 and restrict the set of possible CATEs beyond eq. (4.9). Also note that this uncertainty set is always non-empty for any nonnegative (^ ; ^ 1 ; ^ 2 ). One can verify that C (x) = I+I 2 for all x2X is a member ofU ^ ;^ by letting C () = S (), 0 = I+I 2 , = 0 and() = 0. In other words, for any choices of the model parameters, our uncertainty set includes a “nominal” case where there is no heterogeneity in causal effects, and the SATE of the study and targeted population are the same. To build some intuition for these last two constraints, consider the idealized special case where we 1) choose the normkk such thatkk 2 T with gg 0 E[( g (x ~ s ) g )( g 0(x ~ s ) g 0)] for all g;g 0 = 1;:::;G and 2) choose the normkk res such that ((x c )) C c=1 res = ((x s )) S s=1 2 . (This norm can always be specified in this manner whenever the random variables x ~ s and x ~ c are mutually ab- solutely continuous. In this case, simply take ((x c )) C c=1 2 res P C c=1 (x c ) 2 P(x ~ s =xc) P(x ~ c =xc) : One can then check directly that ((x c )) C c=1 res = ((x s )) S s=1 2 .) Then, the constraintkk ^ 1 is equivalent to Var( 0 + P G g=1 g g (x ~ s )) ^ 2 1 , and the constraint on() bounds the residual variance in the regression 96 eq. (4.10). In other words, ^ 1 controls the amount of variability of (x ~ s ) explained by the description functions, while ^ 2 controls the residual variability. With these choices of norm, the sum ^ 1 + ^ 2 is the total variance of (x ~ s ) and describes the hetero- geneity in the study CATE. As this sum tends to zero, (x ~ s ) tends to a constant, i.e., the study-CATE is homogenous. At the same time, the ratio ^ 1 ^ 1 + ^ 2 describes the ability of these description functions to capture this hetereogeneity, much like theR 2 of a linear regression. If this ratio is large, these description functions capture most of the heterogeneity, and given g (x s ) we can predict the CATE of patients well. If this ratio is small, these description functions are uninformative, and we cannot predict the CATE of patients well from these values. That said, we stress these choices for norms are idealized, not prescriptive. Specifying the norms in this manner would require detailed information on the distribution of covariates in the study population, which is unavailable. Nonetheless, we will leverage this intuition to motivate specific, practical choices of the norm in special cases in what follows. 4.4.4 Robust Counterparts Using standard techniques, we can compute a robust counterpart. Letkk ,kk link andkk res be dual norms tokk,kk link andkk res , respectively. Theorem 4.4.2 (General Robust Counterpart). The robust targeting problem (4.8) with uncertainty set (4.11) is equivalent to max z2Z I C X c=1 z c r c ^ 1 C X c=1 z c r c ( g (x c ) g ) ! G g=1 ^ 2 (z c r c ) C c=1 res ^ (z c r c ) C c=1 link : (4.12) 97 Remark (Computational Complexity). From a theoretical point of view, Problem (4.12) is NP-Complete, even when G = 1, ^ = ^ 2 = 0, (x) takes binary values andkk is an ` p -norm (Theorem C.1.1, Appendix C.1). From a practical point of view, when the norms correspond to (weighted)` 1 or` 1 -norms, Problem (4.12) is a mixed-binary linear program, and when the norms corresponds to (weighted)` 2 -norms, Problem (4.12) is a mixed-binary second-order cone problem. Although theoretically difficult, moderately sized mixed-binary linear and mixed-binary second-order cone problems (such as those we study in this paper) can be solved efficiently using off-the-shelf software on a personal computer in minutes. Problem (4.12) provides insight into the structure of an optimal targeting: No Dependence onI. The robust counterpart does not depend onI, or, equivalently, the width of the con- fidence intervalII. This lack of dependence is a unique feature of our causal inference setting that distinguishes it from more traditional data-driven robust optimization settings. Specifically, in typical data-driven settings where one directly observes data on the relevant uncertain- ties, the width of the of the confidence interval roughly corresponds to the precision of the estimates of those uncertainties (see, e.g., bertsimas2014robust). With more data, this interval shrinks, and the robust counterpart “converges” to a nominal (full-information) problem. In our setting, we do not directly observe data on the relevant uncertainty, i.e., the candidate CATE. The width of the confidence interval [I;I] does not correspond to the precision of the relevant un- certain parameters, i.e., the candidate CATE, but rather to the precision of the estimate for the study SATE. The precision of this estimator does not affect our targeting. What matters in the targeting problem is the level of the study SATE and the variability of study CATE. Intuitively, an RCT with an extremely large sample size could drive the width of the confidence interval to zero, but that would not imply that the study CATE had low variability. There would still be uncertainty in the form of the heterogenous effect in candidate population. 98 Avoiding high-reward patients if ^ or ^ 2 is large. As ^ ! 1 with ^ 1 and ^ 2 fixed, the last term in eq. (4.12) grows. Consequently, an optimal solution selects fewer and fewer patients with very high rewards, until ultimately no patients are targeted. We claim this behavior is intuitive. Recall that ^ proxies (see eq. (4.7)). If a decision-maker believed were quite large (and hence specified ^ to be large), she also believes that the study CATE is not representative of the candidate CATE. Said another way, she believes there is only weak evidence in the study, itself, to guarantee that targeting in the candidate population will be effective. Consequently, an optimal risk-averse targeting should select relatively few patients, and avoid patients with very high rewards. Indeed, recall from our discussion of the pitfalls of reward scoring around fig. 4.1 that high-reward patients are “risky.” If these patients react adversely to treatment, they have very negative effectiveness. The robust model guards against the pitfalls of reward-scoring when the study-evidence is weak. If the evidence is weak enough, it recommends not targeting. A similar behavior holds as ^ 2 !1 with ^ and ^ 1 fixed. Recall that ^ 2 proxies the residual error in eq. (4.10). If a decision-maker believed 2 were large (and hence specified a large ^ 2 ), then even if she believed = 0, i.e., that study CATE and candidate CATE were identical, she should select relatively few patients and avoid patients with high rewards. Indeed, asserting = 0 is tantamount to saying the way causal effects depend on covariates x is the same in both populations. However, asserting that 2 is large implies that this dependence relies on information in x not captured by the values 1 (x);:::; G (x). Consequently, there is still only weak evidence in the study, itself, to guarantee that targeting in the candidate population will be effective. Note the dependence of Problem (4.12) on ^ 1 is more subtle and discussed in detail in the next subsection. Although theorem 4.4.2 is stated in full-generality, it depends on three user-defined parameters ^ , ^ 1 and ^ 2 , and three user-defined normskk,kk link andkk res . Practically, it is not clear that the data at hand, 99 i.e., the summary statistics and the study-evidence, support this detailed a specification. We might prefer a simpler model with fewer parameters to specify. Consequently, in what follows, we propose choosingkk res andkk link to be` 1 -norms. The resulting counterpart has a simple form, with only one effective user-defined parameter and norm. Corollary 4.4.2.1 (Simplified Robust Counterpart). Suppose bothkk res andkk link are taken to be` 1 - norms. Let z be an optimal solution to problem (4.8) with uncertainty set (4.11). Then, 1. IfI ^ 2 ^ 0, z = 0. 2. IfI ^ 2 ^ > 0, z is also an optimal solution to max z2Z C X c=1 z c r c ^ 1 I ^ 2 ^ C X c=1 z c r c ( g (x c ) g ) ! G g=1 ; (4.13) and the optimal value of problem (4.8) is (I ^ 2 ^ ) time the optimal value of problem (4.13). We argue that problem (4.13) represents a good practical modeling compromise. The simplified struc- ture still captures many of the qualitative features of problem (4.12), e.g., for sufficiently large ^ or ^ 2 , we should target no one. More importantly, finding an optimal solution only requires specifying one user- defined parameter, i.e., the ratio ^ 1 I ^ 2 ^ and one user-defined norm, i.e.,kk. This ratio further admits a simple interpretation as an “adjusted” coefficient of variation (CV). Specif- ically, in the special case when ^ = ^ 2 = 0, then, under our earlier idealized choice of normkk 2 Var P G g=1 g g (x ~ s ) , this ratio upper bounds the coefficient of variation of our approximate study CATE S (x ~ s ). Equivalently, since ^ = 0, it also upper bounds the coefficient of variation on our approximate candidate CATE on the study population C (x ~ s ). When> 0 or ^ 2 > 0, we adjust this coefficient of vari- ation by reducing our estimate of the mean effectiveness due to differences between the study and candidate populations, i.e., reducingI toI 2 ^ . 100 4.4.5 Covariate Matching as Regularization Problem (4.13) also facilitates a new connection between our approach and covariate matching in statistics. Intuitively, we might expect that the larger the difference between the covariates of the study population and targeted patients is, the less reliable the SATE of the study population is as an estimate of causal effects in the targeted patients. Thus, we should avoid such targetings. To make this intuition precise, note whenI ^ 2 ^ 0, we can rewrite eq. (4.13) as C X c=1 z c r c ^ 1 I ^ 2 ^ C X c=1 w c g (x c ) g ! G g=1 C X c=1 z c r c ; wherew c z c r c P C c=1 z c r c : (4.14) Thus, the objective Problem (4.13) approximates the intervention effectiveness of a candidate targeting z by its total reward minus a penalty that depends on the distance between the summary statistics evaluated on the study population and a reward-weighted average of these statistics ( P C c=1 w c g (x c )) G g=1 evaluated on the targeted patients from the candidate population. Our adjusted CV ^ 1 I ^ 2 ^ controls the trade-off between these two objectives. For small adjusted CV , solutions to (4.13) will target high-reward patients regardless of their covariates. When the adjusted CV is 0, problem (4.13) reduces to reward scoring. As the adjusted CV increases, solutions to (4.13) will match the reward-weighted average summary statistics in the study population more closely. Similar behavior holds for the general problem (4.12). This “covariate matching as regularization” interpretation of our model provides a natural intuition that connects with the literature on matching in design of experiments. Unlike traditional schemes for matching, however, our approach incorporates the rewardsr c both in the objective and in the particular structure of the penalty. We next show that in special cases with appropriately chosen normskk, we recover common matching procedures in the regularizer. For any positive definite matrix A, letktk A p t T At. The corresponding dual norm iskk A 1. 101 Corollary 4.4.2.2 ( 2 -Matching under Partition Description Functions). Suppose there exists a partition X = S G+1 g=1 X g , and g (x) = I(x 2 X g ) are our description functions with statistics g for all g = 1;:::;G. Definekk in (4.11) bykkkk , where diag() T 2R GG . Then, eq. (4.13) with this uncertainty set is equivalent to C X c=1 z c r c ^ 1 I ^ 2 ^ v u u t G+1 X g=1 (q z;g g ) 2 g C X c=1 z c r c ; (4.15) whereq z;g P C c=1 z c r c I(x c 2X g )= P C c=1 z c r c is the (reward-weighted) proportion of typeg patients in the targeted population and G+1 1 P G g=1 g . The penalty term of (4.15) is the 2 -distance between ( g ) G+1 g=1 and (q z;g ) G+1 g=1 . The 2 -distance met- ric is commonly used for matching with partitioned covariates imbens2015causal. It arises naturally as a regularizer in our method through the appropriate choice of uncertainty set. Remark. WhenjXj is finite and the partition consists of singletons, every CATE S (x) can be written in the form S (x) = 0 + P G g=1 g I(x2X g ) for some ( 0 ;)2 R G+1 . Consequently, one can take 2 = ^ 2 = 0 without loss of generality. This is the canonical setting for Corollary 4.4.2.2. Corollary 4.4.2.3 (Mean Matching under Linear Description Functions). SupposeX2R G , and g (x) = x g are our description functions with statistics g for all g = 1;:::;G. For any positive definite matrix V 2 R GG , consider uncertainty set (4.11) with a weighted ` 2 -normkk V . Then, Problem (4.13) is equivalent to max z2Z C X c=1 z c r c ^ 1 I ^ 2 ^ C X c=1 w c x c g ! G g=1 V 1 C X c=1 z c r c ; (4.16) wherew c zcrc P C c=1 zcrc . 102 The penalty term of eq. (4.16) is a (weighted) distance between the means of the covariates in the study population and in the target population. These types of distances between means are frequently used to assess the quality of covariate matching [93, pg. 410]. The choice of V controls the weighting. A common choice is to take V to be , the covariance matrix of x ~ s , which recovers so-called Mahalanobis matching [93, pg. 411]. Another common choice is to take V to be diag(). This choice recovers the Euclidean metric [93, pg. 411] or so-called mean-matching penalty Kallus2016. In summary, the interpretation of our method as regularizing via covariate matching highlights that the role of the normkk is primarily to establish a metric between the distribution of covariates in the study population and the (reward-weighted) distribution of covariates in the targeted group. Indeed, any choice of norm enjoys this interpretation. Thus, although it is certainly mathematically elegant to letkk to bekk , when is known, other reasonable norms should still yield good performance. Indeed, we use diag() to specify the norm in our case-study, because the full covariance matrix of covariates is not reported in [145]. In principle, one might ask if is possible to choose an uncertainty set to recover other covariate match- ing techniques. Appendix C.2.4 describes a general construction. However, we consider this construction to be principally of theoretical, rather than practical, interest for two reasons: First, the above matching techniques ( 2 -matching, Mahalanobis Matching and mean-matching) are by far the most common in prac- tice. Second, and more importantly, other covariate matching techniques typically require knowledge of the full-distribution of covariates, not simply knowledge of a few statistics. Since most studies do not report this full distribution, they cannot be used practically as a regularizer in our setting. We refer the readers to Appendix C.2.4 for further details. 103 4.4.6 Performance Guarantee for the Robust Approach Recall that for a sufficiently large radius, our uncertainty set contains the true candidate CATE. Using this observation, we bound the performance of our robust approach. Corollary 4.4.2.4 (Worst-Case Performance of Robust Targeting). Let 1 ; 2 ; be sufficiently large so that E[ ~ c j x ~ c = x c ]2U ; . Let z Rob be an optimizer of problem (4.12) for uncertainty setU ^ ;^ . Let d 1 C X c=1 z Rob c r c ( g (x c ) g ) ! G g=1 ; d 2 z Rob c r c C c=1 res ; d 3 z Rob c r c C c=1 link : Then, C X c=1 z Rob c r c E[ ~ c jx ~ c = x c ] I C X c=1 z Rob c r c 1 d 1 2 d 2 d 3 ( ^ 1 1 )d 1 + ( ^ 2 2 )d 2 + (^ )d 3 : corollary 4.4.2.4 describes the performance of the robust model under misspecification of the parameters. The bound only depends on the unknown causal effect through the parameters 1 ; 2 ;. corollary 4.4.2.4 guarantees that if we specify an uncertainty set large-enough, the robust strategy has non-negative effective- ness, i.e., it is not harmful. (A sufficient condition is that ^ 1 1 , ^ 2 2 and ^ .) This is structurally different from targeting rules, where the performance depends strongly on how well the rule matches the underlying causal effect if treatments may be harmful (section 4.3.2). Moreover, we stress that the bound only depends on the covariates differences between the study population and the targeted group, not the candidate population. 4.4.7 Selecting the Adj. CV parameter Thus far, we have not discussed the choice of the adj. CV parameter ^ 1 I ^ 2 ^ . One approach might be to use external information or domain knowledge to 1) estimate the amount of explained heterogeneity in 104 causal effects, 2) estimate the average causal effect in candidate population, and then 3) take their ratio. This approach requires external information or domain knowledge because the reported study-data itself does not contain information about these parameters. We adopt a different viewpoint motivated by the satisficing literature simon1955behavioral. Let z()2 arg max z2Z C X c=1 z c r c C X c=1 z c r c ( g (x c ) g ) ! G g=1 : Note that z(0) is the reward-scoring solution. Then, given an acceptable revenue loss 0<< 1 , we set to be the solution of max 0 s.t. C X c=1 r c z c () (1) C X c=1 r c z c (0): (4.17) In words, we seek the largest amount of robustness such that our targeting still achieves 1 of the optimal procedure under the nominal scenario (no heterogeneity). Similar ideas have been used throughout decision- analysis. and there is growing empirical evidence that such models better capture how real decision-makers think (see, e.g., [35] and references therein). In some sense we have simply replaced the problem of specifying with the problem of specifying . However, from a practical point of view, we believe it is much more natural for a decision-maker to specify that she is willing to give up, say 10%, of revenues in the nominal scenario to protect herself against potential heterogeneity, than it is for her to specify a value for the coefficient of variation of the unknown heterogeneous causal effect in the candidate population. This is the perspective we take in our case study, and specify = 10%. Problem (4.17) can be solved by bisection search over. (See theorem C.1.2 in Appendix.) 105 4.4.8 Extensions of the Base Model Appendix C.2 considers several extensions to our base robust model and shows how one can naturally incor- porate fairness considerations, domain-specific knowledge of the candidate CATE, evidence from multiple papers, and other generalizations. 4.5 Case Study: Comparison of Targeting Methods Using data from our partner hospital, we seek to answer the following two questions: 1) when do robust methods outperform scoring rules, and 2) what drives this performance? We target a subset of Medicaid patients for case management at the end of 2014, with the goal of reducing the underpayments for ED charges from 1/1/2015 to 6/30/2015. At the time of writing, there has not yet been a large-scale, landmark study quantifying the heterogeneous causal effects of case management. Hence, we do not have detailed CATE estimates, i.e., we do not have a “ground truth” against which to evaluate our methods. Our best empirical evidence to date is from [145] (Table 4.1). Using these data, we adopt the following approach to compare different methods: 1. We approximate a confidence interval for the SATE (Table 4.1). Had a researcher run a simple ran- domized control trial using data from [145], instead of a stratified analysis, she might have reported this estimate. This estimate is what would be typically available in a (non-stratified) published study. 2. We use this approximate confidence interval and summary statistics (Table 4.2) to compute our robust targeting solutions for two different variations of our uncertainty set (described below) corresponding to different (possibly misspecified) structures of CATEs. 3. We compare the performance of various methods in two different settings: 106 a. Section 4.5.4: We assume that the ground-truth candidate CATE is given by the estimates in [145]. This setting is most relevant if we believe that the strata based on previous ED visits capture most of the heterogeneity in causal effects. b. Section 4.5.5: We assume that the ground-truth candidate CATE depends only on demographic- related covariates: gender, race and age. This setting is most relevant if we believe that these demographics capture most of the heterogeneity in causal effects. A drawback of this setting is that we have no experimentally validated estimates of CATEs. Thus, we must focus on the worst- case performance. Before proceeding to the details, we summarize our main findings: As predicted by Theorem 4.3.1, when treatment is benign or causal effects are nearly homogeneous, reward scoring performs well. However, this performance degrades rapidly if the treatment may be harmful. As the heterogeneity in CATE increases, scoring rules can be worse than not targeting. Our robust method performs almost as well as scoring rules when the heterogeneity (as measured by the adjusted CV) is small and much better than scoring rules as the adjusted CV increases, provided that the study CATE is approximately linear in the chosen description functions. If not, robust methods may perform poorly. These results agree qualitatively with corollary 4.4.2.4. In this particular dataset, the benefits of our robust approach increase as the distribution ofr c has a shorter right tail or as the resource constraintK=C is tightened. 4.5.1 The Candidate Population The data from our partner hospital include information on all adult Medicaid ED visits from 1/1/2014 to 12/31/2014. For each visit, we have the date of visit, total charges, associated primary ICD-10-CM diagnosis 107 code, and patient identifier as well as the patient’s demographic information, including age, gender, race and Medicaid eligibility. For patients who visited the ED in 2014, we also have the number of ED visits from 1/1/2015 to 6/30/2015. Table 4.3 provides the summary statistics for the patients before and after we apply [145]’s inclusion/exclusion criteria and limit our attention to Medicaid patients only. (See notes of table 4.3 for detailed inclusion/exclusion criteria.) Despite applying the same inclusion-exclusion criteria, the resulting candidate population differs significantly from both the Medicaid ED patient population and the study population. (See table 4.2 for a comparison of candidate and study populations.) 4.5.2 Implementation Details of Targeting Methods For a fair comparison, all methods are assessed against the same values ofr c . Except in Section 4.5.6, these values are the average charges per ED visit for each patient in 2014. In Section 4.5.6, we assess the methods against other values ofr c with different tail behaviors. To maintain anonymity, we normalize performance metrics in dollar amounts by the full-information optimal value when it is available (Section 4.5.4) or by the summation ofK largestr c when it is not (Section 4.5.5). Except for Section 4.5.6, we fixK = 200 (K=C = 21%). We compare the following four methods: Reward Scoring (r-Scoring) We score byr c . Outcome Scoring (ry(0)-Scoring) We score byr c y c (0), wherey c (0) is the actual number of ED visits for patientc from 1/1/2015 to 6/30/2015. Robust-2 We solve Problem (4.15) using a partition description function for strata. (Equivalently, we add the constraints g = 0 to all other description functions.) The corresponding summary statistics, i.e., the proportion of patients in each stratum, are given by Table 4.1. 108 Table 4.3: Inclusion Criteria and Summary Statistics for the Candidate Population All Medicaid ED Patients (24; 943 patients in 2014) Candidate Population (C = 951 patients) Inclusion Criteria x Age, mean sd 43.3 9.5 38.3 12.5 No. of ED Visits in 2014 1 - 4 23,558 (94%) 0 (0%) 5 - 11 1,286 (5%) 860 (90%) 12 99 (1%) 91 (10%) Comorbidity Alcohol Abuse 1,186 (5%) 777 (18%) Drug Abuse 188 (0.8%) 39( 5%) Psychological Problem 1144 (5%) 216 (23%) Medicaid Type HUSKY A 8,767 (35%) 125 (13%) HUSKY B 43 (0.2%) 0 (0%) HUSKY C 1,697 (7%) 204 (21%) HUSKY D 9,039 (36%) 695 (73%) Other Characteristics Male 11,349 (45%) 442 (46%) Race/Ethnicity African American 9,153 (37%) 475 (50%) Hispanic 177 (1%) 7 (1%) White 7,532 (30%) 283 (29%) Other 8,081 (32%) 186 (20%) Length of Stay (hours), mean sd 5.1 7.4 6.2 5.5 No. of ED Visits from 1/1/2015 to 6/30/2015yc(0), mean sd 1.8 1.7 7.2 4.2 Avg. Charges Per ED Visit in 2014rc ($), mean sd 3,252 4,491 3,324 3,048 Charlson Comorbidity Score z , mean sd 0.08 0.4 1.9 0.9 Most Frequent Diagnosis y during ED Visits back problems (5%) alcohol-related disorders (10%) nonspecific chest pain (5%) abdominal pain (6%) skin diseases (4%) back problems (5%) upper respiratory infection (4%) nonspecific chest pain (4%) alcohol-related disorders (4%) connective tissue diseases (3%) sprains and strains (4%) non-traumatic joint disorders (3%) Notes. x Patients are included in the candidate population if they are at least 18 and below 65 years old; had at least 5 visits to the ED of our partner hospital in 2014; have a history of alcohol abuse, drug abuse, or psychological problems; have disability or blindness (HUSKY C); or have a low income and no dependent child (HUSKY D). Calculated from the primary ICD-10-CM diagnosis code using Elixhauser Comorbidity Software elixhauser1998comorbidity. In Connecticut, Medicaid patients are eligible for one of the four parts (http://www.ct.gov/hh/cwp/view.asp?a=3573&q=421548). HUSKY A covers children, their parents and pregnant women; HUSKY B covers children whose parents earn too much money to qualify for Medicaid; HUSKY C covers low-income patients with disabilities or blindness; and HUSKY D covers the lowest-income patients with no dependent child. z Calculated from the primary ICD-10-CM diagnosis code using Charlson Comorbidity Score charlson1987new. y Calculated from the primary ICD-10-CM diagnosis code using Clinical Classification Software elixhauser2014clinical. Robust-Full-Linear We solve Problem (4.16) using partition description functions for strata, gender, and race and a linear description function for age. The corresponding summary statistics, i.e., the mean and standard deviation of each covariate, are given by Table 4.2. 109 tWe focus on reward and outcome scoring because, as mentioned, these methods are optimal when the causal effect is homogeneous and additive or multiplicative, respectively. Moreover, outcome scoring, in particular, closely mirrors current state-of-practice for targeting at our partner hospital. We choose the adjusted CV parameter ^ 1 =(I ^ 2 ^ ) via the satisficing heuristic in Section 4.4.7 with = 10%, yielding 0:55 and 0:3 for Robust-2 and Robust-Full-Linear, respectively. In other words, we are willing to trade-off 10% of the cost saving in a nominal case for robustness. Finally, robust optimization problems frequently exhibit multiple optimal solutions. In our case study, we use the Pareto robust optimal solution corresponding to the realization C () = I ^ 2 ^ , which can be computed as in [89]. Intu- itively, this solution is non-dominated among robust optimal solutions when all patients respond to treatment identically. 4.5.3 Properties of the Solutions Both Robust-2 and Robust-Full-Linear target allK = 200 patients when specifying their Adj. CV parame- ters as described above. This is essentially because targeting fewer than 200 patients would amount to more than a 10% loss in the nominal scenario. Indeed, as seen in fig. C.3 in Appendix C.3.3, as long as one insists on less than a 50% loss in the nominal scenario, both robust methods fully utilize the budget. Because of their different choices of description functions, the two robust methods match the distribution of covariates in the study population differently. Robust-2 attempts to match the proportion of patients in each stratum. Specifically, Table 4.4 shows that there are only 48% patients in stratum 1 in the study population, in contrast to 90% of the reward-weighted proportion of patients in stratum 1 in the candidate population. Thus, Robust-2 targets proportionally fewer stratum 1 patients, yielding an overall percentage of 78%. Notice that although this proportion is closer to the study population’s 48%, it is not an exact match since the robust method also balances the competing objective of targeting higher-reward patients 110 (recall Remark 4.4.5). In our candidate population, stratum 2 patients typically have lowerr c than stratum 1 patients (Figure C.4 in the e-companion). Completely matching the proportion of stratum 1 patients would entail a significant loss in rewards. Also, although Robust-2 improves the matching of proportion of patients in each stratum, it exacerbates differences in other covariates, such as the proportion of Hispanic patients. Similar observations can be made about Robust-Full-Linear. It more closely matches the means of the covariates in the study population than in the candidate population but does not always achieve exact matching. For example, our candidate population has very few Hispanic patients, so Robust-Full-Linear is unable to fully match the 22% of Hispanic patients in the study population. For all other covariates, it achieves a reasonably close match. Table 4.4: Characteristics and Reward-Weighted Average Covariates for the Targeted Patients by Method Study Popula- tion Candidate Popula- tion Reward Scoring Outcome Scoring Robust-2 Robust- Full- Linear Characteristics of Targeted Patients Avg. Charges Per ED Visit in 2014rc ($) 3,324 7,547 5,965 6,795 6,815 Avg. No. of ED Visits from 1/1/2015 to 6/30/2015yc(0) 2.3 2.2 5.3 3.0 2.6 Weighted Avg. Pre-Treatment Covariates Demographics Male 75% 48% 51% 52% 53% 62% African American 54% 46% 43% 44% 41% 53% Hispanic 22% 0.5% 0.3% 0.1% 0.1% 0.3% White 13% 34% 41% 42% 43% 44% Age 43.3 40.7 44.9 44.9 45.3 44.3 Two Strata from [145] Stratum 1: 5 - 11 ED Visits in 2014 48% 90% 90% 83% 78% 84% Note. Except for the study population, we show the reward-weighted average of pre-treatment covariates. Thus, the reward- weighted summary statistics for the candidate population are different from those in Table 4.3. 4.5.4 When CATEs Depend Only on Previous ED Visits In this section, the ground-truth candidate CATE is given by E[ ~ c j x ~ c = x c ] 1 I(patientc in stratum 1) + 2 I(patientc in stratum 2) (4.18) 111 for all c2f1;:::;Cg, where 1 and 2 denote the true CATEs for stratum 1 (5 to 11 ED visits in the previous year) and stratum 2 (greater than or equal to 12 ED visits in the previous year). For the particular point estimates of 1 = 2:1; 2 = 3:3 (ED visits reduced) provided by [145], reward scoring achieves 99% of the full-information optimum benchmark, while Robust-2 achieves 95% and Robust-Full-Linear achieves 93%. However, the point estimates for 1 ; 2 are not exact, so our assessment might be optimistic. We next study the performance under small perturbations of these estimates. Specifically, we vary 1 , 2 uniformly within the confidence intervals provided by [145]: [0:1; 4:1] and [0:2; 6:4], respectively. Each pair of values yields a different relative performance for each method and a different level of heterogeneity in the candidate CATE, as measured by its coefficient of variation. This coefficient of variation is given by CV = q P 2 g=1 g ( g T ) 2 = T , where 1 = 0:48 and 2 = 0:52 from Table 4.2. We summarize the relative performance of each method by plotting the mean and standard deviation for a given level of CV across all perturbations in Figure 4.2. ● ● ● ● ● ● ● ● ● ● 70.0% 80.0% 90.0% 100.0% 0.0 0.2 0.4 0.6 0.8 Degree of Heterogeneity: CV Relative Performance (%) Methods ● Reward Scoring Outcome Scoring Robust−2 Robust−Full−Linear Figure 4.2: Relative Performance as Both 1 and 2 Vary Uniformly within Their Confidence Intervals Performance is relative to the full-information benchmark, and 1, 2 are the number of ED visits reduced in each stratum. We plot the mean relative performance (point) across choices of 1, 2 with the given CV , plus/minus one standard deviation (error bar). tTo the left of the dashed line CV= 0:55, the ground-truth CATEs belong to the uncertainty set of the Robust-2 method. 112 Perhaps unsurprisingly, all methods perform reasonably well, obtaining at least 65% of the full- information optimal value. Indeed, the confidence intervals of CATEs are both strictly positive, i.e., treat- ment is benign, and Theorem 4.3.1 ensures that reward scoring cannot be too suboptimal. Nonetheless, as the heterogeneity increases, we do observe qualitative differences in behavior. When CV= 0, reward scoring is optimal (as expected by Corollary 4.3.1.1). As CV increases, reward scoring’s performance degrades rapidly. At worst, it obtains about 75% relative performance. Robust-2 performs slightly worse than reward scoring when the degree of heterogeneity is small (obtaining about 90% relative performance) and much better than reward scoring as CV increases up to 0.4 (obtaining almost 95% relative performance). A similar observation can be made for Robust-Full-Linear: Robust-Full-Linear performs almost as well as reward scoring when the degree of heterogeneity is small and is better than reward scoring when it is sufficiently large. Notice also that when CV< 0:55, the ground-truth CATEs in our simulation belong to the uncertainty set defined by Robust-2. However, because of its “worst-case perspective”, Robust-2 does not outperform Reward Scoring unless the CV is sufficiently large. One way to interpret the value CV = :4 (where the methods intersect) is that since SATE is approxi- mately 2:7, CV = :4 implies the standard deviation of the candidate CATE is at most:4 2:7 = 1:08. In other words given two random patients, the expected absolute difference between their causal effect is at most p 21:08 1:5 visits. (ForX;Y i.i.d. random variables,E[jXYj] p E[(XY ) 2 ] = p 2Var(X) by Jensen’s inequality.) Thus, if we believe that benefit of case management varies by more than 1:5 visits between patients, Robust-2 outperforms reward scoring. To further investigate the effects of heterogeneity, we vary only one of 1 or 2 in Figure 4.3 and keep the other fixed at its point estimate from [145]. The performance of reward scoring degrades rapidly as 1 decreases, and it can perform very badly when 1 becomes negative, i.e., when case management may 113 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 95% Confidence Interval −50% 0% 50% 100% −1 0 1 2 3 4 CATE of Stratum 1 ψ 1 (No. of ED Visits) Relative Performance (%) Methods ● Reward Scoring Outcome Scoring Robust−2 Robust−Full−Linear ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 95% Confidence Interval −50% 0% 50% 100% 0 2 4 6 CATE of Stratum 2 ψ 2 (No. of ED Visits) Relative Performance (%) Figure 4.3: Relative Performance as Either 1 or 2 Varies Negative values of 1, 2 indicate increased ED visits. In the left panel, we fix 2 = 3:3 and vary 1, while in the right panel, we fix 1 = 2:1, and vary 2. The dashed vertical lines indicate the point estimates of the varying parameter. increase ED visits for patients in stratum 1. To the contrary, both robust methods outperform reward scoring significantly in these instances. As mentioned in Section 4.5.3, both robust methods effectively target fewer stratum 1 patients, which makes their performance less sensitive to the changes in 1 . Of course, since these methods target more stratum 2 patients, they are more sensitive to the value of 2 (see right panel of Figure 4.3), but the magnitude of the changes are substantially smaller. Overall, then, we would argue that the robust methods are indeed more “robust” to uncertainties in 1 and 2 . 4.5.5 When CATEs Depend Only on Patient Demographics In this section, the ground-truth study CATE is given by E[ ~ s j x ~ s = x s ] 0 + G X g=1 g g (x s ) +(x s ); 8s2f1;:::;Sg; (4.19) for some 0 ; such that g 6= 0 only ifg corresponds to a demographic-related covariate, i.e,. the patient’s gender, race or age. LetG be the set of indices for these demographic-related description functions. Since 114 we do not have experimentally validated estimates for a CATE with this structure, we will compare to worst-case performance over the uncertainty set U ; = ( C :X7!R 9 :X7!R; 0 2R; 2R G ; s.t. S (x) = 0 + G X g=1 g g (x) +(x); g = 08g62G; (4.20) I 0 + G X g=1 g g I; S (x c ) C (x c ) C c=1 1 ; kk V 1 ; ((x c )) C c=1 1 2 ) ; for varying , 1 , and 2 . Here, V is diagonal with the variance of demographic-related covariates as entries (Table 4.2). Using theorem 4.4.2, we can compute the worst-case objective in closed form. To maintain anonymity, we normalize the worst-case objective by (I 2 ) P K c=1 r c , wherer 1 r C . The final worst-case performance metric only depends on the ratio Adj. CV 1 =(I 2 ). Inspired by this fact, we present our results relative to this “true” Adj. CV , and use the notation Adj. c CV ^ 1 =(I ^ 2 ^ ) for the parameter of the uncertainty set used to compute Robust-2, or Robust Full-Linear, depending on the chosen setting. Note that both robust methods are “misspecified” under (4.19) and (4.20). ttSpecifically, Robust-2 in- correctly assumes that CATEs only depend on the strata membership, not demographic covariates. Sim- ilarly, Robust-Full-Linear is “over-specified” in the sense that it allows CATES to depend on the demo- graphic covariates, but also allows CATES to depend on strata membership. More importantly, it assumes Adj. c CV = 0:3. In our experiments, we will vary the value of Adj. CV in (4.20). Consequently, most values, the CATE that achieves the worst-case in (4.20) is not a member of the uncertainty set used to define Robust Full-Linear. Finally, we also stress that the treatment in this experiment can be potentially harmful. We plot the anonymized worst-case performance against Adj. CV for different methods in Figure 4.4. Consider reward scoring and Robust-Full-Linear. When Adj. CV = 0, reward scoring is optimal. Notice that Robust-Full-Linear performs not optimally but at least 90% compared to scoring rule due to our proposed 115 ● ● ● ● ● ● ● ● ● ● ● ● −20% 0% 20% 40% 60% 80% 100% 0.0 0.5 1.0 1.5 Adj. CV Worst−Case Performance (%) Methods ● Reward Scoring Outcome Scoring Robust−2 Robust−Full−Linear Figure 4.4: Worst-Case Performance When CATEs Depend Only on Demographics Worst-case performance is the worst-case objective value under uncertainty set (4.20) normalized by (I 2) P K c=1 rc. heuristics of choosing Adj. c CV in section 4.4.7. However, the performance of reward scoring degrades rapidly as Adj. CV increases, while Robust-Full-Linear performs significantly better than scoring rules as Adj. CV increases above 0.2. Robust-2 performs poorly in this experiment. The worst-case performance of Robust-2 can be negative and worse than not targeting. We partially explain this behavior using corollary 4.4.2.4 in appendix C.3.4 of the appendix. Intuitively, the variation in the true CATE is not well-captured by the description functions in Robust-2. Consequently, 2 must be very large before the true CATE is contained in its uncertainty set, making the uncertainty set very large and the worst-case performance very poor. In other words, Robust-2 is highly misspecified. 4.5.6 Sensitivity Analysis Reward scoring performs well compared to robust methods when the treatment is benign. We argued that this is due to the long-tail behavior of the reward distribution. In this section, we seek to verify this by exploring how the performance of our robust methods and reward scoring changes as we vary the tail behavior ofr c distribution and the resource constraintK=C. 116 We only include results for Robust-Full-Linear and reward scoring, but we observe similar behavior for Robust-2 and outcome scoring (see Appendix C.3.5). Throughout, we focus on the performance difference between Robust-Full-Linear and reward scoring under the ground-truth models of Sections 4.5.4 and 4.5.5. For each setting, we compute the performance metrics for the robust method and reward scoring separately and report their difference. For example, under the ground truth of Section 4.5.4, we compute the average relative performance to the full-information optimum for each method and determine the difference. We refer to this quantity as the average relative performance difference. Under the ground truth of Section 4.5.5, we compute the worst-case performance of Robust-Full-Linear and reward scoring (normalized by (I 2 ) P K c=1 r c ) separately and determine the difference. We term this quantity the worst-case performance difference. A positive performance difference implies that Robust-Full-Linear outperforms reward scoring. ● ● ● ● ● ● ● ● ● ● −10% 0% 10% 10% 20% 30% 40% 50% Resource Constraint K C Average Relative Performance Difference (%) CV ● 0 0.2 0.4 0.6 0.8 ● ● ● ● ● ● ● ● ● ● 0% 20% 40% 60% 80% 10% 20% 30% 40% 50% Resource Constraint K C Worst−Case Performance Difference (%) Adj. CV ● 0 0.5 1 1.5 Figure 4.5: Difference between Robust-Full-Linear and Reward Scoring Varying the Resource Constraint The left panel corresponds to Figure 4.2 in Section 4.5.4. The right panel corresponds to Figure 4.4 Section 4.5.5. In both panels, we plot the performance difference (defined in the main text) againstK=C. To investigate the sensitivity of resource constraint, we vary the value of K and reproduce the ex- periments of figs. 4.2 and 4.4 in terms of the performance difference between Robust-Full-Linear and reward scoring in Figure 4.5. For each value of K, we use our satisficing approach of section 4.4.7 with = 10% to specify the parameters of the uncertainty set. For all values of K studied, i.e., 117 K = 10; 20;:::; 50; 100; 150;:::; 800, both Robust Full-Linear and Robust-2 target all K patients. For brevity, we only present results for Robust Full-Linear. We see that the performance difference increases as the resource constraint is tightened, i.e., asK=C decreases, in both settings. In addition, for any fixed level ofK=C, Robust-Full-Linear tends to increasingly outperform reward scoring as CV or Adj. CV increases, which is consistent with our previous observations. To investigate the sensitivity to the reward distribution, we reproduce the experiments in Figure 4.2 and Figure 4.4 in terms of performance differences for a variety of different reward distributions. Specifically, define the new rewards r 0 c F 1 1;10 ( ^ F (r c )); 8c = 1;:::;C; (4.21) where ^ F () is the empirical cumulative distribution function (CDF) of original rewardsr c (average charges per ED visit for patientc in 2014) andF 1 1 ;10 () is the inverse CDF of beta distribution with parameter 1 and 2 = 10. By varying 1 , we can alter the tail behavior of the reward distribution (see Figure 4.6). Although this transformation may change the average reward in the candidate population, the scale of rewards does not affect our performance metrics. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● r′ c α 1 ● 1 3 5 7 20 50 Figure 4.6: Distribution of Rewardsr 0 c defined in eq. (4.21) The parameter2 = 10 and1 varies. For varying 1 , we re-compute rewards by eq. (4.21), re-compute each method and assess the perfor- mance difference with the new rewards under our two possible ground truths (Figure 4.7). 118 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0% 20% 40% 60% 0 10 20 30 40 50 α 1 Average Relative Performance Difference (%) CV ● 0 0.2 0.4 0.6 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0% 25% 50% 75% 0 10 20 30 40 50 α 1 Worst−Case Performance Difference (%) Adj. CV ● 0 0.5 1 1.5 Figure 4.7: Difference between Robust-Full-Linear and Reward Scoring Varying Reward Distribution The left panel corresponds to Figure 4.2 in Section 4.5.4. The right panel corresponds to Figure 4.4 Section 4.5.5. Similar to previous experiments, for a fixed value of 1 , we see an increasing performance gap as CV or Adj. CV increases. More interestingly, the performance difference increases as 1 increases, although with severely decreasing marginal returns; for large values of 1 , there is almost no benefit. This effect is significantly less pronounced in the right panel when considering worst-case behavior. Overall, it seems that for this particular dataset, the benefits of the Robust-Full-Linear method over reward scoring increase as the reward distribution has a shorter right tail. We write “for this dataset” since it is possible to construct datasets where this finding is not true. This result agrees well with our previous intuition that targeting patients with very high rewards is “risky” since the performance will depend strongly on the unknown causal effect of these high-reward pa- tients. Intuitively, if case-management causes these high-reward patients to visit the ED more frequently, our 119 effectiveness decreases substantially. In the left panel, reward scoring performs well (i.e., the performance difference is negative) for the initial reward distribution at least partially because the highest-reward patients mostly belong to stratum 1 and because these patients have the highest causal effects. As the reward distribu- tion shifts, the difference in rewards between stratum 1 and stratum 2 patients shrinks, so this benefit erodes, and the performance difference becomes positive. By contrast, in the right-hand panel, since we consider worst-case behavior, reward scoring does not enjoy such a benefit under the initial reward distribution, and we see a much smaller gain as we shift the distribution. 4.6 Conclusion We proposed a robust optimization model to maximize intervention effectiveness utilizing evidence available from published studies. Our approach is intuitive, flexible, ttpractically tractable using off-the-shelf mixed- integer optimization software and outperforms current practice when the underlying heterogeneity in causal effects are large. 120 References [1] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender sys- tems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge & Data Engineering, (6):734–749, 2005. [2] ¨ Ozden G¨ ur Ali, Yalc ¸ın Akc ¸ay, Serdar Sayman, Emrah Yılmaz, and M Hamdi ¨ Ozc ¸elik. Cross-selling investment products with a win-win perspective in portfolio optimization. Operations Research, 65(1):55–74, 2016. [3] Mohammed Alyakoob and Mohammad Saifur Rahman. Shared prosperity (or lack thereof) in the sharing economy. Working Paper, 2019. [4] Amazon.com. Amazon fulfillment center, 2020. Accessed: 2020-01-25. [5] Amazon.com. Amazon hub locker, 2020. Accessed: 2020-01-10. [6] Michelle Andrews, Xueming Luo, Zheng Fang, and Anindya Ghose. Mobile Ad effectiveness: Hyper-contextual targeting with crowdedness. Marketing Science, 35(2):218–233, 2016. [7] Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996. [8] Eva Ascarza. Retention futility: Targeting high-risk customers might be ineffective. Journal of Marketing Research, 55(1):80–98, 2018. [9] Susan Athey and Guido W Imbens. Machine learning methods for estimating heterogeneous causal effects. Stat, 1050(5), 2015. [10] Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. The Annals of Statis- tics, 47(2):1148–1178, 2019. [11] Betsy Atkins. Logistics in the e-commerce era, 2020. Accessed: 2020-03-01. [12] David H Autor. Outsourcing at will: The contribution of unjust dismissal doctrine to the growth of employment outsourcing. Journal of Labor Economics, 21(1):1–42, 2003. [13] Jill Avery, Thomas J Steenburgh, John Deighton, and Mary Caravella. Adding bricks to clicks: Predicting the patterns of cross-channel elasticities over time. Journal of Marketing, 76(3):96–111, 2012. [14] Yannis Bakos and Erik Brynjolfsson. Bundling and competition on the internet. Marketing science, 19(1):63–82, 2000. 121 [15] Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates. SSRN: https://ssrn.com/abstract=2661896, 2016. [16] Robert Batt and Santiago Gallino. Finding a needle in a haystack: The effects of searching and learning on pick-worker performance. 2017. [17] William J Baumol and Philip Wolfe. A warehouse-location problem. Operations Research, 6(2):252– 263, 1958. [18] Kapil Bawa and Robert Shoemaker. The effects of free sample promotions on incremental brand sales. Marketing Science, 23(3):345–363, 2004. [19] David Bell, Santiago Gallino, and Antonio Moreno. Showrooms and information provision in omni- channel retail. Production and Operations Management, 24(3):360–362, 2015. [20] David R Bell, Santiago Gallino, and Antonio Moreno. How to win in an omnichannel world. MIT Sloan Management Review, 56(1):45, 2014. [21] David R Bell, Santiago Gallino, and Antonio Moreno. Offline showrooms in omnichannel retail: Demand and operational benefits. Management Science, 64(4):1629–1651, 2017. [22] Rodrigo Belo, Pedro Ferreira, and Rahul Telang. Broadband in school: Impact on student perfor- mance. Management Science, 60(2):265–282, 2013. [23] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, Princeton, New Jersey, 2009. [24] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013. [25] Aharon Ben-Tal, Dick Den Hertog, and Jean-Philippe Vial. Deriving robust counterparts of nonlinear uncertain inequalities. Mathematical Programming, 149(1-2):265–299, 2015. [26] Dimitris Bertsimas and Martin S. Copenhaver. Characterization of the equivalence of robustification and regularization in linear and matrix regression. European Journal of Operational Research, 2017. [27] Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Robust sample average approximation. Math- ematical Programming, pages 1–66, 2017. [28] Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Data-driven robust optimization. Mathematical Programming, 167(2):235–292, 2018. [29] Dimitris Bertsimas, Allison O’Hair, Stephen Relyea, and John Silberholz. An analytics approach to designing combination chemotherapy regimens for cancer. Management Science, 62(5):1511–1531, 2016. [30] Dimitris Bertsimas, John Silberholz, and Thomas Trikalinos. Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening. Health Care Manage- ment Science, pages 1–14, 2016. [31] Hemant K Bhargava. Retailer-driven product bundling in a distribution channel. Marketing Science, 31(6):1014–1021, 2012. 122 [32] John Billings, Nina Parikh, and Tod Mijanovich. Emergency department use in New York City: A substitute for primary care? Issue Brief (Commonwealth Fund), 433:1–5, 2000. [33] John Billings and Maria C Raven. Dispelling an urban legend: Frequent emergency department users have substantial burden of disease. Health Affairs, 32(12):2099–2108, 2013. [34] Thomas Bortfeld, Timothy C. Y . Chan, Alexei Trofimov, and John N Tsitsiklis. Robust management of motion uncertainty in intensity-modulated radiation therapy. Operations Research, 56(6):1461– 1473, 2008. [35] David B Brown and Melvyn Sim. Satisficing measures for analysis of risky positions. Management Science, 55(1):71–84, 2009. [36] Erik Brynjolfsson, Yu Hu, and Mohammad S Rahman. Battle of the retail channels: How product selection and geography drive cross-channel competition. Management Science, 55(11):1755–1765, 2009. [37] Erik Brynjolfsson, Yu Hu, and Duncan Simester. Goodbye pareto principle, hello long tail: The effect of search costs on the concentration of product sales. Management Science, 57(8):1373–1386, 2011. [38] Erik Brynjolfsson, Yu Jeffrey Hu, and Mohammad S Rahman. Competing in the age of omnichannel retailing. MIT Sloan Management Review, 54(4), 2013. [39] Erik Brynjolfsson and Michael D Smith. Frictionless commerce? a comparison of internet and conventional retailers. Management science, 46(4):563–585, 2000. [40] Robin Burke. Hybrid recommender systems: Survey and experiments. User modeling and user- adapted interaction, 12(4):331–370, 2002. [41] Jason Chan and Anindya Ghose. Internet’s dirty secret: Assessing the impact of online intermediaries on HIV transmission. MIS Quarterly, 38(4):955–976, 2013. [42] Jason Chan, Probal Mojumder, and Anindya Ghose. The digital sin city: An empirical study of craigslist’s impact on prostitution trends. Information Systems Research, 30(1):219–238, 2019. [43] Tat Y Chan, Chunhua Wu, and Ying Xie. Measuring the lifetime value of customers acquired from google search advertising. Marketing Science, 30(5):837–850, 2011. [44] Timothy C Y Chan, Derya Demirtas, and Roy H Kwon. Optimizing the deployment of public access defibrillators. Management Science, 62(12):3617–3635, 2016. [45] Timothy C Y Chan, Zuo-Jun Max Shen, and Auyon Siddiq. Robust defibrillator deployment under cardiac arrest location uncertainty via row-and-column generation. Operations Research, 2017. [46] Ramnath K Chellappa and Shivendu Shivendu. Managing piracy: Pricing and sampling strategies for digital experience goods in vertically segmented markets. Information Systems Research, 16(4):400– 417, 2005. [47] Yuxin Chen, Xinxin Li, and Monic Sun. Competitive mobile geo targeting. Marketing Science, 36(5):666–682, 2017. [48] Hsing Kenneth Cheng and Yipeng Liu. Optimal software free trial strategy: The impact of network externalities and consumer uncertainty. Information Systems Research, 23(2):488–504, 2012. 123 [49] Alma Cohen and Peter Siegelman. Testing for adverse selection in insurance markets. Journal of Risk and Insurance, 77(1):39–84, 2010. [50] Stephen R Cole and Elizabeth A Stuart. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology, 172(1):107–115, 2010. [51] Ren´ e De Koster, Tho Le-Duc, and Kees Jan Roodbergen. Design and control of warehouse order picking: A literature review. European journal of operational research, 182(2):481–501, 2007. [52] David De Meza and David C Webb. Advantageous selection in insurance markets. RAND Journal of Economics, 32(2):249–262, 2001. [53] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010. [54] David Deming and Susan Dynarski. Into college, out of poverty? Policies to increase the postsec- ondary attainment of the poor. Technical report, National Bureau of Economic Research, 2009. [55] Dick Den Hertog. Is DRO the Only Approach for Optimization Problems with Convex Uncertainty?, 2018. [56] Sarang Deo, Kumar Rajaram, Sandeep Rath, Uday S Karmarkar, and Matthew B Goetz. Planning for HIV screening, testing, and care at the Veterans Health Administration. Operations Research, 63(2):287–304, 2015. [57] Isaac M Dinner, Harald J Van Heerde, and Scott A Neslin. Driving online and offline sales: The cross-channel effects of traditional, online display, and paid search advertising. Journal of Marketing Research, 50(5):527–545, 2013. [58] Jay Dixon, Bryan Hong, and Lynn Wu. The employment consequences of robots: Firm-level evi- dence. Working Paper, 2019. [59] Esther Duflo, Rachel Glennerster, and Michael Kremer. Using randomization in development eco- nomics research: A toolkit. Handbook of Development Economics, 4:3895–3962, 2007. [60] Hans-Georg Eichler, Eric Abadie, Alasdair Breckenridge, Hubert Leufkens, and Guido Rasi. Open clinical trial data for all? A view from regulators. PLoS Medicine, 9(4):e1001202, 2012. [61] A Elixhauser, C Steiner, and L Palmer. Clinical Classifications Software (CCS). Agency for Health- care Research and Quality, 2014. [62] EMarketer. In China, Alibaba dominates digital Ad landscape. https://www.emarketer. com/content/in-china-alibaba-dominates-digital-ad-landscape, 2018. Ac- cessed: 2019-06-28. [63] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Pro- gramming, pages 1–52, 2015. [64] Zheng Fang, Bin Gu, Xueming Luo, and Yunjie Xu. Contemporaneous and delayed sales impact of location-based mobile promotions. Information Systems Research, 26(3):552–564, 2015. [65] EFATL Feldman, FA Lehrer, and TL Ray. Warehouse location under continuous economies of scale. Management Science, 12(9):670–684, 1966. 124 [66] Nathan M Fong, Zheng Fang, and Xueming Luo. Geo-conquesting: Competitive locational targeting of mobile promotions. Journal of Marketing Research, 52(5):726–735, 2015. [67] Santiago Gallino and Antonio Moreno. Integration of online and offline channels in retail: The impact of sharing reliable inventory availability information. Management Science, 60(6):1434–1451, 2014. [68] Fei Gao and Xuanming Su. Omnichannel retail operations with buy-online-and-pick-up-in-store. Management Science, 63(8):2478–2492, 2016. [69] Fei Gao and Xuanming Su. Online and offline information for omnichannel retailing. Manufacturing & Service Operations Management, 19(1):84–98, 2016. [70] Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regularization in statistical learning. arXiv preprint arXiv:1712.06050, 2017. [71] Dan Geng, Vibhanshu Abhishek, and Beibei Li. When the bank comes to you: Branch network and customer multi-channel banking behavior. Proceedings of ICIS 2015, 6(1):971–981, 2015. [72] Anindya Ghose, Avi Goldfarb, and Sang Pil Han. How is the mobile internet different? Search costs and local activities. Information Systems Research, 24(3):613–631, 2013. [73] Anindya Ghose, Hyeokkoo Eric Kwon, Dongwon Lee, and Wonseok Oh. Seizing the commuting moment: Contextual targeting based on mobile transportation apps. Information Systems Research, 30(1):154–174, 2019. [74] Anindya Ghose, Beibei Li, and Siyuan Liu. Mobile targeting using customer trajectory patterns. Management Science, 65(11):5027–5049, 2019. [75] Anindya Ghose and Sha Yang. An empirical analysis of search engine advertising: Sponsored search in electronic markets. Management science, 55(10):1605–1622, 2009. [76] Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International statistical review, 70(3):419–435, 2002. [77] Joel Goh, Mohsen Bayati, Stefanos A. Zenios, Sundeep Singh, and David Moore. Data uncertainty in markov chains: Application to cost-effectiveness analyses of medical innovations. Operations Research, 2018. [78] Avi Goldfarb and Catherine Tucker. Online display advertising: Targeting and obtrusiveness. Mar- keting Science, 30(3):389–404, 2011. [79] Brad N Greenwood and Ritu Agarwal. Matching platforms and hiv incidence: An empirical investi- gation of race, gender, and socioeconomic status. Management Science, 62(8):2281–2303, 2015. [80] Brad N Greenwood and Sunil Wattal. Show me the way to go home: An empirical investigation of ride-sharing and alcohol related motor vehicle fatalities. MIS Quarterly, 41(1):163–187, 2017. [81] Vishal Gupta, Brian Rongqing Han, Song-Hee Kim, and H. Paek. Maximizing intervention effective- ness. Management Science, Forthcoming, 2019. [82] Pierre Gutierrez and Jean-Yves G´ erardy. Causal inference and uplift modelling: A review of the literature. In International Conference on Predictive Applications and APIs, pages 1–13, 2017. 125 [83] Brian Rongqing Han, Tianshu Sun, Leon Yang Chu, and Lixia Wu. Connecting customers and mer- chants offline: Experimental evidence from the commercialization of last-mile stations at alibaba. 2019. [84] Erin Hartman, Richard Grieve, Roland Ramsahai, and Jasjeet S Sekhon. From SATE to PATT: Combining experimental with observational studies to estimate population treatment effects. Journal of Royal Statistical Society: Series A (Statitsics in Society), 10:1111, 2015. [85] Amir Heiman, Bruce McWilliams, Zhihua Shen, and David Zilberman. Learning and forgetting: Modeling optimal product sampling over time. Management Science, 47(4):532–546, 2001. [86] Julian Higgins and Simon G Thompson. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21(11):1539–1558, 2002. [87] Paul W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960, 1986. [88] Sam K. Hui, J. Jeffrey Inman, Yanliu Huang, and Jacob Suher. The effect of in-store travel distance on unplanned spending: Applications to mobile promotion strategies. Journal of Marketing, 77(2):1–16, 2013. [89] Dan A Iancu and Nikolaos Trichakis. Pareto efficiency in robust optimization. Management Science, 60(1):130–147, 2013. [90] Kosuke Imai, Marc Ratkovic, et al. Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013. [91] Guido W Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86(1):4–29, 2004. [92] Guido W. Imbens and Joshua D. Angrist. Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475, 1994. [93] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [94] Carlos Jackson and Annette DuBard. It’s all about impactability! Optimizing targeting for care management of complex patients. Community Care of North Carolina., Data Brief 4, 2015. [95] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning, volume 112. Springer, 2013. [96] Bing Jing. Showrooming and webrooming: Information externalities between online and offline sellers. Marketing Science, 37(3):469–483, 2018. [97] Garrett Johnson, Randall A Lewis, and David Reiley. Location, location, location: Repetition and proximity increase advertising effectiveness. Working Paper, 2016. [98] Garrett A Johnson, Randall A Lewis, and Elmar I Nubbemeyer. Ghost Ads: Improving the economics of measuring online Ad effectiveness. Journal of Marketing Research, 54(6):867–884, 2017. [99] Garrett A Johnson, Randall A Lewis, and David H Reiley. When less is more: Data and power in advertising experiments. Marketing Science, 36(1):43–53, 2016. 126 [100] Nathan Kallus. Generalized optimal matching methods for causal inference. arXiv preprint arXiv:1612.08321, 2016. [101] Nathan Kallus. Recursive partitioning for personalization using observational data. Proceedings of the 34th International Conference on Machine Learning (ICML), 70:1789–1798, 2017. [102] Wagner A Kamakura. Cross-selling: Offering the right product to the right customer at the right time. Journal of Relationship Marketing, 6(3-4):41–58, 2008. [103] Wagner A Kamakura, Bruce S Kossar, and Michel Wedel. Identifying innovators for the cross-selling of new products. Management Science, 50(8):1120–1133, 2004. [104] Daniel Kuhn, Wolfram Wiesemann, and Angelos Georghiou. Primal and dual linear decision rules in stochastic and robust optimization. Mathematical Programming, 130(1):177–209, 2011. [105] Dmitri Kuksov and Chenxi Liao. When showrooming increases retailer profit. Journal of Marketing Research, 55(4):459–473, 2018. [106] Anuj Kumar, Amit Mehra, and Subodha Kumar. Why do stores drive online sales? Evidence of underlying mechanisms from a multichannel retailer. Information Systems Research, 30(1):319–338, 2019. [107] Henry Lam. Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research, 41(4):1248–1275, 2016. [108] Gene Moo Lee, Shu He, Joowon Lee, and Andrew B Whinston. Matching mobile applications for cross promotion. Available at SSRN 2893338, 2018. [109] Hau L Lee and Seungjin Whang. Winning the last mile of e-commerce. MIT Sloan Management Review, 42(4):54, 2001. [110] Keon-Hyung Lee and Laura Davenport. Can case management interventions reduce the number of emergency department visits by frequent users? The Health Care Manager, 25(2):155–159, 2006. [111] Yikuan Lee and Gina Colarelli O’Connor. New product launch strategy for network effects products. Journal of the Academy of Marketing Science, 31(3):241, 2003. [112] Young-Jin Lee and Yong Tan. Effects of different types of free trials and ratings in sampling of consumer software: An empirical study. Journal of Management Information Systems, 30(3):213– 246, 2013. [113] Randall A Lewis and Justin M Rao. The unfavorable economics of measuring the returns to advertis- ing. The Quarterly Journal of Economics, 130(4):1941–1973, 2015. [114] Randall A. Lewis and David H. Reiley. Online ads and offline sales: Measuring the effect of retail ad- vertising via a controlled experiment on yahoo! Quantitative Marketing and Economics, 12(3):235– 266, Sep 2014. [115] Shibo Li, Baohong Sun, and Alan L Montgomery. Cross-selling the right product to the right customer at the right time. Journal of Marketing Research, 48(4):683–700, 2011. [116] Shibo Li, Baohong Sun, and Ronald T Wilcox. Cross-selling sequentially ordered products: An application to consumer banking services. Journal of Marketing Research, 42(2):233–239, 2005. 127 [117] Zhijie Lin, Ying Zhang, and Yong Tan. An empirical study of free product sampling and rating bias. Information Systems Research, 30(1):260–275, 2019. [118] Jun Liu, Vibhanshu Abhishek, and Beibei Li. The impact of mobile adoption on customer omni- channel banking behavior. Working Paper, 2016. [119] Sheng Liu, Long He, and Zuo-Jun Max Shen. Data-driven order assignment for last mile delivery. Working Paper, 2018. [120] Zekun Liu, Dennis Zhang, and Fuqiang Zhang. Information sharing on retail platforms. Available at SSRN 3258109, 2018. [121] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. Content-based recommender systems: State of the art and trends. In Recommender systems handbook, pages 73–105. Springer, 2011. [122] Xueming Luo, Michelle Andrews, Zheng Fang, and Chee Wei Phang. Mobile targeting. Management Science, 60(7):1738–1756, 2013. [123] Puneet Manchanda, Grant Packard, and Adithya Pattabhiramaiah. Social dollars: The economic impact of customer participation in a firm-sponsored online customer community. Marketing Science, 34(3):367–387, 2015. [124] Lawrence J Marks and Michael A Kamins. The use of product sampling and advertising: Effects of sequence of exposure and degree of advertising claim exaggeration on consumers’ belief strength, belief confidence, and attitudes. Journal of Marketing Research, 25(3):266–281, 1988. [125] Edward J McGavin, Leroy B Schwarz, and James E Ward. Two-interval inventory-allocation policies in a one-warehouse n-identical-retailer distribution system. Management Science, 39(9):1092–1107, 1993. [126] Amit Mehra, Subodha Kumar, and Jagmohan S Raju. Competitive strategies for brick-and-mortar stores to counter showrooming. Management Science, 64(7):3076–3090, 2017. [127] Barry Nalebuff. Bundling as an entry barrier. The Quarterly Journal of Economics, 119(1):159–187, 2004. [128] Barrie R. Nault and Mohammad S. Rahman. Proximity to a traditional physical store: The effects of mitigating online disutility costs. Production and Operations Management, 28(4):1033–1051, 2019. [129] Diana M Negoescu, Kostas Bimpikis, Margaret L Brandeau, and Dan A Iancu. Dynamic learning of patient response types: An application to treating chronic diseases. Management Science, 2017. [130] Marius F Niculescu and Dong Jun Wu. Economics of free under perpetual licensing: Implications for the software industry. Information Systems Research, 25(1):173–199, 2014. [131] Shijia Ouyang. Alibaba’s cainiao to create smart logistics network. China Daily, 05 2018. [132] Eric Overby and Chris Forman. The effect of electronic commerce on geographic purchasing patterns and price dispersion. Management Science, 61(2):431–453, 2015. [133] Eric Overby and Sandy Jap. Electronic and physical market channels: A multiyear investigation in a market for products of uncertain quality. Management Science, 55(6):940–957, 2009. [134] Leandro Pardo. Statistical inference based on divergence measures. Chapman and Hall/CRC, 2005. 128 [135] Koen Pauwels, Peter SH Leeflang, Marije L Teerling, and KR Eelko Huizingh. Does online informa- tion drive offline revenues? Only for specific products and consumer segments! Journal of Retailing, 87(1):1–17, 2011. [136] Georgina Ann Phillips, David S Brophy, Tracey J Weiland, Antony J Chenhall, and Andrew W Dent. The effect of multidisciplinary case management on selected outcomes for frequent attenders at an emergency department. Medical Journal of Australia, 184(12):602, 2006. [137] Wei Qi, Lefei Li, Sheng Liu, and Zuo-Jun Max Shen. Shared mobility for last-mile delivery: Design, operational prescriptions, and environmental impact. Manufacturing & Service Operations Manage- ment, 20(4):737–751, 2018. [138] Wei Qi and Zuo-Jun Max Shen. A smart-city scope of operations management. Production and Operations Management, 28(2):393–406, 2019. [139] Paul Resnick and Hal R Varian. Recommender systems. Communications of the ACM, 40(3):56–59, 1997. [140] Michael Rothschild and Joseph Stiglitz. Equilibrium in competitive insurance markets: An essay on the economics of imperfect information. In Peter Diamond and Michael Rothschild, editors, Uncertainty in Economics, pages 257 – 280. Academic Press, 1978. [141] Jayarajan Samuel, Zhiqiang (Eric) Zheng, and Ying Xie. Value of local showrooms to online com- petitors. MIS Quarterly, Forthcoming, 2019. [142] J Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. Collaborative filtering recommender systems. In The adaptive web, pages 291–324. Springer, 2007. [143] Robert Seamans and Feng Zhu. Responses to entry in multi-sided markets: The impact of craigslist on local newspapers. Management Science, 60(2):476–493, 2013. [144] Reema Shah, Charlene Chen, Sheryl O’Rourke, Martin Lee, Sarita A Mohanty, and Jennifer Abra- ham. Evaluation of care management for the uninsured. Medical Care, 49(2):166–171, 2011. [145] Martha Shumway, Alicia Boccellari, Kathy O’Brien, and Robert L Okin. Cost-effectiveness of clini- cal case management for ED frequent users: Results of a randomized trial. The American Journal of Emergency Medicine, 26(2):155–164, 2008. [146] Jennifer Smith. E-commerce driving bigger demand for smaller warehouses, cbre says, 2020. Ac- cessed: 2020-03-02. [147] Dilip Soman and John T Gourville. Transaction decoupling: How price bundling affects the decision to consume. Journal of marketing research, 38(1):30–44, 2001. [148] Gonca Soysal, Alejandro Zentner, and Zhiqiang (Eric) Zheng. Physical stores in the digital age: How store closures affect consumer churn. Production and Operations Management, 28(11):2778–2791, 2019. [149] Joseph E. Stiglitz. The theory of ”screening,” education, and the distribution of income. The American Economic Review, 65(3):283–300, 1975. [150] Stefan Stremersch and Gerard J Tellis. Strategic bundling of products and prices: A new synthesis for marketing. Journal of marketing, 66(1):55–72, 2002. 129 [151] Elizabeth A Stuart, Stephen R Cole, Catherine P Bradshaw, and Philip J Leaf. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(2):369–386, 2011. [152] Hemang Subramanian and Eric Overby. Electronic commerce, spatial arbitrage, and market effi- ciency. Information Systems Research, 28(1):97–116, 2017. [153] Jiankun Sun, Dennis Zhang, Haoyuan Hu, and Jan A Van Mieghem. Predicting human discretion to adjust algorithmic prescription: A large-scale field experiment in warehouse operations. Available at SSRN 3355114, 2019. [154] Tianshu Sun, Susan Feng Lu, and Ginger Zhe Jin. Solving shortage in a priceless market: Insights from blood donation. Journal of Health Economics, 48:149–165, 2016. [155] Tianshu Sun, Lanfei Shi, Siva Viswanathan, and Elena Zheleva. Motivating effective mobile app adoptions: Evidence from a large-scale randomized field experiment. Information Systems Research, 30(2):523–539, 2019. [156] Qian Tang, Mei Lin, and Youngsoo Kim. Showrooming vs. competing: The role of product assort- ment and price. Working Paper, 2016. [157] S Viswanathan and Kamlesh Mathur. Integrating routing and inventory decisions in one-warehouse multiretailer multiproduct distribution systems. Management Science, 43(3):294–312, 1997. [158] V Mart´ ınez Vizca´ ıno, F Salcedo Aguilar, R Franquelo Guti´ errez, M Solera Mart´ ınez, M S´ anchez L´ opez, S Serrano Mart´ ınez, E L´ opez Garc´ ıa, and F Rodr´ ıguez Artalejo. Assessment of an after-school physical activity program to prevent obesity among 9 to 10-year-old children: A cluster randomized trial. International Journal of Obesity, 32(1):12, 2008. [159] Chong (Alex) Wang and Xiaoquan (Michael) Zhang. Sampling of information goods. Decision Support Systems, 48(1):14 – 22, 2009. [160] Kitty Wang and Avi Goldfarb. Can offline stores drive online sales? Journal of Marketing Research, 54(5):706–719, 2017. [161] Matthias Winkenbach and Milena Janjevic. Classification of last-mile delivery models for e- commerce distribution: A global perspective. City Logistics 1: New Opportunities and Challenges, pages 209–229, 2018. [162] Lynn Wu and Erik Brynjolfsson. The future of prediction: How google searches foreshadow housing prices and sales. In Economic Analysis of the Digital Economy, pages 89–118. University of Chicago Press, April 2015. [163] Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10(Jul):1485–1510, 2009. [164] Jiao Xu, Chris Forman, Jun B Kim, and Koert Van Ittersum. News media channels: Complements or substitutes? Evidence from mobile phone usage. Journal of Marketing, 78(4):97–112, 2014. [165] Kaiquan Xu, Jason Chan, Anindya Ghose, and Sang Pil Han. Battle of the channels: The impact of tablets on digital commerce. Management Science, 63(5):1469–1492, 2016. 130 [166] Dennis Zhang, Hengchen Dai, Lingxiu Dong, Qian Wu, Lifan Guo, and Xiaofei Liu. The value of pop-up stores in driving online engagement in platform retailing: Evidence from a large-scale field experiment with Alibaba. Management Science, Forthcoming, 2019. [167] Yingjie Zhang, Beibei Li, and Krishnan Ramayya. Learning individual behavior using sensor data: The case of GPS traces and taxi drivers. Working Paper, 2018. [168] Yuchi Zhang, Xueming Luo, and Fue Zeng. Omnichannel promotion effectiveness. 2016. [169] Zhe Zhang and Beibei Li. A quasi-experimental estimate of the impact of P2P transportation plat- forms on urban consumer patterns. In Proceedings of the 23rd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, pages 1683–1692. ACM, 2017. [170] Yan Zhao, Xiao Fang, and David Simchi-Levi. A practically competitive and provably consistent algorithm for uplift modeling. In Data Mining (ICDM), 2017 IEEE International Conference on, pages 1171–1176. IEEE, 2017. [171] Jianzhe Zhen, Frans JCT de Ruiter, and Dick den Hertog. Robust optimization for models with uncertain SOC and SDP constraints. Optimization Online PrePrint, 2017. [172] Jianzhe Zhen, Dick den Hertog, and M Sim. Adjustable robust optimization via fourier-motzkin elimination. Operations Research, 2017. [173] Mi Zhou, Vibhanshu Abhishek, Edward Kennedy, Kannan Srinivasan, and Ritwik Sinha. Linking clicks to bricks: Spillover benefits of online advertising. Working Paper, 2018. [174] Stephen Zuckerman, Aimee F Williams, and Karen E Stockley. Trends in medicaid physician fees, 2003–2008. Health Affairs, 28(3):w510–w519, 2009. 131 Appendix A Supporting Materials for Chapter 2 A.1 Free Sample Distribution Inside of An Alibaba Station Figure A.1: Free Sample Distribution Inside of An Alibaba Station Notes: The top-left and top-right pictures show the inside of a station. The bottom-left picture presents a shelf with displayed free samples. In the bottom-right picture, a customer claims a free sample while picking up his packages. 132 A.2 Two-Stage Matching from Section 3 In this section, we describe our matching procedure in detail and provide evidence of its effectiveness. To account for the imbalance of treatment and control groups regarding multiple dimensions, we first define three sets of main matching variables in Table A.1 to capture the direct impact on the focal brands, online spillover at the category level, and offline spillover regarding utilization of station services, respectively. The direct impact is the same as our main outcome variables. To quantify the online spillover, we compute customers’ total spending in the same product category. To quantify the offline spillover, we measure cus- tomers’ usage of logistics services at the station (i.e., self-pickup and parcel shipping) based on the number of packages. Similar to our definition of outcome variables, we use both binary and numeric variables to characterize each measure. For instance, “Self-pickup or not” is a binary variable indicating whether a cus- tomer has picked up any packages in the station, whereas “# of self-pickup” corresponds to the number of packages picked up. Table A.1: Main Matching Variables Direct Impact (Main Outcomes) Online Spillover Offline Spillover Brand purchase or not Category purchase or not Self-pickup or not Brand spending Category spending # of self-pickup Brand view or not Parcel shipping or not # of brand item-views # of parcel shipping In the first-stage exact matching procedure, we construct a sequence of binary matching variables (for each binary variable defined in Table A.1) before and including the treatment week, i.e., the week of sample claim, and identify a group of potential matches who have the exact same historical patterns for each sample claimer. To confirm the effectiveness of exact matching, we plot the trends in Figure A.2 for the treatment, unmatched control, and matched control group, respectively. We see that for each of the five binary variables, the treatment group has the identical historical pattern as the matched control group, on average. In fact, it is the case for each of the matched cluster of treated and control customers. 133 ● ● ● ● ● ● ● ● ● ● ● 0.0000% 0.0100% 0.0200% 0.0300% −5 −4 −3 −2 −1 0 1 2 3 4 5 Week Relative to Sample Claim Week Probability Group ● Treatment Matched Control Unmatched Control (a) Brand Purchase Probability ● ● ● ● ● ● ● ● ● ● ● 0.000% 0.200% 0.400% 0.600% −5 −4 −3 −2 −1 0 1 2 3 4 5 Week Relative to Sample Claim Week Probability (b) Brand View Probability ● ● ● ● ● ● ● ● ● ● ● 6.00% 7.00% 8.00% −5 −4 −3 −2 −1 0 1 2 3 4 5 Week Relative to Sample Claim Week Probability (c) Category Purchase Probability ● ● ● ● ● ● ● ● ● ● ● 30.0% 40.0% 50.0% 60.0% −5 −4 −3 −2 −1 0 1 2 3 4 5 Week Relative to Sample Claim Week Probability (d) Self-pickup Probability ● ● ● ● ● ● ● ● ● ● ● 2.00% 2.50% 3.00% 3.50% 4.00% 4.50% −5 −4 −3 −2 −1 0 1 2 3 4 5 Week Relative to Sample Claim Week Probability (e) Parcel Shipping Probability Figure A.2: Matching Variable Trends Before and After Sample Claims by Group Notes: We implement exact matching on the above variables over five weeks before and including the sample claim week (six weeks in total). In the second-stage propensity score matching procedure, we estimate the propensity scores based on a logistic regression that includes the following covariates: city, gender, age segment ( 25, 26 30, 31 40, > 40), total spending at the focal brand three months before treatment, total spending at the category and platform level six weeks before treatment, and the number of self-pickup and parcel shipping six weeks before treatment. Then, from all the potential controls that already survived in the first-stage exact matching, we find the closest one-to-one matching based on the predicted likelihood of being treated, i.e., the propensity scores. Finally, we plot the density of propensity scores by group in Figure A.3. The density of the propensity scores for the treatment group is perfectly matched with that of the matched control group, which assures the validity of the second-stage matching. 134 0.00 0.25 0.50 0.75 1.00 Propensity Score Density Group Treatment Matched Control Unmatched Control Figure A.3: Density of Propensity Score by Group 135 A.3 Robustness Check and Long-Term Dynamics from Section 3 Besides our main model, we employ an alternative specification for two purposes. First, the specification allows us to directly test whether the main effects appear before the treatment (sample claim) week as key robustness check. Second, we leverage the new specification to inspect the long-term dynamics of the causal effect of organic sample claim. For each customer, we first include data for exactly five weeks before and ten weeks after sample claim week (16 weeks in total) and construct a new panel data. The new panel data allows us to estimate the leads/lags model Outcome it = + 1 X =5 SampleClaimer i Lag t+ + 4 X =1 SampleClaimer i Lead t+ + 5 SampleClaimer i After i;t+4 +v i +u t + it ; (A.1) where Lead and Lag are the corresponding leads and lags of the time dummy variables indicating week t. This specification effectively breaks down the DID estimates by week. The coefficients 5 ::: 1 measure the difference in a particular outcome between the control and treatment groups for each week prior to sample claim week, respectively. (The sample claim week serves as the baseline week.) If the parallel trend assumption holds, we would not expect 5 ::: 1 to be statistically significant from zero [12]. Moreover, this specification also estimates the long-term effects over ten weeks. Particularly, we are also interested in the coefficients 1 ::: 4 and 5 . The coefficients 1 ::: 4 measure the short term effects from one week to four weeks after the sample claim week, respectively. The coefficient 5 measures the long-term effects after the fourth week, i.e. from the fifth week to the tenth week after sample claim week. By examining the pattern of these coefficients, we can explore the long-term dynamics of the causal effect. 136 We present estimation results in Table A.2. Recall that, by design of our matching procedure, the pre- treatment trends regarding the binary outcomes must be exactly the same. We further plot the estimates for the numeric outcomes in Figure A.4. All the estimated lag coefficients are close to zero, which further assures that the treatment and control groups have parallel trends before the treatment. As such, the parallel trend assumption is justified in our case, which provides the validity of our approach. Moreover, the graph presents the dynamics of causal effects in a revealing way. Specifically, the focal brand purchase (brand spending) is increased only in the short term during the first three weeks after sample claims but stays relatively flat in the long run. Furthermore, the effect of organic interaction on the number of item-views remains significant over the long run. ● ● ● ● ● ● ● ● ● ● 0.0000 0.0005 0.0010 0.0015 Lag 5 Lag 4 Lag 3 Lag 2 Lag 1 Lead 1 Lead 2 Lead 3 Lead 4 After 4 (a) Log Brand Spending ● ● ● ● ● ● ● ● ● ● 0.000 0.001 0.002 0.003 0.004 Lag 5 Lag 4 Lag 3 Lag 2 Lag 1 Lead 1 Lead 2 Lead 3 Lead 4 After 4 (b) Log # of Brand Item-views Figure A.4: Causal Effect of Organic Interaction By Week Relative to Sample Claim Week: Leads/Lags Model Notes: We plot the estimated coefficient based on Equation (A.1). The error bars represent1:96 times the standard error of each point estimate. The detailed tables are also presented in Table A.2. 137 Table A.2: Estimated Impact of Organic Sample Claim: Leads/Lags Model Brand purchase or not Brand spending Brand view or not # of brand item-views (Log) (Log) SampleClaimer i Lag t5 -0.0000 -0.0000 -0.0000 0.0001 (.) (0.0000) (.) (0.0000) SampleClaimer i Lag t4 -0.0000 -0.0000 -0.0000 0.0001 (.) (0.0000) (.) (0.0000) SampleClaimer i Lag t3 -0.0000 -0.0000 -0.0000 0.0001 (.) (0.0000) (.) (0.0000) SampleClaimer i Lag t2 -0.0000 -0.0000 -0.0000 0.0001 (.) (0.0000) (.) (0.0000) SampleClaimer i Lag t2 -0.0000 -0.0000 -0.0000 0.0001 (.) (0.0000) (.) (0.0000) SampleClaimer i Lead t+1 0.0001 0.0006 0.0039 0.0040 (0.0001) (0.0002) (0.0002) (0.0003) SampleClaimer i Lead t+2 0.0002 0.0009 0.0040 0.0041 (0.0001) (0.0003) (0.0002) (0.0003) SampleClaimer i Lead t+3 0.0001 0.0008 0.0035 0.0033 (0.0001) (0.0003) (0.0003) (0.0003) SampleClaimer i Lead t+4 0.0001 0.0003 0.0029 0.0029 (0.0001) (0.0003) (0.0002) (0.0003) SampleClaimer i After t+4 0.0000 0.0002 0.0017 0.0017 (0.0000) (0.0001) (0.0001) (0.0001) Customer FE Yes Yes Yes Yes Week FE Yes Yes Yes Yes N 4,454,528 4,454,528 4,454,528 4,454,528 Robust standard errors clustered at each matched pair in parentheses p< 0:1, p< 0:05, p< 0:01 138 A.4 Long-Term Effect of Induced Sample Claim from Section 4 Table A.3: Causal Effect of Induced Sample Claim (with 10-week Post-Treatment Window) Brand purchase or not Brand spending Brand view or not # of brand item-views (log) (log) \ SampleClaimer i 0.007 0.114 0.228 0.133 (0.043) (0.219) (0.161) (0.260) N 189,019 189,019 189,019 189,019 Robust standard errors in parentheses p< 0:1, p< 0:05, p< 0:01 139 A.5 Additional Analysis on Offline Spillover of Organic Sample Claim Fol- lowing Section 3 In addition to the main effect on the focal brand, we now test whether organic sample claim would have an impact on customers’ usage of the last-mile stations. Recall that station usage is one of the dimensions being considered in our matching procedure. So our matched sample has a balanced pre-treatment trend regarding both self-pickup and parcel shipping Figure A.2. In Table A.4, we estimate our main DID specifi- cation Equation (3.1) with the four matching variables defined in Table A.1 to capture the offline spillover. Interestingly, organic sample claim significantly increases the usage of the last-mile stations regarding both self-pickup and parcel shipping. The weekly number of packages picked up and shipped from the stations increase by 2.39% (p< 0:01) and 0.6% (p< 0:01), respectively (Table A.4). The economically significant offline spillover exemplifies the synergy effect of combining commercialization and logistics. It shows that connecting merchants and customers at the offline logistics infrastructure can increase customer awareness of the logistics services and their subsequent usage. In this way, the platform could further improve its logistics efficiency, reduce the logistics costs, and also increase the revenue of infrastructure. Table A.4: Offline Spillover of Organic Sample Claim Self-pickup or not # of self-pickup Parcel shipping or not # of parcel shipping (Log) (Log) SampleClaimer i After it 0.0238 0.0239 0.0079 0.0063 (0.0009) (0.0009) (0.0003) (0.0003) After it -0.1343 -0.1216 0.0016 0.0012 (0.0012) (0.0012) (0.0004) (0.0003) Customer FE Yes Yes Yes Yes Week FE Yes Yes Yes Yes N 6,681,792 6,681,792 6,681,792 6,681,792 Robust standard errors clustered at each matched pair in parentheses p< 0:1, p< 0:05, p< 0:01 140 Appendix B Supporting Materials for Chapter 3 B.1 Robustness Check for Section 3.4 141 Table B.1: Causal Impact of Cross-sampling on the Sampling Brand (with Control Variables) Spending Item-views Spending Item-views Spending Item-views (1-4 weeks) ( 5-8 weeks) ( 9-12 weeks) Cross Sampling 0.220 0.116 0.084 0.031 0.053 0.021 (0.045) (0.006) (0.035) (0.004) (0.035) (0.004) Female 0.029 0.004 0.020 0.013 0.011 0.005 (0.059) (0.006) (0.047) (0.004) (0.063) (0.004) Age (26 - 30) 0.082 0.038 0.064 0.022 0.027 0.010 (0.070) (0.006) (0.041) (0.005) (0.062) (0.005) Age (31 - 35) 0.098 0.047 0.003 0.027 0.072 0.015 (0.066) (0.007) (0.074) (0.005) (0.069) (0.005) Age (> 35) 0.058 0.021 0.043 0.013 0.017 0.004 (0.063) (0.008) (0.045) (0.005) (0.067) (0.005) Total order 0.0002 0.00001 0.00004 0.00000 0.0002 0.00000 (0.0003) (0.00001) (0.0001) (0.00000) (0.001) (0.00001) Total spending 0.00000 0.00000 0.00000 0.000 0.00000 0.00000 (0.00000) (0.00000) (0.00000) (0.00000) (0.00001) (0.00000) Purchased both brands 0.196 0.039 0.641 0.002 1.040 0.040 (0.296) (0.033) (0.332) (0.026) (0.471) (0.029) Purchased both categories 0.047 0.005 0.037 0.007 0.028 0.004 (0.055) (0.007) (0.040) (0.005) (0.046) (0.005) Sampling brand existing customer 0.239 0.012 0.026 0.049 0.212 0.056 (0.180) (0.019) (0.147) (0.018) (0.473) (0.016) Sampling brand three-month spending 0.007 0.001 0.009 0.001 0.039 0.0004 (0.006) (0.0004) (0.005) (0.0004) (0.028) (0.0004) Viewed sampling brand within three months 0.556 0.171 0.239 0.119 0.063 0.104 (0.113) (0.016) (0.082) (0.012) (0.165) (0.015) Distributing brand existing customer 0.011 0.014 0.039 0.008 0.031 0.00001 (0.054) (0.007) (0.050) (0.006) (0.040) (0.005) Distributing brand three-month spending 0.001 0.0001 0.0002 0.00001 0.001 0.0001 (0.001) (0.0001) (0.001) (0.0001) (0.001) (0.00004) Viewed distributing brand within three months 0.041 0.032 0.022 0.008 0.035 0.011 (0.051) (0.007) (0.037) (0.004) (0.039) (0.005) Sampling category three-month spending 0.0002 0.00001 0.00003 0.00000 0.00003 0.00000 (0.0002) (0.00001) (0.00003) (0.00000) (0.0001) (0.00000) Sampling category three-month item-views 0.001 0.0005 0.0003 0.0001 0.0002 0.00005 (0.001) (0.0001) (0.0003) (0.0001) (0.0002) (0.0001) Distributing category three-month spending 0.0001 0.00003 0.00001 0.00002 0.0002 0.00002 (0.0002) (0.00001) (0.0001) (0.00001) (0.0002) (0.00001) Distributing category three-month item-views 0.001 0.001 0.0001 0.001 0.0003 0.0003 (0.0004) (0.0001) (0.0003) (0.0001) (0.0004) (0.0001) Item spending 0.001 0.001 0.0003 0.0002 0.001 0.0001 (0.0003) (0.0001) (0.0003) (0.00003) (0.0003) (0.00004) Item discount rate 0.107 0.076 0.043 0.020 0.028 0.004 (0.090) (0.012) (0.081) (0.010) (0.090) (0.009) Constant 0.227 0.068 0.100 0.032 0.128 0.016 (0.082) (0.009) (0.058) (0.006) (0.076) (0.006) N 83,132 83,132 83,132 83,132 83,132 83,132 R 2 0.003 0.028 0.002 0.014 0.004 0.009 Robust standard errors in parentheses; p<0.1; p<0.05; p<0.01 Notes: We include all the variables listed in Table 3.3. 142 Table B.2: Causal Impact of Cross-sampling on the Sampling Brand (Falsification Test) Spending Item-views Spending Item-views ((-1)- (-4) weeks) ((-5)- (-8) weeks) Cross Sampling 0.027 0.001 0.002 0.001 (0.026) (0.003) (0.021) (0.003) Constant 0.141 0.038 0.090 0.037 (0.020) (0.002) (0.015) (0.002) N 83,132 83,132 83,132 83,132 R 2 0.00001 0.00000 0.00000 0.00000 Robust standard errors in parentheses; p<0.1; p<0.05; p<0.01 143 B.2 An Example of Affinity Scores Table B.3: Summary of Both Affinity Scores for Consumption 1 as the Distributing Brand Subcategory User Score Washing Set 93.7 Perfume 68.3 Body Lotion/Cream 67.0 Hand Soap 63.9 Body Care Set 61.9 Hand Cream 58.1 Body Wash 56.2 Car Perfume 54.4 Hair Conditioner 47.8 Shampoo 47.1 Aromatherapy 38.8 Subcategory Item Score Body Lotion/Cream 100.0 Body Care Set 94.9 Perfume 94.3 Hand Cream 93.1 Shampoo 86.6 Body Wash 79.8 Washing Set 64.8 Hair Conditioner 63.4 Car Perfume 59.0 Aromatherapy 24.2 Hand Soap 0.0 Notes: Recall that “Consumption 1” refers to the brand from the daily consumption product category in the first round of our experiment. Consumption 1, as the distributing brand, is distributing a free sample provided by the cosmetic brand “Cosmetic 1”. 144 B.3 Survey Evidence Combined with Affinity Scores In this section, we seek to understand the reason why our proposed scoring system is effective from the customers’ perspective. We carried out an online survey for all 18,847 sample receivers in the first and last round of cross-sampling. We then link the affinity scores to the survey respondents, which leaves us with 141 customers who have complete answers (0.7% response rate). We summarize the average scores by each question in Table B.4. Despite the comparatively small sample size, we find some significant differences for customers with different answers. Specifically, based on Question 1, customers who were happy to receive the free samples have significantly higher item-based affinity scores (75.38) compared with those who were not (65.91, p<0.05). Based on Question 2, customers who have used the free samples have significantly higher item-based affinity scores (76.67) compared with those who have not (65.67, p< 0:05). These two questions ask about customers engagement with the campaign. Moreover, Question 2a and 2b ask about customers’ experience with particular brand. For example, we find that whether customers who used the free samples have the intention to purchase is significantly correlated with the user-based affinity score. Finally, we ask customers overall satisfaction of our proposed cross-sampling campaign. Customers who rated 5/5 have significantly larger item-based affinity scores. Overall, our survey indicates that the item information is correlated with customers’ satisfaction of cross-sampling, and the user information is correlated with customers’ experience or usage of the physical products. 145 Table B.4: Comparing Affinity Scores by Each Survey Answer User-Based Affinity Score Item-Based Affinity Score N No Yes p- Value No Yes p- Value 1. Happy to receive free samples 141 (Yes 80) 68.30 69.26 0.87 65.91 75.38 0.03 2. Whether used the free samples 141 (Yes 72) 70.61 67.14 0.54 65.67 76.67 0.01 2a.Did not use due to unknown brand 69 (Yes 17) 75.30 56.28 0.06 64.80 68.34 0.65 2a.Intend to purchase after usage 72 (Yes 21) 61.32 81.28 0.01 74.47 79.58 0.51 3.Overall, like this form of campaign (score 5/5) 141 (Yes, 52) 67.61 70.95 0.56 67.75 77.34 0.03 p< 0:1, p< 0:05, p< 0:01 146 Appendix C Supporting Materials for Chapter 4 C.1 Proofs Proof. Proof of Theorem 4.3.1. We require the following identity: v +w v +z v +w v +z ; whenevervv; zw: (C.1) To prove the identity, differentiate the left-hand side byv, yielding zw (v +z) 2 0; sincez w by assumption. Thus, decreasingv tov never increases the ratio in (C.1), which proves the identity. Recall thatB denotes the full-information optimal solution to Problem (4.1). We assume thatjB j =K without loss of generality. Since c = ^ c > 0, we haver c c 0 for allc2f1;:::;Kg. IfjB j<K, we can always target additional patients inf1;:::;Kg without decreasing the objective. For notational convenience, let a c r c ^ c and b c c ^ c for all c2f1;:::;Cg. Then, r ^ -scoring is equivalent toa-scoring, and the objective of Problem (4.1) can be written as P C c=1 r c c = P C c=1 a c b c . The relative performance of thea-scoring rule is P c:1cK a c b c P c:c2B a c b c = P c:1cK;c2B a c b c + P c:1cK;c= 2B a c b c P c:1cK;c2B a c b c + P c:cK+1;c2B a c b c : (C.2) 147 SinceB is the full-information optimal solution, X c:c2B a c b c X c:1cK a c b c , X c:cK+1;c2B a c b c X c:1cK;c= 2B a c b c : By assumption, we also haveb c for allc2f1;:::;Kg. Applying the identity in (C.1) yields P c:1cK a c b c P c:c2B a c b c P c:1cK;c2B a c + P c:1cK;c= 2B a c b c P c:1cK;c2B a c + P c:cK+1;c2B a c b c : (C.3) Then, Eq.(C.3) P c:1cK a c P c:1cK;c2B a c + P c:cK+1;c2B a c ; (C.4) sinceb c for allc2f1;:::;Kg. DefinekjB \f1;:::;Kgj to be the number of patients targeted by both methods. Then, Eq. (C.4) P K c=1 a c P k c=1 a c + P 2Kk c=K+1 a c (sincea c is in decreasing order and> 0) P K c=1 a c max 0kK f P k c=1 a c + P 2Kk c=K+1 a c g : (C.5) Consider the maximization in (C.5) and rewrite the summations: k X c=1 a c + 2Kk X c=K+1 a c = 2K X c=K+1 a c + k X c=1 [a c a 2Kc+1 ]: The first term does not depend onk. In the second term, note that forc < c 0 , we havea c a 2Kc+1 a c 0a 2Kc 0 +1 sincea c is in descending order and;> 0. Therefore, the quantitya c a 2Kc+1 is also decreasing inc. It follows that the optimalk is the largestc such that 1cK anda c a 2Kc+1 0 and is 0 if this quantity is always non-positive. This proves the inequality (4.2). 148 We next give an example in which the bound is achieved. Take c = 8 > > > > < > > > > : ^ c ; if 1cK ^ c ; otherwise, , b c = 8 > > > > < > > > > : ; if 1cK ; otherwise. For these values, we will confirm that the true relative performance of a-scoring is given by (4.2). We first show thatB =f1;:::;k g[fK + 1;:::; 2Kk g, wherek is defined in the theorem. To see this, note that since a c is non-increasing and b c is constant on the scale 1 c K, B must be of the formf1;:::;kg[fK + 1; 2Kkg for some 0 k K, and the full-information objective value is max 0kK f P k c=1 a c + P 2Kk c=K+1 a c g. As proven previously,k optimizes this objective so thatB has the required form. Substituting in the definition ofa c ,b c andB into Eq. (C.2) completes the proof. u Proof. Proof of Corollary 4.3.1.1. For the first part, take the derivative of !(=) with respect to = to obtain ! 0 (=) = P K c=1 r c ^ c P k c=1 r c ^ c ((=) P k c=1 r c ^ c + P 2Kk c=K+1 r c ^ c ) 2 0; where the inequality follows becauser c ^ c 0 for allc2f1;:::;Cg by assumption. For the second part, we apply the definition ofk in Theorem 4.3.1. When= r K+1 ^ K+1 =r K ^ K , k =K and!(=) = 1. Finally, when (=) is sufficiently small, (=) r 2K ^ 2K =r 1 ^ 1 , and we have k = 0. Then, the denominator of! does not depend on=, and the numerator goes to 0 as=! 0, so!(=)! 0. Thus, the proof is complete. u Proof. Proof of Theorem 4.4.1. First note that 0 is feasible and has worst-case performance 0 in (4.8). Thus, it suffices to show that for any z62Z , the worst-case objective over (4.9) is at most 0. For such a z, 149 there must exist x 0 2X such thatq z (x 0 )6=P(x ~ s = x 0 ). Consider the function (x) = I +(I(x = x 0 )P(x ~ s = x 0 )). Since ^ 0, 2U ^ . Furthermore, C X c=1 z c r c (x c ) = C X c=1 z c r c X x2X q z (x) (x c ) ! = C X c=1 z c r c I + X x2X q z (x)(I(x c = x 0 )P(x ~ s = x 0 )) ! = C X c=1 z c r c I +(q z (x 0 )P(x ~ s = x 0 )) : Now, if P C c=1 z c r c = 0, this quantity is 0. Otherwise, if P C c=1 z c r c > 0, then, by taking!1, we have that the worst-case performance over (4.9) of z is1. In either case, the worst-case performance is at most 0. This proves the theorem. u Proof. Proof of Theorem 4.4.2. We apply standard techniques (see, e.g., [23] for a review). Given any feasible z, the inner minimization of (4.8) under uncertainty set (4.11) can be rewritten as min 0 ;;((xc)) C c=1 ;(v(xc)) C c=1 C X c=1 z c r c 0 @ 0 + G X g=1 g g (x c ) +(x c ) +v(x c ) 1 A s.t. I 0 + G X g=1 g g I;kk ^ 1 k((x c )) C c=1 k res ^ 2 ;k (v(x c )) C c=1 k link ^ ; wherev(x c ) represents the difference C (x c ) S (x c ). This optimization problem decomposes into the sum of three separate minimizations: min 0; C X c=1 z c r c 0 + G X g=1 g g (x c ) ! s.t. I 0 + G X g=1 g g I;kk ^ 1 ; min ((xc)) C c=1 : C X c=1 z c r c (x c ) s.t. k((x c ))k res ^ 2 ; min (v(xc)) C c=1 : C X c=1 z c r c v(x c ) s.t. k(v(x c ))k link ^ 150 Consider the second optimization problem. By the Cauchy-Schwarz inequality, the optimal value is at least ^ 2 (z c r c ) C c=1 res . In fact, the optimal value is exactly this quantity by definition of the dual norm. An entirely analogous argument holds for the third optimization problem, which has optimal value ^ (z c r c ) C c=1 link . It remains to evaluate the first optmization problem. For ^ 1 > 0, 0 = (II)=2; = 0, is a strictly feasible solution. It follows that Slater’s condition holds, and we have strong duality. Dualizing the two linear inequalities and rearranging yields the Lagrangian dual sup 1 ; 2 0 G( 1 ; 2 ) where G( 1 ; 2 ) = 1 I 2 I + min 0 0 C X c=1 (z c r c 1 + 2 ) + min :kk ^ 1 G X g=1 C X c=1 z c r c g (x c ) ( 1 2 ) g ! g : The first minimization is finite only if 1 2 = P C c=1 z c r c . By the Cauchy-Schwarz inequality the second minimization is equal to C X c=1 z c r c g (x c ) ( 1 2 ) g ! G g=1 : Substituting these values above yields the dual problem: sup 1 ; 2 1 I 2 I + C X c=1 z c r c g (x c ) ( 1 2 ) g ! G g=1 s.t. 1 2 = C X c=1 z c r c : SinceI >I, we claim at optimality that 2 = 0. Indeed, if this were not true, then the solution ( 1 2 ; 0) is still feasible, but yields a better objective value than ( 1 ; 2 ). We conclude that at optimality, 1 = P C c=1 z c r c and 2 = 0. Substituting above, combining the three subproblems and simplifying proves the result. u Theorem C.1.1. Problem (4.12) is NP-Complete even ifG = 1 and ^ = ^ 2 = 0,(x) is binary-valued andkk is the` p -norm. 151 Proof. Proof of Theorem C.1.1. We reduce to the well-known NP-Complete problem Subset Sum. We first state the decision version of subset sum and the relevant special case of problem (4.12). Subset Sum : Given natural numbers a 1 ;:::;a M , and a target number T > 0, is there a subset of Nfa 1 ;:::;a M g that adds up to preciselyT ? Decision Version of Special Case of Problem (4.12): Given parameters I, ^ 1 , K and , sequences r c 0 and(x c )2f0; 1]g forc = 1;:::;C, and a target valueQ, is the objective value of max z2Z I C X c=1 z c r c ^ 1 C X c=1 z c r c ((x c )) at leastQ? We next describe the reduction.: Given an instance of Subset Sum with positive integersfa 1 ;:::;a M g and target sum valueT , for anyI > 0 , let ^ 1 > 2I, and takeC =K =M + 1; = 0:5; Q = 2TI, and r c = 8 > > > > < > > > > : a c ifc = 1;:::;M T ifc =M + 1 ; (x c ) = 8 > > > > < > > > > : 0 ifc = 1;:::;M 1 ifc =M + 1 : Let z be the solution to eq. (4.12) with these parameters. We will prove that the objective value is at least Q if and only if the answer to Subset Sum is “Yes.” Assume without loss of generality that T P m c=1 a c , else the answer to the Subset Sum problem is trivially, “No.” Thus, we have the simple bound I C X c=1 z c r c ^ 1 C X c=1 z c r c ((x c )) I C X c=1 z c r c 2TI: (C.6) 152 Now suppose that the optimal objective is at least Q. We claim that z M+1 must equal 1. Indeed, if z 0 = 0, then the objective can be rewritten as I C X c=1 z c r c ^ 1 C X c=1 z c r c ((x c )) =I M X c=1 z c r c ^ 1 M X c=1 z c r c 1 2 (since(x c ) = 0 for allc = 1;:::;M) = (I ^ 1 =2) M X c=1 z c r c 0 (since ^ > 2I): This contradicts the assumption that the optimal objective is at leastQ. Thus, if the optimal objective is at least Q, each of the inequalities in eq. (C.6) must be equalities. Furthermore, sincez M+1 = 1, it must be that P M c=1 z c r c =T , i.e, thesez encode the relevant subset and the answer to Subset Sum is “Yes.” Now suppose that the answer to Subset Sum is “Yes.” Then, consider a solution z c forc = 1;:::;M encoded by this subset andz M+1 = 1. The objective value of this solution is I C X c=1 z c r c ^ 1 C X c=1 z c r c ((x c )) = 2TI ^ 1 jT (:5) +T (1:5)j = 2TI = Q: Thus, the optimal value must be at leastQ. This completes the proof. u Proof. Proof of corollary 4.4.2.1. Substituting in the appropriate norms to problem (4.12) yields max z2Z (I ^ 2 ^ ) C X c=1 z c r c ^ 1 C X c=1 z c r c ( g (x c ) g ) ! G g=1 : IfI ^ 2 ^ < 0, then both terms of the objective are non-negative and z = 0 is an optimal solution. Else, we can factor out this term from the maximization yielding problem (4.13). u 153 Proof. Proof of Corollary 4.4.2.2. Define G+1 1 T e, where e2< G is a vector of ones. We first claim 1 = diag() 1 + 1 G+1 ee T : (C.7) Indeed, we compute directly 1 = I + 1 G+1 e T e T 1 G+1 ( T e)e T = I + 1 G+1 1 1 G+1 (1 G+1 ) e T = I and 1 = I e T + 1 G+1 e T 1 G+1 ( T e)e T = I 1 1 G+1 + 1 G+1 (1 G+1 ) e T = I; which proves the claim. Problem (4.13) is equivalent to max z2Z C X c=1 z c r c ^ I ^ 2 ^ C X c=1 z c r c (I(x c 2X g ) g ) ! G g=1 1 : (C.8) For notational convenience, let us definef g P C c=1 z c r c (I(x c 2X g ) g ) for allg = 1;:::;G. Conse- quently, f T e = C X c=1 z c r c ( G X g=1 I(x c 2X g ) G X g=1 g ) = C X c=1 z c r c ((1I(x c 2X G+1 )) (1 G+1 )) = C X c=1 z c r c ( G+1 I(x c 2X G+1 )): 154 Simplifying (C.8) yields: Eq. (C.8) = C X c=1 z c r c ^ I ^ 2 ^ p f T 1 f = C X c=1 z c r c ^ I ^ 2 ^ q f T diag() 1 + 1 G+1 ee T f = C X c=1 z c r c ^ I ^ 2 ^ q f T diag() 1 f + (f T e) 2 1 G+1 ; which yields (4.15). Thus, the proof is complete. u Proof. Proof of Corollary 4.4.2.3. The proof follows directly by applying corollary 4.4.2.1 and by definition of dual norm. u Proof. Proof of corollary 4.4.2.4 To prove the first inequality, note that there exists ; such that the true CATE ()E[ ~ c j x ~ c = x c ] is a member ofU ; . So we have C X c=1 r c z Rob c E[ ~ c j x ~ c = x c ] min C ()2U ; C X c=1 r c z Rob c (x c ); (C.9) and applying theorem 4.4.2 yields the result. To see the second equality, notice that for any candidate CATE C ()2U ^ ;^ , min C ()2U ^ ;^ C X c=1 r c z Rob c C (x c ) 0; where the inequality follows because z = 0 is a feasible solution to the robust problem while z Rob is an optimal solution. Continuing eq. (C.9), we have C X c=1 r c z Rob c E[ ~ c j x ~ c = x c ] min C ()2U ; C X c=1 r c z Rob c (x c ) min C ()2U ^ ;^ C X c=1 r c z Rob c C (x c ): Applying theorem 4.4.2 again yields the result, which completes the proof. u 155 To show that problem (4.17) can be solved by bisection search for any, it suffices to show that z(0) equals reward scoring, lim !1 z() = 0 and that the constraint is monotonic in. The first two claims are clear from the definition of z(). We prove the last: Theorem C.1.2 (Monotonicity of z()). The function! P C c=1 r c z c () is non-increasing. Proof. Proof of theorem C.1.2. Let 0 1 < 2 . Then, from optimality of z( 1 ), C X c=1 r c z C ( 1 ) 1 C X c=1 z c ( 1 )r c ( g (x c ) g ) ! G g=1 C X c=1 r c z C ( 2 ) 1 C X c=1 z c ( 2 )r c ( g (x c ) g ) ! G g=1 : (C.10) Similarly, from the optimality of z( 2 ), C X c=1 r c z C ( 2 ) 2 C X c=1 z c ( 2 )r c ( g (x c ) g ) ! G g=1 C X c=1 r c z c ( 1 ) 2 C X c=1 z c ( 1 )r c ( g (x c ) g ) ! G g=1 : Adding these two equations and rearranging yields: 0 ( 1 2 ) 0 @ C X c=1 z c ( 1 )r c ( g (x c ) g ) ! G g=1 C X c=1 z c ( 2 )r c ( g (x c ) g ) ! G g=1 1 A Since 1 < 2 , this implies that C X c=1 z c ( 1 )r c ( g (x c ) g ) ! G g=1 C X c=1 z c ( 2 )r c ( g (x c ) g ) ! G g=1 : Substituting back into eq. (C.10) and rearranging shows C X c=1 r c z c ( 1 ) C X c=1 r c z c ( 2 ) + 1 0 @ C X c=1 z c ( 1 )r c ( g (x c ) g ) ! G g=1 C X c=1 z c ( 2 )r c ( g (x c ) g ) ! G g=1 1 A C X c=1 r c z c ( 2 ); which completes the proof. u 156 C.2 Extensions of the Base Model C.2.1 Fairness Constraints, and Domain-Specific Knowledge Adding constraints on z does not significantly increase the complexity of the robust model. Such constraints can be used to enforce fairness, e.g., that equal numbers of men and women be targeted for treatment. Similarly, it is straightforward to adjust the budget to the form d T zK to model the case when different patients have different costs of treatmentd c . Similarly, adding convex constraints to our uncertainty set (4.11) does not significantly increase the complexity of the model. Thus, we might incorporate domain-specific knowledge of the structure on C () by enforcing, e.g., g = 0 for someg, or by bounding its magnitude, as inl c C (x c ) u c : Applying standard techniques yields a corresponding robust counterpart. C.2.2 Incorporating Evidence from Multiple Studies When there are multiple, distinct studies providing evidence that the treatment is effective, we would prefer to incorporate all of them into our model. For concreteness, consider the case ofJ studies with correspond- ing confidence intervals [I j ;I j ], description functions j g () and statistics j g for all g = 1;:::;G j and j = 1;:::;J. With these data, let U j ^ j ;^ j = ( C () :X7!R 9() :X7!R; 0 2R; 2R Gj ; s.t. S (x) = 0 + Gj X g=1 g j g (x) +(x); (C.11) I j 0 + Gj X g=1 g j g I j ; S (x c ) C (x c ) C c=1 link ^ j ; kk ^ j 1 ; ((x c )) C c=1 res ^ j 2 ; ) be our usual uncertainty set built using the study evidence of thej th paper. Then a natural extension of our robust model is max z2Z min C ()2 T J j=1 U j ^ j ;^ j C X c=1 z c r c C (x c ); (C.12) 157 i.e., to consider worst-case performance over models for the CATE which are consistent with each of the studies. Again, using fairly standard techniques, we can form the robust counterpart: Theorem C.2.1 (Robust Counterpart for Multiple Papers). Problem (C.12) is equivalent to max z;w J X j=1 0 @ I j C X c=1 w j c ^ j 1 C X c=1 w j c ( j g (x c ) j g ) ! G j g=1 ^ j 2 w j c C c=1 res ^ j w j c C c=1 link 1 A s.t. J X j=1 w j c =z c r c c = 1;:::C: Proof. Proof. LetU j R C be the set n ( C (x c ) :c = 1;:::;C)j C 2U j ^ j ;^ j o : In words,U j are the set of possible realizations of the candidate CATE on the candidate population that are consistent with the study evidence from thej th paper. Then, for a fixed z, the inner minimization of Problem (C.12) is equivalent to max p C X c=1 (z c r c )p c s.t. p2 J \ j=1 U j : We recognize the maximum as the support function of the set T J j=1 U j evaluated at (z c r c :c = 1:::C). Standard results allow us to re-express this support function in terms of the support functions ofU j (see, e.g., [25]). Specifically, the above optimization is equivalent to min w J X j=1 (w j jU j ) s.t. J X j=1 w j c =z c r c ; c = 1;:::;C; where ( w j U j ) sup p2U j p T w is the support function ofU j . Pass the negative sign through the minimization and make the transformation w j !w j to write max w J X j=1 (w j jU j ) s.t. J X j=1 w j c =z c r c ; c = 1;:::;C: 158 Finally, note that (w j jU j ) = min p2U j P C c=1 w j c p c ; and this minimization is precisely the inner problem in the proof of theorem 4.4.2. Applying that result and simplifying completes the proof. u In principle, one could specify the each of the norms and parameters ^ j 1 ; ^ j 2 , ^ j , separately, although practical considerations would favor taking them to be equal. Importantly, the theorem emphasizes that in our meta-analysis setting, adding additional evidence does not affect our uncertainty set by simply shrinking its radius as in more traditional data-driven robust optimization models. Rather, additional evidence adds additional constraints to the set, which yields a more complex robust counterpart. C.2.3 Modeling Uncertainty in r c In many applications we are not interested in (dollar) rewards, but rather aggregate benefit to patients, in which case settingr c = 1 is a natural choice. Even in settings such as our case-study, where one is interested in monetary savings, there is often detailed covariate information available for the candidate population, e.g., medical history and past ED visits, which can be used to build high-quality estimates ^ r of these savings from historical data. In these cases, the uncertainty inr c is often relatively small, much smaller than the uncertainty in c , and approximatingr c ^ r c is reasonable. That said, from a theoretical point of view, one could imagine settings where the uncertainty in r is large, and one wishes to“robustify” this parameter. As a simple example, suppose we model r2U r f^ r + r :krk r r g for some point estimate ^ r bounded error r, and we wish to solve max z2Z min ()2U ^ ;^ min r2U r C X c=1 z c r c (x c ): 159 A straightforward computation shows this problem is equivalent to max z2Z min ()2U ^ ;^ C X c=1 z c ^ r c (x c ) r (z c (x c )) C c=1 r wherekk r andkk r are dual norms. We recognize this as a robust problem where the uncertainty occurs in a convex fashion in the inner problem. There are a variety of approaches to attacking such problems whenU ^ is polyhedral, including, e.g., vertex enumeration and converting the problem to adjustable linear program [171, 55] which can be solved exactly via Fourier-Motzkin elimination [172] or approximately via decision-rules [104]. For clarity, U ^ ;^ will be polyhedral whenever the norms defining it are (weighted)` 1 or` 1 norms. Solving robust problems with convex uncertainty can be computationally demanding. Each of the above approaches offers its own strengths and drawbacks. The best approach is often application dependent. Since our target application is well-modeled by a knownr c , we leave a comprehensive study of the computational merits of each of the above approaches to future work. C.2.4 Other Forms of Covariate Matching as Regularization Section 4.4.5 showed that with appropriately chosen norms in our uncertainty set, we can recover several well-known covariate matching techniques as regularizers in the robust counterpart. In this section we show how, given a general covariate matching technique, one can modify the construction of eq. (4.11) to obtain the corresponding uncertainty set. Given a candidate targeting z, let r z = (r c z c ) C c=1 , and let w(z) = rz e > (rz) 2R C + . We can interpret w(z) as a discrete probability distribution onX which assigns mass w c (z) to each point x c 2X , c = 1;:::;C. We will commit a small abuse of notation and refer to w(z) and this probability distribution 160 interchangeably. Similarly, we letP S denote the empirical distribution of the covariates onX in the study population, i.e., the discrete probability distribution that assigns mass 1=S to each point x s 2 X , s = 1;:::;S. Intuitively, covariate matching techniques seek a group such that the distribution of covariates in this group closely matches that of some other, fixed group of interest. (In causal studies, this often amounts to finding a control group who closely matches the treatment group.) We restrict attention to covariate matching techniques of the form min z d(w(z);P S ); where d(;) is a function measuring the “distance” between two probability distributions defined onX . (This is where we commit our aforementioned abuse of notation.) We write “distance” in quotation marks because we do not require that the function be a metric (see below for examples). Almost all common covariate matching techniques can be written in this form for some d(;). For example, we might taked(w(z); P S ) to be an integral probability metric, such as sup AX C X c=1 w c (z)I(x c 2A) 1 S S X s=1 I(x s 2A) (Total Variation Distance) or sup f:X!R; f is 1-Lipschitz C X c=1 w c (z)f(x c ) 1 S S X s=1 f(x s ) (Wasserstein Distance): These metrics minimize the total variation and Wasserstein distance between the distribution of covariates in the target population and study population, respectively. Alternatively, we can taked(w(z); P S ) to be a general-divergence metric, 1 S S X s=1 P C c=1 w c (z)I(x c = x s ) 1 S P S r=1 I(x r = x s ) ! (-divergence); 161 where() is a convex function satisfying(1) = 0, 0(a=0)a lim t!1 (t)=t fora> 0 and 0(0=0) 0. (The use of in defining the-divergences is cannonical and unfortunately conflicts with our use of in defining the description functions. We will only refer to -divergences here, and stress, that with this exception of this appendix, all references to refer to description functions in the study evidence.) By specializing the function,-divergences recover many well-known probability metrics including relative entropy, Hellinger-distance, and the Cressie-Read divergences. See [24] for more examples. Importantly, our earlier covariate matching results, namely, Corollaries 4.4.2.2 and 4.4.2.3, can also be obtained as special cases of this framework. Loosely speaking, we taked(;) to be the function of the two probability distributions which first maps each distribution to the expected value of the description functions with respect to that distribution, and then applies a function to these expected values. More specifically, consider the function d(w(z);P S ) = P C c=1 w c (z)x c 1 S P S s=1 x s V 1 ; which effectively computes the mean of each distribution and then computes the weighted` 2 norm between the re- sulting means. Minimizing thisd(;) is equivalent to Mahanoblis matching (compare to Corollary 4.4.2.3). Similarly, suppose as in Corollary 4.4.2.2, that there exists a partitionX = S G+1 g=1 and theG description functions are given byI(x2X g ). Then the functiond(w(z);P S ) given by v u u t G+1 X g=1 (q z;g g ) 2 g ; where q z;g = C X c=1 w c (z)I(x c 2X g ) and g = 1 S S X s=1 I(x s 2X g ) effectively computes the mean of each description function under the two measures, and then computes the 2 -distance between them. Minimizing thisd(;) is equivalent to 2 matching as in corollary 4.4.2.2. 162 We stress that in contrast to integral probability metrics and general-divergences, the last two examples only depend onP S via the statistics of the description functions, not the entire distribution. In summary, the possibilities for distance functions and covariate matching techniques under this frame- work are numerous and our list is non-exhaustive. We refer the reader to [76, 134] and references therein for further examples and discussion. Given such a distance functiond(;), we can define a modification of eq. (4.11) that yields this distance function as a regularizer, i.e., we can obtain a “covariate matching as regularization” interpretation of the robust counterpart for a general covariate matching technique. We will require that d satisfy some mild conditions. Each of the examples above satisfies these conditions. Assumption 1 (Regularity Conditions of Distance Function). We assume that 1. d(;P S ) is convex and lower-semicontinuous in its first argument, 2. d(P S ;P S ) = 0, and 3. d(P;P S )>1 for allP. Theorem C.2.2 (General Covariate Matching as Regularization). Supposed(;) satisfies assumption 1. Let U ^ ;^ ;d = ( C () :X7!R 9() :X7!R; S () :X7!R; ;I2R; y2R C ; s.t. (C.13) S (x c ) =I +y c +(x c ); c = 1;:::;C; III; S (x c ) C (x c ) C c=1 link ^ ; d y ^ 1 ^ 1 ; ((x c )) C c=1 res ^ 2 ; ) ; 163 where d (y) sup w y > wd(w;P S ) is the convex conjugate of d(;P S ). Then, the robust targeting problem eq. (4.8) with uncertainty set eq. (C.13) is equivalent to max z2Z I C X c=1 z c r c ^ 1 C X c=1 r c z c ! d w(z);P S ^ 2 (z c r c ) C c=1 res ^ (z c r c ) C c=1 link : (C.14) Proof. Proof. The proof follows the proof of theorem 4.4.2 closely. Indeed, for a fixed z, we can write C (x c ) =I +y c +(x c ) + C (x c ) S (x c ). With this substitution, the inner minimization similarly decouples into the sum of four minimization problems: min I Ie > (r z) s.t. I2 [I;I]; min ;y (e y) > (r z) s.t. d y ^ 1 ^ 1 ; min ((xc)) C c=1 C X c=1 z c r c (x c ) s.t. k((x c ))k res ^ 2 ; min (v(xc)) C c=1 C X c=1 z c r c v(x c ) s.t. k(v(x c ))k link ^ ; where v(x c ) represents S (x c ) C (x c ). The solution to the first minimization problem is trivially I =I. The third and fourth minimization can again be solved using the Cauchy-Schwarz inequality, yielding optimal objectives ^ 2 kr zk res and ^ kr zk link . Only the second optimization problem remains. By Lagrange duality, this minimization is equivalent to sup t0 min (e > (r z) t ^ 1 ) + min y y > (r z) +td y ^ 1 The minimization over is finite only ift = ^ 1 e > (r z). The minimization over y is equivalent to t max y ^ 1 y > (r z=t)d (y) = td ^ 1 t r z; P S ! ; 164 where we’ve used Assumption 1 to conclude the conjugate of d is d itself. Combining and simplifying shows min ;y (e y) > (r z) s.t. d y ^ 1 ^ 1 ; = ^ 1 e > (r z)d r z e > (r z) ; P S : Combining the optimal values of all four subproblems proves the theorem. u Equation (C.14) decomposes the objective into a portion that maximizes effectiveness under a worst- case homogeneous effect scenario, and three penalties, the first of which is the covariate-matching distance and the second two are as in theorem 4.4.2. In the special case thatkk res =kk link =kk 1 , we can also simplify the robust counterpart along the lines of corollary 4.4.2.1, leaving only the covariate-matching distance as a regularizer. Theorem C.2.2 generalizes Theorem 4.4.2. Indeed, by choosing d(;) appropriately we can recover eq. (4.12). Moreover, as already noted above, we can also recover the covariate matching regularizers in Corollaries 4.4.2.2 and 4.4.2.3. Perhaps more importantly, Theorem C.2.2 gives an explicit uncertainty set which recovers general covariate matching techniques, e.g., based on total-variation or-divegences. (As an aside, for most covariate matching techniques listed above, the conjugated is known.) This result thus expands the scope of our “covariate matching as regularization” interpretation of our method. Remark (Computational Complexity). Sinced(;P S ) is convex in its first argument, the function z7! e > (r z)d rz e > (rz) ; P S is also convex in z. (Namely, the function (t; z)7! td rz t ; P S is convex in (t; z) for t > 0 since it is the perspective function of d, and our desired function is obtained by composing this function with the linear mapping z7! (e > (r z); z).) Thus, eq. (C.14) is a mixed-binary convex optimization. Developing algorithms for general mixed-binary convex optimization problems remains an active area of research, but, in our opinion, it is fair to say that from a practical perspective, solving such problems is considerably more difficult than solving mixed-binary linear or mixed-binary convex quadratic 165 optimization problems, and there are many fewer commercial codes available. This increased computational burden makes this approach somewhat less appealing practically than our previous formulations. Remark (Dependence onP S ). As stated earlier, covariate matching techniques based on integral probability metrics or -divergences typically require access to the full distributionP S , or, equivalently,fx s : s = 1;:::;Sg, in order to evaluated(w(z);P S ). Most studies do not report these data, making these types of covariate matching impractical in our application setting. For this reason, we consider theorem C.2.2 to be primarily of theoretical interest. Practical implementations will necessarily have to restrict to covariate matching techniques that only depend onP S through the description functions and their statistics. C.3 Additional Graphs and Numerical Results C.3.1 Graphs from section 4.2.3. 0.00 0.05 0.10 0.15 0 25 50 75 100 r Figure C.1: Histogram of Avg. ED Visit Charges By Patient Notes. ED visit charges are highly concentrated with a long tail. Approximately 75% of charges are between 2 and 8:3 in anonymized monetary units. For comparison,rc ranges between 0 and 110 in anonymized monetary units, so that approximately 75% of charges occur over 6% of the range. 166 C.3.2 Graphs from Section 4.3 Figure C.2 shows the worst-case relative performance of outcome scoring (i.e., ry(0)-scoring) based on y c (0). When the case management is benign, outcome scoring performs well, and its performance improves as= approaches 1. When the resource constraint is relaxed (i.e., asK=C increases), the benefit of outcome scoring improves slightly. To the contrary, when case management could increase the number of ED visits to a smaller degree compared to what it can reduce (i.e., = =0:1), outcome scoring performs quite badly. Again, this is because the distribution of outcomes r c y c (0) has a long tail for a fixed degree of correspondence. ● ● ● ● ● ● ● ● ● ● ● ● Benign Treatment Potentially Harmful Treatment −125% −100% −75% −50% −25% 0% 25% 50% 75% 100% −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Degree of Correspondence δ δ Relative Performance (%) Resource Constraint K C ● 10% 20% 30% 40% 50% Figure C.2: Worst-Case Relative Performance of Outcome Scoring (ry(0)-Scoring) Notes. When the treatment is benign, we plot the worst-case relative performance bound (4.2) provided in Theorem 4.3.1. When the treatment is potentially harmful, the worst-case relative performance is1, as mentioned in Remark 4.3.2. Thus, we plot P K c=1 rcyc(0)e= P 2K c=K+1 rcyc(0) for comparison. C.3.3 Graphs from Section 4.5.3 Recall that for very large values of the Adj. CV parameter, the Robust-2 and Robust-Full-Linear methods may not fully utilize the budget. However, for our dataset, if one specifies the Adj. CV parameter via our method in section 4.4.7, both methods fully utilize the budget so long as one specifies < 50%, i.e., requiring the robust methods achieve at least 50% of the rewards in the nominal case. See Figure C.3 167 0% 25% 50% 75% 100% 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Loss in Nominal Scenario α % of Budget Utilized ∑z c K Methods Robust−2 Robust−Full−Linear Figure C.3: Budget Utilization when varying Notes. For each value of, we specify the Adj. CV parameter using the method of section 4.4.7 and plot the corresponding percentage of the budget utilized. We present the box-plot of rewards for each stratum given by [145] in Figure C.4. ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● Stratum 1: 5−11 ED Visits Stratum 2: > 11 ED Visits Strata Based on Previous ED Visits Average Charges per ED Visit of Each Patient Figure C.4: Box-Plot of Rewards by Each Stratum Notes. Each point represents a patient and the monetary values of rewards are anonymized. The dashed line is the cut-off of reward scoring forK = 200. Every patient above the line will be targeted by reward scoring. 168 C.3.4 Graphs from Section 4.5.5 In this section, we leverage corollary 4.4.2.4 to explain the poor performance of Robust-2 in our experiment in section 4.5.5. Recall, given a robust solution z Rob , the worst-case performance bound in corollary 4.4.2.4 is (I Rob 2 Rob ) C X c=1 z Rob c r c Rob 1 C X c=1 z Rob c r c ( g (x c ) g ) ! G g=1 ; where Rob 1 ; Rob 2 ; Rob are large enough so that corresponding uncertainty set for this method with these radii contains the true (unknown) CATE. We argue that for any values of 1 ; 2 ; in eq. (4.20), the cor- responding value of Rob 2 necessary to cover the worst-case realization in eq. (4.20) is fairly large. Conse- quently, this performance bound is quite small, likely negative, and Robust-2 will perform poorly. To see this, we compute the worst-case realized CATE over eq. (4.20) and show that the linear projection of this CATE onto description functions given by the strata necessarily has a large residual. Specifically, we compute the worst-case realization of the CATE over eq. (4.20) for the solution given by Robust-2 and a particular choice of 1 , 2 ,. We then perform a linear regression of this (candidate) CATE over the de- scription functions of Robust-2 and plot the resulting standard deviation of the residual (which corresponds loosely to Rob 2 ). Note that the most optimistic case for Robust 2 is given by 2 = = 0. For any other values, this worst-case residual can only have larger standard deviation. We plot this standard deviation versus the Adjusted CV in fig. C.5. For comparison, we perform the same procedure with Robust Linear and also plot its values. Intuitively, we think this provides a strong intuition for why Robust-2 performs poorly in this setting; it is highly misspecified, so one needs to accommodate a very large residual to cover the true CATE. C.3.5 Graphs from Section 4.5.6 169 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adj.CV Norm of Residual Robust Methods Robust−2 Robust−Full−Linear Figure C.5: Explaining Performance of Robust Methods under the Setting in Section 4.5.5 Notes. The standard deviation of the above residual is a rough approximation of the value Rob 2 required for the uncertainty set of the corresponding robust model to cover the true CATE. Larger values imply poorer performance. ● ● ● ● ● ● ● ● ● ● −10% 0% 10% 20% 30% 10% 20% 30% 40% 50% Resource Constraint K C Average Relative Performance Difference (%) CV ● 0 0.2 0.4 0.6 0.8 ● ● ● ● ● ● ● ● ● ● −10% −5% 0% 5% 10% 10% 20% 30% 40% 50% Resource Constraint K C Worst−Case Performance Difference (%) Adj. CV ● 0 0.5 1 1.5 Figure C.6: Difference between Robust-2 and Reward Scoring Varying the Resource Constraint Notes. The left panel corresponds to Figure 4.2 in Section 4.5.4. The right panel corresponds to Figure 4.4 Section 4.5.5. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0% 20% 40% 60% 0 10 20 30 40 50 α 1 Average Relative Performance Difference (%) CV ● 0 0.2 0.4 0.6 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −10% −5% 0% 5% 0 10 20 30 40 50 α 1 Worst−Case Performance Difference (%) Adj. CV ● 0 0.5 1 1.5 Figure C.7: Difference between Robust-2 and Reward Scoring Varying Reward Distribution Notes. The left panel corresponds to Figure 4.2 in Section 4.5.4. The right panel corresponds to Figure 4.4 Section 4.5.5. 170
Abstract (if available)
Abstract
Many e-commerce platforms have established their own logistics infrastructure such as last-mile/pickup stations and warehouses, which has largely benefited the online marketplaces as a cost center. While the existing literature has focused on either the operational efficiency of logistics or the role of digital marketplaces as separate issues, my thesis studies how e-commerce platforms can leverage the logistics infrastructure as an \textit{offline platform} for commercial activities. Our key idea is to integrate the offline capacity with the resources from the online and transform the logistics infrastructure into also a profit center. Along this line, we empirically investigate how the network of last-mile stations and warehouses can serve as a new media to connect brands and customers with physical promotional items (e.g., free samples) in the first two essays, respectively. Since there is no prior examples/observational data, we conducted large-scale field experiments in close collaboration with Alibaba---one of the world's largest e-commerce platforms---to understand the causal impact of the new business practices and explore the underlying mechanism. Moreover, because the available data from field experiments in the offline setting is often too noisy to accurately learn the heterogeneous treatment effect, decision-makers have to rely on limited information to improve their business practices. In the third essay, we develop a general data-driven optimization framework to link aggregate-level experimental data to decision making and maximize intervention effectiveness under uncertainty. In this way, my thesis demonstrates abundant opportunities and provide actionable insights for e-commerce platforms that manage their own logistics infrastructure.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Competing across and within platforms: antecedents and consequences of market entries by mobile app developers
PDF
Essays on revenue management with choice modeling
PDF
Essays on service systems
PDF
Essays on consumer returns in online retail and sustainable operations
PDF
Essays on improving human interactions with humans, algorithms, and technologies for better healthcare outcomes
PDF
Essays on online advertising markets
PDF
Modeling customer choice in assortment and transportation applications
PDF
Essays on the economics of digital entertainment
PDF
Essays on understanding consumer contribution behaviors in the context of crowdfunding
PDF
Efficient policies and mechanisms for online platforms
Asset Metadata
Creator
Han, Brian Rongqing
(author)
Core Title
Commercialization of logistics infrastructure as an offline platform
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Publication Date
04/08/2020
Defense Date
04/05/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
business innovation,causal inference,data-driven decision making,e-commerce platform,field experiment,logistics,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Zhu, Leon Yang (
committee chair
), Gupta, Vishal (
committee member
), Rajagopalan, Sampath (
committee member
), Sun, Tianshu (
committee member
), Yang, Sha (
committee member
)
Creator Email
rongqing.han@gmail.com,rongqinh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-280030
Unique identifier
UC11675093
Identifier
etd-HanBrianRo-8247.pdf (filename),usctheses-c89-280030 (legacy record id)
Legacy Identifier
etd-HanBrianRo-8247.pdf
Dmrecord
280030
Document Type
Dissertation
Rights
Han, Brian Rongqing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
business innovation
causal inference
data-driven decision making
e-commerce platform
field experiment