Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Optimizing penalized and constrained loss functions with applications to large-scale internet media selection
(USC Thesis Other)
Optimizing penalized and constrained loss functions with applications to large-scale internet media selection
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OPTIMIZING PENALIZED AND CONSTRAINED LOSS FUNCTIONS WITH APPLICATIONS TO LARGE-SCALE INTERNET MEDIA SELECTION by Courtney Paulson A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) August 2016 Copyright 2016 Courtney Paulson To my parents, who introduced me to probability via M&M’s and family poker nights, to my sisters, who kept me in the games when my M&M supply ran low, and to Brianna, who helped me devour entire M&M fortunes on the road to Vegas ii Acknowledgments First and foremost, I would like to express my sincerest gratitude to my advisor, Dr. Gareth James. I can say without the slightest exaggeration if it were not for Dr. James, I would never have been in a position to defend a PhD. His continued support during my graduate experience, whether patiently explaining some theoretical result or shaking his head at a misplaced comma in my code, has been invaluable. I cannot thank him enough for making this research and my career possible. I would also like to thank Dr. Paat Rusmevichientong and Dr. Lan Luo for their immeasurable help in developing my research skills, including their myriad contributions to the work of this dissertation. Their expertise in operations and marketing respectively provided key insights at which I never could have arrived on my own. In addition, I would like to thank Dr. Luo for being a vital resource in preparing for both the INFORMS marketing conferences and the submission of the ISMS dissertation proposal award, an honor I could not have achieved without her. Further, I would like to thank Dr. Jinchi Lv and Dr. Peter Radchenko for serving on my dissertation committee and providing crucial feedback during my thesis process. I really appreciate the time and effort put into this process on my behalf. I would also like to acknowledge the faculty and students of the Data Sciences and Operations depart- ment for attending numerous research presentation talks and always giving fantastic feedback on both my presentation skills and my research. Specifically, thank you to all my fellow PhD students, both past and present. I will miss having a convenient office to exchange ideas with you, but I look forward to visiting you all over the world. Even more specifically, thank you to Pallavi Basu and Luella Fu. Your endless moral support was invaluable over the course of my PhD, and I am proud to call you two of my closest friends. iii Last but not least, thank you to my family for their unceasing support over not just the past five years but the twenty-some before that as well. I would not be here today without you. (Well, I might be, but the journey likely would have involved a lot less video poker, and what kind of journey is that?) iv Contents Acknowledgments iii Contents v List of Tables viii List of Figures x Abstract xiv 1 Introduction 1 1.1 Penalized and Constrained Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Relevant Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Optimal Large-Scale Internet Media Selection . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 comScore Internet Browsing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Penalized Loss Functions 17 2.1 Optimizing Generalized Penalized Loss Functions . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 Formulation of the Generalized Loss Function Optimization . . . . . . . . . . . . . 18 2.1.2 Algorithm to Optimize Generalized Loss Function . . . . . . . . . . . . . . . . . . 19 v 2.2 Application of Methodology to Internet Media Selection . . . . . . . . . . . . . . . . . . . 21 2.2.1 OLIMS Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 The Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.3 Model Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Numerical Studies with comScore Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.1 Comparison between OLIMS and Danaher et al. (2010) . . . . . . . . . . . . . . . 31 2.3.2 Simulated Large-Scale Problem: 5000 Websites . . . . . . . . . . . . . . . . . . . . 36 2.3.3 Case Study 1: McDonald’s McRib Sandwich Online Advertising Campaign . . . . . 38 2.3.4 Case Study 2: Norwegian Cruise Lines Wave Season Online Advertising Campaign with Mandatory Media Coverage to Travel Aggregate Sites . . . . . . . . . . . . . . 43 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3 Constrained and Penalized Loss Functions 48 3.1 Introduction to Penalized and Constrained Loss Functions . . . . . . . . . . . . . . . . . . 49 3.1.1 Common Motivating Examples: Monotone Curve Estimation and Portfolio Opti- mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.2 Larger Class of Problems: Generalized LASSO . . . . . . . . . . . . . . . . . . . . 52 3.2 Optimizing Penalized and Constrained Loss Functions . . . . . . . . . . . . . . . . . . . . 53 3.3 PAC Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.1 Equality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.2 Violations of Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4 Application of Methodology to Internet Media Selection . . . . . . . . . . . . . . . . . . . 66 3.5 Case Study: Cruise Line Internet Marketing Campaign . . . . . . . . . . . . . . . . . . . . 70 3.5.1 Internet Media Metric 1: Maximizing Reach . . . . . . . . . . . . . . . . . . . . . 71 3.5.2 Internet Media Metric 2: Maximizing Clickthrough . . . . . . . . . . . . . . . . . . 75 vi 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4 Future Directions 83 4.1 Extension 1: OLIMS Bidding on Impression Cost . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.1 Model with Cost Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.2 Toy Demonstration Example: 5 Websites . . . . . . . . . . . . . . . . . . . . . . . 87 4.1.3 Companion Results to OLIMS in Section 2.3 . . . . . . . . . . . . . . . . . . . . . 89 4.2 Extension 2: PAC Time-Varying Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.1 Overview and Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 References 101 A Technical Appendix to Chapter 1 106 B Technical Appendix to Chapter 2 109 B.1 Simple Illustration of Correlation in Website Viewership . . . . . . . . . . . . . . . . . . . 109 B.2 Supplementary Information for Model Comparisons with Danaher et al.’s Model . . . . . . 110 C Technical Appendix to Chapter 3 113 C.1 Methodology Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 C.2 Le Cam’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 vii List of Tables 1.1 Number of websites in each of the 16 website categories for the December and January 2011 500-website cleaned comScore data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 The number of websites chosen in each of the 16 website categories for the McRib case study in the three proposed OLIMS optimizations: the original (unconstrained), the demographic- weighted extension, and the targeted ad exposures extension. Also provided for comparison are the total number of websites in each category and the category’s average cost (in CPM), as well as the number of websites in each category present in the top 25 and top 50 cost- adjusted naive comparison methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Actual and desired proportions for the McRib demographic-weighted advertising campaign, where the actual proportions are the proportion of users in the December 2011 cleaned comScore data who met the given demographic criteria, and the desired proportions are realistic example target proportions for the McRib advertising campaign. . . . . . . . . . . . 42 3.1 Average RMSE over100 training data sets, for four LASSO methods tested in three different simulation settings and two different correlation structures. The numbers in parentheses are standard errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Average misspecification error (in percentages) over100 training data sets for four binomial methods tested in three different simulation settings and two different correlation structures. The Bayes error rate is given for comparison; it represents the minimum error rate. The numbers in parentheses are standard errors, also in percentages. . . . . . . . . . . . . . . . 64 viii 3.3 Average RMSE over 100 training data sets in three different simulation settings using the ρ = 0 correlation structure. The numbers in parentheses are standard errors. The true regression coefficients were generated according to (3.21) . . . . . . . . . . . . . . . . . . . 65 4.1 Average CPM, total number of page views, and number of unique visitors for the five- website toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.1 Overview of viewership correlation within and across the sixteen website categories in the December 2011 data set, used in the McRib case study. The diagonal elements represent the mean absolute correlation in that particular website category, while the off-diagonal elements represent the maximum absolute correlation between each pair of groups. . . . . . 107 A.2 Overview of viewership correlation within and across the sixteen website categories in the January 2011 data set, used in the NCL case study. As in Table A.1, The diagonal elements represent the mean absolute correlation in that particular website category, while the off- diagonal elements represent the maximum absolute correlation between each pair of groups. 108 B.1 True and Estimated Meanα,r Values for the simulated data used for comparisons between the OLIMS method and Danaher et al.’s method in Section 2.3.1. . . . . . . . . . . . . . . . 112 B.2 True and Estimated Mean α, r Values for the real data used for comparisons between the OLIMS method and Danaher et al.’s method in Section 2.3.1. . . . . . . . . . . . . . . . . . 112 ix List of Figures 2.1 Performance comparison between the OLIMS method and Danaher et al.’s using simulated data, where the true optimal reach is in solid black, the Danaher estimate is in dashed red, and the OLIMS estimate is in dotted blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2 Performance comparison between the OLIMS method and Danaher et al.’s using real data, where the true optimal reach is in solid black, the Danaher estimate is in dashed red, and the OLIMS estimate is in dotted blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Reach results for the 5000 websites simulated with no viewership correlations, where the optimal reach (solid black) is calculated using the full data set, the OLIMS method reach (heavy dashed blue) is calculated using the 10% subset, and the naive 100 cost-adjusted reach (dashed purple), 50 cost-adjusted reach (dotted brown), 25 cost-adjusted reach (dotted red), and equal 5000-website reach (dotted-dashed green) are calculated using the given budget allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Reach results for the McDonald’s McRib Campaign in which the optimal reach (solid black line) is calculated using the full data set, the OLIMS method (heavy dashed blue) is calcu- lated on the 90% holdout data using the 10% training data subset, and the naive methods (top 50 cost-adjusted websites in small dotted red, top 25 cost-adjusted websites in heavy dotted brown, top 10 cost-adjusted websites in small dashed purple, and equal allocation in dotted-dashed green) are all calculated with a non-optimized budget allocation. . . . . . . . 41 x 2.5 Reach results with mandatory media coverage for the eight aggregate travel websites in the Norwegian Cruise Line case study, with the constrained optimization in dashed blue, the unconstrained optimization in solid black, and the equal eight-website allocation in dotted green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 Plots of the reach calculated using (left) the OLIMS reach criterion and (right) the PAC reach criterion for 500 websites using six methods: the PAC (dashed brown), OLIMS (dotted- dashed red), allocation by cost across the top 50 cost-adjusted websites (light dotted-dashed purple), allocation by cost across the top 25 cost-adjusted websites (light dashed cyan), allo- cation by cost across the top 10 cost-adjusted websites (dotted black), and equal allocation across all 500 websites (solid green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2 Plots of the reach calculated using the PAC reach criterion for 500 websites on (left) the full data set and (right) the subset of travel users using six methods: the unconstrained PAC (thin dashed brown), the unconstrained OLIMS (thin dotted-dashed red), the constrained PAC (thick dashed black), the constrained OLIMS (thick dotted-dashed blue), allocation by cost across the subset of travel websites (dotted purple), and equal allocation across the subset of travel websites (solid green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3 Plots of the clickthrough rate calculated using the PAC reach criterion for 500 websites on (left) the full data set and (right) the subset of travel users using six methods: the uncon- strained PAC (thin dashed brown), the unconstrained OLIMS proxy (thin dotted-dashed red), the constrained PAC (thick dashed black), the constrained OLIMS proxy (thick dotted- dashed blue), allocation by cost across the subset of travel websites (dotted purple), and equal allocation across the subset of travel websites (solid green). . . . . . . . . . . . . . . . 76 3.4 Plots of the ratio of average ad views between target demographic groups correspond- ing to two of the ten demographic constraints in the multiple-constraint CTR example for four methods: the constrained PAC method (thicker black line), the unconstrained PAC method (dashed brown), the constrained OLIMS method (dotted-dashed blue), and the unconstrained OLIMS method (dotted red). A solid gray line is plotted at constraint coeffi- cient (2 for the left panel, 1 for the right panel) for comparison. . . . . . . . . . . . . . . . . 79 xi 3.5 Plots of the clickthrough rate calculated using the PAC reach criterion for 500 websites on the subset of travel users who also come from single-person households, are between the ages of 30 and 59, and have a reported income between $35,000 and $75,000 using four methods: the unconstrained PAC (thin dashed brown), the unconstrained OLIMS proxy (thin dotted-dashed red), the constrained PAC (thick dashed black), and the constrained OLIMS proxy (thick dotted-dashed blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.1 Plots of the simplified cost curve formulation for four values of a j : a j = 0.5 (top left), a j = 1 (top right),a j = 2.5 (bottom left), anda j = 5 (bottom right). . . . . . . . . . . . . . 84 4.2 Five example cost curves (left) and the corresponding total impressions curves (right). The solid black curve has an average CPM of 3.0 and total page views of 30 million (Website 1 in Table 4.1); the thin dashed blue curve has an average CPM of 4.0 and total page views of 20 million (Website 2); the dotted red curve has an average CPM of 5.0 and total page views of 50 million (Website 3); the dotted-dashed green curve has an average CPM of 6.0 and total page views of 10 million (Website 4); the thick dashed purple curve has an average CPM of 7.0 and total page views of 40 million (Website 5). . . . . . . . . . . . . . . . . . . 89 4.3 The reach of the OLIMS optimization procedure (left), with three key budgets identified ($100,000 with the star symbol, $500,000 with the triangle symbol, and $2,000,000 with the box symbol) and (right) the five example website cost curves with the chosen c j values for each budget. total budget spent at each website to purchase those impressions. The solid black curve has an average CPM of 3.0 and total page views of 30 million (Website 1); the thin dashed blue curve has an average CPM of 4.0 and total page views of 20 million (Website 2); the dotted red curve has an average CPM of 5.0 and total page views of 50 million (Website 3); the dotted-dashed green curve has an average CPM of 6.0 and total page views of 10 million (Website 4); the thick dashed purple curve has an average CPM of 7.0 and total page views of 40 million (Website 5). . . . . . . . . . . . . . . . . . . . . . . 90 xii 4.4 Reach results for the 5000 websites simulated with no viewership correlations in the cost curve OLIMS approach, where the optimal reach (solid black) is calculated using the full data set, the OLIMS method reach (heavy dashed blue) is calculated using the 10% sub- set, and the naive 100 cost-adjusted reach (dashed purple), 50 cost-adjusted reach (dotted brown), 25 cost-adjusted reach (dotted red), and equal 5000-website reach (dotted-dashed green) are calculated using the given budget allocation. . . . . . . . . . . . . . . . . . . . . 92 4.5 Reach results for the McDonald’s McRib Campaign using cost curves in which the opti- mal reach (solid black line) is calculated using the full data set, the OLIMS method (heavy dashed blue) is calculated on the 90% holdout data using the 10% training data subset, and the naive methods (top 50 cost-adjusted websites in small dotted red, top 25 cost-adjusted websites in heavy dotted brown, top 10 cost-adjusted websites in small dashed purple, and equal allocation in dotted-dashed green) are all calculated with a non-optimized budget allo- cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 B.1 Illustration of budget allocation with varying correlations in website viewership, where web- site 1 is in red, website 2 is in blue, and website 3 is in green. . . . . . . . . . . . . . . . . . 110 xiii Abstract The use of penalized loss functions in optimization procedures is prevalent throughout the statistical lit- erature. However, penalized loss function methodology is less familiar outside the areas of mathematics and statistics, and even within statistics loss functions are typically studied on a problem-by-problem basis. In this thesis I present general loss function approaches for both penalized as well as penalized and con- strained loss functions and apply these approaches to a problem outside statistics, namely the optimization of a large-scale Internet advertising campaign. Recent work in penalized loss functions has focused largely on linear regression methods with excellent results. Methods like the LASSO (Tibshirani 1996a) have demonstrated that penalization can yield gains in both prediction accuracy and algorithmic efficiency. In cases where linearity is not appropriate, however, the best approaches can be unclear. To this end I propose a generalized methodology, applicable to many loss function for which the first and second derivatives can be readily computed. This leverages the efficiency benefits seen in penalized linear regression approaches and applies them to problems of general nonlinear objectives. Unfortunately, as useful as penalization is in optimization, many problems require another step: con- straints. Problems like portfolio optimization or monotone curve estimation have very natural constraints which need to be taken into account directly during the optimization. Because of this, I propose a further advance in the penalized methodology: PAC, or penalized and constrained regression. PAC is capable of incorporating constraints directly into the optimization, allowing for better parameter estimation when com- pared to unconstrained methods. Again, the use of penalization allows for efficient algorithms applicable to a wide variety of problems. One such problem comes from the field of marketing: the optimization of large-scale Internet advertising campaigns. I apply both approaches to the real-world problem encountered by firms and advertising agencies xiv when attempting to select a set of websites to use for advertising and then allocating an advertising budget over those websites. Due to the sheer number of websites available for advertising, all of which represent a unique advertising opportunity with varying cost, traffic, etc., these selection and allocation questions can present daunting challenges to anyone wanting to advertise online. At the same time, online advertising is becoming increasingly necessary to businesses of all industries. To address this, I show how the efficient penalized approaches developed in the general case can be specifically applied to this problem. I carry out two case studies using real data from comScore, Inc., which exemplify campaign considerations faced by companies today. Further, I show how these methods significantly improve on the methods currently available. I also discuss future directions for the research in this thesis, particularly in the real-world application of Internet media campaigns. Though this research could be extended in a wide variety of directions, I focus on furthering the marketing application to extend to the most current advertising procedure, bidding on ad impressions in real time, as well as extending the penalized and constrained approach to include time- varying objective functions (for problems in which the behavior of one time period affects the optimization of a second time period). xv Chapter 1 Introduction Since both penalized and unpenalized loss functions have been well-studied in the statistical literature, this chapter provides a brief introduction to the concept while also introducing its application to a problem outside the traditional purview of statistics: marketing’s study of optimizing Internet advertising campaigns. Section 1.1 provides an introduction to the loss function study expanded on in Chapters 2 and 3, while Section 1.2 provides a literature review for some of the work on which this thesis is based. Section 1.3 provides an overview of the Internet marketing campaign application considered throughout this dissertation. Section 1.4 provides details on the data set used in the results of Chapters 2 and 3. Finally, Section 1.5 provides an outline of the remainder of this dissertation for reference. 1.1. Penalized and Constrained Loss Functions In statistics, problems are frequently considered in terms of minimizing some undesirable characteristic of a problem, broadly called the “loss” associated with making a particular choice. By formulating a problem in terms of a function to define this loss for each choice under consideration, a function referred to as a loss function, researchers can develop formal procedures to automate the decision-making process and guarantee loss minimization. For now, consider a general loss function,L, with a vector of parametersβ and a matrix of dataz. In terms of minimizing this loss function, researchers face the general problem of: E z L(β,z) (1.1) over allβ. Researchers have a variety of options for operationalizing the problem of Equation 1.1. How they choose to do the operationalization frequently depends on the characteristics of the problem under consideration. As an example, however, consider one of the most common and thus familiar ways to operationalizeE z L(β,z): 1 finding the parameters which minimize the sum over all the values of the loss function. The general loss function problem becomes: min β n X i=1 L(β;z i ), (1.2) whereβ is ap-dimensional vector of weights to optimize over. One very familiar example of this framework is the problem of linear regression. WhenL(β;z i ) = (y i −z T i β), then (1.2) reduces to the linear regression criterion. However, in many cases minimizing only a loss function results in problems with the model fit, for example overfitting the model. In an overfitted model, the complexity of the model leads to predictive error, since what would otherwise be background noise to the model instead is fit along with the desired data and causes wildly variable estimations given only minor changes in the data used to fit the method. To combat this situation, statisticians place a penalty on the complexity of the model. By adjusting the severity of this penalty, researchers can simultaneously adjust the level of complexity. In order for a more complex model to be preferred over a less complex model, the minimization of the less complex model must be higher than the minimization of the more complex model plus its penalty term. Consider the optimization of the general loss function L above. In this optimization, the complexity of the model is driven by the parameter vector β. So, by penalizing the β vector, the model can also be penalized in terms of its complexity: min β n X i=1 L(β;z i )+λρ(β), (1.3) whereρ(β) is a general penalty term on the parameter vectorβ. Here,λ is a tuning parameter which controls the size of the penalty. Note thatλ = 0 would eliminate the penalty and return to the original optimization problem in Equation 1.2. Thus as λ decreases to zero, so does the penalty on the parameter vector; as λ increases to infinity, likewise so does the penalty. λ = 0 then corresponds to the most complex model and λ =∞ to the least complex model. Statistical literature includes a number of varying ρ() functions, usually in the form of a regulariza- tion penalty to control model complexity as discussed above. For example, in the linear regression setting, performing variable selection with large numbers of independent variables is commonly addressed by opti- mizing a penalized (convex) loss function, since in this setting overfitting is a serious concern. Although most researchers are probably more familiar with methods like ridge regression (Hoerl and Kennard 1970) 2 and the LASSO (Tibshirani 1996a), some other recent methods include the SCAD (Fan and Li 2001), the elastic net (Zou and Hastie 2005), the group LASSO (Yuan and Lin 2007), and the adaptive LASSO (Zou 2006). (See the following section for a more detailed literature review of these methods and others.) This penalization is especially crucial in the case of high-dimensional regression problems, i.e. wherep is large relative ton, making it necessary to induce sparsity in parameter estimation. However, some problems require a method with an additional step: constraining the optimization. While penalization has been proven to work well for a variety of problems, in many cases researchers wish to actually constrain some aspect of the optimization directly. For example, in budget allocation problems, an optimization cannot allocate more than the total available budget. In problems assigning weights to various assets such as in portfolio optimization, weights must necessarily be constrained between zero and one (assuming no short-selling of assets) as well as summing to one. In these problems, the optimization procedure itself requires constraints. For these problems, even a penalized procedure is not advanced enough to be used solely to optimize the objective function. Instead, the objective function must incorporate the required constraints before a solution can be found. Say a given problem hasm constraints on the objective. Further, for simplicity assume these m constraints are linear, so the optimization problem of (1.3) becomes argmin β L(β)+λρ(β) subject to Cβ≤b, (1.4) where nowC∈R m×p is a predefined constraint matrix andb∈R m is the corresponding predefined con- straint vector. This setup becomes very useful for solving problems with natural constraints. For example, if the weights of a portfolio need to sum to one,C would be a one-row matrix with p entries of 1, while b would be set to 1. In this case, L(β) = β T ˆ Sigmaβ if the objective is minimizing portfolio risk as in Markowitz (1952). In the following chapters, I will use these formulations to address several current problems, both in statistical methodology and outside the statistics field. The penalized loss function procedure is explored in Chapter 2, while the penalized and constrained extension is discussed in Chapter 3. Further, while I demonstrate both methods are widely applicable to a range of problems, I choose one exemplar application with real-world implications in the field of marketing: optimizing a large-scale Internet media campaign. The background of this application is discussed below in Section 1.3. 3 1.2. Relevant Literature Review Because the methods presented in the following chapters benefit strongly from previous statistical work in the areas of high-dimensional/penalized methodologies, in this section I review a selection of the relevant literature for reference. Further, since the methods of Chapters 2 and 3 make use of an ℓ 1 regularization procedure, I pay particular attention to previousℓ 1 optimization procedures. Penalty terms naturally arise in the area of high-dimensional statistics, i.e. the area focusing on problems in which p is large relative to n (in many cases, where p actually exceeds n). One common method of dealing with high-dimensional problems is through regularization, where researchers better define a problem or address overfitting issues by introducing additional information into an objective function via a penalty term. Often this is done through the use of a norm, since norms always assign a positive length to a vector. This makes them ideal for penalty terms, since penalizing a norm necessarily shrinks the value of the vector and thus encourages less complex models. One of the simplest norms is the ℓ 0 “norm,” so-called becausekβk 0 counts the number of non-zero elements in a vector β (and thus does not technically fit the exact definition of a norm). In this case, a penalty term would penalize the number of non-zero parameters, giving a general problem which takes the form: L(β;z)−λkβk 0 , where λ≥ 0 is a tuning parameter to determine the size of the set of non-zero parameter estimates. The model then optimizes the objective subject to that set size. For example, ifkβk 0 = 10, optimization would involve choosing the best model (i.e. the model that gives the best value of L(β;z)) with specifically 10 non-zero parameters. This idea underlies several common model selection procedures. Two of the most familiar of these procedures, discussed below, are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), both of which are used to compare models with differing numbers of non-zero parameters. The AIC, proposed by Akaike in 1973, aims to choose a model by minimizing the Kullbeck-Leibler divergence of the fitted model from the true model. Akaike (1973) shows that, for the maximum likelihood estimator ˆ β, this is asymptotically equivalent to L( ˆ β)−λk ˆ βk 0 , 4 whereL is the log-likelihood function. This results in the familiar 2k−ln( ˆ L) AIC measure of model com- plexity, wherek is the number of estimated parameters (and ˆ L is the maximum value of the log-likelihood function). For the Bayesian criterion, Schwarz (1978) advocates using λ = logn 2 with prior distributions that have non-zero prior probabilities, resulting in the familiar−2ln( ˆ L) +kln(n) BIC measure of model complexity (wherek and ˆ L are defined as for the AIC measure, andn is the number of observations in the data used to fit the model). Lv and Liu (2014) take this BIC analysis a step further and generalize both the AIC and BIC in the presence of model misspecification by showing the Kullbeck-Leibler divergence interpretation of the BIC as well. These approaches work well to differentiate among models where differ- ing numbers of parameter estimates proves problematic, e.g. linear regression procedures with models of varying size (since typical measures like R 2 do not account well for additional variables). However, more sophisticated penalties exist to solve variable selection problems. A generalization of thisℓ 0 methodology for penalized linear regression is called bridge regression (Frank and Friedman 1993), also known as ℓ q , regression since here the penalty term is formulated as λ|β| q for 0≤ q≤ 2. For bridge regression, q = 0 corresponds to the ℓ 0 norm of best subset selection discussed above, whileq < 1 is a non-convex function and thus not ideal for optimization. However,q≥ 1 is convex, so numerous penalized regularization procedures have evolved for the q≥ 1 case. In particular, q = 1 represents the well-known LASSO method (Tibshirani 1996a), whileq = 2 represents the well-known ridge regression method (Hoerl and Kennard 1970). One key difference between the LASSO and ridge regression is in the way they each address model complexity, mainly in penalizing theβ vector. While ridge regression can shrink parameter estimates, it can never set an individual β value to zero; all variables in the model thus have some non-zero value, unlike the LASSO’s variable selection properties. The LASSO in contrast does a constant shrinkage regardless of the size of an estimated parameter, which can be shown (Fan and Li 2001) to result in poor prediction performance. Thus researchers tend to prefer ridge regression when they have many variables with potentially small effects, but they prefer the LASSO when they have only a few significant variables with large effects. Because of this preference when only a few of the possible p variables are expected to have signifi- cant effects, LASSO and LASSO-based procedures have been explored in a number of situations for high- dimensional problems. In these cases, the ℓ 1 penalty is generally used to induce sparsity in the parameter estimates, which is often desirable in problems where p is large relative to n and necessary in problems where p > n. However, the standard ℓ 1 LASSO penalty does suffer from some drawbacks, particularly 5 when dealing with sparse high-dimensional data (Meinshausen 2007) or when variables are highly corre- lated (Zou and Hastie 2005). To address the case of sparse high-dimensional data, Meinshausen (2007) suggests a simple two-fold LASSO procedure called the relaxed LASSO. Here, a researcher has a very large p relative to n, so a LASSO with a strict penalty (a large λ value) is ideal to select only a very small set of non-zero coefficients. Unfortunately this means the selected non-zero coefficients are likely overshrunk, since the coefficients were highly penalized. To resolve this, the relaxed LASSO does a two-pass procedure: first, the standard LASSO is run with a very high penalty, then the selected set from that highly-penalized LASSO is re-run via the LASSO with a smaller penalty. This has the effect of unshrinking the estimates, since presumably the first run of the LASSO has identified highly significant variables but has not accurately measured them due to the severe penalty. A more serious systemic issue with the LASSO is its issues with highly correlated data. Zou and Hastie (2005) propose the elastic net procedure to compensate for the high correlation problem. The elastic net actually adds a ridge regression-like penalty term to theℓ 1 term, i.e. minimizing L(β;z)+λ 1 kβk 1 +λ 2 kβk 2 2 . Zou and Hastie show this criterion can be rewritten as a LASSO with a β ∗ parameter vector, where the estimate ˆ β = β ∗ √ 1+λ 2 . Further, Zou and Hastie show when the variables are orthogonal to one another, for a single parameter j, ˆ β j is equivalent to 1 1+λ 2 ˆ β LASSO j . This means the final estimates are actually a result of the standard LASSO estimates combined with the additional penalty term. This hybrid LASSO-ridge approach results in highly correlated covariates having similar parameter estimates, and thus the elastic net can be used to group related variables together, overcoming the correlation limitation of the standard LASSO. The standard LASSO typically only chooses one variable out of a highly related group, and there are many situations where the entire group is valuable. One well-known case of this occurs in identifying gene groups. Having the entire group is not only useful to genetic researchers, but it also reduces the prediction error noted by Fan and Li (2001). Another method which utilizes this grouping procedure to address the issue of highly correlated variables is the aptly-named group LASSO (Yuan and Lin 2007). Here, researchers pre-identify groups of related variables rather than using a selection procedure to identify them. This means after optimization, entire groups are either wholly included or wholly excluded from the solution. This setup frequently occurs in 6 practice; again, genetics is an excellent example, since many gene groups have already been identified or naturally occur together. The optimization criterion then becomes: L(β;z)+λ J X j=1 kβ j k K j , where j = 1,...,J refers to each of the J groups,kβ j k K j = (β T j K j β j ) 1 2 , and K j is a positive, definite matrix corresponding to groupj. These matrices are uniquely defined, and through them the group LASSO can solve problems of both the standard LASSO (if no groups are defined and each of thep parameters has aK j matrix corresponding to the identity matrix) and ridge regression (if all variables are in a single group with the K j matrix corresponding to the identity matrix). This occurs because the optimization works like ridge regression within a group, meaning all members of the group are given non-zero parameter estimates, but works like the LASSO outside groups, meaning each group is treated as a single variable to be selected or not. Though these procedures all make use of the ℓ 1 penalty, that penalty can also be employed as part of a weighting scheme rather than a grouping scheme. In these cases, the penalty term becomes λ p P j=1 w j |β j |, where w j is a weighting function. For example, the adaptive LASSO (Zou 2006) uses the function w j = 1 | ˆ β j | ν for some ν > 0 (and where in the case of regression, ˆ β j is the least squares estimate for β j ). The idea here is to reduce the penalty when an estimate is large, and inflate the penalty when the estimate is small. (Note this procedure is applied iteratively during optimization, so the penalty actually applies to a previous estimate when calculating an updated estimate.) This has the effect of reducing the selection of noise predictors in favor of predictors with particular individual strength. In this way the adaptive LASSO functions very much like the standard LASSO, though with guaranteed consistency in its estimation (a property the standard LASSO lacks). Unfortunately, the adaptive LASSO has an infinite penalty at a value of zero. Since the adaptive LASSO updates iteratively, once a parameter is set to zero, it will remain at zero. Other weighting functions avoid this, however. One example is the SCAD penalty function (Fan and Li 2001), which results in a weight ofλ 7 (and thus reverts to the standard LASSO estimate) when a parameter starts out at zero. This is a consequence of the nonconvex SCAD penalty, defined as: ρ λ (β j ) = λ|β j | if|β j |<λ − |β j | 2 −2νλ|β j |+λ 2 2(ν−1) ifλ<|β j |<νλ (ν+1)λ 2 2 if|β j |>νλ, (1.5) whereν > 2 andλ> 0. This penalty has a number of nice properties. For example, the penalty is a quadratic spline function with knots at λ and νλ, is continuous and differentiable everywhere but 0. However, the function is singular at 0 and has a derivative of 0 beyond|νλ| in both directions. Thus small coefficients are set to zero, intermediate coefficients are shrunk toward zero, and large coefficients are unpenalized. This means SCAD can produce a sparse set of parameter estimates which are shown by Fan and Li (2001) to be approximately unbiased for large coefficients. However, the penalty is non-convex, which presents problems in optimization procedures. These represent only a few of the penalized approaches used in the statistical literature, but they form the basis of the work presented in Chapters 2 and 3, particularly the use of theℓ 1 regularization penalty term. Several optimization procedures exist to solve such penalized functions, including the well-known LARS algorithm (Efron et al. 2004) which can be adapted to solve the LASSO criterion. LARS is closely related to traditional forward selection procedures. First, the algorithm identifies the variable most closely correlated with the response vector, then continues along the coefficient path of that predictor until another predictor becomes equally as correlated with the residual vector. While forward selection would then continue along the path set by the first predictor, LARS forges a new path, equiangular from both identified predictors. This procedure is repeated for each new variable encountered along the continuously-updated equiangular path untilλ determines a stopping point. Efron et al. (2004) show the LARS algorithm has a number of efficient qualities and can even be further improved upon in cases where the number of coefficients is much less thann. In the case of the LASSO, specifically for high-dimensional problems, LARS does involve slightly higher complexity if the algorithm is forced to drop a variable, but Efron et al. (2004) demonstrate the added complexity is slight and the solution always results in at most n− 1 selected variables. Thus the LARS algorithm provides an efficient solution to the standard LASSO optimization problem. 8 Numerous other optimization procedures exist for solving ℓ 1 regularization problems, however. For the purposes of this literature review, I consider two main categories of such procedures: subgradient opti- mization strategies and differentiation optimization strategies. Subgradient strategies, such as coordinate descent and active set methods, rely on non-smooth optimization procedures to solve the non-differentiable ℓ 1 -penalized objective function. Coordinate descent methods optimize over one variable at a time. That is, after selecting a specific variable, all other variables are fixed, and the objective is optimized at the one unfixed variable. Following this same procedure, coordinate descent algorithms then cycle through the remaining parameters, starting over at the first parameter if a specified convergence criterion has not been reached. Examples of coordinate descent procedures include the shooting algorithm for the least squares criterion (Fu 1998), in which Fu cycles through the variables in order and shows a closed form update step at each variable, and the Gauss-Seidel algorithm (Shevade and Keerthi 2003) for logistic regression, in which the update step is done via a line search. For the least squares criterion, Gauss-Seidel’s line search update is equivalent to the shooting algorithm’s update, but the coordinate chosen for the update is not done by cycling through the variables in order. Instead, Gauss-Seidel chooses the variable which most violates first-order optimality conditions (and iterates until no variable set at zero violates optimality). This means Gauss-Seidel provides a more general coordinate descent procedure than the shooting algorithm, and it is generally preferable in situations where a high level of sparsity is expected (since it has the potential to include only a few non-zero parameter estimates). Although coordinate descent procedures work well in situations where cycling through all variables and computing updates is not cost- or time-prohibitive, their time to solution can increase drastically with increases in the dependence among variables. This occurs because updating one variable at a time and keeping the others fixed naturally lends itself to situations in which the variables are largely independent; dependent variables require many iterations to achieve convergence, since the dependent variables are not updated together. One way of addressing this is through active set optimization methods. Here, a working set of non-zero variables is identified, then these variables are optimized subject to an active set of zero-valued variables. One example for ℓ 1 regularization is the grafting procedure (Perkins et al. 2003), which works similarly to the Gauss-Seidel procedure except that the grafting line search updates for the entire non-zero set at once, meaning it requires fewer total line searches to achieve convergence. Park and Hastie (2007) also detail a subgradient descent optimization, in which the definition of the working set is expanded to include not only non-zero variables but also zero-valued variables which do not satisfy optimality conditions. The 9 subgradient descent approach can take considerable time to converge in cases where it adds more non-zero variables than necessary, since it also takes steps to bring the unnecessary variables back to zero. The second broad category of strategies to address ℓ 1 -penalized objective functions consists of meth- ods which employ differentiable approximations to the non-differentiable objective. With these methods, the optimization centers on a twice-differentiable approximation function, which provides a way to opti- mize the non-differentiable objective with known algorithms developed for differentiable functions. This works so long as the minimizer of the approximating function is sufficiently close to the minimizer of the actual function, which can be difficult to determine. However, many procedures successfully implement this approximation approach. One such procedure is smoothing, as in the example of the epsL1 approx- imation (Lee et al. 2006). In the epsL1 approach, the regular ℓ 1 penalty is replaced with λ P j q β 2 j +ǫ, where ǫ is some small, positive value such that the limit of q β 2 j +ǫ as ǫ goes to zero is|β j |. As long as this holds, the objective is now differentiable, and the minimizer of the new objective corresponds to the minimizer of the original ℓ 1 penalty. Unfortunately this smoothing procedure does suffer from significant time to convergence. Excessive time to convergence notwithstanding, perhaps the most popular differentiable approaches are bounded optimization procedures. With these approaches, the penalty term is not replaced such that the objective function is differentiable. Instead, each iteration of the bounded optimization replaces the regularization function with a convex upper bounding function equal to the current value ofβ. Once again this becomes a two-step procedure. First, an appropriate convex upper bounding function is computed. Then the total objective function is (approximately) optimized. The bounded optimization finds a solution by alternating between these two steps until a convergence criterion is reached. There are no rigorous conditions required for the convex upper bounding function, so in theory any such function could be used. For example, Krishnapuram et al. (2005) present one common function: the square root bounding function. Here, the bounding function is|β j |≤ β 2 j 2|β ∗ j | + |β ∗ j | 2 , whereβ ∗ j is the value ofβ j at the previous iteration. This is just one example of an upper bounding function, however. Numerous bounding functions have been used throughout the statistical and machine learning literature to develop efficient upper bounding algorithms (e.g. Tibshirani 1996a, Fan and Li 2001, Krishnapuram et al. 2005). 10 1.3. Optimal Large-Scale Internet Media Selection With the increased role of Internet use in the United States economy, Internet advertising is becoming vital for company survival. In 2012, U.S. digital advertising spending (including display, search, and video adver- tising) totaled 37 billion dollars (eMarketer 2012). Of that 37 billion dollars, Internet display advertising accounted for 40%. Internet display ad spending is also expected to grow to 45.6% of the total in 2016, outpacing paid search ad spending (eMarketer 2012). Such an increasing trend in Internet display advertis- ing is related to a wide range of benefits offered by this advertising format, including building awareness and recognition, forming attitudes, and generating direct responses such as website visits and downstream purchases (Danaher et al. 2010, Hoban and Bucklin 2015, Manchanda et al. 2006). Nevertheless, firms face considerable challenges in optimal Internet media selection of online display ads. Because each website represents a unique advertising opportunity, the number of websites at which firms may potentially choose to advertise is extremely high. These websites also vary vastly by their traffic and advertising costs. Furthermore, when optimizing advertising budgets across a large number of websites, it is crucial for firms to account for the inevitable correlations in the viewership among these sites. For example, the 2011 comScore Media Metrix data (see following section for further discussion on this data) show there is over 95% correlation in the viewership of Businessweek.com and Reuters.com. In such cases, heavy advertising on both websites will inefficiently cause firms to advertise twice to mostly the same viewers. These challenges are so formidable that, although Internet advertising is increasingly recommended to reach consumers (e.g. Unit 2005, Chapman 2009), companies often have to rely on advertising exchanges such as DoubleClick to manage their Internet ad campaigns (Lothia et al. 2003). These exchanges are recent innovations in advertising that allow firms to outsource their Internet ad campaigns, giving firms the oppor- tunity to expand online advertising without having to combat the challenges themselves (Muthukrishnan 2009). Generally, a company will specify campaign characteristics (such as which types of consumers to target) and pay a certain amount of money to the exchange to conduct a campaign with those characteristics. One advantage of ad exchanges is their ability to employ behavioral ad targeting, that is, targeting ads to consumers based on their Internet browsing histories (Chen et al. 2009). This is usually accomplished by installing cookies or web bugs on users’ computers to track their online activity. However, this has led 11 to numerous privacy concerns and, in some cases, legal action against behavioral targeters (Hemphill 2000, Goldfarb and Tucker 2011). Another major concern with outsourcing Internet display ad campaigns to ad exchanges is that compa- nies must turn over the control of the campaign to the exchange, which creates a classical principal-agent problem. While the focal firm can request target demographics, the exchange will ultimately solely deter- mine how funds are allocated (Muthukrishnan 2009). In such cases, the ad exchange serves as a broker who maximizes its own profit via distributing ad impressions across multiple campaigns from multiple firms, rather than allocating funds aligning with each individual firm’s best interest. Consequently, when running an online ad campaign through an ad exchange, the focal firm’s budget allocation may be more or less suboptimal compared with the alternative of managing its own campaign. Ideally, firms would like a method which overcomes the above challenges and concerns to allow them to retain control of their online advertising campaigns, rather than entirely outsourcing such campaigns to advertising exchanges. In particular, consider a setting in which a company wishes to maximize reach, i.e. the fraction of customers who are exposed to a given ad at least one time. 1 In such cases, firms still face the same Internet advertising challenges of overwhelming scope and variety. Historically, to be in full control of their own online advertising campaigns, firms often had to employ heuristics to choose a select number of websites over which to advertise. These heuristics include advertising only at big-name websites like Amazon or Yahoo or allocating evenly over the most visited websites under consideration (Cho and Cheon 2004). While such heuristics have been adopted in practice, they can lead to substantially suboptimal budget allocation. For example, the five highest traffic websites are likely not the optimal sites for firms to advertise over. Consider again the case of Businessweek and Reuters. The two websites are both high in traffic. But they actually share highly similar users. A firm will waste money without gaining many new unique ad viewers by heavily advertising on both websites, even if a firm wishes to target primarily frequent viewers of such websites. In addition, a very popular, high-traffic website may also be very expensive to advertise on and may have a large percentage of repeat visitors. Hence, it may not be the most cost-effective option for firms to spend a considerable portion of their advertising budgets on such websites. In many cases, choosing a less visited but also less expensive website could be a better choice, especially if the less-visited website 1 Although reach is a well-known example of an optimization metric for advertising campaigns, in practice firms have a number of possible metrics from which to choose. For example, firms may wish to maximize clicks on a particular ad or number of purchase conversions from the ad. 12 reaches a particular, unique audience (i.e. the less-visited website has a low viewership correlation with the high-traffic website). Despite the considerable importance of optimal Internet media selection for online display ads, very few researchers have proposed methods to alleviate the above challenges faced by firms. Danaher’s Sarmanov- based model (Danaher et al. 2010) was among the first and most successful attempts to optimally allocate budget across multiple online media vehicles. This Sarmanov-based method has proven to work well for budget allocation in settings on the order of 10 websites. While Danaher’s work represents the most state-of- the-art method for allocating Internet advertising budget, under this method the consideration of each addi- tional website increases the optimization difficulty exponentially such that the Sarmanov criterion becomes very difficult to optimize in settings involving more than approximately 10 websites (Danaher 2007, Danaher et al. 2010). For example, even if firms know they wish to advertise across only 10 out of 50 potential web- sites, they must test each possible 10-website combination, resulting in over 10 billion individual problem calculations. Since each website represents a separate advertising opportunity, such methods are hindered by the huge volume of Internet websites on which firms could potentially choose to advertise. Because of this, firms require a method which allows them to efficiently select and allocate budget among a large set of websites (e.g., thousands). One reason for the difficulty in considering a large number of websites is that the problem of choosing a subset of websites is generally NP-hard. In a setting involving p potential websites, each of the 2 p possible website subsets must be considered separately, leading to a computationally infeasible problem. Due to this complexity, the traditional marketing problem here can benefit hugely by introducing a statistical penalized methodology, as I show in Chapters 2 and 3. 1.4. comScore Internet Browsing Data I implement the marketing application on two case studies in Chapters 2 and 3, both using subsets of the 2011 comScore Media Metrix data. This data set comes from the Wharton Research Data Service (www.wrds.upenn.edu). comScore uses proprietary software to record daily webpage usage information from a panel of 100,000 Internet users (recorded anonymously by individual computer). Therefore, the comScore data can be used to construct a matrix of all websites visited and the number of times each com- puter visited each website (and how many webpages were viewed at each visit) during a particular time period. This comScore Media Metrix data is commonly utilized for marketing literature in which Internet 13 visitation is considered (e.g., Danaher 2007, Liaukonyte et al. 2015, Montgomery et al. 2004, Park and Fader 2004). The comScore Internet browsing data is analogous to the more commonly-known Nielsen ratings for television; comScore collects not only the websites visited by users, but also household demographic information such as income level, geographic area, size of the household, etc. In this way, researchers can use the data to get an overall sense of Internet browsing both at individual levels (by particular machine) and at higher group levels (by demographic characteristics). This data can be supplemented by comScore Inc.’s Media Metrix data from May 2010 (Lipsman 2010) to provide average costs (given as CPMs, or cost per thousand ad impressions) for each website by grouping the websites into common website categories. The data used in the following chapters is from two months of the 2011 comScore data: December 2011 (in the case of a McDonald’s McRib advertising campaign) and January 2011 (in the case of a Norwegian Cruise Line’s Wave Sale advertising campaign). For both case studies, in each month I manually identified the 500 most-visited websites which also supported Internet display ads. Although there is considerable overlap between the 500 identified sites for both months, the list of websites is not the same due to differing browsing patterns among users (for example, more retail websites were visited in December, presumably due to holiday shopping). In the first case study, I consider a hypothetical yearly promotion for McDonald’s McRib Sandwich, which is only available for a limited time each year. This promotional period lasts approximately one month, usually in or around December (Morrison 2012) each year. Because of this, the data used in the McRib case study comes from comScore’s December 2011 data. This filtered data contains a record of every computer which visited at least one of the 500 most-visited websites at least once (56,666 users) during the month of December. Thus for the McRib case study, the comScore data is a matrix of 56,666 users by 500 websites, where the matrix entries are the total number of webpages viewed by each user at each website during the month of December. In the second case study, I consider a hypothetical yearly promotion for Norwegian Cruise Line’s “wave season,” which runs from January to March (with the bulk of advertising taking place in January). Because of this, the data used in the NCL case study comes from comScore’s January 2011 data. This filtered data contains a record of every computer which visited at least one of the 500 most-visited websites at least once (48,628 users) during the month of January. Thus for the NCL case study, the comScore data is a matrix of 48,628 users by 500 websites, where again the matrix entries are the total number of webpages viewed by each user at each website, this time during the month of January. 14 Number of Websites Number of Websites Category CPM December January Community 2.10 23 24 E-mail 0.94 7 6 Entertainment 4.75 92 100 Fileshare 1.08 28 22 Gaming 2.68 77 75 General News 6.14 12 12 Information 2.52 47 58 Newspaper 6.99 27 12 Online Shop 2.52 29 29 Photos 1.08 9 6 Portal 2.60 30 36 Retail 2.52 57 49 Service 2.52 18 16 Social Network 0.56 17 25 Sports 6.29 17 13 Travel 2.52 10 17 Total 500 500 Table 1.1: Number of websites in each of the 16 website categories for the December and January 2011 500-website cleaned comScore data Table 1.1 provides the categorical makeup of the 500 websites in both the December and January 2011 data sets. It includes the sixteen broad categories of websites presented by Lipsman (2010): Social Net- working, Portals, Entertainment, E-mail, Community, General News, Sports, Newspapers, Online Gaming, Photos, Filesharing, Information, Online Shopping, Retail, Service, and Travel. The Total Number column provides the total number of websites in each category, and the CPM values given are the average CPM val- ues for that website category as provided by comScore Inc.’s Media Metrix data from May 2010 (Lipsman 2010). (For further details on the relationships among categories, see Tables A.1 and A.2 in Appendix A for an overview of viewership correlations within and across each of the sixteen website categories for both months.) Note that for simplicity, the CPM values given in Table 1.1 are taken from comScore Inc.’s Media Metrix May 2010 data. In practice, however, firms would likely have already obtained actual average CPMs for each individual website from previously collected data or from the advertiser directly. 15 1.5. Summary of Chapters The rest of the thesis is organized as follows: In Chapter 2, I discuss the optimization of generalized penalized loss functions. This chapter includes an overview of the work already done on generalized loss functions and the penalty terms applied to them (Section 2.1), then extends that work to demonstrate an algorithm to optimize general penalized loss func- tions with the specific example of a loss function selected to optimize a large-scale Internet media campaign (Section 2.2). Finally, in Section 2.3, I demonstrate the results of the method when optimizing two example advertising campaigns using the comScore data. In Chapter 3, I extend the methodology of Chapter 2 further to problems requiring both penalization and constraints. This chapter begins by considering the problem of penalized and constrained loss functions (Section 3.1) and an algorithm to optimize such loss functions (Section 3.2). Then I again extend that work to demonstrate the methodology with the specific example of a loss function to optimize a more complex large-scale Internet media campaign (Section 3.4). Finally, in Section 3.5, I show the results of the penalized and constrained methodology in a case study using the real comScore data, particularly in comparison to the previous penalized methodology without constraints. In Chapter 4, I discuss future directions of the research, namely expanding the methodology of Chap- ter 2 to include a more current advertising technique, real-time ad bidding, and extending the methodology of Chapter 3 to time-varying loss functions. In the case of real-time bidding procedures, the extension allows for very dynamic advertising allocation techniques and demonstrates the applicability to even cur- rent, state-of-the-art advertising campaign considerations. In the time-varying extension, I focus on some initial exploration of potential techniques to extend the analysis of Chapters 2 and 3 to problems where researchers may find it useful to consider stages of optimization, that is, where information gathered dur- ing a previous stage may help inform the optimization of a later stage. Both extensions make use of more involved methodology to incorporate desirable features of real-world optimizing problems. 16 Chapter 2 Penalized Loss Functions Loss functions are well-represented throughout the statistical literature, and many common penalty terms are routinely employed in cases where maximizing a loss function involves some sort of potential overfit- ting of the data. However, while these methods may be well-known in the statistical literature, they are less common in other fields. In addition, penalties associated with loss functions in statistics often focus on linear regression procedures, like the well-known LASSO (Tibshirani 1996a). These procedures demon- strate numerous advantages in optimization, particularly in the efficiency of algorithms created to arrive at parameter estimates, but only directly apply when the data shows a linear relationship. Since many prob- lems naturally involve nonlinear relationships, this can be a limitation. In cases where penalized procedures do exist to accommodate a nonlinear relationship, such as genetic algorithms (Grefenstette 1986, Homaifar et al. 1994), they frequently call on linear procedures to supplement the methodology, e.g. Cai et al. (2001). In research where no linear regression procedure is present, the proposed methods solve only one particular problem, that is the solution is only applicable to a specific type of objective function. In this chapter, I aim to introduce a generalized penalized loss function methodology, capable of optimizing both linear and non- linear objective functions. This approach draws on existing statistical methodology to define a procedure for efficiently optimizing these loss functions, including an algorithm implemented via the familiar coordinate descent approach. Further, I demonstrate the method and algorithm on a problem outside statistics with a non-quadratic objective: optimizing the reach of an Internet advertising campaign. Though this is just one of many problems outside the traditional purview of statistics which can benefit from such a generalized methodology, the real data available to test the method makes Internet advertising an excellent example of the advantages other fields can capture by incorporating these statistical procedures, particularly when previous methods cannot handle large-scale optimization. 17 2.1. Optimizing Generalized Penalized Loss Functions 2.1.1 Formulation of the Generalized Loss Function Optimization Recall from Section 1.1 that, in order to optimize a general loss function, L, with a vector of parametersβ and a matrix of dataz, a researcher would want to solve: E z L(β;z), (2.1) whereβ∈ R p is a vector of coefficients over which to optimize. In Section 1.1, Equation 2.1 was opera- tionalized using a common summation, which led to the use of anℓ 1 penalty to penalize model complexity via theβ vector and avoid overfitting: min β n X i=1 L(β;z i )+λkβk 1 , (2.2) where againβ is thep-dimensional parameter optimization vector, andk·k 1 represents theℓ 1 norm, P j |β j |. In the special linear regression case, where againL(β;z i ) = (y i −z T i β), then (2.2) reduces to the LASSO criterion used for fitting a linear regression model (Tibshirani 1996b). The LASSO has been well-studied, and in particular several attractive and computationally efficient algorithms have been derived for computing its solution, such as the LARS algorithm for computing the entire path of solutions for all values of λ (Efron et al. 2004), and coordinate descent algorithms which may improve even on the LARS algorithm for efficiency (Friedman et al. 2007, 2010a). This is especially crucial in the case of high-dimensional regression problems, i.e. problems where the number of parameters p is large relative to the number of observations n. Because of the efficiency of its algorithms, the LASSO is a popular choice for high-dimensional problems, and the ℓ 1 penalty acts to implement variable selection for these problems. For example, one method discussed in Section 1.2, the adaptive LASSO (Zou 2006), has been shown to perform well in high-dimensional, sparse linear regression problems (Huang et al. 2006). Unfortunately, however, there are numerous high-dimensional problems for which regression is neither appropriate nor desirable. In Section 2.2, I consider one specific example in the context of marketing campaigns. 18 In this way, (2.2) demonstrates a significant advantage: it is not restricted to regression problems. Because L(β;z) is a general loss function, the problem formulation using L(β;z) allows for the substi- tution of any such loss function into the analysis with minimal alterations. Thus regression problems are only a subset of the possible applications of the generalized method. This generalization further leads to a wider high-dimensional applicability than high-dimensional regression analysis and can easily be imple- mented across various research areas and platforms, for example advertising budget allocation as considered in Section 2.2. In addition, if the procedure is implemented over a grid ofλ values, a particular λ value can be chosen as a tuning parameter. While this procedure is frequently utilized in regression-based models to choose the best overall minimization (e.g., Lee et al., 2006), the choice of λ itself can correspond to a significant aspect of the problem, again for example a budget constraint as seen in Section 2.2. 2.1.2 Algorithm to Optimize Generalized Loss Function One major advantage of a generalized loss function procedure is that an algorithm developed to solve the generalized loss function can be used for many specific loss functions once they are defined. Thus if an algorithm efficiently solves the generalized problem without requiring any particular loss function definition, it can reliably solve very specific loss functions as well. As noted in Section 2.1.1, the linear regression LASSO method has resulted in several very efficient algorithms to optimize the quadratic linear regression loss function plus the ℓ 1 penalty. This problem falls into a common statistical form, namely: f(β) =l(β)+ p X j=1 k j (β j ), (2.3) where l(β) is a differentiable convex function ofβ and k j (β j ) is a convex but not differentiable function, e.g. k j (β j ) =λ|β j |. It has been shown in Luo and Tseng (1992) that a coordinate descent algorithm, which iteratively minimizes the criterion as a function of one coordinate at a time, will achieve a global minimum for functions of the form in (2.3). Thus if the generalized loss function optimization can be rewritten to follow the form specified by Luo and Tseng in (2.3), convergence using coordinate descent is guaranteed for the criterion. 19 Unfortunately, however, a general loss function cannot be guaranteed to begin in the proper form. To address that, consider a second-order Taylor approximation to the general function. By Taylor’s theorem, L(β;z i )≈L( ˜ β;z i )+ p X j=1 L ′ ij (β j − ˜ β j )+ 1 2 p X j=1 p X k=1 L ′′ ijk (β j − ˜ β j )(β k − ˜ β k ) where L ′ ij = ∂L(β;z i ) ∂β j | β= ˜ β , L ′′ ijk = ∂ 2 L(β;z i ) ∂β j ∂β k | β= ˜ β and ˜ β is a fixed p-dimensional vector. The accuracy of the approximation improves asβ→ ˜ β. Hence, (2.2) can be approximated by min β n X i=1 p X j=1 L ′ ij (β j − ˜ β j )+ 1 2 p X j=1 p X k=1 L ′′ ijk (β j − ˜ β j )(β k − ˜ β k ) +λkβk 1 . (2.4) This optimization criterion then takes the form of f(β) =l(β)+ p X j=1 k j (β j ), (2.5) where l(β) = P n i=1 h P p j=1 L ′ ij (β j − ˜ β j )+ 1 2 P p j=1 P p k=1 L ′′ ijk (β j − ˜ β j )(β k − ˜ β k ) i is a differentiable convex function of β and h j (β j ) = λ|β j | is a convex but not differentiable function. The problem is then in the form specified by Luo and Tseng and meets the criteria to guarantee algorithmic convergence using coordinate descent. To minimize (2.4) overβ j , with allβ k k6=j fixed, first compute the partial derivative with respect toβ j and set equal to zero: n X i=1 h L ′ ij +L ′′ ijj (β j − ˜ β j ) i +λsign(β j ) = 0 forβ j 6= 0. In order to derive the update step, expand this to n X i=1 h L ′ ij +L ′′ ijj β j −L ′′ ijj ˜ β j ) i +λsign(β j ). Then, since the update is forβ j , solve forβ j : n X i L ′′ ijj β j = n X i L ′′ ijj ˜ β j − n X i L ′ ij −λsign(β j ). 20 Divide by n P i L ′′ ijj to get the update step for the parameter estimate for a particular coordinatej: β j = ˜ β j − P i L ′ ij +λ×sign(S j ) P i L ′′ ijj for|S j |>λ 0 otherwise, (2.6) whereS j = P n i=1 ( ˜ β j L ′′ ijj −L ′ ij ). This suggests Algorithm 1 for minimizing (2.4). Algorithm 1 Coordinate Descent Algorithm 0. Initialize ˜ β =0 andj = 1. 1. ComputeL ′ ij andL ′′ ijj fori = 1,...,n. 2. Computeβ j using (2.6). 3. Repeat Steps 1 and 2 forj = 1,2,...,p and iterate until convergence. Algorithm 1 can be applied for any fixed value ofλ. In practice, the algorithm should be implemented over a fine grid of decreasing values ofλ, initializing ˜ β using the final solution from the previous grid point. Note that to use the general methodology here,L should be both convex and twice-differentiable. 2.2. Application of Methodology to Internet Media Selection Although Section 2.1 lays out a general procedure for any given loss function, the procedure can be demon- strably useful in very specific situations as well. In this section, I consider one such specific problem in the field of marketing, the optimization of an Internet media selection campaign. In this case, a firm has a large number of potential advertising opportunities, since each website on the Internet represents a unique potential advertising opportunity. A firm must then decide not only to which websites to allocate budget but also how much budget should be allocated to each. This is a situation in which anℓ 1 penalty makes sense, as it encourages sparsity (the firm would like to minimize the total number of websites on which it advertises to streamline its campaign), but since linear regression is not appropriate, a LASSO procedure cannot be uti- lized. Because of this, the advertising problem is an excellent candidate for the general procedure outlined in Section 2.1, which in deference to the Internet advertising application I refer to as the OLIMS (Optimal Large-scale Internet Media Selection) method. 21 2.2.1 OLIMS Model Formulation Consider a firm that has a budget B for a campaign to be run over a particular time span (e.g., one month or one quarter). A common goal for such a campaign would be to allocate the firm’s budget across a set of p possible websites to maximize the probability an Internet user views the ad at least once during the campaign. This probability is known as the reach of a campaign. 1 Although the definition of reach is clear, the function to model it (in cases where it cannot be calculated empirically) varies. Here I present one such function, based on the Poisson distribution due to ad appearances behaving as an arrival process. Let β j represent the budget allocated to advertising at the jth website, where j = 1,...,p. Further, letX ij represent the number of times an ad appears to customer i during her visits to website j during the course of the ad campaign, wherei = 1,...,n. Hence,Y i = P p j=1 X ij corresponds to the total number of ad appearances to customeri over all websites. LetZ also denote ann byp matrix, withz ij corresponding to the number of webpages visited by customer i at website j during the time span of the ad campaign. In practice, such data (e.g., the comScore Media Metrix data referenced in Section 1.4) are available from commercial browsing-tracking companies. Within this context, the problem can be formulated as a fairly common marketing scenario: given the firm is constrained by a budget B, how does it allocate that budget to maximize reach during its Internet display ad campaign? Mathematically this is equivalent to the following optimization problem: min β 1 n n X i=1 P(Y i = 0|z i ,β) subject to p X j=1 β j ≤B, and β j ≥ 0, j = 1,...,p, (2.7) whereβ = (β 1 ,...,β p ) denotes the budget allocation to thep websites, andz i = (z i1 ,...,z ip ) represents the number of times consumeri visits thep websites over the course of the Internet ad campaign. It is challenging to solve Equation (2.7) because p may be in the thousands, which means this is an extremely high dimensional optimization problem. Additionally, the optimal solution to Equation (2.7) should be able to accommodate corner solutions (i.e., the solution should allowβ j = 0 to arise as an optimal 1 Again note this is simply one metric firms consider in practice. Typically the goal of a marketing campaign changes with the campaign itself. For simplicity, I consider the most familiar advertising metric, reach. 22 solution for certain websites). Both challenges can be addressed using the formulation of Equation 2.2, as shown below. First, express P(Y i = 0|z i ,β) as a function ofz i andβ, whereY i = P p j=1 X ij . A natural approach is to modelX ij as a Poisson random variable with expectationγ ij , i.e. X ij |z ij ,β j ∼ Pois(γ ij ) or equivalently, P(X ij =x|z ij ,β j ) = e −γ ij γ x ij x! . (2.8) In Equation 2.8,γ ij is modeled as the expected number of ad appearances to consumeri at websitej, given the number of webpages visited by the consumer at site j (z ij ) and the amount of money the focal firm spends on advertising at the site (β j ). This expected number of ad appearances is given by the probability of an ad appearing on a random visit to website j (denoted as s j ) multiplied by the number of visits (z ij ), i.e. γ ij = s j z ij . For example, if a firm buys 20% of ad impressions at a particular website, and a consumer visits that website ten times during the course of the ad campaign, γ ij = 0.2×10 = 2. In this example, on average the consumer is expected to see the ad twice during the ten visits. The probability the ad appears is simply the number of ad impressions bought at the website over the total number of expected visits by all customers to the site, sos j is called the share of ad impressions (Danaher et al. 2010). Note because of this, s j is interchangeable withβ j ; buying all ad impressions for websitej meanss j = 1 (or, equivalently, β j is maximized such that the ad appears to all visits to the site), while buying no impressions meanss j = 0 (or, equivalently, β j = 0). The variables s j and β j actually have an exact correspondence. Let τ j represent the expected total number of visits at thejth website during the course of the ad campaign. Following Danaher et al. (2010), operationalize τ j as τ j = φ j N, with φ j being the expected number of per person pages viewed at site j during the ad campaign, andN being the total Internet population. Letc j represent the cost to purchase1000 impressions. (Note that this is an industry standard, referred to popularly as CPM.) Then the total number of impressions purchased will be given by 1000β j /c j . Hence, the corresponding relationship between s j (share of ad impressions) and β j (budget spent) is as follows: s j = 1000β j τ j c j . For example, if the CPM of a particular website is $2, the expected total number of visits to the website during the entire ad campaign is 10 million, and the firm spends $500 advertising on the website, the firm has bought s j =2.5% of the ad impressions at that website. 23 Givenγ ij =s j z ij and substituting s j with 1000β j τ j c j , expressγ ij as a function ofz ij andβ j as below: γ ij =θ ij ×β j where θ ij = z ij τ j c j 1000 . (2.9) In Equation (2.9), θ ij is a known quantity given values of z ij , τ j , and c j . With this setup, correlations in viewership among the p websites are directly captured in the z ij terms which carries into θ ij and then into γ ij . See Appendix B for a simple illustration which demonstrates how correlations in the Z matrix are incorporated by the method. ThusY i = P p j=1 X ij can be conditionally modeled as a Poisson distribution with expected valueγ i = P p j=1 γ ij , i.e. P(Y i =y|z i ,β) = e −γ i γ y i y! . (2.10) This formulation works due to the conditional independence of ad appearances at a website. That is, given a person visits a website, the probability of an ad appearing to him or her is exactly the same as any other user or visit. Thus combining Equation (2.10) with the original Equation (2.7) gives the optimization criterion: min β 1 n n X i=1 e −γ i subject to X j β j ≤B and β j ≥ 0, j = 1,...,p. (2.11) The optimization in Equation (2.11) has the following appealing properties. First, because the objective function is a well-behaved convex and smooth function, it is relatively easy to solve the optimization, even for large values of p. This transforms the original problem from NP-hard to one that is relatively easy to optimize. The algorithm will also not stall at suboptimal local minima. Second, the form of Equation (2.11) encourages sparsity in the solution. Under each given budget, as the number of websites under consideration increases, the optimization criterion will automatically set a budget of zero to more websites (hence the corner solutions as desired; see more discussions on this in Hastie et al. 2009, p. 71). Lastly, given the convex and smooth nature of the objective function, prior budget solutions can be used as effective starting points for neighboring budgets, as utilized in Algorithm 1. Therefore, the function can be efficiently optimized over a range of budgets rather than merely solving one particular budget at a time. 24 2.2.2 The Optimization Algorithm Note that P j β j ≤B in Equation 2.11 is equivalent to theℓ 1 penalty in the procedure outlined in Section 2.1. Here, the tuning parameter λ acts as the budget-setting tool, since there is a one-to-one correspondence between a particular λ and a particular budget B. For a given number of websites, as budget increases, λ decreases, and the algorithm allocates more budget to more websites. As budget decreases,λ increases, and the solution gets sparser. Using theℓ 1 notation, the optimization takes the form: min β 1 n n X i=1 e −γ i +λkβk 1 (2.12) whereλ> 0 since the budget constraint forces the budget to always be nonnegative. Although there is no direct closed form solution to Equation (2.12), the algorithm of Section 2.1.2 can be used to obtain a solution via coordinate descent. For the specific problem of Internet media selection, Algo- rithm 1 simplifies the optimization to a single one-dimensional optimization as described in Algorithm 2: Algorithm 2 Coordinate Descent Algorithm for Budget Optimization 1. Specify a maximum budget,B max . 2. Initialize algorithm with ˜ β =0,j = 1, andλ corresponding toB = 0. 3. Forj in 1 top, (a) Marginally optimize Equation (2.12) over a single website budget β j , keeping β 1 ,β 2 ,...,β j−1 ,β j+1 ,...,β p fixed. (b) Iterate until convergence. 4. Increase budget by incrementally decreasing λ over a grid of values, with each λ corresponding to a budget, and repeat Step 3 until reachingB max . What makes this approach so efficient is that each update step is fast to compute and typically not many iterations are required to reach convergence in Step 3 of the algorithm. Note that again convergence is guaranteed by (Luo and Tseng 1992). Thus the optimization becomes very efficient to solve for a range of budgets at once. 25 The coordinate descent approach used in the algorithm is centered around a pointβ j , so the second order Taylor approximation ofe −γ i aroundβ j is computed as follows: e −γ i ≈ e −˜ γ i 1− p X j=1 θij(βj− ˜ βj)+ 1 2 p X j=1 p X k=1 θijθ ik (βj− ˜ βj)(β k − ˜ β k ) ! s.t. βj,β k ≥ 0, j,k = 1,...,p, (2.13) where ˜ γ i = P p j=1 θ ij ˜ β j , and ˜ β j can be taken as the most recent estimate for β j based on the last iteration of the algorithm. Substituting (2.13) into (2.12) and computing the first order condition with respect to β j , all terms involvingβ 1 ,β 2 ,...,β j−1 ,β j+1 ,...,β p drop out of the criterion. Hence, up to an additive constant (i.e. the first term of the Taylor expansion), Equation (2.12) is approximated for a particular coordinateβ j by: min β j 1 n n X i=1 e −˜ γ i −θ ij (β j − ˜ β j )+ 1 2 θ 2 ij (β j − ˜ β j ) 2 + λ n β j subject to β j ≥ 0. (2.14) With this simplified criterion, it can be shown that the first order condition to Equation (2.14) can be written as Equation (2.16), with the otherwise condition enforcing β j ≥ 0. Because no closed-form solution exists for Equation (2.12), a Taylor approximation to (2.3) can be employed, resulting in Equation (2.14). To minimize Equation (2.14) overβ j , with allβ k k6= j fixed, first compute the partial derivative with respect toβ j : n X i=1 θ ij e − P j θ ij ˜ β j h −1+θ ij (β j − ˜ β j ) i +λ (2.15) forβ j > 0. Setting (2.15) equal to zero gives Equation (2.16): β j = ˜ β j + P n i=1 e −˜ γ iθ ij −λ P n i=1 e −˜ γ iθ 2 ij forH j >λ 0 otherwise, (2.16) where H j = P n i=1 e −˜ γ i θ ij ( ˜ β j θ ij + 1) (note that H j is always positive here). This “otherwise” condition arises from the non-differentiability of theℓ 1 penalty. Equation 2.15 should actually containλsign(β j ) rather than simplyλ, but here allβ j must be non-negative. With typicalℓ 1 regularization penalties like the LASSO, estimates can be either positive or negative but must have a large enough absolute value to be included in the model, where “large enough” is controlled byλ. In this methodology, Equation (2.16) incorporates the β j ≥ 0 condition by testing if the β j coefficient has been forced below zero by the update. If it has, that 26 coefficient is set to 0, the minimum value allowed since budget cannot be negative. This equation can be computed quite efficiently. Equation (2.15) can also be used to find a starting point for Algorithm 2, i.e., theλ value corresponding to B = 0. To do this, first employ the same procedure of calculating H j as used in Equation (2.16). In Equation (2.16), H j measures whether the algorithm has set a coefficient to drop below zero. This same procedure can be used to initialize the first λ at which B = 0. In particular, first define ˜ β j = 0 for all j = 1,...,p, which corresponds to zero budget. Then, calculate H j for each website and set the initial λ value to maxH j , with j = 1,...,p. To calculate increasing budgets, the algorithm uses this value as λ max and incrementally decreases λ by steps. The step size and number of steps are both parameters of the algorithm and are thus specified by the researcher depending on desired granularity and maximum budget. For example, for the McRib case study in Section 2.3.3, I ran the algorithm with 500 steps at a step size of 0.01. The optimization in Equation (2.12) can be solved by iteratively computing Equation (2.16) forj from 1 top and repeating until convergence. 2 The algorithm in Section 2.1.2 works well for the budget allocation problem, even though no closed form solution exists to solve the reach function. 2.2.3 Model Extensions The marketing problem considered in Section 2.2.1 is representative of a basic advertising reach campaign, but in practice firms frequently wish to incorporate more sophisticated and targeted techniques in their campaign. The method described previously in this chapter is flexible enough to incorporate these more advanced techniques. In this section, I describe three common techniques and the ways in which the method and algorithm can utilize these extensions. To see these extensions applied in the context of Internet media selection, see Sections 2.3.3 and 2.3.4. 2 Due to the Taylor approximation in the algorithm, I also did some empirical evaluation to verify the convergence of the approximation. I ran the algorithm with numerous initialization points to determine whether the optimization had converged to a global optimum. In all cases, I obtained identical solutions regardless of initialization points, and the convergence was achieved under very few iterations. 27 Extension 1: Targeted Consumer Demographics In this subsection I describe how the method discussed above can be modified to accommodate targeted consumer demographics. Frequently firms have some subset of total consumers for which they wish to max- imize reach. For example, a women’s clothing company would likely prioritize ad views to female viewers over male viewers. Sometimes these firms wish to advertise exclusively to one demographic, but generally a firm instead wishes to target a consumer group without wholly disregarding the general population. Either way, the method can easily incorporate this. To begin, suppose that each individual belongs to one ofm possible demographic groups. For example, if a firm wished to target people based on household income and whether or not they had children, it could havem = 4 possible demographic groups (low household income with or without children, and high house- hold income with or without children). It will often be the case that the “actual” proportions of individuals with these demographics in the data, P 1,a ,...,P m,a , will differ from the targeted demographic makeup, P 1,d ,...,P m,d , of the firm. For instance, it may be that the fraction of individuals with low household income and with children in the dataZ isP LC,a = 0.3, while the focal firm’s target consumer base consists of a much greater percentage of such consumers, e.g.,P LC,d = 0.6. Within this context, the firm would like to upweight individuals with low household income and children in the data sample. This is easily accomplished with a simple adaptation to Equation (2.12): min β 1 n n X i=1 P i e −γ i + λ n X j β j subject to β j ≥ 0, j = 1,...,p. (2.17) whereP i = P D i ,d /P D i ,a andD i represents the demographic group that individual i falls into. SinceP D i ,a is computed from observed data and P D i ,d is based on the focal firm’s target customer base, P i provides a mathematical approximation to the expected value of the target demographic, and additionally P i is a fixed and known quantity. Therefore, optimizing Equation (2.17) is accomplished in exactly the same fashion as for Equation (2.12). Extension 2: Mandatory Media Coverage to Matched Content Websites Aside from targeted consumer demographics, a firm might wish to impose mandatory media coverage to certain subsets of websites. For example, when planning the online advertising campaign for its annual 28 “wave season,” Norwegian Cruise Lines may want to allocate a certain minimum budget to advertising on aggregate travel sites such as Orbitz or Expedia in addition to other websites, since those websites cater to users looking to travel. In this subsection I discuss how the OLIMS method can be modified to accommodate such requirements. Specifically, I can modify Equation (2.12) to requireβ j to be above a certain threshold, sayβ j ≥ min j , to ensure that a minimum budget is allocated to each aggregate travel websitej. Using the same approach as for optimizing Equation (2.12) I can show that the new optimization is accomplished by setting the “otherwise” condition in Equation (2.16) to a minimum non-zero amount. Specifically, replace Equation (2.16) with the following: β j = ˜ β j + P n i=1 e −˜ γ iθ ij −λ P n i=1 e −˜ γ iθ 2 ij forH j −λ> min j min j otherwise. (2.18) In Equation (2.16), the update step came from solving the first order condition set equal to zero, since the algorithm needed to address any optimization that crossed the threshold β j = 0 for a particular coordinate j. If it did cross the threshold, i.e. fall below 0, the optimization procedure needed to set it to 0 to ensure β≥ 0, since an advertising agency cannot spend a negative amount of money at a website. This extension works in much the same way. For the websites identified as part of the mandatory media coverage set, the algorithm cannot allocate less than a new minimum amount, min j , rather than 0. Extension 3: Target Frequency of Ad Exposure Another practical consideration in an online advertising campaign is the target frequency of ad exposures (e.g., Krugman 1972, Naples 1979, Danaher et al. 2010). For example, sales conversions and profits from online display ads might be highest when the consumer is served an ad within a certain range of frequencies (e.g., one to three times) during the duration of the ad campaign. The OLIMS method can also be readily modified to accommodate such considerations. Within the optimization context, this corresponds toP(k a ≤ Y i ≤ k b |z i ,β) wherek a < k b respectively represent lower and upper bounds on ad exposures. Given prior experience, the firm might determine the lower bound (i.e.,k a ) and the upper bound (i.e., k b ) for the target range of ad exposures. This is known as effective frequency or frequency capping (the latter typically sets the lower bound at 1 and imposes an upper bound on the number of ad exposures). 29 Equation (2.11) can be modified as follows to accommodate such considerations: max β 1 n n X i=1 k b X y=ka P(Y i =y|z i ,β) subject to X j β j ≤B, and β j ≥ 0, (2.19) where as before P(Y i = y|z i ,β) = e −γ iγ y i y! . Using the example of 1≤ Y i ≤ 3, the problem involves maximizing 1 n n X i=1 e −γ i (γ i + 1 2 γ 2 i + 1 6 γ 3 i ). (2.20) Again, after taking a second-order Taylor expansion, the resulting equations have a similar form to Equation (2.14) and Equation (2.16). 2.3. Numerical Studies with comScore Data To demonstrate the method of Sections 2.1 and 2.2 in practice, in this section I present some empirical results with both simulated and real data. In Section 2.3.1, I compare the OLIMS method with the method by Danaher et al. (2010) to show the method’s performance in a small-scale setting for which proven methodology already exists. In Section 2.3.2, I demonstrate how the method can be used for optimal budget allocation when the number of websites under consideration is very large (e.g., 5000 websites), which is computationally prohibitive for extant methods. In Sections 2.3.3 and 2.3.4, I extend the demonstration to two case studies using the comScore data described in Section 1.4. The two case studies, McDonald’s McRib and Norwegian Cruise Lines’ Wave Season online advertising campaigns, demonstrate both the OLIMS method and its extensions from Section 2.2.3. As noted in Section 1.4, the empirical illustrations here are based on the 2011 comScore Media Metrix data, which comes from the Wharton Research Data Service (www.wrds.upenn.edu). comScore uses propri- etary software to record daily webpage usage information from a panel of 100,000 Internet users (recorded anonymously by individual computer). Therefore, the comScore data can be used to construct a matrix of all websites visited and the number of times each computer visited each website during a particular time period. A number of prior studies in marketing have utilized comScore Media Metrix data (e.g., Danaher 2007, Liaukonyte et al. 2015, Montgomery et al. 2004, Park and Fader 2004). 3 3 I followed Danaher et al. (2010) to calculate the effective Internet population size for the data (denoted asN in Section 2.2). I first consider the size of the U.S. population at the time of the data set, which is 310.5 million (Schlesinger 2010). I then multiply 30 2.3.1 Comparison between OLIMS and Danaher et al. (2010) Comparison using Data Simulated from Danaher Sarmanov Function To date, the state-of-the-art method for optimal budget allocation of Internet display ads is by Danaher et al. (2010). A basic premise of this method is that the number of webpages viewed by individuals at visits to websites (denoted as an byp matrixZ in the context of Section 2.2) can be characterized by a multivariate negative binomial distribution (referred to as MNBD hereafter). Within this setup, Danaher et al. (2010) proposes an optimization method to maximize reach for each given budget. To get an idea of Danaher et al.’s optimization, consider their Sarmanov-based reach criterion (for only three websites, since the function is increasingly complex for each additional website): Reach =1− α 1 α 1 +s 1 r 1 α 2 α 2 +s 2 r 2 α 3 α 3 +s 3 r 3 × 1+ω 1 1− α 1 s 1 (1−e −1 )+α 1 r 1 1− α 2 s 2 (1−e −1 )+α 2 r 2 +ω 2 1− α 1 s 1 (1−e −1 )+α 1 r 1 1− α 3 s 3 (1−e −1 )+α 3 r 3 +ω 3 1− α 2 s 2 (1−e −1 )+α 2 r 2 1− α 3 s 3 (1−e −1 )+α 3 r 3 +ω 4 1− α 1 s 1 (1−e −1 )+α 1 r 1 1− α 2 s 2 (1−e −1 )+α 2 r 2 1− α 3 s 3 (1−e −1 )+α 3 r 3 Here α j and r j , j = 1,2,3, are the usual parameters associated with a MNBD. More interestingly for the Internet media selection problem,E(Z j ) =r j /a j , whileω j,j ′ is a set of correlation parameters denoting the correlation coefficient in webpage viewership between websites j and j ′ . For Danaher’s approach, the optimization is done over s j , which is equivalent to the OLIMS notation of 1000β j τ j c j . However, since τ j and c j are constants, there is a one-to-one correspondence between the Danaher optimization and the OLIMS optimization. Below I provide some ideas behind the method used to simulate data according to this MNBD distribution, but Appendix B.2 provides further details. it by the proportion of users who actually visited at least one website in the data set (for example, 48.63% in the comScore January 2011 data). I then define N as 155.25 million (48.63%*310.5 million). It is worth noting that, because the specific value of N simply serves as a baseline effective Internet population estimate in the reach calculation, the relative performance of various methods remain qualitatively intact ifN is defined as a smaller/greater proportion of the U.S. population. 31 To examine how the OLIMS method performs under the basic premise of Danaher et al.’s approach, I first simulate aZ matrix from an MNBD distribution with a set of known parameters. (Note that the OLIMS Z matrix is equivalent to Danaher et al.’sX matrix.) Based on the simulated Z matrix, the true optimal reach is known under each budget. Next, I apply both methods on the simulated Z matrix and compare the discrepancies between the true optimal reach and the reach obtained based on the budget allocations suggested by the two methods. Because theZ matrix in this case originates from the MNBD distribution (which is the basic premise of Danaher et al.’s method), Danaher et al.’s (2010) method would be expected to perform better than OLIMS under such comparisons. Nevertheless, I aim to evaluate the extent to which OLIMS could achieve a reach that is similar to the true optimal or the reach obtained under Danaher et al.’s (2010) method. Because Danaher et al.’s (2010) method is only computationally efficient for budget allocation across a relatively small number of websites, I demonstrate such comparisons for the case of seven websites below. I first generate the Internet usage matrix, Z, with 5000 rows (users) and 7 columns (websites), based on a MNBD with α j and r j , j = 1,...,7, the usual parameters associated with a MNBD, and ω j,j ′, a set of correlation parameters denoting the correlation coefficient in viewership between websites j and j ′ . To make the simulation as realistic as possible, I establish α j , r j , and ω j,j ′ as the values from the seven most visited websites from the December 2011 comScore data. I also use the CPMs provided by comScore’s 2010 Media Metrix (Lipsman 2010) in this stimulation. See Appendix B.2 for more details on the data generation method. I then employ the following procedure to compare the two methods. First, obtain the true optimal reach under each budget based on the true α j , r j , and ω j,j ′ parameters and the optimal criterion in Danaher et al.’s (2010) method. Next, apply both the OLIMS and Danaher et al.’s (2010) methods on the simulatedZ matrix to obtain the corresponding reach estimates. Note that Danaher’s methodology optimizes over share of impressions, s j , instead of monetary spending, β j , and uses estimates of α,r, and ω. Nevertheless, s j can be readily converted toβ j using the formulas j = 1000β j τ j c j as given in Section 2.2. 4 Given that the OLIMS and Danaher et al.’s (2010) methods each have their own definitions of the reach function, I report the reach comparisons using a more neutral reach function. Both methods share a common 4 Since Danaher et al.s reach function is highly nonconvex, it can find local optima during optimization. Consequently, I run their optimization with several initialization points and choose the results with the highest reach in the result comparisons. Since a firm cannot buy more than 100% of ad impressions at a website (i.e., 0 ≤ sj ≤ 1), I also force the OLIMS algorithm to stop allocating budget during the optimization to a website onceβj = τ j c j 1000 is reached (corresponding tosj = 1). 32 metric, share of impressions (s j ) at each website, and the data matrices used to optimize the two differing reach functions contain a record of visits. For each user’s visit to a website, then, the probability of seeing the ad iss j , and the user has the same probability of viewing the ad at each visit. Thus the reach naturally fits well into the setup for a binomial problem, and a binomial reach function (1− 1 n n P i=1 Q j (1− s j ) z ij ) should work well for comparison purposes. (Note the Poisson is the limiting distribution of the binomial, so OLIMS may be expected to benefit slightly from that. However, in these comparisons, the data is the more significant factor, as these results are purely to demonstrate the two methods perform similarly.) Figure 2.1 shows the reach curves for the average reach estimate at each budget across the 100 simulation runs using the binomial definition of reach. Note that the true optimal reach is in solid black, the Danaher estimate is in dashed red, and the OLIMS estimate is in dotted blue. As Figure 2.1 demonstrates, both methods yield reach very close to the true optimal reach. As expected, Danaher’s method performs slightly better under this comparison, because the data is generated from the MNBD assumed by Danaher’s method. However, at all budget levels, the methods perform remarkably similarly even though the data is generated to correspond to that assumed in Danaher et al. (2010). Overall, the comparisons above demonstrate that, even when the Internet usage matrixZ is simulated from the MNBD as assumed in Danaher’s method, OLIMS performs well. Comparing the computation speed of the two methods, I find the computation speed of the OLIMS method is over ten times faster under this setting. Given the highly non-convex nature of the optimization criterion in Danaher et al. (2010), the discrepancies in computation speed would be expected to increase exponentially for larger-scale problems. Comparison using comScore Data In the preceding comparison, the data was generated in order to theoretically favor Danaher et al.’s (2010) method. In this subsection, I compare the two methods using the December 2011 comScore Media Metrix data. Specifically, I use Internet usage data from the top seven most visited websites that support Internet display advertisements. The full month of data contained 51,093 users who visited one of the seven websites at least once in December 2011. I fit both the OLIMS and Danaher’s methods to 100 randomly chosen subsets of these users, each of size 5,109 (approximately ten percent of the population). Again, I use the CPMs as given in comScore Inc.’s Media Metrix data from May 2010 (Lipsman 2010). 33 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Budget [in millions] Reach Figure 2.1: Performance comparison between the OLIMS method and Danaher et al.’s using simulated data, where the true optimal reach is in solid black, the Danaher estimate is in dashed red, and the OLIMS estimate is in dotted blue. Figure 2.2 shows the reach curves for the average reach (calculated using the binomial reach definition given in the preceding section) at each budget across the 100 sample runs. Within this context, I define the true optimal reach (black solid) as that obtained from the OLIMS method applied to the entire data set of 51,093 users. Danahers (red dashed) and the OLIMS method’s (blue dotted) estimates are both computed from the 10% subsets of the data. This also approximates real-world conditions in which a company has access to only part of the total browsing history of all Internet users. All reach curves in Figure 2.2 are then calculated on the ninety percent holdout data to ensure fair comparisons across methods. 34 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Budget [in millions] Reach Figure 2.2: Performance comparison between the OLIMS method and Danaher et al.’s using real data, where the true optimal reach is in solid black, the Danaher estimate is in dashed red, and the OLIMS estimate is in dotted blue. Again, as seen in Figure 2.1, the two methods yield very similar reach results. Similar to the comparison in Figure 2.1 when the data is generated to favor Danaher et al., the OLIMS reach estimate slightly outper- forms the Danaher reach when using the more neutral binomial reach calculation. Presumably this occurs because the real data does not follow precisely an MNBD distribution. As both the simulated and real data comparisons here illustrate, the OLIMS and Danaher’s methods perform similarly when applied to problem settings scalable to the latter. However, due to its non-convex optimization criterion, the Danaher approach is considerably slower to compute and as a result encounters significant computational difficulties in settings involving a large number of websites. In the next section, I 35 demonstrate that, while computationally prohibitive for extant methods, the OLIMS method can be used to optimally allocate advertising budget across a very large number of websites. 2.3.2 Simulated Large-Scale Problem: 5000 Websites In practice, most Internet media selection problems involve far more than a handful of websites. In this subsection I illustrate how OLIMS can optimize over thousands of websites. To demonstrate this, I simulate an Internet usage matrix of 50,000 people over 5000 websites. 5 The visits to each website are randomly generated from a standard normal distribution (after both rounding and taking the absolute value, since website views are positive integers) which are then multiplied by a random integer from zero to ten with higher weight on a value of zero. This ensures that the simulated data set has similar characteristics to the observed comScore data, since a high percentage of matrix entries in the real data appear as zeros. The CPMs of these websites are randomly generated, chosen from 0.25 to 8.00 in increments of 0.25. Similarly as before, I run the OLIMS method over a 10% subset of the 50,000 users, then calculate reach on the 90% holdout data in the result comparisons. Because it is computationally prohibitive for Danaher et al.’s method to optimize over 5000 websites, I compare OLIMS to the following benchmark approaches: 1) equal allocation over all 5000 websites; and 2) cost-adjusted equal allocation (i.e. average number of visits/CPM) over the most visited 25, 50, and 100 websites. These alternative approaches should mimic approaches often used in practice when the sheer number of websites is infeasible to examine individually, such as those outlined in Cho and Cheon (2004). Figure 2.3 shows the resulting reach comparisons, where the reach on the full data is given by the solid black line, the reach using OLIMS on the subset is in the heavy dashed blue line, the reach on the top 100 cost-adjusted websites is in the dashed purple line, the reach on the top 50 cost-adjusted websites is in the dotted brown line, the reach on the top 25 cost-adjusted websites is in the dotted red line, and the reach with an equal allocation to all 5000 websites is in the dotted-dashed green line. As Figure 2.3 shows, even with only a ten percent subset of the data, OLIMS yields reach estimates very similar to the optimal reach estimate based on the entire dataset. In addition, OLIMS outperforms all the benchmark approaches. The comparisons show the equal allocation approach is by far the worst. The 5 I choose to simulate this dataset because data cleaning in comScore for 5000 websites is highly time consuming. For simplicity, the data is generated independently without correlations. Since OLIMS is designed to leverage correlations across sites, this setup provides a lower bound with respect to advantages from the OLIMS approach. 36 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 1.0 Budget (in millions) Reach Method Optimal Proposed Top 100 Top 50 Top 25 Equal Figure 2.3: Reach results for the 5000 websites simulated with no viewership correlations, where the optimal reach (solid black) is calculated using the full data set, the OLIMS method reach (heavy dashed blue) is calculated using the 10% subset, and the naive 100 cost-adjusted reach (dashed purple), 50 cost-adjusted reach (dotted brown), 25 cost-adjusted reach (dotted red), and equal 5000-website reach (dotted-dashed green) are calculated using the given budget allocation. cost-adjusted approaches perform better, but still worse than OLIMS. Overall, the results show OLIMS can be used to effectively allocate advertising budget across a very large set of websites, especially considering the data here is specifically simulated without correlations. Any advantage OLIMS achieves by leveraging relationships among websites is zeroed out here, and yet OLIMS still clearly demonstrates an advantage. 37 2.3.3 Case Study 1: McDonald’s McRib Sandwich Online Advertising Campaign Sections 2.3.1 and 2.3.2 demonstrate the results of the OLIMS method in two extremes: a small-scale setting with few websites, and a large-scale setting with thousands of websites. In this section and Section 2.3.4, I demonstrate how OLIMS can be applied in medium-scale, real-world settings. In this first case study, I consider a yearly promotion for McDonald’s McRib Sandwich, which is only available for a limited time each year (approximately one month). Because McRib is often offered in or around December (Morrison 2012), I consider the comScore data from December 2011 to approximate a McRib advertising campaign. In particular, as noted in Section 1.4, I manually went through the comScore data set to identify the 500 most visited websites that also supported Internet display ads. The data then contains a record of every computer that visited at least one of these 500 websites at least once (56,666 users). ThusZ is a 56,666 by 500 matrix. I then separate the full data set into a ten percent training data set (5667 users) and a ninety percent holdout data set. Similarly as before, I use the training data to fit the method, then calculate reach on the holdout data. Table 2.1 provides the categorical makeup of the 500 websites considered in this application. As detailed in Section 1.4, the data includes sixteen categories of websites: Social Networking, Portals, Entertainment, E-mail, Community, General News, Sports, Newspapers, Online Gaming, Photos, Filesharing, Information, Online Shopping, Retail, Service, and Travel. The Total Number column provides the total number of websites in each category. For simplicity, the CPM values for each website are based on average costs of the website categories provided by comScore Inc.’s Media Metrix data from May 2010 (Lipsman 2010). 6 Table 2.1 shows that Entertainment and Gaming are by far the largest categories (with 92 and 77 websites out of 500, respectively), while Sports, Newspaper, and General News are the most expensive at which to advertise (all over $6.00). Additionally, it appears in Table 2.1 that advertising costs vary considerably across these website categories. See Appendix A for Table A.1, which provides an overview of viewership correlations within and across each of the sixteen website categories. Table 2.1 also shows the number of websites chosen in each of the sixteen website categories over three different methods, discussed below: 1) the original approach that maximizes overall reach, 2) the extension to maximize reach among targeted consumer demographics, and 3) the extension to maximize effective 6 In practice firms could readily apply actual CPMs of all sites in such an optimization. 38 OLIMS Benchmark Budget = $500K Budget = $2 million Total Targeted Targeted Targeted Targeted Top Top Category Number CPM Original Consumers Exposures Original Consumers Exposures 25 50 Community 23 2.10 8 8 11 14 14 20 1 4 E-mail 7 0.94 7 7 7 7 7 7 3 5 Entertainment 92 4.75 2 1 10 13 10 29 0 0 Fileshare 28 1.08 23 20 26 24 22 28 2 7 Gaming 77 2.68 30 40 44 37 45 59 0 1 General News 12 6.14 0 0 0 0 0 0 0 0 Information 47 2.52 24 25 29 27 27 36 1 3 Newspaper 27 6.99 0 0 0 0 0 0 0 0 Online Shop 29 2.52 11 12 15 15 15 26 1 1 Photos 9 1.08 6 6 9 8 9 9 0 2 Portal 30 2.60 13 14 17 16 16 26 5 7 Retail 57 2.52 33 39 39 36 41 49 2 7 Service 18 2.52 13 14 10 14 14 12 2 2 Social Network 17 0.56 16 17 17 17 17 17 8 11 Sports 17 6.29 0 0 1 1 0 1 0 0 Travel 10 2.52 6 7 8 8 8 8 0 0 Table 2.1: The number of websites chosen in each of the 16 website categories for the McRib case study in the three proposed OLIMS optimizations: the original (unconstrained), the demographic-weighted exten- sion, and the targeted ad exposures extension. Also provided for comparison are the total number of websites in each category and the category’s average cost (in CPM), as well as the number of websites in each cate- gory present in the top 25 and top 50 cost-adjusted naive comparison methods. reach with target frequency of ad exposures. This table also provides the number of websites chosen in each category when using only the top 25 and top 50 most visited sites as benchmarks to the OLIMS approach. More details about the result comparisons are provided below. McRib Campaign: Maximizing Overall Reach In this subsection, to provide a baseline, I assume McDonald’s goal is simply to reach as many users as possible during its McRib campaign. Again, because Danaher et al’s (2010) method cannot optimize over 500 websites, I use the following benchmark methods in the model comparisons: equal allocation over all 500 websites, and cost-adjusted equal allocation across the top 10, 25, and 50 most visited websites. 7 Table 2.1 reports the categorical makeup of chosen sites under two budgets ($500K and $2 million). This categorical makeup shows how many websites in each category were chosen with non-zero budget allocation in the solutions of the optimization. It is not surprising that the optimization does not select many websites in relatively expensive categories such as Sports, Newspaper, and General News. Advertising 7 Note that, while included in Figure 2.4, the 10-website benchmark method is omitted from Table 2.1 for space considerations. 39 at a relatively expensive website is only desirable when that website can reach an otherwise unreachable audience. In this case, other websites offer reach without the high price. Social Networking, for example, offers a relatively inexpensive way to reach consumers who are visiting other websites as well. Note that in Table A.1 in Appendix A, the Social Networking sites have relatively high correlations in viewership across other site categories with the only exception being Email and Gaming sites. Consequently, the optimization ultimately includes all 17 Social Networking websites and leaves out the expensive categories where reach would be duplicated. It is also worth noting that the OLIMS optimization selects all websites in the Email category. In addition to the relatively lower cost of advertising on these websites, there is a very low within-category correlation in viewership among email sites (0.01 absolute average correlation; see Appendix A). This indicates the same consumer often does not visit more than one email site, so including an additional email website in the optimization can result in a larger increase in reach. Figure 2.4 shows the results from the OLIMS method with the comparison methods. The optimal reach (solid black line) is again calculated using the full December 2011 cleaned comScore data set, while the OLIMS method for comparison (blue dashed line) is calculated using the 10% subset. The naive methods (budget allocation using the top 50 cost-adjusted websites in small dotted red, the top 25 cost-adjusted websites in heavy dotted brown, the top 10 cost-adjusted websites in small dashed purple, and the equal allocation in dotted-dashed green) all calculate reach based on allocating the budget accordingly with no optimization. Figure 2.4 demonstrates that OLIMS again performs well with ten percent calibration data. The reach estimates based on the ten percent calibration data are very close to those from the true optimal based on the entire data. Additionally, the reach estimates from the naive approaches are significantly below both, and the OLIMS method takes less than a second on average to fit per budget using a personal laptop computer. McRib Campaign with Targeted Consumer Demographics While the previous subsection provides an excellent starting point for the McRib case study analysis, in practice companies often have more challenging campaign considerations in mind. One such major con- sideration is the ability to target specific demographics when running online display ads. In this section I demonstrate that OLIMS can be readily modified to accommodate such needs. For illustration purposes, 40 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 Budget (in millions) Reach Figure 2.4: Reach results for the McDonald’s McRib Campaign in which the optimal reach (solid black line) is calculated using the full data set, the OLIMS method (heavy dashed blue) is calculated on the 90% holdout data using the 10% training data subset, and the naive methods (top 50 cost-adjusted websites in small dotted red, top 25 cost-adjusted websites in heavy dotted brown, top 10 cost-adjusted websites in small dashed purple, and equal allocation in dotted-dashed green) are all calculated with a non-optimized budget allocation. I consider two demographic variables frequently desired for targeting by McDonald’s and companies like it: income level and the presence of children in the household (a binary variable in the comScore data set indicating whether or not a particular tracked computer is placed in a household with at least one occupant under the age of 18). I choose these two demographic variables because McDonald’s has historically targeted families with children (Mintel 2014). In addition, fast food in general tends to target lower-income households (Drewnowski and Darmon 2005). Because of this, I illustrate the OLIMS approach in a scenario where the McRib campaign wishes to reweight the comScore data set with greater emphasis on individuals from lower-income households with children. 41 Actual Desired No Children 0.344 0.25 Children 0.656 0.75 Income below 15,000 0.135 0.25 Income 15,000-24,999 0.074 0.20 Income 25,000-34,999 0.100 0.20 Income 35,000-49,999 0.150 0.15 Income 50,000-74,999 0.260 0.10 Income 75,000-99,999 0.140 0.05 Income above 100,000 0.141 0.05 Table 2.2: Actual and desired proportions for the McRib demographic-weighted advertising campaign, where the actual proportions are the proportion of users in the December 2011 cleaned comScore data who met the given demographic criteria, and the desired proportions are realistic example target proportions for the McRib advertising campaign. Following the procedure outlined in Section 2.2.3, I reweight the comScore data with target population makeup in each variable category as shown in Table 2.2. For example, for “children present,” since I want to give individuals with children greater weights than those who do not have children, I assign a weight of 0.75 to having children and 0.25 to not having children. I do a similar weighting for income level. I choose these desired weights arbitrarily to demonstrate the method, but in practice companies would presumably have data on target proportions before running the campaign. Table 2.1 shows the number of websites chosen in the reweighted setup compared to the standard setup. In this example, reweighting the data does not drastically change the types of websites chosen during the optimization. Families with children and lower-income households did not represent a significant deviation from the overall data set in terms of their Internet browsing behavior. However, Table 2.1 does show some slight changes. For example, the number of Gaming websites increases when reweighing the data. Most of the gaming websites in the data set are online flash-based game websites which primarily target young players (360i 2008). Hence, it is likely that proportionally more McDonald’s consumers frequently visit such sites. McRib Campaign with Target Frequency of Ad Exposure In this subsection I demonstrate another common campaign consideration, targeting a specific frequency of ad exposures as discussed in Section 2.2.3. In this case, McDonald’s wishes to allocate its ad budget such that each individual is exposed to the ad no more than three times during the course of the McRib campaign 42 (so between one and three ad exposures). For simplicity, I use the original December 2011 cleaned comScore data set without demographic reweighting, although both approaches could readily be used together. In this case, the “effective reach” is the value of the functione −γ (γ + 1 2 γ 2 + 1 6 γ 3 ). Again, Table 2.1 shows the optimization allocation across website categories for this extension. In general, under this extension, OLIMS chooses more websites, with a correspondingly lower average budget at each one. This allows more viewers to be reached with the ad, but limits the probability an ad will appear to a particular viewer more than three times. One example of this is the increase in number of Gaming websites chosen by the algorithm. Gaming websites have many repeat visitors, but low correlation among visitation to websites within the Gaming category (as shown in Table A.1 in Appendix A). The algorithm chooses to advertise a small amount at a number of Gaming sites, which gives consumers a low probability of seeing the ad on any particular visit, but will ultimately reach different consumers with each ad appearance. Overall, the algorithm less often includes websites with high repeat visitation. This helps ensure that a consumer does not see the ad more times than desired. Another example of this is that the algorithm chooses more Entertainment websites. Although the Entertainment category is more expensive than others, theZ matrix shows low repeat visitation for Entertainment websites. The websites seem to be more universally visited, so advertising on an Entertainment website results in more different people seeing an ad. 2.3.4 Case Study 2: Norwegian Cruise Lines Wave Season Online Advertising Campaign with Mandatory Media Coverage to Travel Aggregate Sites For the second case study, I consider a very different industry, the cruise industry. Each year, the cruise industry advertises for its annual “wave season”, which begins in January. Norwegian Cruise Lines (NCL) is among the cruise lines that participate heavily in wave season (Satchell 2011). Because consumers who are interested in booking a cruise often use travel aggregation sites like Orbitz and Priceline to compare offerings across multiple cruise lines, I use this case study to demonstrate the extension in which OLIMS is applied in such a scenario. Consider that NCL wants to allocate at least a minimum amount of budget to a set of major aggregate travel websites. While this is a hypothetical example, it is realistic and can be readily applied to similar scenarios. 43 OLIMS handles such scenarios using the extension described in Section 2.2.3. Imagine NCL wants to allocate at least twenty percent of any given budget to eight major aggregate websites (CheapTick- ets.com, Expedia.com, Hotwire.com, Kayak.com, Orbitz.com, Priceline.com, Travelocity.com, and TripAd- visor.com). Thus the optimization is required to place at least 2.5 percent of the budget at each of these eight sites. (Note this is only one simple way to ensure at least 20% of the budget is allocated to these eight sites. Many other methods, including more sophisticated approaches, could be employed. One such sophisticated method is described in Section 3.) I follow the same procedure as in the previous case study described in Section 1.4 to obtain the 500 most visited websites in January 2011 that supported online display advertisements. These 500 websites are also divided into sixteen categories and assigned an average CPM based on their category. (For the categorical makeup of these 500 websites, please refer to Table 1.1 in Section 1.4.) 48,628 users visited at least one of these 500 websites during January 2011, meaning theZ matrix is 48,628 by 500. I again divide this data into a 10% subset (4,863 users) of calibration data and use the remaining 90% as holdout data. Figure 2.5 demonstrates the reach curves under this extension. I refer to the optimization with mandatory media coverage of aggregate travel sites as constrained optimization (in dashed blue), and the standard optimization approach as unconstrained (in solid black). I also include a naive method, allocating the entire budget evenly to the eight aggregate sites (in dotted green). The curves on the left show the calculation of reach using the entire data set, i.e. the full 90% holdout data. As expected, the unconstrained curve performs slightly better than the constrained curve, since it is impossible to do better in overall reach by constraining the optimization. In addition, the naive approach performs poorly. Because the aggregate travel websites do not reach a majority of the users of the data set, allocating budget only to these eight websites will naturally limit the ad’s exposure to all Internet users. The curves on the right show the reach for the subset of users who visited at least one of the eight aggregate travel websites in January 2011 (there are 6,431 such individuals in the data set). Presumably these consumers are more likely to be interested in searching for travel deals compared to the others. In this case, the constrained curve significantly outperforms the unconstrained curve. By constraining the optimization to allocate a percentage of the budget to each aggregate travel website, OLIMS reaches far more of the users who actually visit these sites, which is the group NCL would like to target. In this case, NCL can meet its aggregate travel site requirements without sacrificing much overall reach, meaning that 44 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 Budget (in millions) Reach based on Full Data 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 Budget (in millions) Reach based on Travel Users Subset Figure 2.5: Reach results with mandatory media coverage for the eight aggregate travel websites in the Norwegian Cruise Line case study, with the constrained optimization in dashed blue, the unconstrained optimization in solid black, and the equal eight-website allocation in dotted green. most users will still view the ad in general, but NCL can also be confident it has reached the subset of people most likely to book a cruise. In the right panel of Figure 2.5, the naive approach of equal allocation across the eight travel aggregate sites performs slightly better than OLIMS when the reach is calculated based on the subset of aggregate travel site users (i.e., constrained reach). But this result is expected. NCL is most likely to reach users on the aggregate sites by putting as much budget as possible into those eight sites. As seen in the overall reach curves on the left, that method will not capture users on other websites who might also be attracted to NCL’s Wave Season campaign but did not visit one of the eight aggregate travel websites. Depending on whether the firm wants to reach a broader audience or a targeted audience, either the constrained or the unconstrained optimization could be employed in such online ad campaigns. 45 2.4. Discussion Although most statistical researchers are familiar with loss functions and the benefits of using these func- tions in optimization procedures, many other fields have not yet taken full advantage of the gains provided by the statistical methodology. One such field is marketing, specifically in the area of Internet media selection. In the current advertising climate, firms need an online presence more than ever. Nevertheless, the ever- increasing number of websites presents not only endless opportunities but also tremendous challenges for firms’ online display ad campaigns. While online advertising is limited only by the sheer number of web- sites, optimal Internet media selection among thousands of websites presents a prohibitively challenging task in the modern era. While existing methods can only solve Internet budget optimization for moderately-sized problems (e.g. 10 websites), using a penalized loss function optimization method and algorithm allows for significant gains in both efficiency and scope. The OLIMS method proposed in this chapter allows firms to efficiently allocate budget across a large number of websites (e.g. 5000). I demonstrate the applicability and scalability of the algorithm in real-world settings using the comScore data on Internet usage. I also illustrate how OLIMS extends easily to accommodate common practical Internet advertising considerations, including targeted consumer demographics, mandatory media coverage to matched content websites, and target frequency of ad exposures. Furthermore, the low computational cost means that OLIMS can rapidly examine a range of possible budgets. As a result, firms can easily examine the correspondence between budget and reach, providing them with the ability to spend only as much money as required to achieve a desired level of reach. Consequently, the OLIMS method provides firms with great flexibility and adaptability in their online display advertising campaigns. Accordingly, firms can obtain ultimate control of their own Internet display ad campaigns, alleviating their need to turn to ad agencies or other conglomerate advertising exchanges with little to no oversight over their own campaigns. Also, while the research here focuses specifically on the metric of reach, firms could readily modify the approach and use Internet browsing-tracking data to maximize clickthrough and/or downstream purchases of their Internet display ad campaigns. Additionally, this research is presented from the perspective of an individual firm that wishes to maximize reach for its particular campaign. This method could be further extended for use by an advertising broker who wishes to maximize reach over a set of clients. Advertising brokers must provide clients with the best possible cam- paigns but also use as much of their existing ad space inventory as possible. Thus an interesting extension of 46 the method would be to maximize over multiple campaigns from the perspective of an advertising agency, though this is outside the scope of this dissertation. However, while the advertising application considered here is an exciting area of real-world practical research, it illustrates a much more generally important facet of the methodology: it works well even for non-quadratic criteria. Much of the statistical literature focusing on penalized loss functions does so in the context of linear regression procedures. While these represent significant advances in the area of regression and demonstrate the advantages in using penalized approaches, they are unfortunately limited to problems in which the data is linearly related. In this way, the development of a method capable of optimizing a general loss function represents a significant contribution to the existing statistical loss function literature. Further, like the LASSO method in linear regression, the use of theℓ 1 penalty in the generalized methodology allows for efficient optimization even in large-scale or high-dimensional problems by encouraging sparsity in the parameter estimates. OLIMS does have some limitations, the most significant of which is the use of the second-order Taylor approximation to the objective function for optimization purposes. In cases where the loss function being approximated is highly complex and not well-behaved, further testing may need to be performed to be sure the approximation is reasonable. In addition, since the approximation requires up to a second derivative to the function, the loss function must be second-order differentiable. 47 Chapter 3 Constrained and Penalized Loss Functions The penalized loss function procedure discussed in Chapter 2 has a number of advantages in its ability to efficiently optimize even nonlinear objective functions. However, penalized loss function methodologies are well-studied in the statistics literature, even if methodology is traditionally developed on a problem-by- problem basis. A more complex problem involves situations where researchers not only want to penalize an objective function, but they also want to constrain the solution values, i.e. β. For example, in the mandatory media coverage extension presented in Section 2.2.3 and demonstrated in the Norwegian Cruise Lines case study in Section 2.3.4 for the Internet media selection application, a firm has identified a select set of websites over which it wants to allocate a set percentage of budget. The firm wishes to constrain the optimization, that is force a given amount of the total advertising campaign budget to be allocated to these websites. In Chapter 2, this is handled by evenly distributing this minimum budget as a lower threshold for each website in the subset. This is unlikely to be the best allocation, however. The firm would ideally like to have a variable allocation across this subset of websites, allowing an optimization procedure to potentially allocate (for example) the entire portion of the budget to one or two “best” websites in the subset. This is just one example of a problem in which constraints on the solution to the objective function prove both desirable and necessary. In this chapter, I aim to introduce a procedure for optimizing both penalized and constrained (PAC) loss functions. This extends the usefulness and applicability of the loss function procedure outlined in Chapter 2, as I also demonstrate on the Internet media selection application previously considered. For this, I again use the NCL case study, as it serves as a way to directly compare the two methods and show the increased utility of the penalized and constrained approach. The PAC method demonstrates a number of advantages, primarily in the scope of problems it can optimize. While again the Internet media selection application is an exemplary real-world problem for this demonstration, the PAC procedure can be readily applied to any number of related problems, as explored below. 48 3.1. Introduction to Penalized and Constrained Loss Functions As seen in Chapter 2, one common class of high-dimensional statistical problems involves optimizing a general loss function subject to anℓ 1 penalty term, i.e. argmin β L(β)+λkβk 1 , (3.1) where L is some well-behaved, convex loss function and β∈ R p is a vector of coefficients over which to optimize. Several optimization problems of this form have been extensively studied in the statistical literature. For example when L(β) =kY −Xβk 2 then (3.1) reduces to the LASSO (Tibshirani 1996b). Alternatively, whenL(β) represents the log likelihood function for a generalized linear model (GLM) then (3.1) implements a GLM extension of the LASSO. However, this generalized form is extremely versatile and can be applied in a variety of settings beyond fitting standard statistical models. In Chapter 2, I demonstrated the use of this generalizedℓ 1 penalty procedure for an important problem in marketing: optimizing budget allocations across Internet website advertising opportunities. Advertising on websites entails a variety of challenges not present in advertising on traditional media such as newspapers or magazines, particularly in the wide variations in cost and traffic across websites as well as the sheer number of websites available. To address these concerns without sacrificing optimization efficiency, a penalized loss function provides an appealing option. Unfortunately, in practice firms often wish to optimize reach subject to a set of constraints on the allo- cation of the advertising budget. For example, imagine a firm is developing an advertising campaign to promote a new NCAA sports mobile app. The firm knows its target audience visits sports update websites, e.g. ESPN or Yahoo Sports. Because of this, the firm might wish to allocate, for example, 50% of its advertising budget to these sports websites, since it knows it will reach more likely consumers at these sites. In this case (3.1) cannot be directly applied. Instead, if the firm wishes to incorporate m different linear constraints on its advertising budget, the optimization problem of (3.1) becomes argmin β L(β)+λkβk 1 subject to Cβ≤b, (3.2) whereL(β) is the same well-behaved, convex loss function, but nowC∈R m×p is a predefined constraint matrix andb∈R m is the corresponding predefined constraint vector. I refer to this method as Penalized And 49 Constrained regression (PAC). In the case of the Internet advertising example, PAC would be operationalized as follows: L(β) would be a function defining a common advertising metric (e.g. reach as considered in Chapter 2 or clickthrough rate, which is also considered later in this chapter), the p-dimensional β vector would be the amount of budget allocated to each of the p websites under consideration, and constraints would be introduced directly into the optimization viaC andb. In Section 3.4, I continue to use the Internet advertising example as the primary motivating example for the PAC methodology due to the direct comparisons to Chapter 2 and the appealing real-world setup and available comScore data. However, this problem setup is not unique to the marketing application. Many familiar procedures, including many already well-studied in the statistical literature, actually fall into this framework. To motivate this research beyond the Internet advertising example, I discuss a few such procedures in Sections 3.1.1 and 3.1.2 below. 3.1.1 Common Motivating Examples: Monotone Curve Estimation and Portfolio Optimiza- tion Many familiar problems fall into the framework of Equation 3.2. First, imagine a common problem of interest involving estimating a curve which is constrained to be monotonically increasing or decreasing. Here researchers face the problem of fitting a smooth function, h(x), to a set of observations{(x 1 ,y 1 ),..., (x n ,y n )}, subject to the constraint that h must be monotone. There are many examples of such a setting. For instance, when estimating demand as a function of price, researchers need to ensure the curve is non- increasing. Alternatively, when estimating a cumulative distribution function (CDF), researchers need to ensure that the function is non-decreasing. Similarly, when performing curve registration to align a set of curves, the estimated warping functions must be non-decreasing. If h is modeled as a parametric function, the monotonicity constraint is not hard to impose. However, one often wishes to produce a more flexible fit using a non-parametric approach. For example, one could model h(x) = B(x) T β where B(x) is some p-dimensional basis function. Then one natural approach to produce a very flexible estimate for h(x) would be to choose a large value for p and then to smooth the estimate by solving (3.1) whereL(β) = P n i=1 (y i −B(x i ) T β) 2 . However, minimizing (3.1) does not ensure a monotone curve. Hence, the solution to (3.2) requires constraining the coefficients such that Cβ≤0, (3.3) 50 where thel th row ofC is the derivativeB ′ (u l ) of the basis functions evaluated atu ℓ , whereu 1 ,...,u m are a fine grid of points over the range ofx. Enforcing (3.3) ensures that the derivative ofh is non-positive soh will be monotone decreasing. Aside from monotone curve estimation, another more universally familiar problem in optimization is that of portfolio management. In portfolio optimization, an investor (in a traditional stock portfolio setup) wishes to maximize a return on investment, generally subject to minimizing the overall risk of the portfolio. In this framework, suppose the investor has p assets indexed by 1,2,...,p whose covariance matrix is denoted byΣ. Markowitz (1952, 1959) developed the seminal framework for mean-variance analysis. In particular his approach involved choosing asset weights,β, to minimize the portfolio risk β T Σβ (3.4) subject to β T 1 = 1. An investor may also choose to impose additional constraints on β to control the expected return of the portfolio, the allocations among sectors or industries, or the risk exposures to certain known risk factors. In practiceΣ is unobserved so must be estimated using the sample covariance matrix, ˆ Σ. However, it has been well documented in the finance literature that when p is large, which is the norm in real-world applications, minimizing (3.4) using the sample covariance matrix gives poor estimates for β. One pos- sible solution involves regularizing ˆ Σ, but more recently attention has focused on directly penalizing or constraining the weights, an analogous approach to penalizing the coefficients in a regression setting. Fan et al. (2012) recently adopted this framework using a portfolio optimization problem with a gross-exposure parameterc defined by: β T ˆ Σβ, kβk 1 ≤c (3.5) subject toβ T 1 = 1, wherec is a tuning parameter. Fan et al. (2012) reformulated (3.5) as a penalized linear regression problem and used LARS to estimate the weights. However, as they point out, their regression formulation only approximately solves (3.5). It is not hard to verify that (3.5) can be expressed in the form (3.2), where C has at least one row (to constrain β to sum to one) but may also have additional rows if constraints are placed on the expected return, industry weightings, etc. Hence, implementing PAC with L(β) =β T ˆ Σβ allows investors to exactly solve the gross-exposure portfolio optimization problem. 51 3.1.2 Larger Class of Problems: Generalized LASSO Another more statistically-oriented problem which fits in the paradigm of Equation 3.2 is that of the gener- alized LASSO. Tibshirani and Taylor (2011) introduce the generalized LASSO problem as: argmin Θ 1 2 Y− ˜ XΘ 2 2 +λkDΘk 1 , (3.6) where D ∈ R r×p . When rank(D) = r, and thus r ≤ p, Tibshirani and Taylor (2011) show that the generalized LASSO can be converted to the classical LASSO problem as discussed in Chapter 2. However, if r > p then such a reformulation is not possible. Lemma 3.1.1 shows that when r > p and D is full column rank, then there is an interesting connection between the generalized LASSO and PAC (though note the constraints here are equality constraints, a subset of the more general inequality constraints used in Equation 3.2). Lemma 3.1.1 (Generalized LASSO is a Special Case of PAC). Ifr > p and rank(D) = p then there exist matricesA,C andX such that, for all values of λ, the solution to (3.6) is equal toΘ = Aβ, whereβ is given by: argmin β 1 2 kY−Xβk 2 2 +λkβk 1 subject to Cβ =0. Proof. SinceD has full column rank, by reordering the rows if necessary,D can be written as D = D 1 D 2 whereD 1 ∈R p×p is an invertible matrix andD 2 ∈R r−p×p . Then, 1 2 Y− ˜ XΘ 2 2 +λkDΘk 1 = 1 2 Y− ˜ XD −1 1 D 1 Θ 2 2 +λkD 1 θk 1 +λkD 2 θk 1 = 1 2 Y−( ˜ XD −1 1 )D 1 θ 2 2 +λkD 1 θk 1 +λ D 2 D −1 1 D 1 θ 1 Using the change of variables β 1 =D 1 θ, β 2 =D 2 D −1 1 D 1 θ =D 2 D −1 1 β 1 , and β = β 1 β 2 , 52 the generalized LASSO problem can be rewritten as follows: min θ∈R p 1 2 Y− ˜ Xθ 2 2 +λkDθk 1 = min β∈R r 1 2 Y− ˜ XD −1 1 β 1 2 2 +λkβk 1 D 2 D −1 1 β 1 −β 2 =0 , = min β∈R r 1 2 kY−Xβk 2 2 +λkβk 1 Cβ =0 , where X = h ˜ XD −1 1 0 i andC = D 2 D −1 1 −I . Note that θ = D −1 0 β 1 β 2 , and thus, the generalized LASSO is a special case of the constrained LASSO. Hence, any problem that falls into the generalized LASSO paradigm can be solved as a PAC problem withL(β) = 1 2 kY−Xβk 2 2 andb =0. Tibshirani and Taylor (2011) further demonstrate that a variety of common statistical methods can be formulated as special cases of the generalized LASSO. One example is the 1d fused LASSO (Tibshirani et al. 2005). The fused LASSO encourages blocks of adjacent estimated coefficients to all have the same value. This type of structure often makes sense in situations where there is a natural ordering in the coefficients. If instead the data have a two-dimensional ordering, such as for an image reconstruction, this idea can be extended to the 2d fused LASSO Other examples of statistical methodologies that fall into the generalized LASSO (and hence PAC) set- ting include: polynomial trend filtering where one penalizes discrete differences to produce smooth piece- wise polynomial curves, Wavelet smoothing, and the FLiRTI method (James et al. 2009). See Tibshirani and Taylor (2011) for more details on these methods and further examples of the generalized LASSO paradigm. She (2010) also considers a similar criterion to (3.6) and discusses special cases such as the “clustered LASSO”. Lemma 3.1.1 shows that all of these various approaches can be solved using a PAC optimization. 3.2. Optimizing Penalized and Constrained Loss Functions In the standard generalized linear models setting, it is common to maximize the likelihood by computing a quadratic approximation using the current parameter estimate, maximizing the quadratic, recomputing the approximation, and iterating; the so called iterative reweighted least squares algorithm. A similar approach can be used to solve the PAC criterion in Equation (3.2). 53 As demonstrated in Section 2.2, the second order Taylor approximation toL(β) about the current param- eter estimate, ˜ β, is given by L(β)≈ L( ˜ β) + ˜ d T (β− ˜ β) + 1 2 (β− ˜ β) T ˜ H(β− ˜ β), where ˜ H and ˜ d are respectively the Hessian and gradient ofL(β) at ˜ β. Let ˜ H =UDU T represent the singular value decom- position of the Hessian. Then it is not hard to show that the Taylor expansion can be rearranged in such a way that, up to an irrelevant additive constant,L(β) is approximated by 1 2 kY−Xβk 2 whereX =D 1/2 U T andY =X( ˜ β− ˜ H −1 ˜ d). To see this, note ˜ H = X T X = UDU T = (D 1/2 U T ) T (D 1/2 U T ), so X = D 1/2 U T and ˜ H −1 = UD −1 U T . Then ˜ H −1 X T =UD −1 U T (D 1/2 U T ) T =UD −1/2 , and thusY =X( ˜ β− ˜ H −1 ˜ d) (disregard- ing the additive constant term). Hence, approximate (3.2) by: argmin β 1 2 kY−Xβk 2 2 +λkβk 1 subject to Cβ =b, (3.7) a constrained version of the standard LASSO 1 . Unfortunately, even though many algorithms exist to fit the LASSO, the constraint on the coefficients in (3.7) makes it difficult to directly solve. However, Lemma 3.2.1 provides an approach for reformulating (3.7) as an unconstrained optimization problem. Lemma 3.2.1. LetA represent a given index set corresponding to anm-dimensional subset ofβ (soC A is an m by m matrix), and letX A andX ¯ A respectively represent the columns ofX corresponding toA and the complement ofA. Further defineΘ A by: Θ ˜ A = argmin Θ 1 2 kY ∗ −X ∗ Θk 2 2 +λkΘk 1 +λkC −1 A (b−C ¯ A Θ)k 1 , (3.8) where Y ∗ = Y −X A C −1 A b, X ∗ = X ¯ A −X A C −1 A C ¯ A . Then, for any index set,A, such that C A is non-singular, the solution to (3.7) is given by β = C −1 A b−C ¯ A Θ ˜ A Θ ˜ A . 1 Note, to simplify the explanation I have used equality rather than inequality constraints in (3.7). However, by introducing slack variables, the same basic approach can be used to optimize over inequality constraints. This approach is discussed later in this section. 54 Proof. To prove this, consider any index setA such thatC A is non-singular. The constraintCβ = b can be written as C A β A +C ¯ A β ¯ A =b ⇔ β A =C −1 A (b−C ¯ A β ¯ A ) , and thus,β A can be determined fromβ ¯ A . Then, for anyβ such thatCβ =b, 1 2 kY−Xβk 2 2 +λkβk 1 = 1 2 kY−X ¯ A β ¯ A −X A β A k 2 2 +λkβ ¯ A k 1 +λkβ A k 1 = 1 2 Y−X ¯ A β ¯ A −X A C −1 A (b−C ¯ A β ¯ A ) 2 2 +λkβ ¯ A k 1 +λ C −1 A (b−C ¯ A β ¯ A ) 1 = 1 2 Y−X A C −1 A b − X ¯ A −X A C −1 A C ¯ A β ¯ A 2 2 +λkβ ¯ A k 1 +λ C −1 A (b−C ¯ A β ¯ A ) 1 . By using the change of variableΘ =β ¯ A , the cLASSO problem is equivalent to the following unconstrained optimization problem: min Θ 1 2 kY ∗ −X ∗ Θk 2 2 +λkΘk 1 +λ C −1 A (b−C ¯ A Θ) 1 , and letΘ A denote a solution to the above optimization problem. Then, a solution to the original cLASSO problem is given β = β A β ¯ A = C −1 A (b−C ¯ A Θ A ) Θ A , completing Lemma 3.2.1. To reduce notation, assume without loss of generality that the elements ofβ are ordered so that the first m correspond toA. Equation 3.8 has the advantage that one does not need to incorporate any constraints on the parameters. Hence, one might imagine using a coordinate descent algorithm to computeΘ A . However, for such an algorithm to reach a global optimum it must be the case that the parameters are separable in the non-differentiable ℓ 1 penalty. In other words, the criterion must be of the form f(θ 1 ,...,θ p−m ) =L(θ 1 ,...,θ p−m )+ p−m X j=1 h j (θ j ), (3.9) whereL(·) is differentiable and convex, and theh j (·) functions are convex. The standard LASSO criterion satisfies (3.9) becausekβk 1 = P p j=1 |β j |. However, unfortunately, the second ℓ 1 term in (3.8) can not be 55 separated into additive functions ofθ j . Hence, there does not appear to be any simple approach for directly computingΘ A . Fortunately, Lemma 3.2.2 shows that an alternative, more tractable criterion can be used to compute Θ A . Lemma 3.2.2. For a given index set,A, and vectors defineΘ A,s by: Θ A,s = argmin Θ 1 2 k ˜ Y−X ∗ Θk 2 2 +λkΘk 1 (3.10) where ˜ Y =Y ∗ +λX − C −1 A C ¯ A T s, andX − is a matrix such thatX ∗T X − =I. Then, for any index set, A, it will be the case thatΘ A =Θ A,s provided s = sign Θ − A,s , (3.11) whereΘ − A,s = C −1 A (b−C ¯ A Θ A,s ) and sign(a) is a vector of the same dimension asa with ith element equal to 1 or−1 depending on the sign ofa i . Proof. To prove Lemma 3.2.2, consider an arbitraryΘ A,s ands such thats = sign C −1 A (b−C ¯ A Θ A,s ) . Let F : R p−m → R + denote the objective function of the optimization problem in Equation 3.10; that is, for eachΘ, F(Θ) = 1 2 kY ∗ −X ∗ Θk 2 2 +λkΘk 1 +λ C −1 A (b−C ¯ A Θ) 1 . By definition of Θ A , we have F(Θ A ) ≤ F(Θ A,s ). To complete the proof, it suffices to show that F(Θ A,s ) ≤ F(Θ A ). Suppose, on the contrary, that F(Θ A,s ) > F(Θ A ). For each α ∈ [0,1], let Θ α ∈R p−m andg(α)∈R + be defined by: Θ α ≡ (1−α)Θ A,s +αΘ A and g(α)≡F(Θ α ) =F ((1−α)Θ A,s +αΘ A ) . Note that g is a convex function on [0,1] because F(·) is convex. Moreover, we have g(0) = F(Θ A,s ) > F(Θ A ) =g(1). Thus, for all 0<α≤ 1, g(α) =g(α·1+(1−α)·0)≤αg(1) +(1−α)g(0) <g(0). 56 By the assumed hypothesis,|s i | = 1 for alli, and thus, every coordinate of the vectorC −1 A (b−C ¯ A Θ A,s ) is bounded away from zero. So,α 0 can be chosen sufficiently small so that sign C −1 A (b−C ¯ A Θ α 0 ) = sign C −1 A (b−C ¯ A Θ A,s ) . Then, it follows that F(Θ α 0 ) = 1 2 kY ∗ −X ∗ Θ α 0 k 2 2 +λkΘ α 0 k 1 +λs T C −1 A (b−C ¯ A Θ α 0 ) = (Y ∗ ) T Y ∗ 2 −(Y ∗ ) T X ∗ Θ α 0 −λs T C −1 A C ¯ A Θ α 0 + (X ∗ Θ α 0 ) T X ∗ Θ α 0 2 +λkΘ α 0 k 1 +λs T C −1 A b = (Y ∗ ) T Y ∗ 2 −(Y ∗ ) T X ∗ Θ α 0 − λX − C −1 A C ¯ A T s T X ∗ Θ α 0 + (X ∗ Θ α 0 ) T X ∗ Θ α 0 2 +λkΘ α 0 k 1 +λs T C −1 A b = 1 2 ˜ Y−X ∗ Θ α 0 2 2 +λkΘ α 0 k 1 +d where the last equality follows from ˜ Y =Y ∗ +λX − C −1 A C ¯ A T s, andd is defined by d =−λ(Y ∗ ) T X − C −1 A C ¯ A T s − h λX − C −1 A C ¯ A T s i T h λX − C −1 A C ¯ A T s i 2 + λs T C −1 A b . Also, the third equality follows from the fact that (X − ) T X ∗ =I. It follows from the same argument that F(Θ A,s ) = 1 2 ˜ Y−X ∗ Θ A,s 2 2 +λkΘ A,s k 1 +d Sinceg(α) <g(0),F(Θ α 0 )<F(Θ A,s ), and this implies that 1 2 ˜ Y−X ∗ Θ α 0 2 2 +λkΘ α 0 k 1 < 1 2 ˜ Y−X ∗ Θ A,s 2 2 +λkΘ A,s k 1 , but this contradicts the optimality ofΘ A,s . Therefore, it must be the case thatF(Θ A,s ) ≤ F(Θ A ), which completes the proof. 57 There is a simple intuition behind Lemma 3.2.2. The difficulty in computing (3.8) lies in the non- differentiability (and non-separability) of the secondℓ 1 penalty. However, if (3.11) holds then kC −1 A (b−C ¯ A Θ A,s )k 1 =s T C −1 A (b−C ¯ A Θ A,s ) and theℓ 1 penalty can be replaced by a differentiable term which no longer needs to be separable. Of course the key to this approach is selectingA and s such that (3.11) holds. Note that Θ − A,s corresponds to the elements ofβ indexed byA so the general principle for choosingA ands will be to: 1. Produceβ 0 , an initial estimate forβ. 2. SetA equal to them largest elements of|β 0 |. 3. Sets equal to the signs of the corresponding m elements ofβ 0 . In order for this strategy to work, the procedure only requires that the m largest elements of β 0 are sign consistent i.e. they have the same sign as the corresponding elements in the final solution to (3.7). Corollary 3.2.1 shows that Lemmas 3.2.1 and 3.2.2 provide the solution to (3.7). Corollary 3.2.1. By Lemmas 3.2.1 and 3.2.2, providedA ands are chosen such thatC A is non-singular and (3.11) holds, then the solution to (3.7) is given by β = Θ − A,s Θ A,s . Lemma 3.2.2 and Corollary 3.2.1 provide an attractive means of computing the PAC solution because (3.10) is a standard LASSO criterion so can be solved using any one of the many highly efficient LASSO optimization approaches, such as the LARS algorithm or coordinate descent seen in Chapter 2. These results suggest Algorithm 3 for solving (3.7) over a grid ofλ. This algorithm relies on the fact that β λ will be continuous in λ. Hence, if λ takes a small enough step, then the signs of the m largest elements ofβ k−1 and β k will be identical. If this is not the case then the step is too large; to resolve this, reduce the step size and recomputeβ k . In practice, unless m is very large or α is set too large, (3.11) generally holds without needing to decrease the step size. Step 3 is the main 58 Algorithm 3 PAC with Equality Constraints 1. Initializeβ 0 by solving (3.7) usingλ 0 =λ max . 2. At stepk selectA k ands k using the largestm elements of|β k−1 | and set λ k ← 10 −α λ k−1 , whereα> 0 controls the step size. 3. ComputeΘ A k ,s k by solving (3.10). LetΘ − A k ,s k =C −1 A k b−C ¯ A k Θ A k ,s k . 4. If (3.11) holds then setβ k = Θ − A k ,s k Θ A k ,s k ,k←k +1 and return to 2. 5. If (3.11) does not hold then one of the largest m elements of β k−1 has changed sign. Since the coefficient paths are continuous, this means the step size inλ was too large. Hence, set λ k ←λ k−1 − 1 2 (λ k−1 −λ k ) and return to 3. 6. Iterate untilλ k <λ min . computational component of this algorithm butΘ A k ,s k is easy to compute, because (3.10) is just a standard LASSO criterion. Thus any one of a number of optimization tools can be used. The initial solution,β 0 , can be computed by noting that asλ→∞ the solution to (3.7) will be argmin β kβk 1 such that Cβ =b, (3.12) which is a linear programming problem that can be efficiently solved using standard algorithms. In addition, the procedure also makes use of a reversed version of this algorithm where the first step sets λ 0 = λ min , then computesβ 0 as the solution to a quadratic programming problem. Then increase λ at each step until λ k >λ max . Implementing the PAC LASSO algorithm requires making a choice for X − , which is generally not difficult. If p≤ n +m then it is easy to see thatX − = UD −1 V T satisfiesX ∗T X − = I whereX ∗ = UDV T represents the singular value decomposition ofX ∗ . Ifp > n +m then in general for Lemma 3 to hold,X − must be chosen such that Θ ¯ A,s =X − T X ∗ Θ ¯ A,s , (3.13) 59 whereΘ A,s is the solution to (3.17). But standard properties of the LASSO tell us thatΘ ¯ A,s can have at most n non-zero components. Hence, (3.13) will hold ifX − is chosen to be the inverse of the columns of X ∗ corresponding to the (at most)n non-zero columns ofΘ ¯ A,s . Of course it is impossible to know a priori with complete certainty which elements ofΘ ¯ A,s will be non-zero. However, based on the solution to the previous step in the algorithm, it is easy to compute the elements that are furthest from becoming non-zero and these can generally be safely ignored in computingX − . On the rare occasions where an incorrect set of columns is selected, simply reduce the step size inλ. As mentioned above, one can generally initialize the algorithm using the solution to (3.12). However, this approach could potentially fail if one of the constraints inC is parallel withkβk 1 , for example P β j = 1, in which case there may not be a unique solution to (3.12). In this setting quadratic programming can be used to initialize the algorithm, which is slightly less efficient, but does not unduly impact the computational burden because the solution only needs to be found for a single value ofλ. This equality constraint approach can be extended in much the same way for inequality constraints by using slack variables. Thus consider the more general optimization problem given by (3.7). One might imagine that a reasonable approach would be to initialize withβ such that Cβ≤b (3.14) and then apply a coordinate descent algorithm subject to ensuring that at each update (3.14) is not violated. Unfortunately, this approach typically gets stuck at a constraint boundary point where no improvement is possible by changing a single coordinate. In this setting the criterion can often be improved by moving along the constraint boundary but such a move requires adjusting multiple coefficients simultaneously which is not possible using coordinate descent because it only updates one coefficient at a time. Instead, consider a set ofm slack variables,δ, so that (3.14) can be equivalently expressed as Cβ +δ =b,δ≥0 or ˜ C ˜ β =b,δ≥0, (3.15) where ˜ β = (β,δ) is a p +m-dimensional vector, ˜ C = [C I] andI is an m-dimensional identity matrix. Lete δ (a) be a function which selects out the elements ofa that correspond toδ. For example,e δ ( ˜ β) =δ 60 whilee¯ δ ( ˜ β) =β. Then, the inclusion of the slack variables in (3.15) means the cLASSO criterion (3.7) can be reexpressed as argmin ˜ β 1 2 kY− ˜ X ˜ βk 2 2 +λke¯ δ ( ˜ β)k 1 such that ˜ C ˜ β =b, e δ ( ˜ β)≥0, (3.16) where ˜ X = [X 0] and0 is an bym matrix of zero elements. The criterion in (3.16) is very similar to the equality constrained LASSO (3.7). The only differences are that the components of ˜ β corresponding toδ do not appear in the penalty term and are required to be non-negative. Even with these minor differences, the same basic approach from Section 3.2 can still be adopted for fitting (3.16). In particular Lemma 3.2.3 provides a set of conditions under which (3.7) can be solved. Lemma 3.2.3. For a given index set,A, and vectors such thate δ (s) = 0, define ˜ Θ A,s by: ˜ Θ A,s = argmin Θ 1 2 k ˜ Y−X ∗ Θk 2 2 +λke¯ δ (Θ)k 1 such that e δ (Θ)≥ 0, (3.17) where ˜ Y =Y− ˜ X A ˜ C −1 A b+λX − ˜ C −1 A ˜ C ¯ A T s,X ∗ = ˜ X ¯ A − ˜ X A ˜ C −1 A ˜ C A , andX − is a matrix such that X ∗T X − =I. Suppose e¯ δ (s) = sign e¯ δ ˜ Θ − A,s , (3.18) e δ ˜ Θ − A,s ≥ 0, (3.19) where ˜ Θ − A,s = ˜ C −1 A b− ˜ C ¯ A ˜ Θ A,s . Then, the solution to the cLASSO criterion (3.7) is given bye¯ δ ( ˜ β) where, ˜ β = ˜ Θ − A,s ˜ Θ A,s . Lemma 3.2.3 shows that, provided an appropriateA ands are chosen, the solution to PAC with inequality constraints can still be computed by solving a LASSO type criterion, (3.17). However, both (3.18) and (3.19) must hold. Condition 3.18 is equivalent to (3.11) in the equality constraint setting, while (3.19) along with the constraint in (3.17) ensure thatδ≥ 0. The same strategy as in Section 3.2 can be used. First, obtain an initial coefficient estimate, ˜ β 0 . Next, selectA corresponding to the largest m elements of| ˜ β 0 |, say ˜ β A . Finally, the m-dimensional s vector is chosen by fixinge¯ δ (s) = sign(e¯ δ ( ˜ β A )) and setting the remaining elements ofs to zero i.e. e δ (s) = 0. ˜ β A is the initial guess for ˜ Θ − A,s so as long as ˜ β A is sign consistent 61 for ˜ Θ − A,s then (3.18) will hold. Similarly, by only including the largest current values ofδ inA, for a small enough step inλ, none of these elements will become negative and (3.19) will hold. Hence, Algorithm 4 can be used to fit the inequality constrained PAC. Notice that Algorithm 4 only Algorithm 4 PAC with Inequality Constraints 1. Initialize ˜ β 0 by solving (3.16) usingλ 0 =λ max . 2. At stepk selectA k ands k using the largestm elements of| ˜ β k−1 | and set λ k ← 10 −α λ k−1 . 3. Compute ˜ Θ A k ,s k by solving (3.17). Let ˜ Θ − A k ,s k = ˜ C −1 A k b− ˜ C ¯ A k ˜ Θ A k ,s k . 4. If (3.18) and (3.19) hold then set ˜ β k = " ˜ Θ − A k ,s k ˜ Θ A k ,s k # ,k←k +1 and return to 2. 5. If (3.18) or (3.19) do not hold then the step size must be too large. Hence, set λ k ←λ k−1 − 1 2 (λ k−1 −λ k ) and return to 3. 6. Iterate untilλ k <λ min . involves slight changes from Algorithm 3. In particular solving (3.17) in Step 3 poses little additional com- plication over fitting the standard LASSO. The only differences are that the elements ofΘ that correspond toδ, i.e.e δ (Θ), have zero penalty and must be non-negative. However, these changes are simple to incorpo- rate into a coordinate descent algorithm. For anyθ j that is an element ofe δ (Θ), first compute the unshrunk least squares estimate, ˆ θ j , and then set ˜ θ j = h ˆ θ j i + . (3.20) It is not hard to show that (3.20) enforces the non-negative constraint onδ while also ensuring that no penalty term is applied to the slack variables. The update step for the originalβ coefficients,e¯ δ (Θ) (those that do not involveδ), is identical to that for the standard LASSO. The initial solution, ˜ β 0 , can still be computed by solving a standard linear programming problem, subject to inequality constraints. 62 3.3. PAC Simulation Studies Before examining a very specific use of PAC via the Internet advertising example, shown in Section 3.4, first consider a demonstration of the advantages of PAC using simulation results to compare PAC’s performance relative to unconstrained fits. This shows how PAC can offer significant gains in prediction and optimization in several known settings. Here, I consider two such settings corresponding to data generated from a standard Gaussian linear regression withL(β) =kY−Xβk 2 and data generated from a binomial logistic regression model with L(β) equal to the corresponding loglikelihood. In Section 3.3.1, I show that when the true underlying parameters satisfy equality constraints, the PAC formulation can yield significant improvements in prediction accuracy over unconstrained methods. In addition, Section 3.3.2 shows that the PAC approach is robust, so that even when the true parameters violate some of the constraints, PAC still yields superior estimates. 3.3.1 Equality Constraints I consider six simulation settings, three different combinations of n observations and p predictors (incor- porating both traditional and high-dimensional problems) and two different correlation structures, ρ jk = 0 and ρ jk = 0.5 |j−k| (where ρ jk is the correlation between the jth and kth variables). The training data sets were produced using a random design matrix generated from a standard normal distribution. For each set- ting I randomly generate a training set, fit each method to the data, and compute the error over a test set of N = 10,000 observations. For the binomial responses, the error metric is the percentage of incorrect bino- mial predictions; for the Gaussian case, the error computations are performed using the root mean squared error, RMSE = r 1 N P N i=1 ˆ Y i −E(Y i |X i ) 2 . This process was repeated 100 times for each of the six settings. In all cases, them-by-p constraint matrixC and the constraint vectorb were randomly generated from the same normal distribution as the data. The true coefficient vector β ∗ was produced by first generating β ∗ ¯ A using 5 non-zero random uniform components and p−m− 5 zero entries, and then computingβ ∗ A = C −1 A (b−C ¯ A β ∗ ¯ A ). Note that this process resulted inβ ∗ having at mostm+5 non-zero entries, and ensured that the constraints held for the true coefficient vector. For each set of simulations, the optimal value of λ was chosen by minimizing error on a separate validation set, which was independently generated using the same parameters as for the corresponding training data. 63 ρ LASSO PAC Relaxed LASSO Relaxed PAC n = 100,p = 50 0 0.57(0.01) 0.50(0.01) 0.43(0.01) 0.29(0.01) m = 5 0.5 |i−j| 0.59(0.01) 0.46(0.01) 0.50(0.02) 0.32(0.01) n = 50,p = 500 0 3.39(0.08) 3.28(0.09) 3.29(0.09) 3.07(0.10) m = 10 0.5 |i−j| 2.61(0.08) 2.39(0.09) 2.50(0.09) 2.23(0.10) n = 50,p = 100 0 3.90(0.08) 1.57(0.05) 3.81(0.07) 1.47(0.04) m = 30 0.5 |i−j| 3.15(0.06) 1.45(0.04) 3.00(0.05) 1.34(0.03) Table 3.1: Average RMSE over 100 training data sets, for four LASSO methods tested in three different simulation settings and two different correlation structures. The numbers in parentheses are standard errors. ρ Bayes GLM PAC Relaxed GLM PAC Relaxed n = 100,p = 50 0 12.27(0.11) 19.36(0.23) 18.56(0.24) 19.33(0.45) 17.68(0.31) m = 5 0.5 |i−j| 9.30(0.10) 14.60(0.18) 14.08(0.19) 14.51(0.20) 13.34(0.25) n = 1000,p = 500 0 11.02(0.19) 12.33(0.22) 12.00(0.26) 12.14(0.26) 11.68(0.28) m = 10 0.5 |i−j| 8.60(0.19) 10.15(0.28) 9.76(0.31) 10.01(0.33) 9.44(0.27) n = 50,p = 100 0 8.20(0.06) 43.17(1.07) 36.06(0.85) 41.60(1.01) 31.23(0.63) m = 30 0.5 |i−j| 7.26(0.10) 37.73(1.58) 28.03(0.78) 35.67(1.47) 24.54(0.70) Table 3.2: Average misspecification error (in percentages) over 100 training data sets for four binomial methods tested in three different simulation settings and two different correlation structures. The Bayes error rate is given for comparison; it represents the minimum error rate. The numbers in parentheses are standard errors, also in percentages. For each method I explored three combinations of n, p, and m: a low-dimensional, traditional prob- lem with few constraints, a more complex, higher-dimensional problem with few constraints, and a high- dimensional problem with more constraints. The test error values for the six resulting settings are displayed in Tables 3.1 and 3.2 respectively for the Gaussian and Binomial responses. For each method, I compared results from four different approaches: the standard unconstrained but penalized fit i.e. (3.1) (Friedman et al. 2010b), PAC, the relaxed GLM, and the relaxed PAC. The latter two methods use a two-step approach in an attempt to reduce the overshrinkage problem commonly exhibited by theℓ 1 penalty. In the first step, the given method is used to select a candidate set of predictors. In the second step, the final model is produced using an unshrunk ordinary least squares fit on the variables selected in the first step. The first set of results in each table correspond to a setting with onlym = 5 constraints, n = 100,p = 50. In the binomial case, PAC shows a moderate improvement over the standard logistic regression fit, which one might expect given this is a relatively low dimensional problem with only a small number of constraints. However, even with this low value for m, in the Gaussian case PAC shows highly statistically significant improvements over the unconstrained methods. Both relaxed methods display lower error rates than their 64 a LASSO PAC Relaxed LASSO Relaxed PAC n = 100 0.25 0.57(0.01) 0.51(0.01) 0.43(0.01) 0.30(0.01) p = 50 0.50 0.57(0.01) 0.51(0.01) 0.42(0.01) 0.33(0.01) m = 5 0.75 0.57(0.01) 0.52(0.01) 0.42(0.01) 0.35(0.01) 1.00 0.57(0.01) 0.53(0.01) 0.42(0.01) 0.38(0.01) n = 50 0.25 3.34(0.08) 3.22(0.09) 3.24(0.09) 2.98(0.10) p = 500 0.50 3.34(0.08) 3.22(0.09) 3.23(0.09) 2.99(0.10) m = 10 0.75 3.37(0.08) 3.28(0.09) 3.26(0.09) 3.07(0.10) 1.00 3.36(0.08) 3.28(0.09) 3.23(0.09) 3.06(0.10) n = 50 0.25 6.66(0.07) 1.17(0.03) 6.80(0.08) 0.95(0.03) p = 100 0.50 6.66(0.07) 1.18(0.03) 6.81(0.08) 0.99(0.03) m = 60 0.75 6.65(0.07) 1.21(0.03) 6.79(0.08) 1.03(0.03) 1.00 6.68(0.07) 1.27(0.04) 6.86(0.08) 1.08(0.04) Table 3.3: Average RMSE over 100 training data sets in three different simulation settings using theρ = 0 correlation structure. The numbers in parentheses are standard errors. The true regression coefficients were generated according to (3.21) unrelaxed counterparts, and the higher correlations in the ρ = 0.5 |i−j| design structure do not change the relative rankings of the four approaches. The second setting represents a more complex situation with n = 50,p = 500, and m = 10 for the Gaussian, andn = 1000,p = 500, andm = 10 for the Binomial 2 . As one would expect, given the low ratio ofm relative top, PAC only shows small improvements over its unconstrained counterparts. However, the main purpose in examining this setting is to prove the PAC algorithm is still efficient enough to optimize the constrained criterion even for large data sets and very high dimensional data. The final setting examined data with n = 50,p = 100 and a larger number of constraints, m = 30. This setting is more statistically favorable for PAC because it has the potential to produce significantly more accurate regression coefficients by correctly incorporating the larger number of constraints. However, this is also a computationally difficult setting for PAC because a large value of m causes the coefficient paths to be highly variable. Nevertheless, the large improvements in accuracy for both PAC and relaxed PAC in Tables 3.1 and 3.2 demonstrate the algorithm is quite capable of dealing with this added complexity. 65 3.3.2 Violations of Constraints The results presented in the previous section all correspond to an ideal situation where the true regression coefficients exactly match the equality constraints. I also investigated the sensitivity of PAC to deviations of the regression coefficients from the assumed constraints. In particular I generated the true regression coefficients according to Cβ ∗ = (1+u)·b, (3.21) whereu = (u 1 ,...,u m ), u l ∼ Unif(0,a) for l = 1,...m and the vector product is taken pointwise. The PAC and relaxed PAC were then fit using the usual (but in this case incorrect) constraint,Cβ =b. Table 3.3 reports the new RMSE values for three Gaussian settings under theρ = 0 correlation structure; the first two settings correspond exactly to the first two settings considered in Section 3.3.1, while the last is a setting with a very large number of constraints to demonstrate robustness even when n < m. I tested four values for a: 0.25,0.50,0.75 and 1.00. The largest value of a corresponds to a 50% average error in the constraint. The results suggest that PAC and relaxed PAC are surprisingly robust to random violations in the constraints. While both methods deteriorated slightly asa increased, they were still both superior to their unconstrained counterparts for all values ofa and all settings. 3.4. Application of Methodology to Internet Media Selection Although PAC can be applied in many settings, as in Chapter 2, the main application for demonstration purposes is a case study of real-world Internet media campaigns. The focus of these campaigns has tra- ditionally fallen into two main problems of interest: maximizing reach and maximizing clickthrough rate (CTR), both subject to some maximum allowed budget, B, across a defined advertising campaign. As dis- cussed previously, reach is defined as the fraction of customers who are exposed to a given ad at least one time. However, clickthrough rate is defined as the fraction of customers who click on a given ad. Model- ing these first requires defining the functions themselves, as there is no standard reach or CTR function for Internet media as seen with the discussion of reach in Chapter 2. 2 I use a larger value forn in the binomial setting because these distributions provide less information for estimating the regres- sion coefficients. 66 CTR is a natural extension of reach, since before an Internet user can click on an ad, he or she must first be exposed to it. I thus begin by defining the reach function in a slightly different (though related) manner to Section 2.2. As before, letβ j be the budget allocated to websitej andυ j represent the effectiveness of a dollar of advertising at websitej (note that in the notation of Section 2.2,υ j would be equivalent toθ ij /z ij ). Again, υ j is a function of website characteristics, such as cost and traffic, such that υ j β j represents the probability of an ad appearing to a user on website j given a firm has spent β j dollars. (This is equivalent tos j in Section 2.2.) The simplest formulation forυ j isυ j = 1 τ j CPM j 1000 , where CPM j is the cost per thousand impressions at websitej andτ j is the expected total number of visits to thejth website during the course of the ad campaign. Under this formulation, τ j CPM j 1000 represents the total cost to buy all expected impressions at websitej, so multiplying byβ j gives advertisers the proportion of total impressions bought at websitej. Thus ifυ j β j represents the probability an ad appears to a user on website j, the probability an ad does not appear to a user on a random visit to websitej is 1−υ j β j . Because ad appearances are independent of users and previous visits, each time a user visits website j, he or she has the same probability of failing to view the ad (i.e. 1− υ j β j ). Hence, if user i visits each website a total of z ij times then the probability that they fail to view the ad at least once over all p websites is p Q j=1 (1−β j υ j ) z ij . If this is averaged over all n users in the data, the reach function becomes: L(β) = 1 n n P i=1 p Q j=1 (1−β j υ j ) z ij where the sum of money spent across the p websites, P j β j , must be less than the total allowed budget,B, and spending at all websites must be nonnegative. Thus maximizing reach can be formalized as argmin β 1 n n X i=1 p Y j=1 (1−β j υ j ) z ij +λkβk 1 , (3.22) whereλ is a tuning parameter with a one-to-one correspondence to the budgetB as noted previously. This is an instance of (3.1). Further, the nonnegativity constraint on theβ j values is unnecessary, since the solution will never optimize by setting aβ j to a negative value. The minimization of the function ensuresβ j will be, at minimum, zero; a negativeβ j would actually increase the value of the objective function. This reach function can be naturally extended to the other most common Internet advertising metric, clickthrough rate. In practice the clickthrough rate for each website can be obtained either directly from past click logs (e.g. Dave and Varma (2010)) or estimated in numerous ways if historical data is not available (e.g. Immorlica et al. (2005)). Letq j represent the expected clickthrough probability on websitej, conditional on a user viewing the ad. Hence, the unconditional probability of a user viewing the ad AND clicking through 67 is β j υ j q j , and the corresponding probability that they fail to click through is 1−β j υ j q j . Thus the CTR function looks very similar to the reach function, L(β) = 1 n n P i=1 p Q j=1 (1−β j υ j q j ) z ij , which leads to argmin β 1 n n X i=1 p Y j=1 (1−β j υ j q j ) z ij +λkβk 1 . (3.23) Note the objective function here is binomial and decidedly nonconvex. However, following Le Cam’s Theorem (Le Cam 1960), it can be shown that given independent random variables with a Bernoulli dis- tribution, the sum of the Bernoulli variables follows an approximate Poisson distribution with a bounded approximation error. The problem considered here falls into this framework and can thus be approximated by an exponential distribution which is convex; see Appendix C.2 for more details. (Note this Poisson dis- tribution is the underlying basis for the method considered in Chapter 2, though there it is only considered for a formulation of reach due to the more complex problem of modeling clickthrough rate.) In practical applications many advertising campaigns include constraints on β. For example, firms frequently wish to impose a minimum level of media coverage on certain subsets of websites. In this case, firms have a subset S of websites on which they know they want to advertise, so they dedicate a minimum proportion of their budget to this subset. The NCL case study in Section 2.3.4 is one example of this. A problem like this fits very naturally into the constraint matrix setup by defining C T S β≥b S , (3.24) whereC S defines the websites in the subsetS, andb S is the proportion of budget the firm wishes to allocate to the subset S. I will explore an example of this in the later case study in Section 3.5, in which I again consider the problem of Norwegian Cruise Lines (NCL) running an Internet advertising campaign when it knows users who visit aggregate travel websites such as Orbitz.com are more likely to purchase cruises than users at other websites. Another more complex motivating application of the constrained setup is in problems where firms want to maximize either reach or CTR but do not wish to solely maximize one or the other. That is, firms may choose their primary goal (e.g. maximizing CTR), but wish to solve it subject to constraints on their secondary goal (e.g. maximizing reach to a particular demographic). One common example of this from the 68 marketing literature (Burke 2002, Moe 2003) involves a firm offering the same products both online and in- store. The firm may wish to maximize CTR, since those users are more likely to purchase the product once they visit the firm’s website, but it also wants to be sure its ad reaches as many older users as possible, since older users are more likely to see the ad online but visit the physical store to purchase. The firm can build this, and other, requirements directly into the optimization using the constraint matrix,C, and optimizing: n X i=1 p Y j=1 (1−β j υ j q j ) z ij +λkβk 1 , Cβ≤b, (3.25) an instance of (3.2). These (and other constraints) can be included simultaneously in a campaign. Often, firms wish to design and carry out a campaign subject to many constraints at once. One such example involves the changes in targeting implemented by NCL in 2011. In 2011, NCL began expanding its cruise line offerings to a new market: solo cruise buyers. Historically, cruise lines had focused on double-occupancy rooms, requiring solo travelers to room with a stranger or incur the cost of booking a room for two people (Clements 2013). In 2011, NCL specially designed single-occupancy rooms to appeal to solo cruise travelers, a niche which had been previously unexplored by the cruise industry. Capitalizing on this niche can be extremely valuable; Cruise Lines International Association expected 23 million solo travelers in 2015 in North America alone (Ambroziak 2015). Solo cruise line travelers typically fall into an age range of 30-59 with an income of $35,000 to $70,000 (Clements 2013). The demographic information included with the comScore data can be used to primarily target this demographic segment, while also ensuring the advertising campaign meets other targets as well. For example, solo travelers are most likely to come from single-person households, which is logged directly in the comScore data through a “number of people in the household” variable. However, another key solo traveling group on the rise is wealthier multiple-person householders in which one person chooses to travel alone for some reason (typically a family reunion cruise, a bachelorette party cruise, etc.) (Clements 2013). Because of this, a marketing campaign for these single-person cruise fares must include multiple constraints. One very interesting extension of the PAC method not currently implemented in other optimization procedures is the ability to optimize CTR but also constrain reach. That is, a firm like NCL can maximize the number of clicks to its ad but also set constraints on the characteristics of the customers to whom that ad appears, meaning NCL should get its clicks from more likely, target cruise buyers. PAC can do this by 69 constraining that the number of average views to a user in a target group H is some factor (say two times) more than the average views to a user not in the target group. The constraint thus looks like this: 1 n H X i∈H p X j=1 z ij υ j β j ≥ 2 1 n−n H X i/ ∈H p X j=1 z ij υ j β j , (3.26) where n H is the number of people in the target group, and z ij υ j β j represents the expected number of ad appearances to person i at website j (since z ij is the number of times person i visits website j, and υ j β j is the probability the ad appears to user i at website j on any given visit). This can be used to extend the analysis of Section 2.3.4 to a more advanced NCL cruise campaign, as demonstrated in the next section. 3.5. Case Study: Cruise Line Internet Marketing Campaign In this section, I again consider an exemplar real-world case study for the cruise line Norwegian Cruise Lines (NCL). As discussed in Section 1.4, each year the cruise industry advertises for its annual “wave season”, a promotional cruise period which begins in January. NCL participates heavily in wave season (Satchell 2011). Because consumers who are interested in booking a cruise often use travel aggregation sites like Orbitz and Priceline to compare offerings across multiple cruise lines, and cruise lines frequently want to make the sales known to potential customers without sacrificing clickthrough to their websites, this case study is ideal for demonstrating both PAC in general and the common advertising constraints discussed in Section 3.4, as well as providing a direct comparison to the method in Chapter 2. Because the wave season sale begins in January, I again consider the comScore data from January 2011 to approximate an NCL advertising campaign. Following the procedure outlined in Section 1.4, I manually went through the comScore data set to identify the 500 most visited websites which also supported Internet display ads. The data then contains a record of every computer which visited at least one of these 500 websites at least once (48,628 users). ThusZ is a 48,628 by 500 matrix. I then separate the full data set into a ten percent training data set (4863 users) on which the methods are fit and a ninety percent holdout data set to calculate the resulting reach, again mimicking the procedure used in Section 2.3. The results using the PAC methodology are laid out as follows: Section 3.5.1 demonstrates PAC in the setting in which NCL wishes to maximize only reach. The section shows how PAC performs in this reach setup in two scenarios: (1) in the absence of constraints, and (2) with a simple constraint like the 70 mandatory media coverage extension of Section 2.2.3. Section 3.5.2 demonstrates PAC in the setting in which NCL wishes to maximize CTR. Although no directly analogous methods exist for modeling CTR, some comparisons exist to naive and modified reach approaches, as I demonstrate here both for the simple single-constraint case as well as a scenario in which NCL wishes to maximize CTR in the presence of several constraints, including constraints on reach for target customers. 3.5.1 Internet Media Metric 1: Maximizing Reach Reach I first consider a naive advertising budget allocation problem, analogous to an unconstrained reach problem. Here, NCL simply wishes to maximize the reach for its ad campaign for certain budget levels without having any additional insights into the campaign. While this scenario considers a case in which NCL does not have demographic information for users who visit the 500 websites under consideration, NCL does have access to facets of the advertising campaign, such as the cost of advertising on each of the 500 websites in the data set and their corresponding traffic. Henceυ j is available for each website as defined in Section 3.4. While this is necessarily less complex than a campaign being run in a real-world situation, this scenario provides a good baseline comparison to existing methods, since reach is a commonly-considered advertising metric, and this setup has no constraints. Figure 3.1 demonstrates how reach changes with budget for this scenario for a variety of allocation methods. PAC is computed by solving (3.22) (dashed brown line) and compared to four naive methods: equal budget allocation across all 500 websites (solid green) and cost-adjusted allocation across the top 10 (dotted black), 25 (light dashed cyan), and 50 most-visited websites (light dotted-dashed purple). In addition, I compare this to the more state-of-the-art method described in Chapter 2. To easily differentiate between the two methods here, I refer to the method from Chapter 2 as OLIMS (Optimal Large-scale Internet Media Selection). Recall OLIMS (dotted-dashed red) is an unconstrained optimization based on modeling views of Internet ads as a Poisson arrival process. In this way, OLIMS is similar to an unconstrained PAC, since PAC takes a similar approach except that it assumes a binomial process. However, because the two methods are based on two related but separate distributions, the two optimization functions are slightly different as well. Because of this, each method has its own definition of reach. (The binomial PAC reach function is defined as in Section 3.4 and is actually used as the neutral comparison method between the OLIMS and 71 Danaher et al. methods in Section 2.3.1; see Chapter 2 for the OLIMS Poisson definition of reach. The two are related in that the Poisson is the limiting distribution of the binomial, though the binomial is non-convex as discussed in Section 3.4 and further in Appendix C.2.) 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 Budget (in millions) OLIMS Reach 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 Budget (in millions) PAC Reach Figure 3.1: Plots of the reach calculated using (left) the OLIMS reach criterion and (right) the PAC reach cri- terion for 500 websites using six methods: the PAC (dashed brown), OLIMS (dotted-dashed red), allocation by cost across the top 50 cost-adjusted websites (light dotted-dashed purple), allocation by cost across the top 25 cost-adjusted websites (light dashed cyan), allocation by cost across the top 10 cost-adjusted websites (dotted black), and equal allocation across all 500 websites (solid green). As Figure 3.1 shows, the four naive methods perform similarly and significantly below the two more sophisticated optimization procedures. More interestingly, OLIMS and PAC perform similarly to each other regardless of the definition of reach. As expected, OLIMS slightly outperforms PAC when using the OLIMS definition of reach, while PAC slightly outperforms OLIMS with the PAC definition of reach. In this basic reach advertising setup, PAC performs comparably to the most sophisticated reach optimization available, 72 which is not surprising given the Poisson is the limiting distribution for the binomial. Because of this similarity, for simplicity in what follows I will only report results using the PAC definition of reach. Reach subject to Mandatory Media Coverage In practice, NCL will likely not wish to run an unconstrained campaign as in Section 3.5.1. For real-life advertising campaigns, firms attempt to leverage business insights in order to improve their advertising campaigns by reaching target customers. Although NCL does want to reach as many potential cruisers as possible, they also know which characteristics make a consumer more likely to purchase a cruise. For example, because consumers who are interested in booking a cruise often use travel aggregation sites like Orbitz and Priceline to compare offerings across multiple cruise lines, NCL will reach more likely customers at these websites. Because of this, NCL may want to allocate at least a minimum amount of budget (say, 20%) to a set of major aggregate travel websites. This induces a constraint in the optimization; NCL wishes to optimize total overall reach, but subject to 20% of budget being spent at the set of aggregate travel websites. While PAC can handle this directly through a single constraint (whereC S identifies the aggregate travel sites and b = 0.20B), OLIMS cannot handle constraints in this form. Instead, OLIMS handles budget allocation in this scenario by placing a minimum budget at each of the aggregate travel websites, thus ensuring at least 20% of the budget overall is allocated to these sites. Eight major aggregate websites (CheapTickets.com, Expedia.com, Hotwire.com, Kayak.com, Orbitz.com, Priceline.com, Travelocity.com, and TripAdvisor.com) exist in the cleaned January 2011 comScore data set. Because NCL wants 20% of its budget to be allocated to this set, OLIMS will satisfy the constraint by requiring the optimization to place at least 2.5% of the budget at each of these eight sites as seen in Section 2.3.4. Figure 3.2 shows the results for the changes in reach with budget, both on the overall data set and on the target users, the ones who have visited the aggregate travel websites. Here, I compare both the con- strained (thick black dashed line) and unconstrained PAC (thin brown dashed line) to two naive methods as well: equal budget allocation across the eight aggregate travel websites (solid green line) and cost-adjusted allocation across those eight aggregate websites (dotted purple line). In addition, I again compare to the OLIMS method. In this scenario, however, there is both a constrained (thick blue dotted-dashed line) and 73 unconstrained (thin red dotted-dashed line) OLIMS procedure as well, since OLIMS can accommodate the allocation constraint by setting minimum budget values at each of the eight websites. 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Budget (in millions) Reach (Full Data) 0.00 0.10 0.20 0.0 0.2 0.4 0.6 0.8 1.0 Budget (in millions) Reach (Travel Subset) Figure 3.2: Plots of the reach calculated using the PAC reach criterion for 500 websites on (left) the full data set and (right) the subset of travel users using six methods: the unconstrained PAC (thin dashed brown), the unconstrained OLIMS (thin dotted-dashed red), the constrained PAC (thick dashed black), the constrained OLIMS (thick dotted-dashed blue), allocation by cost across the subset of travel websites (dotted purple), and equal allocation across the subset of travel websites (solid green). As Figure 3.2 shows, once constraints are introduced, PAC begins to consistently outperform OLIMS and the naive methods. Because the PAC optimization incorporates the budget allocation constraint directly, it has more flexibility in allocating across the subset of websites. OLIMS is forced to allocate a minimum to each website, whether that website is preferred over others or not. Most importantly, however, on the target subset of users (those who visit travel websites, shown in the right panel), the constrained PAC very clearly outperforms all other methods, but overall reach is relatively unchanged between the constrained and unconstrained PAC methods. This means NCL is reaching its target customers at the aggregate sites with- out sacrificing much overall reach. In contrast, the naive allocation methods actually beat the constrained 74 OLIMS on the aggregate travel users’ subset. PAC provides an option to maximize reach over the target consumer base without losing other potential customers at the non-aggregate websites. 3.5.2 Internet Media Metric 2: Maximizing Clickthrough In this section, I consider a more sophisticated advertising budget allocation problem: allocating budget to maximize clickthrough, as described in Section 3.4. Here, NCL wishes to maximize the number of people who click on their ad for certain budget levels. Maximizing clickthrough rate (CTR) is a more recent area of interest in the marketing literature, and as such it has been far less explored than the traditional reach setup. This section demonstrates how PAC can be used with CTR as an alternate metric for advertisers who are using a directed click campaign. Clickthrough Rate I first imagine NCL wishes to maximize the CTR for their ad campaign for certain budget levels, since click- ing on the ad directs users to NCL’s personal website and increases the likelihood of purchasing a cruise with NCL specifically. To approximate the CTR for each of the 500 websites under consideration, I use Media- Mind’s 2012 Global Benchmarks Report (med 2015). This report provides average display ad clickthrough rates by industry for 2011-2012. Thus to provide CTR values for each website,q j , I first group the websites by industry, then use the industry average for the optimization CTR. In practice, advertisers would have specific CTR values at each website for their campaign and would update these throughout the campaign. These values are specifically chosen to correspond to the 2011 data used in the case study analysis, but the average 0.12% average banner ad CTR for North America provided by MediaMind approximates well the April 2016 0.10% average industry standard (Ramirez and King 2016, Chaffey 2016). For comparison purposes, first consider a campaign analogous to the one in Section 3.5.1 above. Instead of maximizing reach subject to a constraint on the subset of aggregate travel websites, NCL wishes to maximize CTR subject to the same budget constraint. As shown in Section 3.4, PAC using (3.23) does this directly, but I am not aware of any other method that currently exists to maximize CTR on a large- scale problem such as a 500-website optimization. To compare, however, I provide a proxy to the OLIMS procedure. Whereas the typical OLIMS methodology is designed for reach only, the reach criterion can be modified to incorporate a CTR parameter by multiplying the probability of an ad appearance by the 75 0.0 0.5 1.0 1.5 2.0 0.00 0.02 0.04 0.06 0.08 Budget (in millions) CTR (Full) 0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 Budget (in millions) CTR (Travel Subset) Figure 3.3: Plots of the clickthrough rate calculated using the PAC reach criterion for 500 websites on (left) the full data set and (right) the subset of travel users using six methods: the unconstrained PAC (thin dashed brown), the unconstrained OLIMS proxy (thin dotted-dashed red), the constrained PAC (thick dashed black), the constrained OLIMS proxy (thick dotted-dashed blue), allocation by cost across the subset of travel websites (dotted purple), and equal allocation across the subset of travel websites (solid green). probability a user will click on it (the CTR parameter, q j ). This is not directly a CTR optimization, since CTR is defined as the proportion of users who click on an ad at least once and thus does not fit neatly into a Poisson arrival process definition, but in the absence of other analogous methods, it works well for comparison purposes. Figure 3.3 shows the results for the changes in CTR with budget, both on the overall data set and on the target users who have visited the aggregate travel websites. Again, I compare both the constrained (thick black dashed line) and unconstrained (thin brown dashed line) PAC to the two naive methods: equal budget allocation across the eight aggregate travel websites (solid green line) and cost-adjusted allocation across those eight aggregate websites (dotted purple line). In addition, I again compare to the OLIMS method, which in this case is the OLIMS proxy. In this scenario both a constrained (thick blue dotted-dashed line) 76 and unconstrained OLIMS proxy (thin red dotted-dashed line) are also again present, since the proxy does not change OLIMS’s accommodation of the allocation constraint by setting minimum budget values at each of the eight websites. Figure 3.3 displays similar results to Figure 3.2, as expected since PAC optimizes CTR similarly to reach. Because constraints are still present, PAC continues to consistently outperform OLIMS and the naive methods. Overall clickthrough is much lower than reach (again, as expected, since not all users who see the ad will click on it), but for the subset of aggregate travel site visitors, CTR is almost double that of the overall advertising campaign. This indicates NCL is directing more of the likely cruise purchasers to its website. These clicks are presumably much more valuable than clicks from the general Internet population. In addition, the relative ranking of the methods is unchanged by using CTR rather than reach. This verifies that the PAC method is flexible enough to accommodate more complex optimization functions. Clickthrough Rate subject to Multiple Constraints Though the situations considered previously demonstrate simple constraints known to be of interest to online marketers, as noted in Section 3.4, often firms wish to consider many constraints on a single campaign. Section 3.4 discusses an example of this for NCL, in which NCL expanded its cruise line offerings into the solo cruise buyers market. Solo cruise line travelers traditionally have well-known demographics, as discussed by Clements (2013). Solo cruise line travelers typically fall into an age range of 30-59 with an income of $35,000 to $70,000 (Clements 2013). Using these target metrics, NCL can design a marketing campaign for these single-person cruise fares with multiple constraints, in particular using Equation 3.26 from Section 3.4. This procedure can be used to develop some potential constraints for a more complex NCL cruise campaign. First, since solo cruises are aimed primarily at single-person households in the 30-59 age range with incomes between $35,000 and $70,000, NCL may wish to constrain the optimization to ensure twice as many average views come from this group as from all others, i.e. a randomly chosen person in this group is twice as likely to have viewed the ad as someone not in this group. In addition, NCL still wishes to constrain the optimization to force 20% of the budget to the aggregate travel websites as previously considered. Further, NCL can add six additional constraints corresponding to the seven income levels provided by comScore; the optimization can be constrained to ensure a randomly chosen person at a higher income level is more 77 likely to have seen the ad than a randomly chosen person at a lower income level (that is, average ad views for the highest income level are higher than average ad views from the second-highest, which are higher than average ad views from the third-highest, etc.). The seven income levels given by comScore are below $15,000, from $15,000 to $24,999, from $25,000 to $34,999, from $35,000 to $49,999, from $50,000 to $74,999, from $75,000 to $99,999, and above $100,000. Finally, because solo travelers necessarily travel without children, NCL will likely wish to constrain the optimization to have twice as many average ad views come from households without children present, and since cruise lines frequently offer rates targeted at specific geographic regions, here the optimization is also constrained to have more ad views from the “West” region than any other. This ultimately provides ten constraints for this example, though firms could conceivably add as many as they have target areas. Comparing the CTR or reach of PAC with other (unconstrained) methods in this multiple-constraint situation provides no useful information over the whole data set, since constraining the optimization will necessarily decrease overall CTR and reach. Instead these results concentrate on using Figure 3.4, which demonstrates that the constraints are satisfied, and Figure 3.5, which demonstrates how CTR varies across the methods for a specific subset of targeted users, to show that PAC effectively enforces these constraints. Figure 3.4 shows graphically that the optimization is indeed satisfying the constraints. For simplicity, I choose two of the ten constraints to represent in the plot: the relative ad views for the “no family” constraint (that is, the two times as many ad views should come from households without children) and the relative ad views for the regional constraint (that is, that more ad views should come from users in the “West” region). For both comparisons here, the plots show the ratio of expected views by someone in the target constraint to the expected views by someone not in the target constraint. In cases where no constraint is present (as in the non-PAC comparison methods), these lines should hover above or below a value of one, since the optimization has no incentive to prioritize any particular group over another. However, the constrained optimization for PAC stays above two in the left panel (since the constraint for households with no children present was twice as many views as with children) and above one in the right panel (since the constraint for households in the West region was simply more than the households in any other region). The constrained PAC method here is in the thicker black line, while the unconstrained PAC method is in dashed brown, the constrained OLIMS method (with the constraint on the eight aggregate travel websites) is in dotted-dashed blue, and the unconstrained OLIMS method is in dotted red. 78 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 Budget (in millions) Relative Average Views 0.0 0.5 1.0 1.5 2.0 2.5 0.8 0.9 1.0 1.1 1.2 Budget (in millions) Relative Average Views Figure 3.4: Plots of the ratio of average ad views between target demographic groups corresponding to two of the ten demographic constraints in the multiple-constraint CTR example for four methods: the constrained PAC method (thicker black line), the unconstrained PAC method (dashed brown), the constrained OLIMS method (dotted-dashed blue), and the unconstrained OLIMS method (dotted red). A solid gray line is plotted at constraint coefficient (2 for the left panel, 1 for the right panel) for comparison. Though not shown in Figure 3.4, the constraints here do affect one another, particularly when the constraints target similar characteristics. For example, in the case of the income bracket constraints, the optimization forces successively higher income brackets to have more average ad views than the previous bracket. Because of this, there is an increase in the ratio as well. Most notably, a jump occurs between the “$25,000 to $34,999” and “$35,000 to $49,999” income groups, presumably because an additional con- straint for the target group begins to take effect at incomes above $35,000 (because NCL is specifically targeting users in the $35,000 to $70,000 income range). Above and below that constraint, the ratio behaves as expected, with successive income groups hovering around the previous income group, but as expected the constraints work together to achieve more effective targeting. 79 0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Budget (in millions) CTR on Target Subset Figure 3.5: Plots of the clickthrough rate calculated using the PAC reach criterion for 500 websites on the subset of travel users who also come from single-person households, are between the ages of 30 and 59, and have a reported income between $35,000 and $75,000 using four methods: the unconstrained PAC (thin dashed brown), the unconstrained OLIMS proxy (thin dotted-dashed red), the constrained PAC (thick dashed black), and the constrained OLIMS proxy (thick dotted-dashed blue). To supplement the demonstration of Figure 3.4, I also provide Figure 3.5. Figure 3.5 shows how the four methods considered above perform on a very specific targeted subset of users: single-household users in the target 30-59 age range with incomes between $35,000 and $70,000 ($75,000 using the comScore income ranges) who also visited one or more of the aggregate travel websites during the month of January 2011. Because the optimization will perform successively worse on overall reach metrics with the addition of more and more constraints, this target group demonstrates how NCL is performing with the group of users they most specifically wish to encourage to click on the solo cruise ad. Presumably in a situation like this, NCL has allocated a given amount of budget to these users, and thus the overall optimization results do not matter. (In fact, in a situation like this where the target group is very specific, NCL would likely run two separate campaigns: one for solo cruise buyers and another for general cruise buyers.) 80 As Figure 3.5 shows, by incorporating the constraints, PAC (in thick dashed black) vastly improves clickthrough rate relative to the other three methods (the unconstrained PAC in thin dashed brown, the unconstrained OLIMS proxy in thin dotted-dashed red, and the constrained OLIMS proxy in thick dotted- dashed blue). In fact, only the constrained OLIMS proxy even approaches the success of the PAC, which is expected given at least the OLIMS proxy here has a constraint to guarantee budget allocation to the travel websites. While other methods may be able to be adjusted to get closer to the constrained PAC results, PAC handles these constraints directly and easily, without needing extensive modification. PAC performs demonstrably better, ensuring NCL gets almost a 30% clickthrough rate if it spends $2 million on just the target group (though as Figure 3.5 shows, NCL would likely not spend that much, since it can achieve very similar results with smaller budget spend.) This 30% may seem high compared to the 0.1% average clickthrough rate for a typical website, but note this is an expected value based on the objective function used to model the clickthrough rate. That is, this assumes if a user has an expected 0.1% probability of clicking on an ad given she sees it, and she sees the ad 1,000 times, she would be expected to click once. With enough money spent on the target group, the algorithm eventually buys all ad impressions the target group sees, even if they have clicked on the ad previously or will never click on the ad. 3.6. Discussion While the penalized loss function procedure in Chapter 2 represents a significant step in generalizing loss function optimization, it does not address a more complex class of problems, namely problems requiring both penalization and constraints. Examples of such problems range from well-studied portfolio optimiza- tion to statistical regression procedures to real-world problems which are not rigorously defined. Because of this, a procedure capable of incorporating constraints directly rather than through any sort of ad hoc approach becomes very useful. The methodology outlined in this chapter works for a variety of problems, both for already-studied statistical applications like the generalized LASSO as well as problems outside the area of statistics for which loss function procedures are relatively novel, such as the Internet advertising example. The PAC procedure also demonstrates computationally efficient path algorithms for computing its solutions, which is vital to any procedure being implemented today. The PAC method and algorithm can generally outperform unconstrained estimates, as demonstrated using simulated data. This 81 is true not only when the constraints hold exactly, but also when there is some reasonable error in the assumption on the constraints. Though the general analyses are demonstrated using representative simulated data, in the case of the specific Internet media selection example, PAC can also perform extremely well (better than extant methods) on real-world data. This is demonstrated here using the comScore data in the context of an Internet media campaign, but many other problems can potentially benefit from the use of PAC methodology. By handling constraints directly, PAC presents a significant advantage over current methods used to address problems like Internet media selection. PAC can handle multiple linear constraints with no additional ad hoc optimization requirements, and it is not limited to a single optimization criterion. In the framework of an Internet media campaign, this means firms can run campaigns to maximize reach, clickthrough rate, or almost any other metric for which they have a known function. 82 Chapter 4 Future Directions 4.1. Extension 1: OLIMS Bidding on Impression Cost 4.1.1 Model with Cost Curves For the Internet media campaign example formulated in Sections 2.1.1 and 3.4, the optimization criterion makes one major assumption: the cost to advertise on a website is constant. That is, whether a firm buys one impression or one million impressions at a website, the cost per impression will be the same. This is a reasonable assumption in cases where advertising space is bought well ahead of the campaign. In these situations, firms negotiate with websites before the campaign begins, agreeing to buy a certain amount of ad impressions at a fixed rate. While this used to be the dominant form of Internet media spend, today that practice is reserved for “whitelist” websites, or a select group of websites a firm knows will work well for a campaign. The rest of the advertising spend is allocated via bidding on advertising space, in which case an automated bid is placed for each ad impression that appears (Ramirez and King 2016). Because of this, firms outsource their campaigns to ad agencies, which run these campaigns through data service platforms (DSPs). These DSPs have some set algorithm (generally unknown to both the contracted agency and the firm) which makes a bid for an available ad space based on the incoming characteristics of the ad impression’s placement and the user viewing it. The DSP’s algorithm makes this decision almost instantaneously, where these algorithms typically follow rule-based procedures involving the demographic and historical characteristics associated with the ad appearance (Wang 2015). In this situation, the price of an ad is not constant even at a given website. Here, anyone wishing to advertise at a website must deal with cost curves, which show a relationship between the bidded price of an ad and the likelihood of winning said bid. Note, however, if a company places the same bid on every ad impression available, this is equivalent to purchasing that same percentage of total ad space. That is, if bidding at a CPM of $3.00 results in a 50% chance of winning the bid for a single ad impression, bidding 83 0 2 4 6 8 10 0.0 0.4 0.8 CPM F 0 2 4 6 8 10 0.0 0.4 0.8 CPM F 0 2 4 6 8 10 0.0 0.4 0.8 CPM F 0 2 4 6 8 10 0.0 0.4 0.8 CPM F Figure 4.1: Plots of the simplified cost curve formulation for four values ofa j : a j = 0.5 (top left), a j = 1 (top right),a j = 2.5 (bottom left), anda j = 5 (bottom right). $3.00 on all available ad impressions which appear will lead to an expected buy of 50% of the advertising space. While this type of advertising spend is observed much more readily in practice than pre-purchased advertising space, it complicates the optimization criterion. Specifically, instead of having a constant c j value for each website, the optimization takes ac j function. Denote this cost curve function asF j (c j ), since now the share of impressions bought at websitej depends on the CPM a firm has paid for them. For a very simple illustration of potential cost curves, see Figure 4.1. In this figure, I have generated four example cost curves from the function c j a j +c j using variousa j values over a range of CPM values from 0 to 10 to show how variations ina j affect the shape of the curve. Here, the four plots involvea j = 0.5 (top left),a j = 1 (top right),a j = 2.5 (bottom left), anda j = 5 (bottom right). 84 This cost curve framework requires an update to the objective function in Section 2.2, since β j is no longer a linear function of cost. Following the formulation laid out in Equation 2.12, the reach criterion is now: L(c) = 1 n n X i=1 e − p P j=1 z ij F j (c j ) , (4.1) whereF j (c j ) is the share of impressions, previously denoted ass j . In this formulation, what was previously β j would now beF j (c j )τ j c j /1000; the total cost to purchase a given number of impressions is the share of impressions bought,F j (c j ), multiplied by the cost to purchase all impressions at that site,τ j c j /1000. Note this means the choice of budget spend at websitej now depends onc j rather thanβ j , since chang- ing c j also changes the amount of advertising space bought. Because of this, the optimization is actually optimizing over c j values rather than the previous β j , which complicates the optimization. However, one nice feature of this reformulation is that the F j (c j ) curves are typically known to advertisers. Ad agencies do a period of burn-in bidding, in which the agency submits bids at a wide range of prices, then charts which bids were accepted and which were not (Ramirez and King 2016). Assuming these cost curves are relatively smooth and well-behaved, though, the same procedure and methodology of Section 2.2 can be used. To approximate this using the second-order Taylor method as before, again do a quadratic approximation to the overall L(c) function 1 . To make it easier to follow, fix a customer i and consider calculating the derivatives with respect to a single coordinate, c j . Then the initialL function becomes: L(c j ) =e − p P j=1 z ij F j (c j ) (4.2) The first and second derivatives ofL(c j ) would be: L ′ (c j ) =−z ij F ′ j (c j )e − p P j=1 z ij F j (c j ) and L ′′ (c j ) =z ij e − p P j=1 z ij F j (c j ) z ij F ′ j (c j ) 2 −F ′′ j (c j ) 1 If desired due to an extremely complicatedFj function, a linear approximation toFj could be done first, followed by the usual quadratic approximation to the functionL(c) 85 Thus the quadratic approximation (around the previousc j estimate, ˜ c j ) becomes: L(c j )≈e − p P j=1 z ij F j (˜ c j ) −z ij F ′ j (˜ c j )e − p P j=1 z ij F j (˜ c j ) (c j −˜ c j )+ 1 2 z ij e − p P j=1 z ij F j (˜ c j ) z ij F ′ j (˜ c j ) 2 −F ′′ j (˜ c j ) (c j −˜ c j ) 2 . (4.3) For methodology in cases where the cost curve presents more of an obstacle to the second-order Taylor method, it may actually be desirable to first approximate the cost curve with a first-order approximation, Depending on the structure of the cost curveF j c j , researchers may wish to simplify the optimization proce- dure by first taking a first-order approximation toF j (c j ) before doing a second-order Taylor approximation. In this case, the linear approximation toF j (c j ) can be represented asK ij +F ′ j (˜ c j )c j z ij , whereK ij is a con- stant term. The L(c j ) function of Equation 4.2, then, has the following first and second derivatives (again as done above for a single coordinatec j and a specific individuali): L ′ (c j ) =−z ij F ′ j (˜ c j )e − p P j=1 K ij +F ′ j (˜ c j )c j z ij and L ′′ (c j ) =z 2 ij F ′ j (˜ c j ) 2 e − p P j=1 K ij +F ′ j (˜ c j )c j z ij . Thus the quadratic Taylor approximation (around the previousc j estimate, ˜ c j ) becomes: L(c j )≈e − p P j=1 K ij +F ′ j (˜ c j )˜ c j z ij −z ij F ′ j (˜ c j )e − p P j=1 K ij +F ′ j (˜ c j )˜ c j z ij (c j −˜ c j )+ 1 2 z 2 ij F ′ j (˜ c j ) 2 e − p P j=1 K ij +F ′ j (˜ c j )˜ c j z ij (c j −˜ c j ) 2 . (4.4) There is one other downside to having a cost curve rather than a fixed cost per website, which is that previously the constraint on the optimization, p P j=1 β j ≤ B, was a linear constraint. Now, however, the constraint becomes: p P j=1 τ j c j F j (c j )/1000≤ B, which is nonlinear. This can be handled easily through a linear approximation toτ j c j F j (c j )/1000 for the constraint instead. As demonstrated here, the penalized loss function methodology still works in cases of non-constant cost. Further, the method gives a transparent algorithm in the face of black box DSP algorithms, though the proposed method would be unlikely to be used in instantaneous bidding situations. That is, the algorithm would not need to be run every time a new ad impression appeared. The algorithm could be run before the 86 campaign, leveraging relationships among viewership at websites to provide a prescriptive bidding heuristic or budget-setting tool for the campaign, then updated as needed throughout the remainder of the campaign. 4.1.2 Toy Demonstration Example: 5 Websites First, to demonstrate how this cost curves approach performs in practice (and compares to the fixed cost approach of Section 2.2), consider a toy example with five simulated sample websites. I simulate these in order to control certain facets of the webpage viewership distribution acrossZ, since otherwise it is hard to see precisely how the OLIMS method allocates budget (and beyond five websites, it becomes difficult to feasibly represent this allocation graphically). TheZ matrix for the five websites is generated according to the parameters for number of page views and number of unique visitors given in Table 4.1. In each case, the number of unique visitors was kept constant, so the only factors influencing the budget allocation across the five websites would be the total viewership to the websites and the cost curves of the websites. These websites correspond to moderately- to highly-visited websites. For example, Website 3 has 50,000,000 total webpages viewed across its 10,000 unique visitors, meaning an average of 5,000 webpages viewed per user. This is an example of a highly- visited website, since if this data is monthly, each user is viewing over 150 webpages at this site per day (on average). This is consistent with the highly-visited websites catalogued in the real comScore, Inc. data. Although in practice a DSP would have derived the underlying cost curves for each website through their initial “burn-in” advertising period, for these case studies I demonstrate the method using a reasonable cost curve approximation: the logistic curve. A standard logistic cost curve follows the equation F j (c j ) = e c j a j +b j 1+e c j a j +b j , where c j is the CPM for website j and a j and b j are the standard parameters for the logistic curve. For simplicity, I assumeb j = 0 for allj. Also, since the logistic function asymptotically approaches zero as c j goes to zero, I shift the logistic curve to ensure F j (c j ) = 0 when c j = 0 (since otherwise the function would allow for small amounts of reach with an ad that is not appearing at a website, i.e. when c j = 0). The a j parameter for the cost curves is found by setting the F j (c j ) function equal to 50% at the average CPM,c 0 . 87 c 0 Total Page Views (in mil.) Unique Visitors Website 1 3.0 30 10,000 Website 2 4.0 20 10,000 Website 3 5.0 50 10,000 Website 4 6.0 10 10,000 Website 5 7.0 40 10,000 Table 4.1: Average CPM, total number of page views, and number of unique visitors for the five-website toy example. To demonstrate how the method allocates budget and how differences in viewing the websites can affect that allocation, each website is given a different c 0 value as well as a different number of total page views. As noted above, the number of unique visitors stays the same for each website as seen in Table 4.1. The left panel of Figure 4.2 shows the five cost curves plotted on the same chart, where the chance of winning a bid at a particular cost point (and hence the proportion of total available ad space at the website a firm could buy at that cost) is plotting against the bidded CPM. Here, the solid black curve has the lowest average CPM (3.0, for Website 1), followed by the thin dashed blue curve (4.0, Website 2), then the dotted red curve (5.0, Website 3), then the dotted-dashed green curve (6.0, Website 4), and finally the thick dashed purple curve (7.0, Website 5). The right panel of Figure 4.2 shows the number of impressions an advertiser buys for each cost curve at each CPM value. This provides a more complete picture of the considerations used in allocating budget, since based purely on the cost curves, the solid black line for Website 1 with an average CPM of 3.0 looks like the best option for advertising. However, Website 3’s middle red line (c 0 = 5.0) ends up being a better investment overall. Figure 4.3 demonstrates how budget is being allocated at the five websites at three particular budget points: $100,000 (star symbol), $500,000 (triangle symbol), and $2,000,000 (box symbol). The left-hand plot shows the total reach achieved by the OLIMS method and where the three budgets fall by symbol. On the right-hand side, the five website cost curves show how budget was allocated across the five websites by the algorithm. As in Figure 4.2, The solid black curve has an average CPM of 3.0 and total page views of 30 million (Website 1 in Table 4.1); the thin dashed blue curve has an average CPM of 4.0 and total page views of 20 million (Website 2); the dotted red curve has an average CPM of 5.0 and total page views of 50 million (Website 3); the dotted-dashed green curve has an average CPM of 6.0 and total page views of 10 million (Website 4); the thick dashed purple curve has an average CPM of 7.0 and total page views of 40 million (Website 5). 88 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 CPM Probability of Winning Bid 0 5 10 15 20 0 10 20 30 40 50 CPM Number of Impressions (in millions) Figure 4.2: Five example cost curves (left) and the corresponding total impressions curves (right). The solid black curve has an average CPM of 3.0 and total page views of 30 million (Website 1 in Table 4.1); the thin dashed blue curve has an average CPM of 4.0 and total page views of 20 million (Website 2); the dotted red curve has an average CPM of 5.0 and total page views of 50 million (Website 3); the dotted-dashed green curve has an average CPM of 6.0 and total page views of 10 million (Website 4); the thick dashed purple curve has an average CPM of 7.0 and total page views of 40 million (Website 5). Each symbol on each cost curve corresponds to what CPM was bid at that particular website for that particular budget. (For example, at the $2,000,000 budget level, the algorithm was bidding almost 17.5 at Website 3, which has a c 0 of 5.0.) As the plots in Figure 4.3 demonstrate, the algorithm first allocates the majority of the budget to the website with the best total impressions for the cost (Website 3). Then, as viewership at those websites is maxed out, the algorithm begins to allocate to the next best websites (Websites 1 and 5), then finally it allocates to the lowest impressions per cost (Websites 2 and 4). 4.1.3 Companion Results to OLIMS in Section 2.3 The cost curve approach actually provides very similar results to the fixed-cost approach of Section 2.2. To compare, I ran the same optimizations as in Section 2.3 with the same data matrices and τ j values. Now, 89 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 Budget [in millions] Reach 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 CPM Probability of Winning Bid Figure 4.3: The reach of the OLIMS optimization procedure (left), with three key budgets identified ($100,000 with the star symbol, $500,000 with the triangle symbol, and $2,000,000 with the box sym- bol) and (right) the five example website cost curves with the chosenc j values for each budget. total budget spent at each website to purchase those impressions. The solid black curve has an average CPM of 3.0 and total page views of 30 million (Website 1); the thin dashed blue curve has an average CPM of 4.0 and total page views of 20 million (Website 2); the dotted red curve has an average CPM of 5.0 and total page views of 50 million (Website 3); the dotted-dashed green curve has an average CPM of 6.0 and total page views of 10 million (Website 4); the thick dashed purple curve has an average CPM of 7.0 and total page views of 40 million (Website 5). however, instead of a constant, fixed CPM at each website, each website has a cost curve associated with its advertising space. These cost curves are generated in the same manner as the preceding toy example, that is, the previous fixed CPM value is now the CPM value corresponding to a 50% chance to make a successful bid according to the website’s shifted logistic cost curve. Though all the analyses of Section 2.3 can be repeated using the updated cost curve methodology, here I present two example results to demonstrate how similar the results are. (Note the resulting reach curves are similar, though this does not necessarily mean the resulting allocations are similar.) For example, I first repeat the analysis with 5000 simulated websites, as seen in Section 2.3.2. I use the same 5000-website matrix, simulated without correlation. Further, I again run the OLIMS method over a 10% subset of the 90 50,000 users, then calculate reach on the 90% holdout data in the result comparisons. I compare OLIMS to the same benchmark approaches: 1) equal allocation over all 5000 websites; and 2) cost-adjusted equal allocation (i.e. average number of visits/CPM) over the most visited 25, 50, and 100 websites. One further step in the cost curves approach is that in order to compare to methods based on a particular budget allocation (e.g. the equal budget allocation across the 5000 websites), computing the share of impressions bought requires finding the point on a cost curve corresponding to a given budget (that is, whereF j (c j ) = β j from the previous formulation). Because there is a one-to-one correspondence, however, this is not a difficult procedure. Figure 4.4 shows the resulting reach comparisons, where (as in Figure 2.3) the reach on the full data is given by the solid black line, the reach using OLIMS on the subset is in the heavy dashed blue line, the reach on the top 100 cost-adjusted websites is in the dashed purple line, the reach on the top 50 cost-adjusted websites is in the dotted brown line, the reach on the top 25 cost-adjusted websites is in the dotted red line, and the reach with an equal allocation to all 5000 websites is in the dotted-dashed green line. As Figure 4.4 shows, even with only a ten percent subset of the data, OLIMS yields reach estimates very similar to the optimal reach estimate based on the entire dataset. In addition, OLIMS outperforms all the benchmark approaches. The comparisons show the equal allocation approach is again the worst. The cost-adjusted approaches perform better, but still worse than OLIMS. Overall, the results show OLIMS can be used to effectively allocate advertising budget across a very large set of websites, especially considering the data here is specifically simulated without correlations. Any advantage OLIMS achieves by leveraging relationships among websites is zeroed out here, and yet OLIMS still clearly demonstrates an advantage. The results in the cost curve bidding example show the same pattern as the results in the fixed cost example. Figure 4.5 shows the results from the OLIMS method with the comparison methods in the repeated case of maximizing reach for the McRib campaign using comScore data from December 2011. As in Section 2.3.3, the December 2011 data was divided into a 10% training data set and a 90% holdout testing data set. The optimal reach (solid black line) is again calculated using the full December 2011 cleaned comScore data set, while the OLIMS method for comparison (blue dashed line) is calculated using the 10% subset. The naive methods (budget allocation using the top 50 cost-adjusted websites in small dotted red, the top 25 cost-adjusted websites in heavy dotted brown, the top 10 cost-adjusted websites in small dashed 91 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Budget (in millions) Reach Figure 4.4: Reach results for the 5000 websites simulated with no viewership correlations in the cost curve OLIMS approach, where the optimal reach (solid black) is calculated using the full data set, the OLIMS method reach (heavy dashed blue) is calculated using the 10% subset, and the naive 100 cost-adjusted reach (dashed purple), 50 cost-adjusted reach (dotted brown), 25 cost-adjusted reach (dotted red), and equal 5000- website reach (dotted-dashed green) are calculated using the given budget allocation. purple, and the equal allocation in dotted-dashed green) all calculate reach based on allocating the budget accordingly with no optimization. Figure 4.5 demonstrates that OLIMS again performs equally well with ten percent calibration data in this new cost curves setup. The reach estimates based on the ten percent calibration data are very close to those from the true optimal based on the entire data. Additionally, the reach estimates from the naive approaches are significantly below both, as seen in Section 2.3.3. In fact, this plot looks extremely similar to Figure 2.4, showing that while this cost curves approach may represent a significant step forward in methodology, it does not fundamentally change the results of the OLIMS method. 92 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 Budget (in millions) Reach Figure 4.5: Reach results for the McDonald’s McRib Campaign using cost curves in which the optimal reach (solid black line) is calculated using the full data set, the OLIMS method (heavy dashed blue) is calculated on the 90% holdout data using the 10% training data subset, and the naive methods (top 50 cost-adjusted websites in small dotted red, top 25 cost-adjusted websites in heavy dotted brown, top 10 cost- adjusted websites in small dashed purple, and equal allocation in dotted-dashed green) are all calculated with a non-optimized budget allocation. 4.2. Extension 2: PAC Time-Varying Loss Functions 4.2.1 Overview and Approximation One major difference between the cost curves setup for the OLIMS method from Section 4.1.1 and the OLIMS method from Section 2.2 is that cost curves are constantly changing in real-time bidding procedures. Whereas with a fixed cost per website, advertisers buy space in advance at sites and know exactly how much they will pay before their campaign is even run, real-time bidding procedures mean an optimization is highly sensitive to how it responds to uncertainty in the distribution of cost. If a firm has no idea what a website will charge to advertise two days from now, how can it plan a week-long campaign? 93 This demonstrates just one example of a situation in which information is updated for subsequent time periods based on information gathered from optimization choices made in initial time periods. While many problems fit neatly into penalized (or penalized and constrained) loss function setups, if these loss functions change over multiple time periods, a much more difficult problem exists. For example, imagine a doctor is trying to develop a treatment plan for a particular region suffering from a disease with an available vaccine. If she only has a limited supply of the vaccine, she must judiciously apply the vaccine to stop the spread of the infection. In this case, it may be advantageous to first identify an initial treatment area, then observe the effects of that treatment before selecting a second treatment area. The difficulty here comes if the doctor wants to optimize the treatment for the two time periods simultaneously, that is incorporate the learning from the first time period but optimize over the entire treatment regime. For this setup, the main optimization function for two time periods (following the operationalization of Equation 2.1 in Section 2.1) would be: min β n X i=1 L(β 1 ;β 2 ;z 1i ;z 2i ), (4.5) where now the loss function L requires two separatez data matrices: z 1 , which corresponds to data from the first time period, andz 2 , which corresponds to data from the second time period; and two separate β vectors: β 1 , a p-dimensional vector of weights for the first time period, andβ 2 , a p-dimensional vector of weights for the second time period. L here would be uniquely defined for any given time-varying problem, so for ease of explanation, I will consider the reach problem setup from the Internet maketing campaign example in Section 3.4 to provide a clear connection. Recall here the optimization criterion for maximizing reach in a single time period (without the penalty term) is: argmin β 1 n n X i=1 p Y j=1 (1−β j υ j ) z ij , (4.6) where againβ is the vector of budget allocations acrossp websites,υ is the effectiveness per dollar spent at each of thep websites, andZ is then byp data matrix of the number of webpages visited by each of then users at each of thep websites. Using this setup, for two time periods the optimization criterion would be: 1 n n X i=1 p Y j=1 (1−β 1j υ j ) z 1ij (1−β 2j υ j ) z 2ij , (4.7) 94 where υ j represents the overall effectiveness of website j (including visitation and clickthrough rate, if a firm desires to optimize over clickthrough rather than reach). This formulation corresponds to firms who adjust their ad campaigns during the run of the campaign itself. For example, consider a firm running a month-long campaign leading up to a major Labor Day weekend sale. The first two weeks of the campaign are likely to far ahead of the holiday to want to do a serious advertising push, so the firm might use those two weeks to gather information and adjust its pitch leading up to the two weeks of advertising about which it is most concerned, the two weeks right before the sale. While an extremely reasonable situation in practice, a time-varying campaign necessarily entrails a messier setup. Namely that before, the model assumedυ j was known for each of thep websites. However, in this case, the overall effectiveness of a website will be updated based on information gained during the first time period. Because of this,υ j actually becomes a distributional parameter rather than a fixed, known quantity. So, assume υ j ∼ N(μ j ,σ 2 ), where μ j and σ are known. This is analogous to a firm knowing approximately how each website’s effectiveness behaves but acknowledging the uncertainty in previous estimations (or an understanding that a website’s effectiveness naturally changes over time). While υ j has a distribution, advertising at websitej during the first time period gives updated, known information forυ j . So, let A ={j : β 1j 6= 0} (i.e., the set of all websites on which advertising took place in the first time period), and thusυ ∗ j (υ j such thatj∈ A) represents the values ofυ for which there is updated information after the first time period. Then the overall optimization problem becomes (given a set budget of B 1 for the first time period and B 2 for the second): min kβ 1 k≤B 1 E υ 1 n n X i=1 p Y j=1 (1−β 1j υ j ) z 1ij min kβ 2 k≤B 2 1 n n X i=1 Y j∈A (1−β 2j υ ∗ j ) z 2ij E υ¯ A Y j∈ ¯ A (1−β 2j υ j ) z 2ij . (4.8) This is a significantly more complicated-looking criterion than Equation 4.7 at first appears. In fact, this criterion presents a number of unique optimization obstacles. The nature of the function requires breaking it into several separate minimization problems, starting with considering the two time periods separately. First, consider the second time period, presumably the time period in which the biggest advertising push is occurring (that is, the time period in which advertisers are most concerned about maximizing their ad views). After allocating budget to some setA of all possiblep websites during the first time period, the firm has gathered information on the actual effectiveness of these sites. This means if a firm has advertised at a 95 website, it knows its true υ j value, υ ∗ j . This is true for all sites in A. The reach at the second time period 1 n n P i=1 p Q j=1 (1−β 2j υ j ) z 2ij thus becomes split into two separate components of the objective function, one in which theυ ∗ j values are known and one in which they must still be modeled: min β 2 1 n n X i=1 Y j∈A (1−β 2j υ ∗ j ) z 2ij E υ ¯ A Y j∈ ¯ A (1−β 2j υ j ) z 2ij , (4.9) subject to the usualλkβk 1 penalty (and where ¯ A is the set of websites not inA). There are a variety of ways to approximate this expectation component where the υ j values are still unknown,E ¯ A ( Q j∈ ¯ A (1−β 2j υ j ) z 2ij ). Consider two such potential methods: (1) a simulation of random draws ofυ j for these ¯ A websites, and (2) a quadratic approximation to the(1−β 2j υ j ) z 2ij function for theυ vector, whereβ 2j is treated as fixed. For the simulation approach to approximating the expectation component of Equation 4.9, the expecta- tion function should be simulated using a given number of the random draws from the υ j population. The expectation thus becomes 1 N N P k=1 Q j∈ ¯ A (1−β 2j υ jk ) z 2ij , resulting in a new minimization of: min β 2 1 n n X i=1 Y j∈A (1−β 2j υ ∗ j ) z 2ij 1 N N X k=1 Y j∈ ¯ A (1−β 2j υ jk ) z 2ij . (4.10) Though this is more computationally expensive to calculate, particularly depending on how largeN must be to get convergent results, the process should be fairly quick in practice. A computer can easily handle the additional summation, after which a quadratic approximation to the sum makes the process equivalent to the existing methodology. Previously, the algorithm optimized the problem of: (1−β 1 υ 1 ) z i1 (1−β 2 υ 2 ) z i2 (1−β 3 υ 3 ) z i3 ..., (4.11) since it treated all υ values as known. Now, however, it requires a summation over N random draws of υ values for each of thej websites, so the problem becomes: 1 N N X k=1 (1−β 1 υ 1k ) z i1 (1−β 2 υ 2k ) z i2 (1−β 3 υ 3k ) z i3 ..., (4.12) 96 sincez ij values remain the same regardless ofk, which only affects the value ofυ j chosen for the particular random draw. Note the optimization here is done only for theβ values after the summation overk, so there is nok associated with theβ values here. For one particular coordinatej, then, all values are fixed exceptβ j . For example, for the first coordinate, β 1 , the problem becomes: 1 N N X k=1 (1−β 1 υ 1k ) z i1 (1−β 2 υ 2k ) z i2 (1−β 3 υ 3k ) z i3 ..., (4.13) but (1−β 2 υ 2k ) z i2 (1−β 3 υ 3k ) z i3 ... are all constants and can be combined with 1/N into ac k term: N X k=1 c k (1−β 1 υ 1k ) z i1 . (4.14) For this, a quadratic approximation is required in order to be able to calculate the value of the function, so do a second-order Taylor approximation to (1−β 1 υ 1k ) z i1 (where the first derivative relative to υ 1k is −β 1 z i1 (1−β 1 υ 1k ) z i1 −1 and the second derivative isβ 2 1 z i1 (z i1 −1)(1−β 1 υ 1k ) z i1 −2 ) to get: N X k=1 c k (1−β 1 μ i ) z i1 −β 1 z i1 (1−β 1 μ 1 ) z i1 −1 (υ 1k −μ 1 )+ 1 2 β 2 1 z i1 (z i1 −1)(1−β 1 μ 1 ) z i1 −2 (υ 1k −μ 1 ) 2 . (4.15) This can be rewritten as: X k c k (a 1k +b 1k β 1 ) 2 , (4.16) where a 1k and b 1k are functions of non-β 1 terms. From there, c k (a 1k + b 1k β 1 ) 2 can be expanded and ultimately rewritten asa+2bβ 1 +cβ 2 1 orα 1 (α 2 +α 3 β 1 ) 2 +α 4 . For the quadratic approximation method to approximate the expectation component of Equation 4.9, instead do a quadratic approximation to the υ function first, rather than calculating the quadratic approx- imation after the simulated draws. In this case, ideally (1−β 2j υ j ) z 2ij should be able to be rewritten in the form ofE((υ−b) T C(υ−b)). Because the υ j values are normal, this should be able to be written as an expectation after some calculation. By pivoting the Taylor approximation around the mean, μ j , for the 97 jth website, the expected value calculation is simplified. For one variable, then, the second-order Taylor approximation is: E[f(υ j )]≈E f(μ j )+f ′ (μ j )(υ j −μ j )+ 1 2 f ′′ (μ j )(υ j −μ j ) , which simplifies to f(μ j )+ 1 2 f ′′ (μ j )σ 2 j . As the optimization above for the second time period shows, the problem naturally divides into two stages: (1) selecting theA set of websites at which to advertise in the first time period in order to maximize expected clickthrough in the second time period, and (2) using the results from the first time period in order to establishυ ∗ j values, allowing for the optimization of the budget allocation over allp websites in the second time period (including the ¯ A websites not advertised on during the first time period). To simplify for now, assume the set A is known. Then if the first time period is used purely to gather information (not to also add to the overall reach), the problem in the first time period becomes optimizing: E υ min β 2 1 n n X i=1 Y j∈A (1−β 2j υ ∗ j ) z 2ij E υ ¯ A Y j∈ ¯ A (1−β 2j υ j ) z 2ij . (4.17) This requires the expected value of even the υ ∗ j values for A, because unlike in the approximation setup, the values ofυ ∗ j are not known until the campaign has actually been run in the first time period. In order to identify the setA, then, the expected clickthrough in the second time period must be based on the selection of the sites inA, not on theυ ∗ j values themselves (which are not known before the campaign begins). AssumingA is unknown, the problem of selectingA to gather information on theυ j values becomes: min choice ofA E υ min β 2 1 n n X i=1 Y j∈A (1−β 2j υ ∗ j ) z 2ij E υ¯ A Y j∈ ¯ A (1−β 2j υ j ) z 2ij . (4.18) If there is no cost to increasing A by additional websites, any optimization would naturally choose all p websites for the set A. However, in practice there will necessarily be a cost. As an example, consider websites which will not allow less than a minimum amount of advertising (or, equivalently, imagine a firm must advertise a certain amount at a website in order to achieve useful information). Then the cardinality of A becomes important, say|A|≤B 1 , whereB 1 is the budget allocated to time period 1. 98 Still, however, the function of Equation 4.18 is not as complicated as Equation 4.8, because Equa- tion 4.18 only refers to the choice of set A. It does not take into account the reach achieved during time period 1, nor the allocation of budget across the two time periods. If a firm a priori splits its budget intoB 1 for the first time period and B 2 for the second time period, incorporating reach into the equation results in Equation 4.8: min kβ 1 k 1 ≤B 1 E υ 1 n n X i=1 p Y j=1 (1−β 1j υ j ) z 1ij min kβ 2 k 1 ≤B 2 1 n n X i=1 Y j∈A (1−β 2j υ ∗ j ) z 2ij E υ¯ A Y j∈ ¯ A (1−β 2j υ j ) z 2ij . There are a number of difficulties in optimizing this criterion. First, though the methodology for the second time period seems very achievable via either the simulation or approximation approach, it assumes the set A and thus the values of υ ∗ j are known. In fact, this must come from the overall optimization, and it is neither trivial to select the set A nor optimize over the problem even assuming A is already known (that is, if advertisers had already selected a set of websites they were going to use in the first time period). The two separate minimizations, one overβ 1 and the other overβ 2 are compounded by taking expectations over the unknown υ values inside the minimization. While simulation has seemed promising in the case of the inner loop for time period 2, doing so for the entire problem would necessarily become NP-hard and time-infeasible. As such, this time-varying extension represents ongoing work as a further development of the method- olody of Chapter 3. The major consideration for the use of the time-varying methodology above is the selection of the A set of websites, which is a much more complex problem than it might first appear, fol- lowed by the subsequent optimization over both time periods simultaneously. 4.3. Discussion Though the penalized and penalized and constrained methodology of Chapters 2 and 3 both present interest- ing solutions to broad optimization problems, the methodology can be extended still further in a number of exciting ways. In this chapter, I consider just two of those. In the case of the OLIMS method of Chapter 2, the Internet advertising example considered in this section is an extension to a more dynamic problem. The features of Internet advertising and the methods employed to optimize Internet ad campaigns are constantly changing. One such change is the shift from buying advertising space on websites in advance, so cost is 99 known going into the campaign, to buying ad space in real-time via bidding. In the bidding environment, firms cannot possibly run their own campaigns and must rely on advertising agencies, which themselves rely on data service platforms to make bids on their behalf. These DSP algorithms for bidding are black box algorithms, proprietary to the company which owns the DSP. The methodology presented in the previous chapters can actually be extended to this case as well, giving users a transparent option for bidding choices and budget allocation. Further, because the method is more versatile than a standard DSP bidding algorithm, it can be used to set budget similar to the reach-budget tradeoffs managers routinely make with campaigns. This means the method provides not only a rubric for bidding decisions but also a budget-setting tool so ad agencies can give firms a clear idea of how much return they should expect on various investments. In addition, my most current ongoing work focuses on extending the PAC methodology to cases where the objective function under consideration benefits from a time component in some way. There are many examples of situations in which using one time period to help predict the next time period benefits researchers. In the exemplar case of Internet media selection considered throughout this thesis, this benefit occurs in the form of providing additional information to better inform advertising allocation in a subsequent time period. Since Internet usage is constantly changing, the effectiveness measures for a website from last year or even last month may not be accurate as measures for a current month. Breaking up a campaign into distinct time periods can help update these measures and provide more reliable allocations. The cost curve example provides a further justification; if cost curves are changing continuously, and advertising at websites provides the necessary information to update those cost curves, real-time bidding procedures could potentially significantly benefit from optimizing time-varying campaigns. 100 References Mediamind global benchmarks report. Technical report, MediaMind Technologies, Inc., 2015. 360i. Point of view on gaming. Technical report, May 2008. H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second Interna- tional Symposium on Information Theory, (eds. B.N. Petrov and F . Cz´ aki) Akademiai Kiad´ o, Budapest, pages 267–281, 1973. A. Ambroziak. Cruise lines increasingly enticing solo travelers aboard, June 2015. R. Burke. Technology and the customer interface: What consumers want in the physical and virtual store. Journal of the Academy of Marketing Science, 30(4):411–432, 2002. X. Cai, D. McKinney, and L. Lasdon. Solving nonlinear water management models using a combined genetic algorithm and linear programming approach. Advances in Water Resources, 24(6):667–676, June 2001. D. Chaffey. Display advertising clickthrough rates, April 2016. URL http://www.emarketer.com/newsroom/index.php/ digital-ad-spending-top-37-billion-2012-market-consolidates. URL: http://www.smartinsights.com/internet-advertising/internet-advertising-analytics/display-advertising- clickthrough-rates/. Accessed 27 May 2016. M. Chapman. Digital advertising’s surprising economics. Adweek, 50(10):8, March 2009. Y . Chen, D. Pavlov, and J. Canny. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: Large-scale behavioral targeting. June 2009. C. Cho and H. Cheon. Why do people avoid advertising on the internet? Journal of Advertising, 33(4): 89–97, Winter 2004. M. Clements. Solo travelers find a berth with norwegian cruise lines, January 2013. 101 P. Danaher. Modeling page views across multiple websites with an application to internet reach and fre- quency prediction. Marketing Science, 26(3):422–437, May-June 2007. P. Danaher, L. Janghyuk, and L. Kerbache. Optimal internet media selection. Marketing Science, 29(2): 336–347, March-April 2010. K. Dave and V . Varma. Learning the click-through rate for rare/new ads from similar ads. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 897–898. ACM, 2010. A. Drewnowski and N. Darmon. Food choices and diet costs: an economic analysis. The Journal of Nutrition, 135(4):900–904, April 2005. B. Efron, T. Hastie, I. Johnston, and R. Tibshirani. Least angle regression (with discussion). The Annals of Statistics, 32(2):407–451, 2004. eMarketer. Digital ad spending tops 37 billion, 2012. URL http://www.emarketer.com/ newsroom/index.php/digital-ad-spending-top-37-billion-2012-market-consolidates. URL: http://www.emarketer.com/newsroom/index.php/digital-ad-spending-top-37-billion-2012- market-consolidates. Accessed 4 Jun 2015. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, December 2001. J. Fan, J. Zhang, and K. Yu. Vast portfolio selection with gross-exposure constraints. Journal of the American Statistical Association, 107:592–606, 2012. I. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993. J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 1:302–332, 2007. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):302–332, 2010a. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):302–332, 2010b. W. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397416, 1998. 102 A. Goldfarb and C. Tucker. Online display advertising: Targeting and obtrusiveness. Marketing Science, 30 (3):389–404, 2011. J. J. Grefenstette. Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16(1):122–128, January 1986. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, second edition, 2009. T. Hemphill. Doubleclick and consumer online privacy: An e-commerce lesson learned. Business and Society Review, 105(3):361–372, Fall 2000. P. R. Hoban and R. E. Bucklin. Effects of internet display advertising in the purchase funnel: Model-based insights from a randomized field experiment. Journal of Marketing Research, LII:375–393, June 2015. A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Techno- metrics, 12(1):55–67, 1970. A. Homaifar, C. X. Qi, and S. H. Lai. Constrained optimization via genetic algorithms. Simulation, 62(4): 242–253, 1994. J. Huang, S. Ma, and C. Zhang. Adaptive lasso for sparse, high-dimensional regression model. Statistica Sinica, 18(374):1603–1618, November 2006. N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click fraud resistant methods for learning click-through rates. Internet and Network Economics, pages 34–45, 2005. G. M. James, J. Wang, and J. Zhu. Functional linear regression that’s interpretable. Annals of Statistics, 37: 2083–2108, 2009. B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 27(6):957–968, June 2005. H. Krugman. Why three exposures may be enough. Journal of Advertising Research, 12(6):11–14, 1972. L. Le Cam. An approximation theorem for the poisson binomial distribution. Pacific Journal of Mathemat- ics, 10(4):1181–1197, 1960. S. Lee, L. Honglak, P. Abbeel, and A. Ng. Efficient l1 regularized logistic regression. American Association for Artificial Intelligence, pages 401–408, 2006. 103 J. Liaukonyte, T. Teixeira, and K. Wilbur. Television advertising and online shopping. Marketing Science, 34(3):311–330, 2015. A. Lipsman. The New York Times ranks as top online newspaper according to May 2010 U.S. comScore Media Metrix data. Technical report, ComScore, Inc., June 2010. R. Lothia, N. Donthu, and E. Hershberger. The impact of content and design elements on banner advertising click-through rates. Journal of Advertising Research, 43(04):410–418, December 2003. Z. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiable mini- mization. Journal of Optimization Theory and Applications, 72(1):7–35, January 1992. J. Lv and J. Liu. Model selection principles in misspecified models. Journal of the Royal Statistical Society Series B, 76:141–167, 2014. P. Manchanda, J. Dub, K. Goh, and P. Chintagunta. The effect of banner advertising on internet purchasing. Journal of Marketing Research, 43:98–108, February 2006. H. M. Markowitz. Portfolio selection. Journal of Finance, 7(1):77–91, 1952. H. M. Markowitz. Portfolio Selection: Efficient Diversification of Investments. New York: Wiley, 1959. N. Meinshausen. Relaxed lasso. Computational Statistics and Data Analysis, pages 374–393, 2007. Mintel. Kids as influencers–U.S. Technical report, Mintel, April 2014. W. Moe. Buying, searching, or browsing: Differentiating between online shoppers using in-store naviga- tional clickstream. Journal of Consumer Psychology, 13:29–39, 2003. A. L. Montgomery, S. Li, K. Srinivasan, and J. Liechty. Modeling online browsing and path analysis using clickstream data. Marketing Science, 23(4):579–595, 2004. M. Morrison. Can the Mcrib save Christmas? Ad Age, September 2012. S. Muthukrishnan. Ad exchanges: Research issues. Technical report, Google, Inc., 2009. M. Naples. Effective frequency: The relationship between frequency and advertising effectiveness. Techni- cal report, Association of National Advertisers, New York, 1979. M. Park and T. Hastie. An L1 regularization-path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B, 69:659–677, 2007. Y . Park and P. Fader. Modeling browsing behavior at multiple websites. Marketing Science, 23(3):280–303, 2004. 104 S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003. G. Ramirez and K. King. Interview with mbuy agency representatives. Personal Interview, February 2016. A. Satchell. Norwegian: Cruise fares to increase up to 10 percent April 1. South Florida Sun-Sentinel, March 2011. R. Schlesinger. U.S. population, 2011: 310 million and growing. U.S. News, December 2010. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461–464, 1978. Y . She. Sparse regression with exact clustering. Electronic Journal of Statistics, 4:1055–1096, 2010. S. Shevade and S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regres- sion. Bioinformatics, 19:2246–2253, 2003. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996a. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996b. R. Tibshirani and J. Taylor. The solution path of the generalized lasso. Annals of Statistics, 39(3):1335– 1371, 2011. R. Tibshirani, M. Saunders, S. Rosset, and J. Zhu. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society, Series B, 67(1):91–108, 2005. T. E. I. Unit. Business: The online ad attack. The Economist, 375(8424):63, April 2005. C. Wang. Are media buyers using the best decision algorithms?, March 2015. M. Yuan and Y . Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2007. H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101: 1418–1429, 2006. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005. 105 Appendix A Technical Appendix to Chapter 1 Tables A.1 and A.2 provide an overview of correlation in viewership among the 16 website groups in the data for the McRib and NCL case studies, respectively. The tables include both within group correlations and among group correlations. Within group correlation in the tables is calculated by taking the mean of all absolute correlations between websites in a particular group. These are displayed in the diagonal of each table. For example, in the December 2011 McRib data table (Table A.1), the Newspaper category shows moderately high average correlation in viewership among websites with a value of 0.48. In contrast, there is not much correlation in viewership among websites in the E-mail category, only 0.01 on average. The off-diagonal elements of Tables A.1 and A.2 show the maximum absolute correlation between each pair of groups. This is calculated by taking the maximum correlation between two websites from the respective groups. For example, in the December 2011 McRib data table (Table A.1), there is a high correlation of 0.96 between Newspaper and Portal sites. In contrast, there is a low correlation between Filesharing and E-mail sites, only 0.03. 106 Category Com Email Ent File Game Gen Info News Onl Photo Port Ret Serv Soc Sport Travel Community 0.02 0.14 0.82 0.14 0.77 0.14 0.47 0.16 0.55 0.88 0.21 0.39 0.21 0.26 0.12 0.15 Email . 0.01 0.07 0.03 0.28 0.04 0.07 0.05 0.09 0.04 0.87 0.10 0.10 0.06 0.12 0.04 Entertainment . . 0.02 0.78 0.32 0.90 0.76 0.92 0.28 0.83 0.90 0.30 0.69 0.24 0.79 0.10 Fileshare . . . 0.05 0.27 0.05 0.15 0.56 0.67 0.13 0.17 0.10 0.13 0.14 0.10 0.07 Gaming . . . . 0.01 0.12 0.82 0.32 0.85 0.12 0.25 0.14 0.95 0.09 0.51 0.09 General News . . . . . 0.28 0.76 0.94 0.08 0.04 0.96 0.08 0.10 0.34 0.85 0.11 Information . . . . . . 0.02 0.77 0.51 0.18 0.76 0.30 0.11 0.24 0.65 0.27 Newspaper . . . . . . . 0.48 0.10 0.05 0.96 0.36 0.12 0.26 0.86 0.15 Online Shop . . . . . . . . 0.03 0.49 0.16 0.26 0.75 0.42 0.19 0.10 Photos . . . . . . . . . 0.02 0.11 0.09 0.09 0.41 0.04 0.05 Portal . . . . . . . . . . 0.06 0.19 0.19 0.12 0.87 0.09 Retail . . . . . . . . . . . 0.04 0.19 0.18 0.25 0.12 Service . . . . . . . . . . . . 0.01 0.15 0.19 0.05 Social Network . . . . . . . . . . . . . 0.02 0.10 0.26 Sports . . . . . . . . . . . . . . 0.07 0.08 Travel . . . . . . . . . . . . . . . 0.18 Table A.1: Overview of viewership correlation within and across the sixteen website categories in the December 2011 data set, used in the McRib case study. The diagonal elements represent the mean absolute correlation in that particular website category, while the off-diagonal elements represent the maximum absolute correlation between each pair of groups. 107 Category Com Email Ent File Game Gen Info News Onl Photo Port Ret Serv Soc Sport Travel Community 0.02 0.06 0.27 0.15 0.23 0.57 0.43 0.14 0.16 0.24 0.49 0.35 0.07 0.21 0.20 0.12 Email . 0.00 0.12 0.04 0.14 0.05 0.08 0.04 0.04 0.02 0.89 0.07 0.09 0.07 0.05 0.03 Entertainment . . 0.02 0.38 0.38 0.61 0.53 0.32 0.22 0.23 0.74 0.28 0.12 0.19 0.39 0.09 Fileshare . . . 0.05 0.25 0.06 0.53 0.09 0.09 0.10 0.28 0.08 0.05 0.15 0.11 0.04 Gaming . . . . 0.01 0.13 0.96 0.11 0.16 0.11 0.25 0.15 0.89 0.32 0.10 0.92 General News . . . . . 0.05 0.61 0.31 0.08 0.06 0.62 0.11 0.15 0.16 0.53 0.14 Information . . . . . . 0.02 0.24 0.52 0.12 0.65 0.33 0.12 0.39 0.34 0.33 Newspaper . . . . . . . 0.06 0.08 0.04 0.46 0.13 0.06 0.27 0.87 0.31 Online Shop . . . . . . . . 0.04 0.06 0.15 0.22 0.07 0.59 0.06 0.12 Photos . . . . . . . . . 0.03 0.11 0.13 0.03 0.16 0.08 0.03 Portal . . . . . . . . . . 0.03 0.26 0.27 0.24 0.45 0.10 Retail . . . . . . . . . . . 0.03 0.11 0.20 0.19 0.14 Service . . . . . . . . . . . . 0.01 0.04 0.18 0.06 Social Network . . . . . . . . . . . . . 0.02 0.09 0.36 Sports . . . . . . . . . . . . . . 0.03 0.06 Travel . . . . . . . . . . . . . . . 0.14 Table A.2: Overview of viewership correlation within and across the sixteen website categories in the January 2011 data set, used in the NCL case study. As in Table A.1, The diagonal elements represent the mean absolute correlation in that particular website category, while the off-diagonal elements represent the maximum absolute correlation between each pair of groups. 108 Appendix B Technical Appendix to Chapter 2 B.1. Simple Illustration of Correlation in Website Viewership In this section, I provide a simple illustration of how the OLIMS method handles correlation in the Z data matrix. To demonstrate basic intuitions, I illustrate the effects of correlation on budget allocation by considering a case with three websites, all generated from the same distribution with the same cost. However, the viewership for websites 1 and 2 has a measurable correlation ranging from 0.0 (fully independent) to 1.0 (perfect positive correlation), and website 3’s viewership is generated entirely independently of the other two websites (correlation of 0). Figure B.1 shows the change in budget allocation across the three websites as the correlation between websites 1 and 2 changes, where the red line is website 1’s allocation, the blue line is website 2’s allocation, and the green line is website 3’s allocation. When the correlation between websites 1 and 2 is zero, all three websites are completely independent. In this case, the algorithm allocates one-third of the budget to each of the three websites, since no website has a clear advantage over the other two. As the correlation between websites 1 and 2 increases, the algorithm gradually allocates more budget to website 3 and split the remaining budget among websites 1 and 2. When these two websites become perfectly correlated, the algorithm divides the budget in half, allocating one half to website 3 and the other half across websites 1 and 2. 109 0.0 0.2 0.4 0.6 0.8 1.0 0.25 0.30 0.35 0.40 0.45 0.50 Correlation Between Websites 1 and 2 Proportion of Budget Allocated Website 1 Website 2 Website 3 Figure B.1: Illustration of budget allocation with varying correlations in website viewership, where website 1 is in red, website 2 is in blue, and website 3 is in green. B.2. Supplementary Information for Model Comparisons with Danaher et al.’s Model I first describe how I generate theZ matrix using the multivariate negative binomial marginal distribution as described in Danaher et al. (2010)’s method. Under Danaher et al.’s approach, page impressions are equivalent to ad appearances; thus what Danaher et al. refer to asX is equivalent to the proposedZ in the OLIMS method. To keep terminology consistent, I will useZ j for the methodology in this section. I first generate website 1’s data, Z 1 , from a typical negative binomial distribution (i.e., the marginal f 1 (Z 1 ) distribution). Then, from Danaher 2007 (p. 425) I note the conditional distribution of Z 2 given Z 1 is given byf(Z 2 |Z 1 ) = f(Z 1 ,Z 2 ) f(Z 1 ) , or 110 f(Z2 = z2|Z1 = z1) = f2(z2) 1+ω e −z 1 − α1 1−e −1 +α1 r 1 e −z 2 − α2 1−e −1 +α2 r 2 (B.1) I then use the following approach to generate theZ matrix following Danaher’s methodology: 1. Randomly generate n synthetic respondents from a negative binomial distribution, corresponding to Z 11 ,...,Z n1 . 2. For eachZ i1 randomly generateZ i2 by sampling from the probability distribution given by (B.1). 3. Repeat process forZ 3 ,Z 4 , etc. until desired number of websites is reached. Note the calculation of the conditional f(Z j |Z 1 ,...,Z j−1 ) will become increasingly complex as each successive website’s viewership is calculated. For example,f(Z 1 ,Z 2 ,Z 3 ) =f(Z 1 )f(Z 2 |Z 1 )f(Z 3 |Z 1 ,Z 2 ). Thus I extend to seven websites for the example used in Section 2.3.1. By combining these vectors, I can create theZ matrix based on the multiple negative binomial distribution. To make the simulated data as realistic as possible, I generate the simulated data using values ofα andr estimated from the top 7 most visited website from the December 2011 comScore data as the true parameter values of the MNBD. SinceE(Z j ) =r j /α j , ˆ r j /ˆ α = ¯ Z j or alternatively ˆ r j = ¯ Z j ˆ α j . ¯ Z j can be found easily from the data, as it is simply the mean of the visit values for a particular website j. Further, given that the probability of a NBD random variable taking on the value zero is given by (α j /(1+α j )) r j , estimate ˆ α j as the solution to y j = (ˆ α j /(1+ ˆ α j )) ¯ Z j ˆ α j , (B.2) where y j denotes the observed fraction of zero visits to a given website j. Equation (B.2) can be easily solved using a root solving function and in turnr j estimated using ˆ r j = ¯ Z j ˆ α j . I used this approach to estimate α and r from Amazon, AOL, Edgesuite, Live, MSN, Weatherbug and Yahoo, which provided the basis for the seven website simulation in Section 2.3.1. Table B.1 shows a comparison between the estimated and true α and r for the simulated data. Here, the true values are from the seven previously mentioned websites, while the estimated values are mean values from 100 simulation runs with matrices of 50,000 users each. 1 The table also shows the mean squared error between the true and 1 Note the simulation used in Section 2.3.1 is done with 5,000 synthetic respondents due to the computational complexity involved in estimating Danaher et al.’s method for 50,000 synthetic respondents. 111 estimated values over the 100 runs, as well as the mean absolute deviation. It is evident that the estimated and trueα andr values are reasonably close to one another. Website 1 Website 2 Website 3 Website 4 Website 5 Website 6 Website 7 α 0.187 0.017 0.093 0.038 0.043 0.025 0.032 ˆ α 0.187 0.018 0.093 0.039 0.043 0.025 0.033 MSE 2e−5 2e−5 4e−5 6e−5 6e−5 3e−5 3e−5 MAD 0.008 0.001 0.004 0.001 0.002 0.002 0.002 r 0.287 0.056 0.174 0.093 0.167 0.051 0.444 ˆ r 0.287 0.057 0.174 0.093 0.168 0.051 0.445 MSE 8e−5 7e−5 2e−5 8e−5 6e−5 6e−5 7e−5 MAD 0.009 0.003 0.005 0.004 0.005 0.004 0.006 Table B.1: True and Estimated Mean α, r Values for the simulated data used for comparisons between the OLIMS method and Danaher et al.’s method in Section 2.3.1. Table B.2 shows a comparison between the estimated and full α and r for the seven-website data from comScore in Section 2.3.1, where the full values are values based on the entire December 2011 comScore data set, and the estimated values are the mean values across 100 runs on random 10% subsets. The table also shows the mean squared error between the full and estimated values over the 100 runs, as well as the mean absolute deviation. Again, the estimated values based on the subset data highly resemble the values obtained from the full data. Amazon AOL Edgesuite Live MSN Weatherbug Yahoo Fullα 0.187 0.017 0.093 0.038 0.043 0.025 0.032 Estimatedα 0.188 0.017 0.094 0.038 0.043 0.025 0.032 MSE 2e−5 2e−6 3e−5 8e−6 6e−6 4e−6 2e−6 MAD 0.010 0.001 0.005 0.002 0.002 0.002 0.001 Fullr 0.287 0.056 0.174 0.093 0.167 0.051 0.444 Estimatedr 0.288 0.056 0.175 0.093 0.167 0.051 0.444 MSE 1e−4 4e−6 3e−5 1e−5 2e−5 4e−6 1e−4 MAD 0.010 0.002 0.005 0.003 0.004 0.002 0.008 Table B.2: True and Estimated Meanα,r Values for the real data used for comparisons between the OLIMS method and Danaher et al.’s method in Section 2.3.1. 112 Appendix C Technical Appendix to Chapter 3 C.1. Methodology Details The first step in the methodology is computing the quadratic Taylor approximation. By Taylor’s Theorem, L(β)≈L( ˜ β)+ p X j=1 g ′ ij (β j − ˜ β j )+ 1 2 p X j=1 p X k=1 g ′′ ijk (β j − ˜ β j )(β k − ˜ β k ) whereg ′ ij = ∂L(β) ∂β j | β= ˜ β ,g ′′ ijk = n P i=1 ∂ 2 L(β) ∂β j ∂β k | β= ˜ β and ˜ β is a fixedp-dimensional vector. Thus over alli, this gives the approximation: min β n X i=1 p X j=1 g ′ ij (β j − ˜ β j )+ 1 2 p X j=1 p X k=1 g ′′ ijk (β j − ˜ β j )(β k − ˜ β k ) . (C.1) This procedure can be used to solve for any generalized linear model using the model’s canonical link function. Let l(·) denote the link function, and V(·) the variance function. Analogous to the iterative reweighted least squares approach, given the current parameter estimates ¯ β, a second-order Taylor approxi- mation to the optimization problem in (3.2), up to irrelevant constants, is given by argmin β 1 2 n X i=1 w i z i −β 0 −x T i β 2 +λ p X j=1 υ j |β j | such that Cβ≤b, (C.2) whereυ j =ρ ′ (| ¯ β j |),z i = ¯ β 0 +x T i ¯ β + y i −¯ μ i w i ,w i =V(¯ μ i ), and ¯ μ i =l −1 ¯ β 0 +x T i ¯ β . For example, consider a logistic regression involving binary responses. In this case,l(μ) = log μ 1−μ , V(μ) =μ(1−μ), and fori = 1,...,n, LogLik i (β) =y i (β 0 +x T i β)−log 1+e β 0 +x T i β . 113 Thus, given the current parameter estimates z i = ¯ β 0 +x T i ¯ β + y i − ¯ μ i w i , w i = ¯ μ i (1− ¯ μ i ), and ¯ μ i = e β 0 +x T i ¯ β 1+e β 0 +x T i ¯ β . This need not be restricted to linear models either. The motivating example of Sections 3.4 and 3.5, reach and clickthrough rate for a Norwegian Cruise Lines advertising campaign, are highly nonlinear in nature. However, the same procedure can be readily applied to put these nonlinear functions into the quadratic formulation. Consider the clickthrough rate function example, where L(β) = argmin β n P i=1 (1−βqυ) z i (note this is analogous to the reach function whenq = 1 for all websites). To see how the first and second derivatives behave in this case, consider the simplified CTR optimization problem for a single websitej: argmin β n X i=1 (1−β j q j υ j ) z ij +λkβk 1 subject to Cβ≤b. Because all terms butβ j are constant here, the first and second derivatives can easily be computed as (respec- tively): g ′ (β j ) =−z ij q j υ j (1−β j q j υ j ) z ij −1 and g ′′ (β j ) = 1 2 z ij (z ij −1)q 2 j υ 2 j (1−β j q j υ j ) z ij −2 . Thus even with a highly nonlinear objective function, the quadratic approximation can utilize the PAC methodology and algorithm of Section 3.2 as long as the first and second derivatives can be readily com- puted. C.2. Le Cam’s Theorem Le Cam’s Theorem states the following. Suppose: • V 1 ,...,V n = 0,1 are independent (not necessarily identically distributed) random Bernoulli variables • P(V j = 1) =ρ i forj = 1,2,3,... • μ p = p P j=1 ρ i 114 • S p = p P j=1 V j , soS p follows a Poisson binomial distribution Then the error between the Bernoulli and Poisson distributions is bounded by: ∞ X k=0 |P(S p =k)− μ k p e −μp k! |< 2 p X j=1 ρ 2 i (C.3) Ifρ i is set toμ p /p, the general Poisson limit theorem results. In the case of the Internet advertising example,V j is whether or not a particular user views (or clicks on) an ad at a given websitej. ThusS p (the sum of all ad appearances to this user across thep websites) should ideally not equal zero. Then fixing customeri and lettingp i (μ) = p P j=1 z ij (the total number of visits customeri makes to thep websites in the set) 1 , Le Cam results in: | p Y j=1 (1−β j υ j ) z ij −e − p P j=1 z ij β j υ j |< 2 p P j=1 z ij β 2 j υ 2 j p P j=1 z ij β j υ j , (C.4) since p Q j=1 (1−β j υ j ) z ij is the total probability user i does not see the ad (P(S p = 0)) at any visit to the p websites under consideration and e − p P j=1 z ij β j υ j is the Poisson approximation to that same Bernoulli setup (that is, again the probability useri does not see the ad at any visit). If p P j=1 z ij β j υ j → K asp→∞ (whereK is a constant) and max j=1,...,p β j υ j → 0 asp→∞, then the approximation is correct (withβ j υ j ≤ 1 as well). 1 Note thatζ here is the probability parameter of the Bernoulli distribution. Thus this is an approximation via the usual Poisson limit theorem, where Poisson(μ) is used to approximate Bernoulli(p,ζ); asp→∞ andζ→ 0,pζ→ μ. 115
Abstract (if available)
Abstract
The use of penalized loss functions in optimization procedures is prevalent throughout the statistical literature. However, penalized loss function methodology is less familiar outside the areas of mathematics and statistics, and even within statistics loss functions are typically studied on a problem-by-problem basis. In this thesis I present general loss function approaches for both penalized as well as penalized and constrained loss functions and apply these approaches to a problem outside statistics, namely the optimization of a large-scale Internet advertising campaign. ❧ Recent work in penalized loss functions has focused largely on linear regression methods with excellent results. Methods like the LASSO (Tibshirani 1996a) have demonstrated that penalization can yield gains in both prediction accuracy and algorithmic efficiency. In cases where linearity is not appropriate, however, the best approaches can be unclear. To this end I propose a generalized methodology, applicable to many loss function for which the first and second derivatives can be readily computed. This leverages the efficiency benefits seen in penalized linear regression approaches and applies them to problems of general nonlinear objectives. ❧ Unfortunately, as useful as penalization is in optimization, many problems require another step: constraints. Problems like portfolio optimization or monotone curve estimation have very natural constraints which need to be taken into account directly during the optimization. Because of this, I propose a further advance in the penalized methodology: PAC, or penalized and constrained regression. PAC is capable of incorporating constraints directly into the optimization, allowing for better parameter estimation when compared to unconstrained methods. Again, the use of penalization allows for efficient algorithms applicable to a wide variety of problems. ❧ One such problem comes from the field of marketing: the optimization of large-scale Internet advertising campaigns. I apply both approaches to the real-world problem encountered by firms and advertising agencies when attempting to select a set of websites to use for advertising and then allocating an advertising budget over those websites. Due to the sheer number of websites available for advertising, all of which represent a unique advertising opportunity with varying cost, traffic, etc., these selection and allocation questions can present daunting challenges to anyone wanting to advertise online. At the same time, online advertising is becoming increasingly necessary to businesses of all industries. To address this, I show how the efficient penalized approaches developed in the general case can be specifically applied to this problem. I carry out two case studies using real data from comScore, Inc., which exemplify campaign considerations faced by companies today. Further, I show how these methods significantly improve on the methods currently available. ❧ I also discuss future directions for the research in this thesis, particularly in the real-world application of Internet media campaigns. Though this research could be extended in a wide variety of directions, I focus on furthering the marketing application to extend to the most current advertising procedure, bidding on ad impressions in real time, as well as extending the penalized and constrained approach to include time-varying objective functions (for problems in which the behavior of one time period affects the optimization of a second time period).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Penalized portfolio optimization
PDF
Essays on revenue management with choice modeling
PDF
Model selection principles and false discovery rate control
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Shrinkage methods for big and complex data analysis
PDF
Large scale inference with structural information
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Structural nonlinear control strategies to provide life safety and serviceability
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Distributed adaptive control with application to heating, ventilation and air-conditioning systems
Asset Metadata
Creator
Paulson, Courtney L.
(author)
Core Title
Optimizing penalized and constrained loss functions with applications to large-scale internet media selection
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Publication Date
07/28/2016
Defense Date
06/06/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
constrained functions,cost curves,loss functions,Marketing,media campaigns,OAI-PMH Harvest,online advertising,optimization algorithms,penalty functions,Poisson,regularization,statistics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
James, Gareth (
committee chair
), Luo, Lan (
committee member
), Lv, Jinchi (
committee member
), Radchenko, Peter (
committee member
), Rusmevichientong, Paat (
committee member
)
Creator Email
clpaulso@usc.edu,Courtney.Paulson.2016@marshall.usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-284874
Unique identifier
UC11281291
Identifier
etd-PaulsonCou-4660.pdf (filename),usctheses-c40-284874 (legacy record id)
Legacy Identifier
etd-PaulsonCou-4660.pdf
Dmrecord
284874
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Paulson, Courtney L.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
constrained functions
cost curves
loss functions
media campaigns
online advertising
optimization algorithms
penalty functions
Poisson
regularization