Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Customized data mining objective functions
(USC Thesis Other)
Customized data mining objective functions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CUSTOMIZED DATA MINING OBJECTIVE FUNCTIONS by Greg Harris A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2017 Copyright 2017 Greg Harris Dedication To Rhonda, for her encouragement and support. ii Acknowledgments I would like to express my gratitude to my advisor, Professor Viktor K. Prasanna. He has truly supported me in my research in every sense of the word. As with each of his students, he has placed my growth, development, and happiness as the highest priorities. He has exemplified wisdom, excellence, patience, steadiness, optimism, and kindness. I am fortunate to have had him as a mentor. I would like to express my gratitude to Professor Anand Panangadan, with whom I collaborated even after his time as a Postdoctoral Research Associate in our lab came to an end. Anand’s example and guidance helped me become a better researcher. I value his friendship and association. I appreciate the help and support of all the good people at Chevron, particularly our champions Brian Thigpen, Tamas Nemeth, and Randy McKee. I also appreciate the assistance and kindness of Juli Legat, Kathryn Kassar, and Janice Thompson. Most of all, I express my gratitude to my wife, Rhonda Harris. She built me up when I was discouraged. She has been a pillar of support and encouragement, from the beginning with graduate school applications all the way through the defense and the job interviews. I appreciate all the times she put our children to bed on her own while I stayed late at the lab. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures ix Abstract xiii Chapter 1: Introduction 1 1.1 Interpretable Machine Learning . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Importance of Interpretability . . . . . . . . . . . . . . . . . 2 1.1.2 Properties of Interpretable Models . . . . . . . . . . . . . . 5 1.1.3 Rule-based Models . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Learning Objective Functions from the Data . . . . . . . . . 12 1.3.2 Incorporating Domain Knowledge into Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.3 Learning Objective Functions from Domain Expert Feed- back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.4 GPU-Accelerated Beam Search . . . . . . . . . . . . . . . 14 Chapter 2: Background 15 2.1 Classification Rule Induction . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Classification Rule Evaluation Metrics . . . . . . . . . . . . 16 2.1.2 Search Algorithms . . . . . . . . . . . . . . . . . . . . . . 21 2.1.3 Separate-and-Conquer . . . . . . . . . . . . . . . . . . . . 23 2.1.4 Classification Rule Learning Systems . . . . . . . . . . . . 23 2.2 Regression Rule Induction . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Regression Rule Evaluation Metrics . . . . . . . . . . . . . 26 2.2.2 Regression Rule Learning Systems . . . . . . . . . . . . . . 27 iv Chapter 3: Related Work 30 3.1 Measure Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Choosing the Frequency Bias . . . . . . . . . . . . . . . . . . . . . 32 3.3 Choosing Search Parameters . . . . . . . . . . . . . . . . . . . . . 33 3.4 Randomization Methods . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Comparison against Real Human Interest . . . . . . . . . . . . . . . 34 Chapter 4: Learning Objective Functions from the Data 36 4.1 Precision@k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 FrontierMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1 Pareto Frontier . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2 Penalty Functions . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Empirical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Baseline Method . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.2 Experiment 1: Untuned Baseline on Synthetic Data . . . . . 54 4.3.3 Experiment 2: FrontierMiner and Tuned Baseline on Syn- thetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.4 Experiment 3: FrontierMiner and Tuned Baseline on Real- world Data . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.5 Learning Multiple Rules . . . . . . . . . . . . . . . . . . . 69 4.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . 74 4.4.1 Computational Complexity of the Untuned Baseline . . . . . 74 4.4.2 Relative Complexity of the Tuned Baseline . . . . . . . . . 75 4.4.3 Relative Complexity of FrontierMiner . . . . . . . . . . . . 75 Chapter 5: Incorporating Domain Knowledge into Objective Functions 78 5.1 Intervention Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Intervention Optimization . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4 PRIMER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.1 Separate-and-Conquer with Beam Search . . . . . . . . . . 84 5.4.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.2 PRIMER Settings . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.3 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . 92 5.5.4 Baselines for Comparison . . . . . . . . . . . . . . . . . . . 92 5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6.1 Example Rules . . . . . . . . . . . . . . . . . . . . . . . . 96 v 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 6: Learning Objective Functions from Domain Expert Feed- back 100 6.1 Finance Background . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1.1 Equity Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1.2 Distribution-Based Measures . . . . . . . . . . . . . . . . . 103 6.1.3 Multi-Period-Based Measures . . . . . . . . . . . . . . . . 104 6.1.4 Selecting an Investment Performance Metric . . . . . . . . . 107 6.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.1 Generating Equity Graphs . . . . . . . . . . . . . . . . . . 109 6.2.2 Collection of Ranking Data . . . . . . . . . . . . . . . . . . 109 6.2.3 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2.4 Learning-to-Rank . . . . . . . . . . . . . . . . . . . . . . . 112 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.1 Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.4 Idiosyncratic Preferences . . . . . . . . . . . . . . . . . . . . . . . 121 Chapter 7: Conclusion 125 7.1 FrontierMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2 PRIMER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.3 Learning Investment Performance Metrics . . . . . . . . . . . . . . 127 7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Bibliography 130 Chapter A: GPU-Acclerated Beam Search 140 vi List of Tables 2.1 Contingency table for rule A! B showing notational conven- tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Non-parameterized classification rule evaluation metrics . . . . . . 18 2.3 Parameterized classification rule evaluation metrics . . . . . . . . . 20 2.4 Regression rule evaluation metrics, wheren is the number of sam- ples being evaluated,y i is the true value, ^ y i is the predicted value, y is the mean value of all samples, andr is the current rule. . . . . . . 27 4.1 Synthetic data attribute values are drawn from the following distri- butions, and samples and rules are randomly generated such that the attribute conditions are satisfied. . . . . . . . . . . . . . . . . . . . 55 4.2 Un-tuned rule-finding methods compared against the baseline. . . . 58 4.3 Results of Experiment 2, evaluating the effect of tuning the param- eters of the baseline method using grid search and cross-validation. Parameters in parenthesis (m, b, l) indicate the tuned parameters. Parameters left untuned are set to the default values from the base- line method: m-estimate withm = 22:466, beam search withb = 5, and maximum rule length set tol = 10. p-Values are from a paired- sample sign test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 vii 4.4 Results of Experiment 3 on 138 real-world classification tasks. Each method is compared against the baseline method (m-estimate with m = 22:466, beam search withb = 5, and maximum rule length set tol = 10) to calculate the win-loss-tie record. p-Values are from a paired-sample sign test. . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5 Results of Experiment 3 on 1,000 synthetic classification datasets. Each method is compared against the baseline method (m-estimate with m = 22:466, beam search with b = 5, and maximum rule length set tol = 10) to calculate the win-loss-tie record. p-Values are from a paired-sample sign test. . . . . . . . . . . . . . . . . . . 64 4.6 Symbols used in the complexity expression. . . . . . . . . . . . . . 74 5.1 Average total response of the events covered by the topx% of pre- dictions of each model (top 1%, top 5%, etc.) . . . . . . . . . . . . 95 5.2 Highest-impact rules learned by PRIMER on the insider trading data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1 Baseline accuracies when tested against filtered community rank- ings (FCR), all community rankings (ACR), and author rankings (AR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 Mean and standard deviation of accuracies across the (13) partici- pants who ranked 25 or more chart pairs, for selected top perfor- mance measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 viii List of Figures 1.1 A Weka screenshot showing an example of the many options rule learning algorithms can have. . . . . . . . . . . . . . . . . . . . . . 10 2.1 An example binary classification rule . . . . . . . . . . . . . . . . . 15 4.1 An illustration of the trade-off between precision and estimation error for an example generating function. The expected value of precision@k for the top k samples decreases with each additional sample, however the confidence intervals get tighter. We have cho- sen precision@100 as our evaluation metric, because it has reason- ably tight confidence intervals yet maintains an emphasis on the top predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Penalty isometric line illustration for them-estimate (m = 22:466), with equal class proportions withn(B) = 500 andn(:B) = 500. Each isometric line corresponds to a level of rule precision. . . . . . 47 4.3 Example penalty curve learned by FrontierMiner after 10 sampling iterations. Each blue point represents the amount by which the pre- cision of a Pareto-optimal rule was overestimated. The red line is fitted to the points using least-squares isotonic regression. . . . . . . 50 ix 4.4 The number of datasets where cross-validation selected each beam size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 (1 of 4) Cumulative precision and coverage of rules found using a separate-and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run with iters = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 (2 of 4) Cumulative precision and coverage of rules found using a separate-and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run with iters = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 (3 of 4) Cumulative precision and coverage of rules found using a separate-and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run with iters = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.8 (4 of 4) Cumulative precision and coverage of rules found using a separate-and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run with iters = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.9 Log-log plot showing that in practice FrontierMiner exhibits poly- nomial computational complexity. The slope of the red line is approx- imately 0:6, making the complexityO(itersNM 0:6 ). The runtime is normalized by dividing by the number of samples (N) in order to reduce noise and focus on the complexity as a function of the num- ber of features (M). Each point on this plot represents one synthetic dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 x 4.10 Histogram of FrontierMiner relative runtimes on 1,000 synthetic datasets, withiters = 50. . . . . . . . . . . . . . . . . . . . . . . . 77 5.1 Illustration using synthetic data, where each response sample is gen- erated using the power law function with white noise added. Aver- aging many samples improves the goodness-of-fit, raising the lower confidence bound and increasing the score of a rule. . . . . . . . . . 88 5.2 Average response of all events in the insider purchase dataset. . . . . 91 5.3 Average total response of the events covered by the topx% of pre- dictions. PRIMER is highlighted to show it is competitive with state-of-the-art models. . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 The red chart on the right was generated by permuting the daily returns from the blue chart on the left. Both have the same dis- tribution of daily returns, and hence the same daily Sharpe ratio. This figure illustrates how distribution-based performance measures cannot capture some features preferred by traders, such as a small maximum drawdown. . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 The Pain index is the area colored black. The Ulcer index is the root mean squared height of each vertical black line. . . . . . . . . . . . 106 6.3 “Max Days Since First at This Level” is the longest horizontal line that can be drawn between two points on the graph. . . . . . . . . . 107 6.4 Web page used to collect pairwise investment performance ranking data from traders and quantitative analysts. . . . . . . . . . . . . . . 111 6.5 Accuracy of learning-to-rank models trained and tested on crowd- sourced “real human interest” data in the form of pairwise rank- ings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xi 6.6 These charts show the relationship between accuracy, number of rankings, and confidence for two users (an author, and the commu- nity member with the greatest contribution – User42). The results are based on randomly choosingk of the user’s rankings, using them to choose the most accurate performance measure, then testing the accuracy of that measure on the rest of the user’s rankings. Confi- dence was evaluated by doing this calculation 1,000 times perk. k is capped at 50 because User42 only ranked 100 chart pairs, and we held out at least 50 for testing. . . . . . . . . . . . . . . . . . . . . 123 xii Abstract Data mining is the process of sifting through large amounts of information to find interesting patterns or relationships. It is a broad field because of the many possible ideas of what “interesting” means. Each task should ideally be matched with a custom solution, tailored to the objective and to the characteristics of the data. The problem is that current algorithm options and parameters are not specified in terms that align with domain expert knowledge. The result is that domain expertise yields little benefit, and that options and parameters must be set through trial-and-error or simply left at default recommended values. Customizing data mining objective functions is difficult because it involves specify- ing the trade-offs between multiple heuristics. Some heuristics align with the domain objective, some favor simplicity and parsimony, and others are included for regulariza- tion, or to reduce the chances of finding spurious patterns. Furthermore, the optimal objective function is dependent on the search algorithm. For example, the effect of over-searching, or finding false discoveries by chance due to an extensive search, can be mitigated by choosing an objective function with higher preference for frequently occur- ring patterns. In general, even among use-cases with the same objective, the optimal set of techniques and parameters to use varies due to differences in the data. xiii Our approach to objective function customization is two-pronged: First, we explore new and existing ways for domain experts to contribute the knowledge they have. Sec- ond, we make extensive use of bootstrapping and cross-validation to customize objective functions in the absence of domain expertise. Our first contribution is FrontierMiner, a new rule-based algorithm for predicting a target class with high precision. It has no tuning parameters and learns an objective function directly from the data. We show evidence from a large-scale study that Fron- tierMiner finds higher-precision rules more often than competing systems. Our second contribution is PRIMER, a new algorithm for maximizing event impact on time series. It has an objective function that adapts to the level of noise in the data. It also incorporates user-provided input on the expected response pattern as a heuristic that helps prevent over-fitting. Our third contribution is a method of learning an objective function from user feedback in the form of pairwise rankings. With this feedback, we use learning- to-rank algorithms to combine existing heuristics into an overall objective function that more closely matches the user’s preference. xiv Chapter 1 Introduction Data mining is about knowledge discovery. It is the process of discovering interest- ing patterns and relationships in large volumes of data. It is commonly performed by domain experts looking for information that can be used to further a business objective. Data mining practitioners make use of machine learning algorithms. Their goal is not necessarily to automate a task. Rather, their goal is to gain further insight by examining the trained model. In order to gain insight through machine learning, a domain expert must be able to understand the reasons the model makes the predictions that it does. The model must be interpretable in some way. Our work in this thesis seeks to improve the accuracy of interpretable machine learn- ing methods while at the same time improving one aspect in which such methods remain uninterpretable. 1.1 Interpretable Machine Learning Interpretability in machine learning is a multi-faceted concept. In this section, we discuss why interpretability is important and how interpretability can have many defini- tions. Additional detail and discussion can be found in the recent publication by Lipton et al. (2016). 1 1.1.1 Importance of Interpretability Trust Trust is prerequisite for the adoption and deployment of machine learning models. Empirical evidence that a model makes accurate predictions naturally increases trust. However, empirical evidence alone is not always sufficient. Swartout (1983) stated: Trust in a system is developed not only by the quality of its results, but also by clear description of how they were derived. Domain experts can be wary of data-derived predictions and want to understand the reasoning behind them. In some domains, even the term “data mining” is used in a pejorative sense, referring the practice of selectively trying to find data to support a particular hypothesis (Berry & Linoff, 1997). Spurious findings can even occur when researchers follow best practices, carefully evaluating models on held-out data. This can happen due to the problem of multiple testing, where many researchers are investigating the same data, leading to false discoveries. A satirical example of flawed data analysis is a study by Leinweber (2007). After searching through thousands of economic variables, he found a model that fits the US stock market (S&P 500) withR 2 = 0:99. The three variables in the model were: butter production in Bangladesh, US cheese production, and the sheep population in Bangladesh! Data snooping by forecasters is a significant problem in Finance, because the market has only one past, yet new predictive variables are constantly devised and tested. Interpretability is one tool for identifying over-fitting. It allows domain experts to vet what was learned from the data. By comparing the findings against prior knowledge, domain experts gain trust in a model. 2 Causality Machine learning models can only find associative relationships in data. They cannot concretely establish the presence of causal relationships. Causality can only be deter- mined through scientific investigation and controlled experimentation. Interpretable models have the benefit of making plain the associative relationships discovered. These relationships can then be treated as hypotheses and further studied. In fields such as Medicine, understanding cause-effect relationships is the ultimate goal. While associative relationships are useful in diagnosis, causal relationships are directly useful in finding cures and improving outcomes for patients. Transferability Interpretable models give humans the opportunity to judge whether the associations being learned will likely generalize and transfer well to unseen situations. This can help make models less brittle. A relevant example is described by Dreyfus & Dreyfus (1992) regarding an effort by the army to train a neural network to recognize tanks partly hidden by trees in the woods. The training was successful, in that the model had high accuracy recognizing tanks in images from the dataset that were held out for testing. However, after taking more pictures in the same woods, the researchers found the model to perform very poorly. The mystery was only solved when someone finally noticed the original photos of woods without tanks were all taken on an overcast day, and the photos with tanks were all taken on a sunny day. The neural network apparently learned to identify shadows instead of tanks. Had it been possible to gain insight into the reasons behind the neural network predictions, the researchers would have more easily recognized the reliance on such an ungeneralizable association. 3 The transferability of learned associations is important in situations where where the deployment environment is not perfectly represented by the training data. This is partic- ularly important in situations where actions influenced by a model alter the environment, invalidating future predictions (Lipton et al., 2016). Examples of this can be found in the field of Healthcare, where it is usually not ethical to to perform randomized tests in order to collect unbiased training data. Instead, each patient is given care according to the best practices known at the time. Caruana et al. (2015) describe a specific example of this, and show how interpretable machine learning helped discover and correct the bias. Their task was to learn to predict the possibility of death for pneumonia patients, so that patients with high risk could be admitted to the hospital instead of being treated as outpatients. They built on earlier work (Cooper et al., 1997), and used multiple machine learning algorithms to predict the risk of death based on features describing each patient. They found that the most accurate predictive models were relatively unintelligible. In the end, an interpretable model was chosen for deployment, because it uncovered a bias and allowed them to correct it. By reviewing the interpretable model, the researchers found it predicted that patients with asthma were considered low-risk. This is known to be false, yet it was reflected in the data. The reason is that in this situation, patients with asthma were considered at such high risk that by policy they were admitted not only to the hospital but directly to the Intensive Care Unit. Such aggressive care was the reason asthma patients had such good outcomes. The interpretable model was able to be manually corrected to remove the bias in the data related to asthma. In this way, domain experts identified a non-transferable association that was likely also learned by the more complex models. 4 Informativeness Informativeness describes how well a model imparts what it learned from the data in a way humans can understand. It is an important attribute for situations such as data mining, where the purpose of the model is to provide evidence and assistance to human decision makers. 1.1.2 Properties of Interpretable Models Transparency Transparency in machine learning can have multiple definitions. It can refer specif- ically to the understandability of the trained model. One way a trained model can be transparent is if it can be simulated in the mind. This means that within a reasonable amount of time, a person can recreate the prediction process. Huysmans et al. (2011) propose evaluating the comprehensibility of predictive mod- els through user testing. They presented model representations and questions to users. They asked users to perform model-based classification tasks, to answer logical yes/no questions, and to determine the equivalence of two models. Example questions included: Is it correct that applicants with a high income are more likely to be accepted than applicants with a low income? and How does the representation classify observation X? They assert that users presented with more-comprehensible models would answer the questions with higher accuracy, lower response time, and with more confidence in their answers. We cite this study because it helps define model transparency using concrete metrics. 5 Transparency must sometimes be traded-off for accuracy. Models produced by com- plex modern machine learning algorithms can be highly accurate, but relatively unintel- ligible. Examples include: deep neural networks, kernelized support vector machines, random forests, bagged trees, boosting methods, and ensemble methods. Even models that are classically considered transparent can become unwieldy, such as when rule lists become long or when decision trees become deep. A complex model may still be considered transparent in some way if it can be decomposed into parts that make sense. Domain experts may not be able to contem- plate an entire model at once, but my still find value in studying components of the model individually. Components can be individual nodes in a decision tree, for exam- ple, or individual rules in a rule list. Linear models may also be considered transparent in this way, because individual parameters represent the strength of association between features and labels. Decomposability requires that the features are sensible and inter- pretable, and not highly engineered or anonymous. Transparency in machine learning may also refer not to the model, but rather to the algorithm for training the model. For example, one may feel more comfortable with an algorithm that is guaranteed to converge to a unique solution. Post-hoc Interpretability When accuracy cannot be sacrificed and a complex model is needed, post-hoc meth- ods of interpretation can still be used to gain trust in the model. These methods do not fully describe the exact prediction reasoning, but they do give some explanation. This is similar to how humans might partially explain complex reasoning using only simple natural language explanations. 6 One method of post-hoc interpretation is the use of text-based explanations. For example, with topic modeling, latent topics can often be explained or represented by showing the top words associated with each topic. Another method of post-hoc interpretation is the use of case-based explanations (Caruana et al., 1999). Methods such as k-nearest neighbors can be used to identify similar situations from past data that help justify a prediction. This is similar to the use of case studies in Medicine. Visualizations can also be used to explain what a model has learned. With image classification using convolutional neural networks, an image-specific class saliency map can be created to highlight the areas of the given image most discriminative with respect to the given class (Simonyan et al., 2013). These maps take the form of a new image and often visibly represent the class in a recognizable way. 1.1.3 Rule-based Models Arguably, the most interpretable form of model is a short set of concise rules. Rules take the form, “IF <conditions>, THEN <prediction>.” Modern rule induction has its roots in Psychology. It was called “concept attainment” by Bruner, Goodnow, and Austin in their book A Study of Thinking (Bruner et al., 1956). The authors spent three years performing dozens of experiments to better understand how people learn in a vari- ety of conditions. They argue strongly that that rule learning is at the core of how people think: We wish to make it as clear as possible that the task of isolating and using a concept is deeply imbedded [sic] in the fabric of cognitive life; that indeed it represents one of the most basic forms of inferential activity in all cognitive life. (p. 79) 7 Rule-based machine learning techniques are a natural choice for data mining tasks, due to their close alignment with the way people learn and understand concepts. For this reason, much of the work described in this thesis is focused on the improvement of rule learning techniques. 1.2 Problem Description At a high level, the problem we seek to address in this thesis is the lack of inter- pretability in mechanisms for adapting data mining algorithms to individual domains. Many data mining and machine learning algorithms have tuning parameters and user- selected options. These hyper-parameters allow the algorithms to be tuned and cus- tomized. This makes the algorithms suitable for wider use, in more domains. The prob- lem is that, with few exceptions, the hyper-parameters are uninterpretable to domain experts. By uninterpretable, we mean that even extensive domain knowledge does not prepare the user to estimate the optimal parameter value or choose the option best suited to the domain. These algorithms ask questions the user cannot answer. An example is the regularization parameterC used in support vector machines to control the trade-off between the margin and the size of the slack variables (Cortes & Vapnik, 1995). The optimal value ofC cannot be estimated through knowledge of the domain, even when coupled with knowledge of the theory and mathematics behind support vector machines. Rather,C must be optimized through trial and error using held-out data. Evaluating more than one hyper-parameter value using the same held-out data is a form of multiple testing, which can lead to over-fitting. Multiple testing becomes a problem as the number of hyper-parameters increases, because the number of parameter 8 combinations grows exponentially with the number of hyper-parameters. The use ofk- fold cross-validation reduces the chances of over-fitting but increases the computational cost of the optimization. Rule learning algorithms are among those that tend to have many parameters and options. Figure 1.1 shows one example of a rule learning algorithm with many hyper- parameters. Using cross-validation to optimize the combination of so many hyper- parameters is time consuming and prone to over-fitting. It is not surprising that data mining practitioners and even researchers often use default or general-purpose recom- mended parameter values instead of tuning on each dataset. A review of 100 exper- imental studies on Inductive Logic Programming found that only 38 mention values assigned to some parameters, and only 17 describe an enumerative search over a small set of possible values to optimize typically one parameter (Srinivasan & Ramakrishnan, 2011). In this thesis, we focus specifically on the improvement of data mining objective functions through domain-specific customization. The methods we explore all improve the user experience by eliminating uninterpretable hyper-parameters. We explore three methods of objective function customization: 1. Learning objective functions from data, with no user input 2. Incorporating domain knowledge into objective functions 3. Learning objective functions from domain expert feedback 9 Figure 1.1: A Weka screenshot showing an example of the many options rule learning algorithms can have. 10 1.3 Research Contributions Many data mining techniques, such as rule learning, involve maximizing an objec- tive function. These techniques are based on heuristics. We define a heuristic as a sensible rule of thumb learned through past experience which is generally expected by practitioners to guide a data mining algorithm toward its primary objective. The most obvious heuristic is that the past is useful in predicting the future. In particular, if the primary objective is to find classification rules with high-precision predictions on future data, this heuristic says it would be sensible to find rules with high-precision predictions on historical data. Following just this one heuristic, however, often leads to over-fitting. A second common heuristic is that rules which cover many examples in historical data are more reliable and less likely to be spurious. Objective functions in rule learning incorporate this heuristic, as well as the first. These objective functions are called interestingness measures. They are bivariate functions, producing a score based on both precision and some measure of frequency. A third common heuristic, acknowledging the problems associated with multiple testing, is that reducing the search space leads to the discovery of fewer spurious rules. This heuristic is usually incorporated into the search algorithm. For example, greedy search methods are often able to find good solutions while covering only a fraction of the search space. Another example is using feature selection to reduce the search space to cover only the most promising attributes. Optimally defining the heuristic trade-offs for any given task is challenging. In this thesis, we examine three methods of combining heuristics into domain-customized objective functions. Our goals include improving model accuracy, taking advantage of available domain knowledge, and better representing user preferences in an objective function. 11 1.3.1 Learning Objective Functions from the Data The first contribution in this thesis, found in Chapter 4, is in the area of classifi- cation rule learning. As we described previously, objective functions in classification rule learning define a trade-off between in-sample precision and frequency. Dozens of general-purpose interestingness measures have been proposed as objective functions. Some are parameterized, giving the user one degree of control over the trade-off func- tion. Domain experts have no intuitive way of knowing which measure is best suited to their domain, or what parameter value to use, when applicable. They must either choose a general-purpose interestingness measure shown to work well in many domains, or they must use trial-and-error or cross-validation to find the one that best helps them reach their objective. We introduce a new method, FrontierMiner, for rule learning which does not use a pre-specified formulaic interestingness measure. Instead, it uses bootstrapping to learn a domain-specific objective function. It learns a non-parametric penalty function for cor- recting the over-optimism of in-sample precision. We show that FrontierMiner outper- forms the general-purpose interestingness measures in our experiments involving 1,000 synthetic datasets and 138 real-world datasets. It also outperforms the cross-validation method of selecting or tuning an interestingness measure. FrontierMiner involves no parameter tuning or option selection, which is important for the user experience. 1.3.2 Incorporating Domain Knowledge into Objective Functions Our second contribution, in Chapter 5, is in an area we introduce as Intervention Optimization. Intervention optimization extends classical intervention analysis to a data mining framework with thousands of intervention events and time series. 12 Intervention optimization can be performed using traditional regression techniques. However, such techniques fail to incorporate domain knowledge that is available for this task: the expected response pattern. We introduce PRIMER, a new regression rule system developed for intervention optimization. PRIMER allows a domain expert to specify a transfer function, which is the pattern expected to appear in the time series following an intervention. The PRIMER objective function incorporates the goodness of fit to the transfer function. This helps it avoid finding spurious rules covering events with apparently large impacts but improb- able response patterns. By incorporating the transfer function into the objective function, PRIMER achieves accuracy competitive with state-of-the-art regression algorithms in a large-scale event study. The rule sets produced by PRIMER are also more interpretable than those pro- duced by competing regression rule systems for the purpose of intervention optimiza- tion. 1.3.3 Learning Objective Functions from Domain Expert Feedback Our third contribution, in Chapter 6, evaluates a method of learning an objective function directly from domain expert feedback. The objective functions we study are investment performance measures common in financial data mining. These can be used to screen and rank investment performance time series from stocks, mutual funds, and hedge funds. They can also be used to evaluate the hypothetical performance of system- atic trading strategies back-tested on historical data. Over 100 performance measures have been proposed, primarily differing on how they define historical risk. There is no consensus as to which measure is best to use. The choice lies with the user, however the user has no intuitive way of knowing which measure is best suited to their needs. 13 Our research evaluates the use of learning-to-rank algorithms to learn a function of the proposed measures that more closely matches user preferences. For training and testing data, we used crowd-sourcing to collect pairwise rankings of equity graphs from a community of domain experts. Furthermore, we show that learning an individualized objective function based on a reasonably small number of pairwise rankings can more accurately represent an individual user’s preferences. 1.3.4 GPU-Accelerated Beam Search Our final contribution, described in Appendix A, is an algorithm for accelerating beam search on a general purpose graphics processing unit (GPU). We are the first to show a significant speedup, up to 28x when performing beam search on a GPU as com- pared to a multi-threaded CPU implementation. Beam search, which is described in detail in Section 2.1.2, is the most commonly used search algorithm in classification rule finding. We use beam search extensively in the experiments in Chapter 4 to establish a baseline level of accuracy achievable through currently proposed rule finding methods tuned with cross-validation. The scale of our experiments is orders of magnitude larger than other comparative studies we have found in the literature. We estimate the experiments would have taken one year to complete on a single commodity desktop machine, had we used a CPU implementation instead of our GPU implementation. Taking advantage of the massively parallel architecture of a commodity GPU allows us to evaluate computationally intensive techniques such as cross-validation at a larger scale. 14 Chapter 2 Background In this chapter, we provide background information on rule-based interpretable machine learning methods. We define terms and give context to the discussion in subse- quent chapters. Our research builds upon the concepts described here. We first describe classification rules and strategies for learning them. We then describe regression rules and some systems that have been developed for learning them. 2.1 Classification Rule Induction Classification rules take the form A ! B, which is understood to read, “If A, then B.” A is referred to as the body, and B is referred to as the head (see Figure 2.1). Figure 2.1: An example binary classification rule In binary classification, the head is always the same: (class = Positive). The body consists of a conjunction of logical conditions placed on input attributes, called 15 features. Rule length is the number of features in the body. For categorical attributes, each attribute-value pair is a feature. Continuous variables must be discretized. The dis- cretization can be applied as a set of disjoint value ranges over an attribute, for example: (12<X 1 15)^ (0:1<X 2 1:3)! (class =Positive) The discretization can also be applied as a set of cut-points bisecting the attribute value range: (X 1 3)^ (X 2 < 0:1)! (class =Positive) In this case, constraints should be applied to avoid searching for mutually exclusive features, such as: (X 1 3)^ (X 1 < 1)! (class =Positive) or for uninformative combinations of features, such as: (X 1 3)^ (X 1 4)! (class =Positive) 2.1.1 Classification Rule Evaluation Metrics Rule-finding is a combinatorial optimization problem with the objective of maximiz- ing an interestingness measure. Many interestingness measures have been proposed. Surveys on the topic compare and contrast the proposed measures (Tan et al., 2002; Geng & Hamilton, 2006; Lenca et al., 2008; Janssen & Fürnkranz, 2010a). The choice of measure depends partly on personal preference, and partly on the data itself. Interest- ingness measures are calculated from values found in a 2 2 contingency table, as seen in Table 2.1. 16 Table 2.1: Contingency table for ruleA!B showing notational conventions. Covered by rule Not covered by rule Positive samples n(AB) n(:AB) n(B) Negative samples n(A:B) n(:A:B) n(:B) n(A) n(:A) N We use the same notational conventions as Geng & Hamilton (2006). We denote counts withn() and probabilities withP (). The number of positive samples covered by the rule is n(AB). The total number of positive samples is n(B). The number of negative samples covered by the rule isn(A:B). The total number of negative samples isn(:B). The overall total number of samples isN. In this notation, precision is defined as: precision = n(AB) n(A) =P (BjA) (2.1) Interestingness measures are designed to trade-off precision with frequency, which is also called completeness (Fürnkranz et al., 2012). Frequency can mean either coverage: coverage = n(A) N =P (A) (2.2) or support: support = n(AB) N =P (AB) (2.3) Each measure has a unique way of trading-off precision for frequency. A measure favoring precision in a noisy dataset may over-fit by finding rules that cover only indi- vidual positive samples in the training data. Likewise, a measure favoring frequency 17 may under-fit by overlooking low-frequency, but high-precision rules. Table 2.2 lists and defines the non-parameterized interestingness measures we evaluate in this thesis. Table 2.3 lists and defines the parameterized measures we evaluate. Table 2.2: Non-parameterized classification rule evaluation metrics Measure Formula Support n(AB), or sometimesP (AB) Confidence/Precision P (BjA) Coverage n(A), or sometimesP (A) Prevalence P (B) Recall P (AjB) Specificity P (:Bj:A) Accuracy P (AB) +P (:A:B) Lift/Interest P (BjA) P (B) Leverage P (BjA)P (A)P (B) Added Value P (BjA)P (B) Relative Risk P (BjA) P (Bj:A) Jaccard P (AB) P (A) +P (B)P (AB) Certainty Factor P (BjA)P (B) 1P (B) Odds Ratio P (AB)P (:A:B) P (A:B)P (:AB) Yule’s Q P (AB)P (:A:B)P (A:B)P (:AB) P (AB)P (:A:B) +P (A:B)P (:AB) Continued on next page 18 Table 2.2 – Continued from previous page Measure Formula Yule’s Y p P (AB)P (:A:B) p P (A:B)P (:AB) p P (AB)P (:A:B) + p P (A:B)P (:AB) Conviction P (A)P (:B) P (A:B) Collective Strength (P (AB) +P (:Bj:A))(1P (A)P (B)P (:A)P (:B)) (P (A)P (B) +P (:A)P (:B))(1P (AB)P (:Bj:A)) Laplace Correction n(AB) + 1 n(A) + 2 Gini Index P (A)(P (BjA) 2 + P (:BjA) 2 ) + P (:A)(P (Bj:A) 2 + P (:Bj:A) 2 )P (B) 2 P (:B) 2 J-Measure P (AB) log P (BjA) P (B) +P (A:B) log P (:BjA) P (:B) One-Way Support P (BjA) log 2 P (AB) P (A)P (B) Two-Way Support P (AB) log 2 P (AB) P (A)P (B) Two-Way Support Variation P (AB) log 2 P (AB) P (A)P (B) + P (A:B) log 2 P (A:B) P (A)P (:B) + P (:AB) log 2 P (:AB) P (:A)P (B) +P (:A:B) log 2 P (:A:B) P (:A)P (:B) Linear Correlation P (AB)P (A)P (B) p P (A)P (B)P (:A)P (:B) Piatetsky-Shapiro P (AB)P (A)P (B) Cosine P (AB) p P (A)P (B) Continued on next page 19 Table 2.2 – Continued from previous page Measure Formula Loevinger 1 P (A)P (:B) P (A:B) Information Gain log P (AB) P (A)P (B) Sebag-Schoenauer P (AB) P (A:B) Least Contradiction P (AB)P (A:B) P (B) Odd Multiplier P (AB)P (:B) P (B)P (A:B) Example and Coun- terexample Rate 1 P (A:B) P (AB) Zhang P (AB)P (A)P (B) max(P (AB)P (:B);P(B)P (A:B)) Table 2.3: Parameterized classification rule evaluation metrics Measure Formula Cost Measure (c) cP (AB) (1c)P (A:B) Relative Cost Measure (c) c n(AB) n(B) (1c) n(A:B) n(:B) m-Estimate (m) n(AB) +mP (B) n(AB) +n(A:B) +m Klösgen (!) (P (A) ! )(P (BjA)P (B)) Continued on next page 20 Table 2.3 – Continued from previous page Measure Formula F -Measure () ( 2 + 1)P (BjA)P (AjB) 2 P (BjA) +P (AjB) 2.1.2 Search Algorithms Finding the best rules is a search problem over the space of all possible feature combinations. Given such an expansive search space, heuristic search is often used to find good rules relatively quickly, albeit without a guarantee of finding the very best rules. Many heuristic search algorithms have been proposed for rule finding (Fürnkranz et al., 2012), some of which include: Hill-climbing Beam search Tabu search Simulated annealing Genetic algorithms Ant colony optimization Random-restart hill-climbing Complete search methods are guaranteed to find globally maximal rules. Examples of complete search include exhaustive search methods such breadth-first search and best- first search (Fürnkranz et al., 2012). Other complete search algorithms can prune the 21 search space if given a minimum support level. This constraint reduces execution time, as many branches of the search tree can then safely remain unexplored. Examples of this type of algorithm include Apriori (Agrawal et al., 1996) and FP-Growth (Han et al., 2000). Counter-intuitively, searching more extensively for good rules can lead to worse out-of-sample performance, a phenomenon called over-searching (Quinlan & Cameron- Jones, 1995; Jensen & Cohen, 2000; Možina et al., 2006; Janssen & Fürnkranz, 2009). Searching involves generating hypotheses, or candidate rules, and evaluating them on the training data. The repeated testing of candidate rules against the training data is a form of multiple testing, and increases the false-discovery rate. Quinlan & Cameron-Jones (1995) speculate that any training dataset contains “fluke theories” that fit the data well but have poor predictive accuracy. The more extensive the search is, the higher the probability of discovering the flukes. Greedy heuristic search algorithms, such as beam search, work well in this regard. They tend to find good rules quickly, while covering only a small fraction of the search space. By contrast, complete search algorithms such as Apriori (Agrawal et al., 1996) or FP-growth (Han et al., 2000) are guaranteed to discover all flukes as well as all good rules. Top-down beam search has emerged as the most commonly used search algorithm in rule finding (Fürnkranz et al., 2012). Beam search starts with an empty rule that covers all samples and greedily refines it by adding features as conditions. It maintains a beam of the best b rules of each rule length. To find rules of length l, each feature is successively added as a refinement to each rule in the beam for length l 1. The refined candidate rules are then evaluated, and the top b rules are stored in the beam for length l. After reaching some stopping criterion, the best rule from all lengths is selected. Algorithm 4 implements beam search in the FINDBESTRULE function (lines 8 – 21). 22 Limiting the beam size limits the extent of the search. Setting b = 1 makes the algorithm equivalent to hill-climbing. Settingb =1 makes it equivalent to exhaustive search. 2.1.3 Separate-and-Conquer Rule search methods such as beam search are prone to finding rules which are only slight variations of each other and which cover substantially the same samples. For this reason, most rule learning systems use a separate-and-conquer strategy to build up an ordered set of rules. This strategy was originally proposed in theA q system (Michalski, 1969). A survey of systems that use separate-and-conquer can be found by Fürnkranz (1999). The strategy iteratively identifies the events not yet covered by any rule (sepa- rate), and then learns a single new rule using only the uncovered events (conquer). The new rule covers additional events, which are then removed from consideration in the next iteration. This guarantees subsequent rules have diversity in coverage. Algorithm 4 implements separate-and-conquer in lines 1 – 7. 2.1.4 Classification Rule Learning Systems An overview of classification rule learning systems can be found in Fürnkranz et al. (2012). Here, we briefly describe some influential systems. 23 CN2 Clark & Niblett (1989) propose the CN2 rule learning algorithm. CN2 uses beam search withb = 5 in conjunction with the separate-and-conquer strategy to find ordered rule lists. The objective function is an information-theoretic entropy measure: Entropy = X i p i log 2 p i (2.4) wherep i represents the proportion of classi in samples covered by the candidate rule. The best rule is the rule with the lowest entropy. The class prediction of the best rule is the majority class of samples covered by the rule. A test of statistical significance is applied to each rule. FOIL Quinlan (1990) proposes the FOIL rule learning algorithm. FOIL uses hill-climbing in conjunction with the separate-and-conquer strategy to find ordered rule lists. The objective function is an measure of information gain associated with adding each new feature to a rule: Gain(R 0 ;R 1 ) =c (log 2 p 1 log 2 p 0 ) (2.5) Here, R 0 is the rule before adding an extra feature, andR 1 is the rule after adding an extra candidate feature. p 0 is the positive class proportion of samples covered by R 0 . p 1 is the positive class proportion of samples covered by the refined rule R 1 . c is the number of true positives covered by bothR 0 andR 1 . RIPPER Cohen (1995) proposes Repeated Incremental Pruning to Produce Error Reduc- tion (RIPPER). This algorithm is still considered state-of-the-art and is widely used 24 (Fürnkranz et al., 2012). RIPPER has two stages: the rule building stage, and the ruleset optimization stage. In the rule building stage, RIPPER first separates the training data into two-thirds growing data and one-third pruning data. On the growing data, RIPPER learns each rule using greedy top-down hill-climbing. The objective function for growing a rule is FOIL’s information gain. Each rule is pruned immediately after it is grown. For pruning, the features are removed in the reverse order that they were added. Final sequences of features are removed to maximize this objective function: pruning_metric = pn p +n (2.6) wherep is the number of positive samples covered by the rule, andn is the number of negative samples covered by the rule. The ordered list of pruned rules is learned using the separate-and-conquer strategy. Once the rules are learned, the second stage is performed a pre-specified number of times. The second stage involves optimizing the ruleset by considering two alternatives for each rule: a newly learned replacement rule, and a refinement of the current rule. The best ruleset is returned, as measured by minimum description length. Rules may also be deleted during this phase, if doing so improves the ruleset. PART Frank & Witten (1998) propose the PART rule learning algorithm. PART builds up a rule list from partial decision trees using the separate-and-conquer strategy. For each rule, it grows a decision tree, prunes the tree, makes a rule from the leaf with the highest coverage, and then discards the remainder of the tree. It grows decision trees 25 based on average entropy, expanding nodes with the smallest entropy first. PART has an efficiency advantage over RIPPER, by not performing a global optimization step. ACO Parpinelli et al. (2002) propose the ACO algorithm, which uses ant colony optimiza- tion to find rules. For the objective function, ACO uses the same information-theoretic entropy measure used in the CN2 system. It uses the separate-and-conquer strategy to generate an ordered rule list. ACO also includes a post-pruning procedure to reduce rule length. 2.2 Regression Rule Induction Regression rules are similar to classification rules. The rule bodies are identical, but the heads are different. Regression rules make real-valued predictions in the head, instead of classification predictions: IF <conditions> THEN predict ^ y A complete ruleset forms a piece-wise constant regression model. 2.2.1 Regression Rule Evaluation Metrics Regression rule learning objective functions are different from the interestingness measures used in classification rule learning. Table 2.4 lists common regression rule evaluation metrics suitable for real-valued prediction evaluation. 26 Table 2.4: Regression rule evaluation metrics, wheren is the number of samples being evaluated,y i is the true value, ^ y i is the predicted value, y is the mean value of all samples, andr is the current rule. Measure Formula Mean Absolute Error (MAE) 1 n P n i=1 jy i ^ y i j Mean Squared Error (MSE) 1 n P n i=1 (y i ^ y i ) 2 Deviation from Mean (DVM) 1 n P n i=1 (y i y) 2 Normalized Mean Squared Error (NMSE) MSE=DVM Relative Coverage (RC) Coverage(r)=n Relative Cost Measure (RCM) c (1 NMSE) + (1c) RC 2.2.2 Regression Rule Learning Systems M5’Rules Holmes et al. (1999) propose M5’Rules for learning a set of rules that have a linear model in the head instead of a constant prediction. The algorithm uses separate-and- conquer to learn the set of rules, stopping when all samples are covered. Each rule is learned by first generating a decision tree and then extracting the rule corresponding to the best leaf. The best leaf is determined according to a user-specified heuristic. Its authors tested three heuristics: percent root mean squared error, mean absolute error divided by coverage, and the correlation between predicted and actual values for samples covered by a leaf multiplied by the number of samples in the leaf. 27 REGENDER Dembczy´ nski et al. (2008) propose REGENDER for learning an ensemble of regres- sion rules using forward stage-wise additive modeling. REGENDER greedily builds up a rule list one rule at a time, but not using a separate-and-conquer strategy. Instead, it finds rules to minimize a loss function, which is either the sum of squared error or the sum of absolute error, calculated over all samples. The number of rules to learn is an input to the algorithm which acts as the stopping criterion. The minimization technique can be specified as either gradient boosting or a least angle approach. To reduce correla- tion between rules as well as computational complexity, the training of each rule is done using a random subset of the training data. The fraction of samples to use for training is an input to the algorithm. The final parameter is a shrinkage factor which reduces the degree to which previously generated rules affect the generation of the successive one in the sequence. The algorithm outputs an unordered list of rules. The prediction for a given sample is calculated by summing the contributions of all rules that cover the sample. SeCoReg Janssen & Fürnkranz (2010b) propose SeCoReg, a regression rule system based on the separate-and-conquer strategy. The algorithm uses hill-climbing to find each rule. The objective function maximized by hill-climbing is a weighted combination of the relative root mean squared error and the relative coverage: h cm = (1L RRMSE ) + (1) RC (2.7) Here, the parameter controls the trade-off between error and coverage. The stopping criterion for the algorithm is set as the fraction of samples that can be left uncovered by 28 the rules learned. A third user-specified parameter controls the number of split-points found by a supervised clustering algorithm for discretizing numeric attributes. Dynamic Reduction to Classification Janssen & Fürnkranz (2011) propose Dynamic Reduction to Classification, a method of converting a regression problem into a multi-class classification problem. This enables the use of well-studied classification rule induction techniques. For each rule, the predicted value is the median of the values covered. The rule quality is measured by how well it identifies samples valued within one standard deviation of the predicted value. Samples valued within one standard deviation of the predicted value are set as the positive class, and all other values outside this range are considered negative. In this way, traditional classification rule heuristics can be used as an objective function. The heuristics tested by its authors include: correlation, relative cost measure, Laplace mea- sure, and weighted relative accuracy. The algorithm uses separate-and-conquer com- bined with hill-climbing. The stopping criterion for the algorithm is the fraction of samples that can be left uncovered by the rules learned. Another algorithm very similar to this has been proposed, with the addition of rule pruning based on held-out data as a post-processing step (Sikora et al., 2012). Ant-Miner-Reg Brookhouse & Otero (2015) propose Ant-Miner-Reg, a version of SeCoReg with hill-climbing replaced by Ant Colony Optimization. It requires three additional user- specified paramaters to control the optimization. 29 Chapter 3 Related Work For related work, we consider research on methods of adapting rule learning to meet the needs of individual domains. We also include survey papers which draw connections between measures and between methods. 3.1 Measure Comparisons Hilderman & Hamilton (2001) describe five principles that well-behaved measures must satisfy regarding the class distributions of covered samples: 1. Minimum Value Principle – The minimum interestingness score should be attained when the distributions are even. 2. Maximum Value Principle – The maximum interestingness score should be attained when the distributions are as uneven as possible. 3. Skewness Principle – Maximally uneven distributions are more interesting when there are fewer classes. 4. Permutation Invariance Principle – Every permutation of a given distribution is equally interesting. 5. Transfer Principle – Interestingness must increase if covered samples were trans- ferred from one class to another class which already has more covered samples. 30 They evaluate thirteen interestingness measures and find that only four satisfied all five principles. Their work builds on earlier work by Piatetsky-Shapiro (1991), who pro- posed three key properties a measure should satisfy. Tan et al. (2002) continue this type of analysis and evaluate twenty-one interestingness measures against eight properties. Vaillant et al. (2004) experiment with twenty measures on ten datasets. They perform a cluster analysis of the measures and find there are four main groups of measures. McGarry (2005) survey both objective and subjective interestingness measures. They review the literature in addition to listing the measure formulas. Blanchard et al. (2005) propose an interestingenss measure for association rule min- ing and provide a detailed comparison against five information-theoretic measures. They mine rules with the Apriori algorithm on two real-world datasets and on two synthetic datasets generated using the IBM synthetic data generator (Agrawal et al., 1996). Geng & Hamilton (2006) survey both objective and subjective interestingness mea- sures for data mining. They find that the measures to emphasize nine criteria when evaluating interestingness: conciseness, coverage, reliability, peculiarity, diversity, nov- elty, surprisingness, utility, and actionability. This survey is the most comprehensive with respect to the number of interestingness measures evaluated and the number of principles used as evaluation criteria. Lenca et al. (2008) survey twenty objective interestingness measures and rate them according to eight properties: 1. Asymmetric processing of A and B – whether or not the measure evaluates A ! B the same asB ! A 2. Decrease withn(B) – whether the measure decreases whenn(:AB) increases 3. Reference situations, independence – whether the measure has the same constant value whenP (BjA) =P (B) 31 4. Reference situations, logical rule – how the measure behaves when it covers no counter-examples 5. Linearity withP (A:B) around 0 + – whether the measure changes abruptly when only a few counter-examples are added to a rule that otherwise had none 6. Sensitivity toN, the total number of samples – whether the measure is a descrip- tive measure or a statistical measure 7. Easiness to fix a threshold – whether or not a p-value threshold can readily be calculated for the measure 8. Intelligibility – based on three factors: (a) whether the measure integrates only simple arithmetic operations on the fre- quencies (b) whether the variations of the values taken by the measure are easily inter- pretable (c) whether the mesaure definition is intelligible The authors propose a multi-criteria decision aid process wherein the user assigns weights to the values of each principle, and the process then ranks the measures for appropriateness. Additionally, they describe a visualization technique that indicates measure clusters. 3.2 Choosing the Frequency Bias Cestnik (1990) propose them-estimate, a parameterized interestingness measure for estimating conditional probabilities. Dzeroski et al. (1993) test them-estimate and show an improvement over the Laplace estimate by trying many values ofm on each dataset. 32 They suggest cross-validation as a possibility for choosing m but do not perform an evaluation. Janssen & Fürnkranz (2010a) evaluate parameterized interestingness measures by comparing their micro- and macro-averaged accuracies on 30 datasets after first tuning the parameters on 27 datasets to find optimal general-purpose recommended values. Our research relies on these recommended values as baselines against which to compare per-dataset tuning methods. The authors also train a multilayer perceptron to learn a general-purpose interestingness measure. Minnaert et al. (2015) evaluate parameterized interestingness measures used with genetic optimization algorithms for searching. They show that general-purpose recom- mended parameter values are different for each search algorithm. They do not address the tuning of search parameters in this work. They also do not tune measure parameters for each individual dataset, but mention it as a possibility for future work. 3.3 Choosing Search Parameters Quinlan & Cameron-Jones (1995) evaluate choosing the extent of search by adapting the beam size to the given dataset. They use the Laplace estimate for an interestingness measure. They iteratively increase the beam size until two successive iterations fail to find a better rule. They call this layered search, and report an improvement over a more extensive search. Srinivasan & Ramakrishnan (2011) study parameter optimization for the Aleph and Toplog Inductive Logic Programming (ILP) systems. These systems have many options, so screening the important options is one primary purpose of this research. They propose first using a regression model to find the most relevant factors. This is followed by response surface optimization on the relevant parameters. Their optimization method 33 uses gradient ascent to find a local maximum. They do not use cross-validation for parameter optimization. In their experiment, they consider four parameters: maximum rule length, maximum nodes explored in the search, minimum precision, and minimum support. They validate their method on 8 biochemistry datasets. 3.4 Randomization Methods Možina et al. (2006) remove the multiple testing-induced optimism in rule evaluation measures by performing permutation tests. They choose the rule that is most statistically significant based on permutation tests for each rule length. Zhang et al. (2016) propose a mathematical model and method of evaluating the statistical significance of rules found in uncertain data, or data with errors. The user specifies the largest acceptable family-wise false positive error rate. The algorithm then adjusts to meet the requirement through a process involving randomization tests. 3.5 Comparison against Real Human Interest Tan et al. (2002) study ways to select the best interestingness measure for association rules. They propose a method of selecting a small number of contingency tables which a domain expert could evaluate when choosing a measure. Their method selects the contingency tables which are furthest apart in terms of relative rankings my the various measures. Such a method could help a domain expert identify their preferred measure more quickly than reviewing contingency tables at random. Ohsaki et al. (2004) experimentally compare interestingness measures against real human interest in medical data mining. They generated prognosis-prediction rules from a clinical dataset on hepatitis. They then had a medical expert evaluate rules as Especially-Interesting, Interesting, Not-Understandable, and Not-Interesting. 34 Carvalho et al. (2005) build on Ohsaki et al. (2004) with evaluations on eight datasets. They presented nine rules to each expert for each interestingness measure: the best three, the worst three, and three in the middle. Experts were asked to assign a subjective degree of interestingness to each rule. 35 Chapter 4 Learning Objective Functions from the Data Most research on objective functions for classification rule learning has focused on developing mathematical formulas for pre-specifying the trade-off between precision and frequency. These interestingness measures are designed for general-purpose use. In some cases, they have parameters, allowing them to be tuned to work better with a particular domain. Section 2.1.1 describes the measures in detail. Our research in this chapter, however, seeks to learn objective functions directly from the data. We seek to reduce reliance on pre-specified relationships and assumptions. We refer to a parallel situation in another area of machine learning – the difference between generative and discriminative models. Ng & Jordan (2002) show that generative models follow a pattern of having higher predictive accuracy than discriminative models in low data regimes, but that discriminative models perform better in regimes with more train- ing data. The assumptions specified in generative models provide structure such that they require less training data. However, in situations with large amounts of training data, the assumptions restrict the ability of the models to adapt to the nuance found in the data. Our proposal for learning objective functions is to remove assumptions and instead rely on what can be learned from the data. In this chapter, we evaluate methods based on their ability to predict a target class with the highest-possible precision. We restrict our research to models which produce binary classification rules, as described in Section 2.1. Such rules are commonly used 36 to encode knowledge in a form interpretable to domain users. Our research aims to help domain users who want only the highest-probability predictions, and who want to know why each prediction was made. An example use-case for this research can be found in the oil and gas industry. Oilfield optimization teams consist of relatively few engineers tasked with extracting the most value possible from fields with thousands of oil wells. They are resource- constrained and cannot attend to many wells each day. Classification rules can assist by identifying wells that are beginning to fail, wells that need operating parameters recon- figured, and wells that would benefit from thermal stimulation. The transparency of rule-based classifiers is appealing to engineers who both vet and learn from the rules. Additionally, each positive prediction is associated with a specific rule, and the condi- tions of that rule provide information useful for diagnosing the root cause of an issue. In this use-case, the precision of the top predictions is more important than the overall accuracy of the classifier. This is because the engineers can respond to only a limited number of automated alerts. Their primary goal is to avoid false-positives and maximize the value of their time. We propose a new rule learning system, FrontierMiner, for finding interpretable, high-precision rules. FrontierMiner differs from traditional systems because it does not use a formulaic interestingness measure as an objective function during the search for rules. Instead, it learns a per-dataset objective function through the use of bootstrapping. In each bootstrap iteration, FrontierMiner’s search function finds Pareto-optimal rules in a random subset of training samples. Pareto-optimal rules include the highest-precision rules found for each minimal level of support. By testing these rules on the remainder of the training samples, information is collected on the level of over-optimism found at each level of support. This information is used to develop a non-parametric objective function that corrects precision over-optimism in order to better compare and rank rules 37 found in the final search. Because it learns an objective function for each dataset, Fron- tierMiner does not require the user to choose a traditional interestingness measure or tune any related parameters. FrontierMiner is publicly available for download. 1 Identifying the state-of-the-art baseline against which to compare FrontierMiner is challenging. Many methods have been proposed. (An overview of the methods can be found in Fürnkranz et al., 2012). With each rule learning system, the user is presented with choices of settings and parameters, enabling it to be optimized for a given domain. The optimal combination of parameter values is difficult to specify, even for a domain expert. Therefore, users generally rely on trial-and-error, default settings recommended by system designers, and optimal general-purpose parameters recommended by studies involving many datasets (e.g. Janssen & Fürnkranz, 2010a). Evidence of this is seen in a survey by Srinivasan & Ramakrishnan (2011). They reviewed 100 experimental stud- ies on Inductive Logic Programming and found that only 38 mention values assigned to some parameters, and only 17 describe an enumerative search over a small set of possible values to optimize typically one parameter. We have not found any large-scale study comparing many rule learning algorithms across many datasets that also tunes the parameters of each algorithm for each dataset. For this reason, we dedicate a significant portion of our research effort to establishing a baseline for comparison which can reasonably be expected to outperform other methods for finding high precision rules across a variety of domains. We take a greedy approach, following this sequence of steps: 1. Choose a candidate baseline method based on studies in the literature using fixed, general-purpose recommended parameter values 1 http://ganges.usc.edu/wiki/FrontierMiner 38 2. Validate that the chosen baseline method outperforms other methods (using their respective recommended parameter values) across a large variety of synthetic datasets 3. Validate that parameter tuning benefits the baseline method when performed through grid search with cross-validation, again tested on synthetic data 4. Compare FrontierMiner to the baseline method (with parameter tuning) both on synthetic datasets and on a variety of real-world datasets Our experimental results show evidence that FrontierMiner learns higher-precision rules than the baseline method. When tested on real-world datasets, it found the higher- precision rules more than twice as often as the baseline. It also outperformed the tuned baseline, where up to three parameters were optimized using grid search with cross- validation. FrontierMiner involves no prior assumptions as to the optimal form of the trade-off between precision and support. The only constraint applied is that the penalty to correct for precision over-optimism must decrease monotonically with support. It has only one parameter, the number of bootstrap iterations, which is not domain-dependent and need not be tuned. In our experiments, FrontierMiner proved to be more computationally efficient than cross-validation, given the number of folds and the grid search values we chose. The rest of this chapter is organized as follows: Section 4.1 discusses precision@k, which is the evaluation metric we use in this study. Section 4.2 describes the Frontier- Miner algorithm in detail, including information on Pareto frontiers, penalty functions, and bootstrapping. Section 4.3 describes our experiments to establish a baseline and evaluate FrontierMiner. It also describes our synthetic data generation process. 39 4.1 Precision@k Most studies on classification rules compare systems based on their classification accuracy across all test samples. Our comparisons, however, are based on precision@k, which measures the correctness of only the top k predictions. Precision@k is more commonly used in Information Retrieval, when the relevance of only the top few search results need be considered (Manning et al., 2008). Our choice of precision@k not only matches our emphasis on high-precision rules, but also matches our emphasis on inter- pretability. With large datasets, the search for a complete classifier with the highest- possible accuracy can result in rulesets with hundreds or thousands of rules. Such large rulesets can no longer be considered interpretable, negating perhaps the primary benefit of using a rule-based classifier. Specifically, we use precision@100 for all empirical comparisons. The choice of k involves a trade-off between focusing strictly on the best predictions and on having sufficient predictions to accurately estimate precision. At the extreme end, a rule cov- ering only a single test sample must have precision of either 0% or 100%. Therefore, comparing two rule learning systems based only on their single best prediction would be difficult. A large number of experiments would be required in order for the comparison to be statistically significant. Figure 4.1 illustrates the trade-off between precision and estimation error using a hypothetical example. In this illustrative example, we assume a generating function that produces a sequence of 500 binary “test samples” each time it is run. The first test sample in the sequence is generated with probability 0.8 of being positive, and the last is generated with probability 0.5 of being positive. The samples in-between 40 have probabilities that decrease linearly from the first sample to the last, such that the probability of samplei being positive is: p i = 0:8 i 1 500 1 (0:8 0:5) (4.1) The red dashed line in Figure 4.1 shows the expected value of precision@k for the given generating function: E[precision@k] = 1 k k X i=1 p i (4.2) When working with real-world data, as opposed to synthetic data, we do not know the generating function, so we must estimate the precision of the top samples. In this example, the solid black lines in Figure 4.1 show the 80% confidence intervals of the precision estimation. The confidence intervals,x, are found using the inverse of the Beta cumulative distribution function: x =F 1 (qj;) (4.3) usingq = 0:1 for the lower confidence interval, andq = 0:9 for the upper confidence interval, where: q =F (xj;) = 1 B(;) Z x 0 t 1 (1t) 1 dt (4.4) andB(;) is the Beta function. We calculate the parameters and for each value of k: = 1 + k X i=1 p i (4.5) = 1 +k k X i=1 p i (4.6) 41 Figure 4.1 shows the result, which is that confidence intervals are very wide when estimating precision@k for small k. The intervals are substantially more narrow at k = 100, with only a small amount of further narrowing at k = 500. In this exam- ple, each additional test sample decreases the confidence interval width, but also has the adverse effect of lowering the expected precision@k. We have chosen to use preci- sion@100 to compare rule learning systems because the top 100 predictions still repre- sent a small fraction of predictions in any reasonably large dataset, and because smaller choices ofk would require a prohibitively large number of datasets/experiments to draw a statistically significant conclusion. k covered test samples 0 100 200 300 400 500 Precision@k 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Expected precision@k 80% confidence intervals Figure 4.1: An illustration of the trade-off between precision and estimation error for an example generating function. The expected value of precision@k for the topk samples decreases with each additional sample, however the confidence intervals get tighter. We have chosen precision@100 as our evaluation metric, because it has reasonably tight confidence intervals yet maintains an emphasis on the top predictions. 42 4.2 FrontierMiner We propose FrontierMiner as an algorithm for finding the single binary classification rule expected to have the highest out-of-sample precision. It uses a non-parametric bootstrapping method of learning the optimal trade-off between precision and frequency for the given dataset. It is designed with the goal of having no interestingness measure to choose, no options to select, and no parameters to tune. FrontierMiner incorporates a novel greedy search algorithm for finding a set of Pareto-optimal rules. It begins by repeatedly running the search algorithm on random subsets of training data to gather information on the propensity to over-fit at each level of support. From that information, it develops a customized, non-parametric rule eval- uation function. Finally, it runs the greedy search algorithm once more on the entire training dataset and returns the highest scoring rule according to the learned evaluation function. 4.2.1 Pareto Frontier We begin by describing the greedy search algorithm which is invoked multiple times by FrontierMiner. The purpose of the search algorithm is to efficiently find as many rules as possible which lie on the Pareto frontier. We treat rule learning as a multi-objective optimization problem, where the two objectives are precision and support. The Pareto frontier includes any rule that is not worse than at least one other rule with regard to both precision and support. Concretely, for rulesr 1 andr 2 ,r 1 is worse thanr 2 if: sup(r 1 ) sup(r 2 ) ^ prec(r 1 )< prec(r 2 ); or (4.7) sup(r 1 )< sup(r 2 ) ^ prec(r 1 ) prec(r 2 ): (4.8) 43 Bayardo Jr & Agrawal (1999) show that the optimal rule according to various interestingness measures must lie on the Pareto frontier, which they call the “sup- port/confidence border.” They propose a constraint-based rule miner that finds rules on this border. Their algorithm does a complete best-first search for rules, pruning branches of the search tree that cannot contain a Pareto-efficient rule. Our search algo- rithm is inspired by theirs, however, we have chosen greedy search instead of complete search. This is because FrontierMiner requires many rules, but not necessarily all rules, and computational efficiency is a primary concern. Minnaert et al. (2015) also search for sets of Pareto-optimal rules. They use a param- eterized interestingness measure in conjunction with meta-heuristic search algorithms (genetic algorithms, ant colony optimization, particle swarm optimization, etc.) They find rules along the Pareto frontier by varying the parameter of the interestingness mea- sure. Using meta-heuristic search algorithms may find more rules than greedy search, but it comes at the cost of computational efficiency. Additionally, the method of scan- ning through a range of parameter values leads to gaps on the Pareto frontier if the parameter step size is too large. If the step size is too small, computational inefficiency can result if care is not taken to avoid the redundant evaluation of rules. Our search algorithm performs breadth-first search, greedily refining only Pareto- optimal rules. It finds a list of frontiers, where each frontier is itself a list of Pareto- optimal rules of maximum lengthl. The first frontier contains only the default rule which covers all samples and has length 0. Next, the frontier of max-length 1 is initialized to contain the rule from the max-length-0 frontier. Each candidate rule of length 1 is then tested sequentially. If a candidate rule is found not to be worse than all the rules currently in the frontier, meaning it is either better than or incomparable to all rules, then it is added to the frontier. If any rules already in the frontier are worse than the newly added rule then they are removed. This process is repeated for each maximum 44 rule length, where each new frontier is initialized to contain the rules from the previous frontier. The process stops when no new rules can be found to add to the previous frontier. Algorithm 1 describes our search algorithm in detail. 4.2.2 Penalty Functions Fürnkranz & Flach (2005) compare various rule evaluation metrics visually by plot- ting isometrics in coverage space, which is related to ROC space. The two-dimensional plots have covered negative examples on the x-axis and covered positive examples on the y-axis. The plotted isometric lines connect points in this space that have equal scores, according to the interestingness measure used. In Figure 4.2 we show that rule evaluation metrics can also be visualized in terms of a penalty applied to precision. In this type of visualization, each isometric line corre- sponds to a level of precision. Given the precision and support of a candidate rule, one can see on the y-axis the corresponding penalty assigned by the interestingness measure. The penalty can be thought of as an amount of skepticism to be applied to the candidate rule, in units of precision. Figure 4.2 represents the penalty function associated with using them-estimate interestingness measure withm = 22:466, and assuming a dataset with equal class proportions withn(B) = 500 andn(:B) = 500. We see that lower penalties are applied to rules claiming precision closer to the positive class proportion. Increasing the value ofm increases the penalty at all levels of precision. Penalties higher than the difference between precision and the positive class proportion are not sensible, but can still be used to compare two candidate rules. To compare two rules,r 1 andr 2 , we compare their penalized precision: r 1 <r 2 () prec(r 1 ) penalty(r 1 )< prec(r 2 ) penalty(r 2 ) (4.9) 45 The penalty function in Figure 4.2 was calculated by first converting them-estimate into a function of precision and support. Using notation defined in Section 2.1.1 and the definition of them-estimate from Table 2.3: m-estimate = n(AB) +mP (B) n(AB) +n(A:B) +m (4.10) Givensupp =n(AB),prec =P (BjA) = n(AB) n(A) , andn(A:B) =n(A) (1prec) we can make substitutions to make it a function of precision and support: m-estimate = supp +mP (B) supp +supp 1prec prec +m (4.11) The penalty function is found by algebraically solving for the amount of additional precision that would be required to bring them-estimate value up to same level as the precision: penalty(prec;supp;m) = 2 4 1 1 prec + m supp P(B) prec 1 3 5 prec (4.12) 46 0 10 20 30 40 50 60 70 80 90 100 Support 0 0.1 0.2 0.3 0.4 0.5 Penalty m-Estimate Penalty, m = 22.466 prec = 1 prec = 0.9 prec = 0.8 prec = 0.7 prec = 0.6 Figure 4.2: Penalty isometric line illustration for the m-estimate (m = 22:466), with equal class proportions with n(B) = 500 and n(:B) = 500. Each isometric line corresponds to a level of rule precision. The reason for introducing the penalty function method of visualizing interesting- ness measures is because FrontierMiner evaluates rules directly in terms of penalties. Through bootstrapping, FrontierMiner explicitly learns the penalty function it will use to downward-adjust rule precision before ranking discovered rules. 4.2.3 Bootstrapping FrontierMiner applies a penalty to rule precision, downward-adjusting in-sample precision to a level more likely to be true on out-of-sample data. In the initial phase of the algorithm, it empirically estimates the appropriate penalty using a combination of bootstrapping and cross-validation techniques as described by Efron & Gong (1983). 47 There is one key difference between the empirical penalty function in FrontierMiner and the penalty functions implicit in traditional interestingness measures. It is that Fron- tierMiner learns a univariate function, while traditional measures such as them-estimate define bivariate functions (see Eq. 4.12, assuming constantm). The penalty learned by FrontierMiner is a function only of support, and not of both precision and support. This is because bootstrapping is not effective for estimating the full two-dimensional penalty space. Finding rules in data samples that have both high support and high precision is too rare an occurrence to provide sufficient examples. For that same reason, however, the penalty function will also likely never need to be evaluated at such values during the final rule finding phase. Likewise, the penalty function will likely never be evalu- ated at low values for both support and precision. The penalty function is only used to rank rules from the Pareto frontier, so only Pareto-optimal rules are considered during the bootstrapping phase. The effect is that, rather than holding precision constant, the FrontierMiner penalty function implicitly uses the precision typically associated with each level of support on Pareto frontiers found in sampled training data. In this way, the penalty sampling occurs in the same region of the two-dimensional space that is eventually needed for evaluating the final Pareto-optimal rule candidates. Algorithm 2 describes FrontierMiner in detail. Lines 1 – 13 outline the process of finding the rule with the highest expected precision. The steps include learning the penalty curve, finding the candidate Pareto-optimal rules, and ranking the candidate rules according to penalized precision. Lines 14 – 53 describe the bootstrapping technique used to estimate the penalty curve. In lines 19 – 27 the data is randomly separated into training data and test data. The fraction of samples to use as training data is randomly chosen between 0:25 and 0:75, with the remainder used as test samples. The greedy search procedure from Algorithm 1 is run on the training data to find a set of frontiers with increasing maximum rule lengths. 48 Each rule from each frontier is then evaluated on both the training and test data. The training precision is expected to be higher than the test precision, and the difference between them is the slippage and gets appended to an array of such data points (allY), as does the corresponding level of training support for each rule (allX). These arrays form the evidence used to estimate the penalty curve. The sampling process is repeated iters times. Repeatedly searching for rules on randomly-sized partitions of training data, set by the random variable frac on line 20, causes slippage data to be collected for a rich variety of support levels. Our goal in sampling is to test a variety of pareto-optimal rules at each support level, and not to repeatedly evaluate the slippage of the same set of rules. If frac is held constant, clusters of points form, due to the same rules getting found on each equal-sized training sample. We vary frac randomly at each iteration between 0:25 and 0:75. These bounds are chosen for practical reasons and are not intended to be tunable parameters. The upper bound is set well below 1 in order to reduce estimation error when evaluating the test precision for each rule. The lower bound is set well above 0 for efficiency. This is because the majority of frontier rules found are already in the low-support region of the curve, and there is no need to deliberately seek out more low-support examples. High-support frontier rules are comparatively rare, but this is not problematic because the penalty curve flattens-out in high-support regions. Figure 4.3 shows an example scatterplot of allX and allY after 10 iterations. 49 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Support 0 0.02 0.04 0.06 0.08 0.1 0.12 Penalty slippage penalty curve Figure 4.3: Example penalty curve learned by FrontierMiner after 10 sampling itera- tions. Each blue point represents the amount by which the precision of a Pareto-optimal rule was overestimated. The red line is fitted to the points using least-squares isotonic regression. Lines 40 – 45 of Algorithm 2 pre-process the data, averaging slippage values for each distinct level of support. In line 46, we fit the slippage points using least squares isotonic regression based on the “pool adjacent violators” algorithm (Kruskal, 1964). Isotonic regression makes no assumption as to the form of the curve, however it does enable us to enforce a weak monotonicity constraint ony. We assume that the penalty should never increase as support increases: minimize y X i (y i avgY i ) 2 subject to y i y i1 ; i> 1 (4.13) 50 The two negative signs in line 46 are included because isotonic regression assumes an increasing function, while we require a decreasing function. In line 47, we filter points where y i = y i1 in order to make the curve strictly decreasing. The penalty function is a linear interpolation of the penalty curve (line 8), so in lines 48 – 51 we add points to the extremity of the curve to prevent any need for extrapolation. We prepend the point (0; 1) to the fitted curve, and we append a point with zero slippage for the maximum possible support. Figure 4.3 shows an example portion of a penalty curve (in red) learned after 10 sampling iterations. 4.2.4 Usage In summary, FrontierMiner has two main phases: 1. Learning the penalty function 2. Finding Pareto-optimal rules, ranked according to penalized precision FrontierMiner is used for finding the single rule with the highest expected precision. If more coverage is needed, the user can incorporate FrontierMiner into a separate-and- conquer strategy (Fürnkranz, 1999) to build up an ordered set of rules. This strategy iteratively identifies the samples not yet covered by any rule (separate), and then learns a new rule using only the uncovered samples (conquer). The new rule covers additional samples, which are then removed from consideration in the next iteration. This strategy guarantees subsequent rules have diversity in coverage. If computational resources are limited, the user can choose to re-use the penalty function from the previous separate- and-conquer iteration. 51 4.3 Empirical Tests Our experimental approach begins by choosing a baseline for comparison. The base- line is the rule learning method that will most likely identify the highest-precision rules. This is subjective, based on our reading of existing literature. Experiment 1, in Sec- tion 4.3.2, is designed to validate our choice of baseline using a large variety of syn- thetic datasets. Experiment 2, in Section 4.3.3, tests the benefits of parameter tuning on the baseline, also with synthetic data. Experiment 3, in Section 4.3.4, compares FrontierMiner against the tuned baseline, using both synthetic datasets and a variety of real-world datasets. 4.3.1 Baseline Method For a baseline, we choose beam search with the m-estimate, using parameter val- ues recommended in the literature. This represents arguably the best general-purpose algorithm for finding high-precision rules. In this sub-section, we explain our choice in more detail. Search Algorithm As described in Section 2.1.2, top-down beam search is a greedy heuristic search method which is used by the vast majority of rule learning algorithms (Fürnkranz et al., 2012). It is used by systems built on the AQ algorithm (Clark & Niblett, 1987), CN2 (Clark & Niblett, 1989), and SeCo (Fürnkranz, 1999). Beam search generalizes hill-climbing, which is used in FOIL (Quinlan, 1990) and RIPPER (Cohen, 1995). The parameter b for beam size varies the extent of search. For the baseline beam size, we useb = 5. This value has been commonly used and is the default value used in CN2 (Clark & Niblett, 1989; Možina et al., 2006). Low values ofb have been shown to 52 often out-perform large values when tested out-of-sample (Quinlan & Cameron-Jones, 1995; Janssen & Fürnkranz, 2009). Interestingness Measure Them-estimate (Cestnik, 1990) in conjunction with beam search is commonly used in rule learning, with accuracy comparable to other state-of-the-art algorithms. Them- estimate generalizes the Laplace estimate, replacing the assumption of a uniform initial distribution with a more flexible probability density function: f(x) = 1 B(a;b) x a1 (1x) b1 (4.14) where a > 0, b > 0, and B(a;b) is the Beta function. The values for a and b are determined from an estimate of the class proportion and a parameterm, which is chosen based on the amount of noise in the domain. We repeat here the formula for the m- estimate: m-estimate = n(AB) +mP (B) n(AB) +n(A:B) +m (4.15) Janssen & Fürnkranz (2010a) included the m-estimate in a study comparing five parameterized interestingness measures. They optimized the parameter value for each measure with respect to ruleset accuracy across 27 datasets and validated the results using that parameter on 30 datasets. The optimal general-purpose value of m they reported ism = 22:466, which we then use form in the baseline. Rule Length Shorter rules are easier to interpret and easier to find. In addition, shorter rules are believed by some to generalize better to unseen data. The number of features in a ruleset is a measure of model complexity. Fürnkranz et al. (2012) have proven that when 53 learning from noisy datasets, the size of an over-fitting model grows at least linearly with the number of samples. They say that biasing a learner toward shorter rules is a typical way to counter the growth and avoid over-fitting. The bias is generally not explicit, but implemented through pre- or post-pruning. One way to ensure rules are short is simply to constrain the maximum rule length, an approach often taken in association rule mining. The baseline method we have chosen does not have a restrictive maximum rule length. However, we include it as a search parameter that can be tuned. Setting a maximum rule length adds little extra complexity to the tuning method we test, because it entails simply storing intermediate results during the search for longer rules. In summary, the baseline we compare against uses: 1. Beam search (b = 5) 2. m-Estimate (m = 22:466) 3. Maximum rule length (l =1) 4.3.2 Experiment 1: Untuned Baseline on Synthetic Data The purpose of Experiment 1 is to compare the chosen baseline method against other methods which use only recommended parameter values and involve no per-dataset tun- ing. Synthetic Data In order to report statistically significant results, we test on many datasets. Using synthetic data allows us generate unlimited datasets. It also allows us to experiment on data with a wide variety of attributes. 54 Table 4.1: Synthetic data attribute values are drawn from the following distributions, and samples and rules are randomly generated such that the attribute conditions are satisfied. Attribute Value Distribution Sample count (N) NUf50000; 500000g Feature count (M) MUf50; 5000g Positive class proportion () U(0:1; 0:9) Mean noise features per sample () Uf10;bM=4cg Noise features for samplen (f n ) f n Poisson()jf n 1 Rule count (R) RUf1; 20g Max rule lift () U(0; 0:95) Max rule support () =dN=10e Length for ruler (l r ) l r Uf2; 6g Precision for ruler (p r ) p r U(; +) Support for ruler (s r ) s r Uf5;p r g We have identified important attributes used to describe datasets. Before generating each dataset, we first pick reasonable values at random for each attribute (see Table 4.1). We then generate random binary samples and labels, constrained to meet the chosen attribute values. We also create random rules with reasonable attributes, and we incorpo- rate them into the dataset without affecting the other attributes. This gives the otherwise random data some amount of predictable structure. We follow this process to create two datasets, one for training, and one for testing. Each has identical attributes and contains the same structure from the generated rules. Experiment Plan For this experiment, we test beam search (b = 5) with a large selection of non- parameterized interestingness measures. We also test five parameterized interestingness measures, using parameter values recommended by Janssen & Fürnkranz (2010a). One 55 of these tests, them-estimate (m = 22:466), is the baseline method all other experiments are compared against. These tests should ideally be done with no maximum rule length, but for practical purposes, we set the maximum rule length tol = 10. This limit should not generally be restrictive, because the maximum true rule length in our synthetic data isl = 6. We also test an implementation of RIPPER modified for our use case. Our imple- mentation only finds enough rules to cover 100 test samples. We do not generate the full rule list or perform ruleset optimization, but we still post-prune the individual rules we do find. Additionally, we only search for rules predicting the positive class. We call this, “Modified RIPPER.” We use precision@100 as our evaluation metric, as discussed in Section 4.1. We combine as many of the best rules as needed until they cover at least 100 samples in the test data. We evaluate each method on 1,000 synthetic datasets generated using the method described in Section 4.3.2. Rule finding is discrete in nature, and the test results are neither continuous nor normally distributed. For these reasons, we use a paired-sample one-sided sign test to evaluate statistical significance. Some rule finding methods are similar enough to the baseline method that they often find the same best rules, resulting in ties. The sign test is calculated from wins and losses, but does not handle ties. To resolve this, we use the recommended approach (Gibbons & Chakraborti, 2011) of randomly breaking ties when computingp-values. Results of Experiment 1 Experiment 1 results are listed in Table 4.2. The first noticeable feature is that most of the p-values are essentially 1, meaning the method is unlikely to be higher precision than the baseline, given the win-loss-tie record. These results confirm the results of 56 Janssen & Fürnkranz (2010a), finding that them-estimate (m = 22:466) outperforms all non-parameterized interestingness measures. Both our study and theirs also confirm that RIPPER and them-estimate perform equally well. Our results differ, however, in that Relative Cost (c r = 0:342) performed better in their study than ours. 4.3.3 Experiment 2: FrontierMiner and Tuned Baseline on Syn- thetic Data The purpose of Experiment 2 is to test whether parameter tuning improves the pre- cision of the baseline method described in Section 4.3.1. We use the common strategy of grid search combined withk-fold cross-validation to optimize parameters using only the training data. This strategy is computationally intensive, however, in Appendix A we show how beam search can be greatly accelerated through the use of a commodity GPU. We assert that frequency bias and extent of search should be optimized jointly. Both affect statistical significance. Increasing the extent of search increases the number of false discoveries. This lowers the statistical strength of the search, making it difficult to recognize good low-frequency rules among all the spurious ones. Only higher frequency rules can still possibly be found statistically significant as the false discovery rate rises. Therefore, the optimal frequency bias depends on the extent of search. In noisy datasets with little structure, dropping unnecessary hypothesis tests will enable a reduction in frequency bias, potentially leading to an increase in out-of-sample precision. Experiment Plan We use 3-fold cross-validation in a three-dimensional grid search to find the best values form in them-estimate,b in beam search, andl, the maximum rule length. We test all combinations of the following values: 57 Table 4.2: Un-tuned rule-finding methods compared against the baseline. Method Win - Loss - Tie p-Value Modified RIPPER 497 - 497 - 6 0.51 J-Measure 408 - 582 - 10 ~1 Klösgen (! = 0:4323) 403 - 553 - 44 ~1 Two-Way Support Variation 402 - 587 - 11 ~1 Klösgen (! = 0:5) 395 - 580 - 25 ~1 Correlation 392 - 584 - 24 ~1 Linear Correlation 391 - 585 - 24 ~1 Example and Counterexample Rate 375 - 622 - 3 ~1 Odds Ratio 372 - 580 - 48 ~1 Odd Multiplier 371 - 582 - 47 ~1 Added Value 370 - 626 - 4 ~1 Conviction 369 - 585 - 46 ~1 Lift/Interest 369 - 626 - 5 ~1 Certainty Factor 366 - 626 - 8 ~1 Sebag-Schoenauer 366 - 587 - 47 ~1 Yule’s Y 365 - 630 - 5 ~1 Information Gain 364 - 631 - 5 ~1 Yule’s Q 361 - 634 - 5 ~1 One-Way Support 357 - 636 - 7 ~1 Zhang 354 - 642 - 4 ~1 Relative Risk 352 - 589 - 59 ~1 Rel Cost (c r = 0:342) 339 - 634 - 27 ~1 Laplace Correction 325 - 575 - 100 ~1 Leverage 316 - 680 - 4 ~1 Two-Way Support 282 - 716 - 2 ~1 Accuracy 259 - 731 - 10 ~1 Piatetsky-Shapiro 259 - 741 - 0 ~1 Cost (c = 0:437) 255 - 733 - 12 ~1 Least Contradiction 254 - 735 - 11 ~1 Gini Index 244 - 756 - 0 ~1 Collective Strength 242 - 758 - 0 ~1 Loevinger 237 - 761 - 2 ~1 Cosine 217 - 783 - 0 ~1 Jaccard 217 - 783 - 0 ~1 F-score ( = 0:5) 217 - 783 - 0 ~1 58 m2f1; 2; 5; 10; 22:466; 50; 100; 1000g b2f1; 2; 3; 5; 10; 20; 50; 100; 1000g l2f1; 2; 3; 4; 5; 6; 7; 8; 9; 10g Cross-validation evaluates each parameter combination on folds of the training data. The best combination is then used to search the full training data for rules which are then evaluated on the test data. Three-dimensional grid search seems computationally cumbersome, but the third dimension, maximum rule length, does not add much complexity. This is because longer rules build on shorter rules when using top-down beam search. If the intermediate results are stored, a single beam search for long rules collects the same information as separate beam searches for smaller values ofl. The baseline recommendation is having no max- imum rule length at all, meaning long rules need to be searched anyway. In addition to testing the cross-validation method for all three parameters, we also evaluate the cross-validation method for individual parameters (one-dimensional grid search) and pairs of parameters (two-dimensional grid search), with left-out parameters being set to recommended values from the baseline approach. We use the same experimental setup used in Experiment 1: precision@100 for the evaluation metric, the paired-sample sign test to evaluate statistical significance, and we test on the same 1,000 synthetic datasets. Results of Experiment 2 Experiment 2 results are listed in Table 4.3. The results confirm that grid search with cross-validation can be used to improve on the baseline method with respect to out-of- sample precision. The best result, “Cross-validation (m,b),” has a win-loss-tie record of 511-416-73 against the untuned baseline. 59 Table 4.3: Results of Experiment 2, evaluating the effect of tuning the parameters of the baseline method using grid search and cross-validation. Parameters in parenthesis (m, b,l) indicate the tuned parameters. Parameters left untuned are set to the default values from the baseline method: m-estimate withm = 22:466, beam search withb = 5, and maximum rule length set tol = 10. p-Values are from a paired-sample sign test. Method Win - Loss - Tie p-Value Cross-validation (m,b) 511 - 416 - 73 0.00044 Cross-validation (m) 420 - 344 - 236 0.01 Cross-validation (m,b,l) 494 - 438 - 68 0.027 Cross-validation (m,l) 436 - 390 - 174 0.077 Cross-validation (b) 419 - 403 - 178 0.44 Cross-validation (b,l) 444 - 438 - 118 0.49 Cross-validation (l) 275 - 391 - 334 ~1 While cross-validation with all three parameters still outperforms the baseline with statistical significance, it is not the best result. The results show worse performance in every case where the maximum rule length (l) is included in the cross-validation. Evidently, either enforcing a maximum rule length is counter-productive, or else the optimal value ofl is difficult to determine using grid search with cross-validation. Figure 4.4 confirms our assertion that extent of search and frequency bias should be optimized jointly. It shows that large beam sizes are optimal in the training data more often whenm can be varied to compensate, leading to the better results seen in Table 4.3. 60 Beam Size (b) 1 2 3 5 10 20 50 100 1000 Number of Times Chosen 0 50 100 150 Frequency of b Selection by Cross-Validation CV (b) CV (m,b) Figure 4.4: The number of datasets where cross-validation selected each beam size. 4.3.4 Experiment 3: FrontierMiner and Tuned Baseline on Real- world Data The purpose of Experiment 3 is to compare FrontierMiner against the baseline method, both tuned and untuned, on both synthetic data and real-world data. An addi- tional purpose is to see the effect of varying the number of iterations FrontierMiner uses to learn the penalty function (Section 4.2.2). Real-world Data For real-world data, we use the KEEL dataset repository (Alcalá et al., 2010). We use all 75 standard classification datasets. 2 Each dataset has an output variable specified containing two or more target classes. For each target class, we create a separate binary 2 http://sci2s.ugr.es/keel/category.php?cat=clas 61 classification task. We also discretize continuous input variables into 10 equal density bins. Experiment Plan To better evaluate the performance of each method on the real-world datasets, we use 5-fold cross-validation. For each method, we iteratively train rules on 4 folds and test on the fifth fold, and we report the average of the 5 resulting precision@100 values. We only experiment on datasets and target classes large enough such that each of the 5 cross-validation folds has at least 100 positive samples. After applying this filter, 138 binary classification tasks remain part of the experiment. In this experiment, we once again compare the tuned baseline against the untuned baseline, this time on real-world data. We only test the best tuning baseline method from Experiment 2 (using 3-fold cross-validation to estimatem andb). The maximum rule length is raised tol = 20 to be sure it not generally restrictive. This 3-fold cross- validation runs in an inner loop for parameter optimization. This process is repeated 5 times for each fold of the outer 5-fold cross-validation loop, which has the purpose of improving precision@100 estimation. In this experiment, we also compare FrontierMiner against the baseline on the real- world classification tasks. We use the same random splits for the outer 5-fold cross- validation loop previously discussed. We re-run FrontierMiner with six different settings of the bootstrapping iterations parameter, to see if performance is sensitive to the number of iterations. We test settingiters2f5; 10; 20; 50; 100; 500g. For completeness, we also run FrontierMiner on the same 1,000 synthetic datasets used in the first two experiments. Again, we use precision@100 for the evaluation metric, and the paired-sample sign test to evaluate statistical significance. 62 Table 4.4: Results of Experiment 3 on 138 real-world classification tasks. Each method is compared against the baseline method (m-estimate with m = 22:466, beam search withb = 5, and maximum rule length set tol = 10) to calculate the win-loss-tie record. p-Values are from a paired-sample sign test. Method Win - Loss - Tie p-Value FrontierMiner,iters = 5 93 - 40 - 5 ~0 FrontierMiner,iters = 20 92 - 39 - 7 ~0 FrontierMiner,iters = 500 92 - 37 - 9 ~0 FrontierMiner,iters = 50 91 - 41 - 6 ~0 FrontierMiner,iters = 100 91 - 40 - 7 1e-05 FrontierMiner,iters = 10 91 - 39 - 8 1e-05 Cross-validation (m,b) 87 - 38 - 13 1e-05 Results of Experiment 3 Experiment 3 results are listed in Tables 4.4 and 4.5. The results in Table 4.5 confirm that tuning the parameters of the baseline method using grid search and cross-validation outperforms the untuned baseline method with statistical significance on real-world data. The results in both tables also show FrontierMiner outperforms the baseline method more often than even the cross-validation tuned method. Neither Table 4.4 nor Table 4.5 show a pattern of increasing performance with an increasing number of bootstrap iterations. More analysis needs to be done to understand the effect ofiters on performance. Until then, the consistently high precision of Fron- tierMiner at all iteration settings indicatesiters is not a critical parameter for users to adjust. 63 Table 4.5: Results of Experiment 3 on 1,000 synthetic classification datasets. Each method is compared against the baseline method (m-estimate withm = 22:466, beam search withb = 5, and maximum rule length set tol = 10) to calculate the win-loss-tie record. p-Values are from a paired-sample sign test. Method Win - Loss - Tie p-Value FrontierMiner,iters = 20 577 - 392 - 31 ~0 FrontierMiner,iters = 50 577 - 387 - 36 ~0 FrontierMiner,iters = 100 569 - 402 - 29 ~0 FrontierMiner,iters = 5 562 - 413 - 25 ~0 FrontierMiner,iters = 10 559 - 410 - 31 ~0 FrontierMiner,iters = 500 557 - 412 - 31 1e-05 Cross-validation (m,b) 511 - 416 - 73 0.00044 64 Algorithm 1 FindFrontierRules Input: samples . training samples, each one a binary feature vector labels . binary training labels Output: frontiers . list of frontiers indexed such that the rules infrontiers[l] has a max rule length ofl 1: function FINDFRONTIERRULES(samples,labels) 2: emptyRule:features { } . empty rule has no conditions, covers all events 3: emptyRule:supp RULESUPPORT(emptyRule,samples) 4: emptyRule:prec RULEPRECISION(emptyRule,samples,labels) 5: frontiers [ ] . init empty list of frontiers 6: frontiers[0] [emptyRule] . init length-0 frontier to hold the empty rule 7: M total number of features 8: forl = 1 toM do 9: frontier frontiers[l 1] . init with rules from last one 10: candidates [ ] . init empty list of candidate rules to test 11: for eachrule2frontier do 12: ifrule is lengthl 1 then . already tried refining shorter rules 13: forf = 1 toM do 14: c rule . create a copy of the rule 15: c:features.append(f) . refine the rule by addingf 16: c:supp RULESUPPORT(c,samples) 17: c:prec RULEPRECISION(c,samples,labels) 18: ifc:supp> 0 then 19: candidates.append(c) 20: end if 21: end for 22: end if 23: end for 65 Algorithm 1 FindFrontierRules (continued) 24: numNewRules 0 . number of new rules of lengthl added tofrontier 25: for eachc2candidates do 26: cBetterThanAny False . isc better than any rule infrontier 27: cNotWorseThanAll True .c not worse than all rules infrontier 28: for eachr2frontier do 29: crEqual (c:supp =r:supp) and (c:prec =r:prec) 30: cWorse [(c:supp r:supp) and (c:prec < r:prec)] or [(c:supp<r:supp) and (c:precr:prec)] 31: rWorse [(r:supp c:supp) and (r:prec < c:prec)] or [(r:supp<c:supp) and (r:precc:prec)] 32: if:(:cWorse and:crEqual) then 33: cNotWorseThanAll False 34: end if 35: if rWorse then 36: cBetterThanAny True 37: frontier.remove(r) 38: end if 39: end for 40: ifcBetterThanAny orcNotWorseThanAll then 41: frontier.append(c); 42: numNewRules numNewRules + 1 43: end if 44: end for 45: ifnumNewRules> 0 then 46: SORT(frontier) . sortfrontier in descending order of precision 47: frontiers.append(frontier) 48: else 49: break . greedy search is at a dead-end; no more frontiers to add 50: end if 51: end for 52: returnfrontiers 53: end function 66 Algorithm 2 FrontierMiner Input: samples . training samples, each one a binary feature vector labels . binary training labels iters . number of bootstrapping iterations Output: ruleList . ordered list of discovered rules 1: x;y ESTIMATEPENALTYCURVE(samples,labels,iters) 2: frontiers FINDFRONTIERRULES(samples,labels) 3: ruleList [ ] . init output set of rules 4: for eachfrontier2frontiers do . one frontier for each max rule length 5: for eachrule2frontier do 6: rule:supp RULESUPPORT(rule,samples) 7: rule:prec RULEPRECISION(rule,samples,labels) 8: rule:penalty LINEARINTERPOLATION(x,y,rule:supp) 9: rule:score rule:precrule:penalty 10: ruleList.append(rule) 11: end for 12: end for 13: SORT(ruleList) . sort output rule list in descending order of score 14: function ESTIMATEPENALTYCURVE(samples,labels,iters) 15: allX [ ] 16: allY [ ] 17: N COUNT(samples) 18: fori = 1 toiters do 19: permix RANDPERM(N) . random permutation vector of lengthN 20: frac random number between 0:25 and 0:75 21: n ROUND(fracN) . random number of training samples 22: trainix permix[1 ton] . index into training samples 23: testix permix[(n + 1) toN] . index into test samples 24: trainSamples samples[trainix] 25: trainLabels labels[trainix] 26: testSamples samples[testix] 27: testLabels labels[testix] 28: frontiers FINDFRONTIERRULES(trainSamples,trainLabels) 67 Algorithm 2 FrontierMiner (continued) 29: for eachfrontier2frontiers do . one frontier for each max rule length 30: for eachrule2frontier do 31: trainSupp RULESUPPORT(rule,trainSamples) 32: trainPrec RULEPRECISION(rule,trainSamples,trainLabels) 33: testPrec RULEPRECISION(rule,testSamples,testLabels) 34: slippage trainPrectestPrec 35: allX.append(trainSupp) 36: allY .append(slippage) 37: end for 38: end for 39: end for 40: x distinct values ofallX, sorted 41: avgY ZEROSSIZEOF(x) . initavgY to be the same size asx 42: fori = 1 to COUNT(x) do 43: ix FIND(allX =x[i]) 44: avgY [i] MEAN(allY [ix]).avgY is the avg slippage per level of support 45: end for 46: y LSQISOTONIC(x,avgY ) . approximate avgY such that y monotonically decreases 47: x,y keep only points (x i ;y i ) wherey i <y i1 . make curve strictly decreasing 48: x.prepend(0) . prepend zero-support point to enable interpolation 49: y.prepend(1) . use 100% slippage for zero-support rules 50: x.append(RULESUPPORT(emptyRule,samples)) . append overall support 51: y.append(0) . use 0% slippage for rules with complete support 52: returnx;y 53: end function 68 4.3.5 Learning Multiple Rules In this section we explore some synthetic datasets in more detail by learning multiple rules from each. The datasets are chosen for their diverse characteristics. Algorithm 3 describes the method we use to find multiple rules. We use a separate-and-conquer strat- egy, using FrontierMiner to find the single best rule, then remove the covered samples from the training data. We stop either when 100 rules are learned, or when no positive samples remain in the training data. For comparison, we also use the same strategy to find multiple rules using the baseline method. We evaluate the rule lists on test data and present the results graphically (Fig- ures 4.5 – 4.8). The charts plot the cumulative precision as a function of the cumulative coverage for both FrontierMiner and the baseline method. Concretely, the first data point (x 1 , y 1 ) represents the coverage and precision of the first rule (R 1 ). The second data point (x 2 ,y 2 ) represents the coverage and precision of the first two rules together (R 1 _R 2 ). The third data point (x 3 ,y 3 ) represents the coverage and precision of the first three rules together (R 1 _R 2 _R 3 ), and so on. Algorithm 3 FrontierMiner in a separate-and-conquer covering strategy 1: ruleList [ ] . initialize empty rule list 2: repeat 3: rule FRONTIERMINER(samples) 4: ruleList.append(rule) 5: coveredSamples samples covered byrule 6: samples samplesncoveredSamples . remove covered samples 7: until 100 rules are learned, or until no positive samples remain 69 0 2000 4000 Coverage 0.85 0.9 0.95 1 Precision Dataset with Most Samples Baseline FrontierMiner 0 2000 4000 Coverage 0.1 0.15 0.2 0.25 0.3 Precision Dataset with Fewest Samples Baseline FrontierMiner 0 2000 4000 Coverage 0.6 0.65 0.7 0.75 0.8 Precision Dataset with Most Features Baseline FrontierMiner 0 2000 4000 Coverage 0.2 0.3 0.4 0.5 0.6 Precision Dataset with Fewest Features Baseline FrontierMiner Figure 4.5: (1 of 4) Cumulative precision and coverage of rules found using a separate- and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run withiters = 20. 70 0 2000 4000 Coverage 0.88 0.89 0.9 0.91 0.92 0.93 Precision Dataset with Highest Positive Class Proportion Baseline FrontierMiner 0 2000 4000 Coverage 0 0.05 0.1 0.15 Precision Dataset with Lowest Positive Class Proportion Baseline FrontierMiner 0 2000 4000 Coverage 0.65 0.7 0.75 0.8 0.85 0.9 Precision Dataset with Most Incorporated Rules Baseline FrontierMiner 0 2000 4000 Coverage 0.25 0.3 0.35 0.4 0.45 Precision Dataset with Fewest Incorporated Rules Baseline FrontierMiner Figure 4.6: (2 of 4) Cumulative precision and coverage of rules found using a separate- and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run withiters = 20. 71 0 2000 4000 Coverage 0.4 0.5 0.6 0.7 0.8 0.9 Precision Dataset with Highest Lift Incorporated Rule Baseline FrontierMiner 0 2000 4000 Coverage 0.3 0.35 0.4 0.45 Precision Dataset with Lowest Lift Incorporated Rule Baseline FrontierMiner 0 2000 4000 Coverage 0.975 0.98 0.985 0.99 Precision Dataset with Highest Precision Incorporated Rule Baseline FrontierMiner 0 2000 4000 Coverage 0 0.05 0.1 0.15 0.2 0.25 Precision Dataset with Lowest Precision Incorporated Rule Baseline FrontierMiner Figure 4.7: (3 of 4) Cumulative precision and coverage of rules found using a separate- and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run withiters = 20. 72 0 2000 4000 Coverage 0.86 0.88 0.9 0.92 Precision Dataset with Highest Coverage Incorporated Rule Baseline FrontierMiner 0 2000 4000 Coverage 0.46 0.48 0.5 0.52 0.54 0.56 Precision Dataset with Lowest Coverage Incorporated Rule Baseline FrontierMiner 0 2000 4000 Coverage 0.4 0.5 0.6 0.7 0.8 Precision Dataset with Highest First Rule Found Difference Baseline FrontierMiner 0 2000 4000 Coverage 0.4 0.5 0.6 0.7 0.8 Precision Dataset with Lowest First Rule Found Difference Baseline FrontierMiner Figure 4.8: (4 of 4) Cumulative precision and coverage of rules found using a separate- and-conquer strategy. The synthetic datasets shown were selected for their diversity. FrontierMiner was run withiters = 20. 73 4.4 Computational Complexity Although FrontierMiner was designed to take advantage of excess modern process- ing power, it still proved to be more computationally efficient than the method we tested of tuning parameters using grid search with cross-validation. In this section, we show the computational complexity of the untuned baseline method, the computational com- plexity of the method of tuning through cross-validation, and also the computational complexity of FrontierMiner. 4.4.1 Computational Complexity of the Untuned Baseline Table 4.6 defines the symbols used in the complexity expression for the untuned baseline method (beam search withb = 5,m-estimate withm = 22:466, andl = 10 as a proxy for unlimited rule length). Table 4.6: Symbols used in the complexity expression. Symbol Description N Sample count M Feature count b Beam size l Maximum rule length The computational complexity of the untuned baseline method is: C base =O(NMbl 2 ) (4.16) 74 4.4.2 Relative Complexity of the Tuned Baseline We define the relative complexity of a rule learning method as the ratio of its com- putational complexity to the computational complexity of the baseline method. The relative complexity of the tuned baseline method can be calculated directly. Its compu- tational complexity depends on the specific values ofb used during the grid search. The formula for the relative complexity of tuning with cross-validation is: C tuned = X b C base b b base foldsjmj (4.17) wherefolds is the number of cross-validation folds,b base is the beam size of the baseline method, andjmj is the number of values ofm evaluated during cross-validation. In our experiments, we useb2f1; 2; 3; 5; 10; 20; 50; 100; 1000g. Given these values, and thatb base = 5, the relative complexity of the tuned baseline method is 5,717. 4.4.3 Relative Complexity of FrontierMiner Predicting the computational complexity of FrontierMiner is challenging. First, FrontierMiner is a nondeterministic algorithm. The bootstrap sampling procedure used to learn the penalty function selects a random fraction of the data during each iteration. Second, the FrontierMiner search algorithm relies on pre-pruning, which can drasti- cally reduce the search space. The search algorithm progressively refines newly-found frontier rules and stops when no refinements are found which are themselves Pareto- optimal. The search complexity, is therefore data-dependent. Datasets in which many frontier rules are found will take longer to explore, as each rule must be refined. The worst-case runtime would occur if every rule evaluated during every search was Pareto-optimal, even up to the final rule which contains all possible features. In this case, 75 10 2 Log of Number of Features 10 -4 10 -3 Log of Normalized FrontierMiner Runtime Figure 4.9: Log-log plot showing that in practice FrontierMiner exhibits polynomial computational complexity. The slope of the red line is approximately 0:6, making the complexityO(itersNM 0:6 ). The runtime is normalized by dividing by the number of samples (N) in order to reduce noise and focus on the complexity as a function of the number of features (M). Each point on this plot represents one synthetic dataset. the search would be exhaustive. The worst-case complexity of the full FrontierMiner algorithm would be: C frontier =O(itersNM2 M ) (4.18) whereiters is the number of bootstrap iterations. Figure 4.9 represents the actual runtime of FrontierMiner (iters = 50) on each of the 1,000 synthetic datasets. The runtime is normalized by the number of samples in each dataset (N) and plotted against the number of features (M). Figure 4.9 is a log- log plot. On such a plot, a linear relationship between the independent variable and the 76 dependent variable would indicate polynomial computational complexity. In this figure, the red line represents a linear model fitted to the log-transformed variables. The slope of the red line is approximately 0:6. To the extent that a linear model appropriately fits the transformed data, the slope indicates the approximate measured computational complexity: C measured =O(itersNM 0:6 ) (4.19) Figure 4.10 is a histogram of the relative FrontierMiner runtime (FrontierMiner run- time divided by the baseline method runtime) on the synthetic datasets. This histogram shows that FrontierMiner (with iters = 50) generally takes between 20x and 60x as long to run as the baseline method does. FrontierMiner is an order of magnitude more computationally complex than the baseline method, however it is two orders of mag- nitude less computationally complex than the grid search with cross-validation method we used for parameter tuning, which is 5,717x more computationally complex than the baseline method (Section 4.4.2). Figure 4.10: Histogram of FrontierMiner relative runtimes on 1,000 synthetic datasets, withiters = 50. 77 Chapter 5 Incorporating Domain Knowledge into Objective Functions One benefit of using machine learning algorithms that produce interpretable models is the opportunity it provides to vet the results. Vetting a model is done by comparing the knowledge learned from the data with prior domain knowledge. In this chapter, instead of using domain knowledge for vetting, we study a way of incorporating it directly into the objective function as a way to improve model accuracy. We choose the area of intervention analysis as a test-bed for exploring the effect of incorporating domain knowledge into regression problems. 5.1 Intervention Analysis Intervention analysis was introduced by Box & Tiao (1975) as a means of assessing the impact of a special event on a time series. In one example, they evaluate whether gasoline regulation in 1960 impacted smog levels in Los Angeles. The effect is not obvious in the noisy graph of monthly smog levels. However, their method is able to quantify even weak effects in such noisy time series. Their method has three steps: 1. Identification – frame a model for change which describes what is expected to occur given knowledge of the known intervention; 2. Fitting – work out the appropriate data analysis based on that model; 78 3. Diagnostic Checking – if the model proves inadequate for inference, make neces- sary adjustments and repeat the analysis. 5.2 Intervention Optimization Our research extends intervention analysis from the case of one event to the case of many events. The goal is to reliably predict which events will have the highest impact (defined later in this section) on their corresponding time series. We call this interven- tion optimization and have not found it previously discussed in data-mining literature. An example use-case is optimizing the impact of advertising campaigns on same-store sales for a retail business. In this case, the events include various kinds of advertis- ing campaigns, such as locally airing a television ad or mailing fliers. Each event is expected to affect a corresponding time series, in this case, sales at the targeted store location. Another example use-case is optimizing cyclic steam injection for enhanced oil recovery in highly viscous oil fields. Pausing production periodically to send steam down an oil well warms the surrounding oil, making it easier to extract once pumping resumes. The increase in production depends on the well, the reservoir, and the charac- teristics of the steam job. Intervention optimization is useful in these use-cases, because it helps maximize the impact of limited resources (ad budget or steam supply). Intervention optimization requires training a model that predicts the impact of an event based on a set of descriptive features. Modeling the highest-impact events is challenging due to the propensity to over-fit. Models that make inferences based on too- few historical high-impact examples can have low out-of-sample accuracy. Additionally, over-fitting can be caused by noise in the time series, which affects the estimation of impact. PRIMER, the intervention optimization system we propose, adapts to noisy time series by requiring more samples for inference. We evaluate PRIMER and other 79 modeling techniques by the average impact of their top-predicted x% of intervention events, using held-out data and given only the features describing each event. The same techniques can be applied to general event studies where one has no direct control over the events, but would still benefit from a predictive or explanatory impact model. We define the impact of an event as the cumulative subsequent “boost” in time series values caused by the event. Concretely, we assume a given eventE occurs at timet 0 and is expected to affect time seriesR of realized values. In our notation, R t refers to the value ofR at time indext. The first value ofR subject to the influence ofE isR t 0 , and the value one time step after the event isR t 0 +1 . We first calculate a baseline time series B of expected values had the event not occurred. In the simplest case,B t =R t 0 1 ;8t : t t 0 , which assumes the time series would have simply remained at the last known pre-event level. In more complex cases,B must be determined from a domain-specific model, including considerations such as autocorrelation and seasonality. We define S such that: S t =R t B t (5.1) making it the unexplained residual after all known effects unrelated to E have been removed fromR. IfE were to have no impact, then the expectation I E[S] = 0. Finally, we define the impact ofE onS over a finite period (n time steps) as: impact E (S) = t 0 +n1 X t=t 0 S t (5.2) We assume the effect ofE begins att 0 , meaning the event is unanticipated in the time series:B t =R t ;8t :t<t 0 . 80 5.3 Interpretability Modeling impact as a function of input variables is a regression problem, and we include state-of-the-art regression algorithms in the experiments in Section 5.5. How- ever, our interest is in interpretable models which more easily provide insight to the user. For this reason, we also evaluate regression rule learning algorithms, which are designed to emphasize interpretability over accuracy. As described in Section 2.2, regression rules are simple piece-wise constant models, taking the form: (F 1 ^F 2 ^F 3 )!C (5.3) This is understood to mean that an event containing featuresF 1 ,F 2 , andF 3 would have a predicted impact ofC. Some systems generate ordered rule lists where the predicted value of a sample comes from the first rule that covers it, meaning the first rule where the sample contains all the rule features (Brookhouse & Otero, 2015; Janssen & Fürnkranz, 2010b, 2011; Sikora et al., 2012). Other systems generate unordered ensembles of rules, where all rules that cover a sample provide an additive contribution to the overall pre- diction (Dembczy´ nski et al., 2008). Currently-proposed regression rule learning systems are not specifically designed to produce rules easily interpretable for the task of intervention optimization. The ideal set of rules for this task should be: short, covering only the highest-impact events sorted in descending order of predicted impact free of exceptions or caveats to the predictions 81 Current systems do not produce rule sets with these attributes. Ensemble-based systems produce rule sets which cannot be shortened without affecting the predictions. Ensem- ble rule sets can be sorted in descending order of impact contribution to improve inter- pretability, but caveats remain in the form of rules with a negative impact contribution. For example, a rule with featuresF 1 andF 2 may have a high impact contribution, but another rule with only featureF 1 may have a negative impact contribution. Therefore, a user cannot reliably identify high-impact events by simply remembering the high- impact rules. Likewise, ordered rule lists generated by current systems are complicated by exceptions. The rules are not ordered according to impact, but are ordered according to how well they reduce a loss function such as mean squared error. The rule lists cannot be re-sorted by impact, because the order of precedence must be maintained for accu- rate predictions. For example, a high-impact rule in the middle of the list with feature F 1 may be preempted by an earlier low-impact rule with featureF 2 . So, a user cannot simply say that events with featureF 1 will have high impact, without also mentioning the exception for events that also contain featureF 2 . PRIMER is a rule-based system designed specifically to generate interpretable rules for intervention optimization. The rule lists generated by PRIMER are ordered accord- ing to expected impact. The user can review and retain only as as many rules as needed to cover a sufficient number of events. Because of the rule ordering, predictions can be interpreted as minimum predictions. There is no concern about exceptions, because earlier rules in the list have predictions at least as high. The requirement of using interpretable models sometimes means accepting lower accuracy. PRIMER, however, is able to maintain high accuracy by taking advantage of domain knowledge specific to the task of intervention optimization. The domain knowl- edge provided by the user is the functional form of the expected response pattern. By 82 knowing the expected pattern to find in the time series, PRIMER is better able to dis- regard spurious fluctuations and noise. In this paper, we limit our scope to response patterns that exhibit a strong initial response that decays following the event, until even- tually the effect has dissipated. The two example use-cases mentioned both have this form of response pattern. In the retail business use-case, the effect of an ad is likely to be strongest initially, before it slowly gets forgotten. In the oil recovery use-case, production is highest immediately after steam injection. As the oil cools, production gradually decays back to its original level. PRIMER has a unique objective function designed specifically for intervention opti- mization. It combines three heuristics to improve out-of-sample performance: impact, coverage, and goodness-of-fit to the expected response pattern. We show its effective- ness in a large-scale event study modeling the impact of insider trading filings on intra- day stock returns. The study of market reactions to news, and filings in particular, is an active area of research in Behavioral Finance (Li & Ramesh, 2009; You & Zhang, 2009). PRIMER has the capability of providing insight into which filing characteristics most influence the market. In our tests, we show that that PRIMER is competitive with state-of-the-art regression algorithms at identifying high-impact filings, while the model output has improved interpretability. 5.4 PRIMER We propose a new method for intervention optimization: Pattern-specific Rule-based Intervention analysis Maximizing Event Response (PRIMER). In this section, we refer to Algorithm 4 as we describe each part of the method. The inputs to PRIMER, as shown in Algorithm 4, include a set of events. Each event contains a descriptive set of binary features. Each also contains the corresponding 83 response, which is a short time series segment beginning with the event at timet 0 and lasting until the impact of the event has substantially decayed. The event response is a sub-sequence ofS, defined in Eq. 5.1. 5.4.1 Separate-and-Conquer with Beam Search PRIMER uses the separate-and-conquer strategy (Fürnkranz, 1999) common to most rule learning systems, discussed in Section 2.1.3. This strategy iteratively identifies the events not yet covered by any rule (separate), and then learns a new rule using only the uncovered events (conquer). The new rule covers additional events, which are then removed from consideration in the next iteration. This guarantees subsequent rules have diversity in coverage. Lines 1 – 7 in Algorithm 4 describe our implementation of this strategy. This loop is repeated until rules are discovered that cover all events, or until the empty rule is returned because no better rule could be found. To find each new rule, PRIMER uses top-down beam search, discussed in Sec- tion 2.1.2. Beam search is a greedy heuristic search method which is used by the vast majority of rule learning algorithms (Fürnkranz et al., 2012). Beam search starts with an empty rule that covers all samples and greedily refines it by adding features as condi- tions. It maintains a beam of the bestb rules of each rule length. To find rules of length l, each feature is successively added as a refinement to each rule in the beam for length l 1. The refined candidate rules are then evaluated, and the topb rules are stored in the beam for lengthl. After reaching some stopping criterion, the best rule from all lengths is selected. Limiting the beam size limits the extent of the search. Settingb = 1 makes beam search equivalent to hill-climbing. Setting b =1 makes it equivalent to exhaustive search. PRIMER implements top-down beam search in the FINDBESTRULE function 84 (lines 8 – 21). The function returns the single highest-scoring rule discovered during the search. 5.4.2 Objective Function During beam search, each candidate rule is evaluated and scored by an objective function. PRIMER’s unique objective function combines three heuristics to improve out-of-sample performance: impact, goodness-of-fit, and coverage. Impact. The goal of intervention optimization is to maximize the impact of future interven- tions. We evaluate rules according to the average impact of the out-of-sample events they cover, so historical impact is naturally an important heuristic. For rule evaluation, the impact for an event is defined in Eq. 5.2. During model training, however, PRIMER optimizes only on impact that fits the expected response pattern given by a user with domain expertise. This helps avoid over-fitting to noise or spurious fluctuations in the time series. The key to the intervention analysis method proposed by Box & Tiao (1975) is specifying the expected response pattern. Their method is able to quantify weak effects in noisy time series by relying on the use of a transfer function, a tentative specification of the stochastic model form. The transfer function is based on prior knowledge of the intervention, and how the time series is expected to react. Some example transfer functions they listed include linear, pulse, and step functions. PRIMER inherits the use of a transfer function from their work on intervention analysis. 85 With PRIMER, we have tested response patterns that exhibit an abrupt initial response at timet 0 which decays back down to the pre-intervention level withinn time steps. Specifically, we have tested the exponential decay function: f(t) =Ae k(t1) ; t 1 (5.4) and the power law function: f(t) =At k ; t 1 (5.5) In both cases,A is the scaling parameter, andk determines the rate of decay. Restricting the impact to only include the fitted area under the transfer function curve makes the algorithm more robust by down-weighting, for example, spurious spikes that occur well aftert 0 . The first step in calculating the fitted impact of a rule is to average the responses of all events covered by the rule (avgResponse in line 23 of Algorithm 4). The next step is to fit the the transfer function to the average response by minimizing the sum of squared differences: minimize A;k n X t=1 (f(t)avgResponse t ) 2 subject to A;k 0 (5.6) We use the trust-region-reflective optimization algorithm (Coleman & Li, 1996), which allows us to specify lower bound constraints of zero on the fit parameters. Goodness-of-Fit. In PRIMER, we score each rule conservatively, based on the goodness-of-fit of the average response to the transfer function. In line 24 of Algorithm 4, we calculate confi- dence intervals for the fit parameters optimized in Eq. 5.6. We calculate the confidence 86 intervals based on the asymptotic normal distribution for the parameter estimates (Seber & Wild, 1989). The level of confidence used in the interval calculation is a user-specified parameter,, which produces 100 (1) percent confidence intervals. In line 28 of Algorithm 4, we choose the more conservative values from the con- fidence intervals for each fit parameter. For the scale parameter A in the exponential decay transfer function (Eq. 5.4) and the power law transfer function (Eq. 5.5), we use the lower bound confidence interval. For the rate of decay parameter k, we use the upper bound confidence interval. We calculate the rule score as the area under the trans- fer function using the conservative confidence interval values for parameters (line 29). This score is lower than the fitted impact due to the reduced scale and increased rate of decay. This penalizes rules with poorly-fitting average response curves which would otherwise have had high impact according to the fitted parameters. We also use the confidence intervals as a stopping criterion. If both parameters are not significantly greater than zero according to the confidence intervals, the rule is given a score of zero. If no rule can be found with a score greater than zero, the default empty rule is chosen by the beam search, since it has been assigned a small positive score (line 10). The empty rule has no conditions and covers all remaining samples, causing the separate-and-conquer loop to terminate. Coverage. PRIMER includes a bias toward rules with high coverage. This trade-off of impact for coverage in the objective function has the potential to improve out-of-sample perfor- mance, because high-coverage rules are less prone to over-fitting. Due to random noise in the time series, an individual event response is unlikely to closely resemble the transfer function. A rule that covers only a small number of events will have a noisy average response. As discussed previously, a poor fit of the average 87 response to the transfer function reduces the rule score, possibly to zero. A rule that covers many events will have a well-behaved average response where the random noise averages out. With less noise, the fit becomes better, and the rule score increases. In this way, high-coverage rules are favored. The trade-off between coverage and impact is controlled by the parameter , which was introduced in the previous section. High equates to low-confidence bounds, which are close to the least squares fit parameters. Inversely, low increases confidence that the true parameters are within the intervals by widening the intervals. Effectively, lowering reduces tolerance for noisy average responses, which then increases the bias toward rules with less noise; and rules with less noise tend to be rules with higher coverage. Figure 5.1 illustrates how increasing coverage increases the score for a rule on syn- thetic data. This plot involves samples with the same power law decay response added to noisy time series. Adding samples smooths the average response, which raises the lower bound on the fitted curve. Figure 5.1: Illustration using synthetic data, where each response sample is gener- ated using the power law function with white noise added. Averaging many samples improves the goodness-of-fit, raising the lower confidence bound and increasing the score of a rule. 88 5.5 Experiments We evaluate PRIMER on readily available data from the U.S. financial markets. Our objective is to predict which insider trading reports have the largest positive effect on intra-day stock prices. We cannot influence such financial news, so this this use- case is not true intervention optimization. However, as an event study, we can evaluate PRIMER’s ability to identify high-impact events. 5.5.1 Data Events. For events, we use Form-4 regulatory filings disseminated by the U.S. Securities and Exchange Commission. 1 Form-4 filings are submitted by insiders (officers, direc- tors, etc.) in publicly traded companies to disclose changes in personal ownership of company shares. We filter these down to include only purchases of common stock, which are considered more informative by analysts and market participants than sales. Insiders may sell for a variety of uninformative reasons, including diversification and raising cash for personal reasons, whereas they buy primarily because they believe com- pany shares will rise in value. We further filter the filings down to just those that become public during business hours when the markets are open. This allows us to measure the intra-day response to each filing. Each filing has such features as: the insider type: director, officer, 10% owner, or other the total dollar value of direct purchases, discretized the total dollar value of indirect purchases, discretized 1 ftp://ftp.sec.gov/edgar/Feed/ 89 various transaction codes We also include the market capitalization of the company prior to the filing, discretized. Our final set of events has 136 binary features covering 158,983 events from years 2004 – 2014. Time Series. For time series, we use stock returns of the company associated with each filing. We use tick data provided by Wharton Research Data Services. 2 We pre-process the data by converting it to 5-minute bars, creating a time seriesP of the last traded price within each 5-minute time period. The return time series is calculated as: S t = P t P t1 P t1 (5.7) The event response listed as an input in Algorithm 4 is the length-10 sub-sequence ofS beginning with the first return affected by the event: event:response = [S t 0 ;S t 0 +1 ;:::;S t 0 +8 ;S t 0 +9 ] (5.8) In cases with insufficient trades to calculate each bar of the response, the event is removed from the dataset. The total response is the target variable to be maximized by PRIMER. 5.5.2 PRIMER Settings We choose a transfer function based on prior knowledge that the financial markets respond positively to insider purchases, and that the effect of the news decays once the 2 https://wrds-web.wharton.upenn.edu/wrds/ 90 Figure 5.2: Average response of all events in the insider purchase dataset. information is fully disseminated and acted upon by market participants. We choose the power law decay transfer function over the exponential decay function, because it more closely fits the average response of all events in the dataset, as seen in Figure 5.2. The parameter b for beam size varies the extent of search. For the entire experi- ment, we use b = 5. This value has been commonly used in rule learning (Clark & Niblett, 1989; Možina et al., 2006). Similarly low values ofb have been shown to often out-perform large values when tested out-of-sample (Quinlan & Cameron-Jones, 1995; Janssen & Fürnkranz, 2009). The parameterl constrains the maximum rule length. We setl = 10 for the entire experiment, a value we believe is high enough to impose minimal constraint. The parameter determines the confidence intervals. A low value means more emphasis on the goodness-of-fit heuristic. If the value is too low, few rules will be discovered. We experiment with three values: =f0:1; 0:2; 0:3g. 91 5.5.3 Evaluation Method Evaluation is based on the average impact (average sum of return bars) as a function of the percent of test data covered. First, test events are sorted in descending order according to predicted impact. Then the average actual impact is calculated using the topx% of events, forx = 1::: 100. Models are preferred which best predict the highest impact events for each given coverage percentile. All experiments are evaluated using 10-fold cross-validation, where each model is trained on 90% of the data and tested on the remaining 10%. This is performed ten times, once for each test partition, and the results are averaged. All models are evaluated using the same ten randomly partitioned folds. For some models, we optimize a hyper- parameter by further use of 10-fold cross validation in an inner-loop. For each fold of the outer-loop, once the hyper-parameter value is chosen, it is used to train a model on the full set of training data in the fold. In total, for a model with one hyper-parameter, we run the training 110 times on subsets of data. 5.5.4 Baselines for Comparison We compare PRIMER with common regression models as well as state-of-the-art regression rule learning algorithms: Ridge Regression. For linear regression with L 2 -norm regularization, we use the MATLAB function ridge, in conjunction with cross-validation to find the optimal value for the regularization parameter, out of values: = f10 4 ; 10 3 ;:::; 10 5 ; 10 6 g. LASSO. For linear regression withL 1 -norm regularization, we use the MATLAB functionlasso, which has built-in cross-validation (Friedman et al., 2010). For 92 , we set the function to use a geometric sequence of ten values, the largest just sufficient to produce a model with all zeros. Support Vector Regression. For linear L 2 -regularized Support Vector Regres- sion (Ho & Lin, 2012), we use liblinear (Fan et al., 2008). We use cross- validation to find the optimal value for the cost parameter, out of values: c = f10 4 ; 10 3 ;:::; 10 3 ; 10 4 g. RegENDER. Regression Ensemble of Decision Rules (REGENDER) is available as a Java library which integrates with the Weka data-mining environment (Hall et al., 2009). We run it with the parameters recommended by the authors (gradient boosting, squared-error loss function, 200 rules, = 0:5, with resampling set to 50% drawn without replacement) (Dembczy´ nski et al., 2008). We also run it with the default parameters in Weka, which have three differences (simultaneous minimization, 100 rules, and = 1:0). M5’Rules. We use the implementation of M5’Rules (Holmes et al., 1999) included with Weka. We use the default parameters (minimum number of instances = 4, build regression tree = false, unpruned = false, use unsmoothed = false). Dynamic Reduction to Classification. Dynamic Reduction to Classification (Janssen & Fürnkranz, 2011) also integrates with Weka. We tried numerous heuristics and found Laplace to be the best. We tried using a minimum cover- age of 90% as recommended by the authors, but found that using 100% worked better for our dataset. In this case, we report results only for the best settings. SeCoReg. Separate-and-Conquer Regression (SeCoReg) (Janssen & Fürnkranz, 2010b) testing was inconclusive. We confirmed that the recommended parameters 93 ( = 0:591, minimum coverage of 90%) worked well on small datasets. However, our dataset is too large, and we did not wait for completion. 5.6 Results Table 5.1 and Figure 5.3 both show the test results. No model performed signifi- cantly better than all others over the entire set of coverage levels. REGENDER, using its authors’ recommended parameters, has the best overall performance. However, sev- eral models have similar performance, including PRIMER. These results show that PRIMER outperforms Dynamic Reduction to Classification, which is its closest algorithm in terms of model form. We assert that the reason PRIMER has such strong performance despite being restricted to a simple model form is because it incorporates extra domain knowledge which helps it avoid over-fitting. 94 Table 5.1: Average total response of the events covered by the topx% of predictions of each model (top 1%, top 5%, etc.) Average Response ofx% of Top Predictions Algorithm 1% 5% 10% 25% 50% PRIMER, = 0:1 0.0207 0.0142 0.0099 0.0045 0.0023 PRIMER, = 0:2 0.0203 0.0142 0.0100 0.0046 0.0023 PRIMER, = 0:3 0.0204 0.0141 0.0101 0.0046 0.0024 Ridge Regression 0.0202 0.0136 0.0097 0.0047 0.0024 LASSO 0.0202 0.0129 0.0098 0.0047 0.0024 Support Vector Regression 0.0161 0.0126 0.0087 0.0042 0.0023 REGENDER (authors) 0.0214 0.0145 0.0100 0.0049 0.0025 REGENDER (Weka) 0.0208 0.0143 0.0100 0.0049 0.0025 M5’Rules 0.0203 0.0141 0.0098 0.0049 0.0025 Dynamic Reduction 0.0198 0.0128 0.0084 0.0040 0.0022 Figure 5.3: Average total response of the events covered by the topx% of predictions. PRIMER is highlighted to show it is competitive with state-of-the-art models. 95 5.6.1 Example Rules In Table 5.2, we list some top rules learned by PRIMER on the insider trading data. This is to show concretely the type of model produced by PRIMER. The interpretability of models such as this provides transparency into the relationships learned from the data. Table 5.2: Highest-impact rules learned by PRIMER on the insider trading data. Rule Features Impact Coverage #1 directInsiderBuyDollars $90,041 isOfficer = true marketcap < $371,668m hasTransactionCodeP = true 0.023 565 #2 directInsiderBuyDollars $51,473 isOfficer = true marketcap < $371,668m hasTransactionCodeP = true 0.018 323 #3 indirectInsiderBuyDollars $89,187 isDirector = true marketcap < $371,668m hasTransactionCodeP = true 0.017 354 #4 directInsiderBuyDollars $90,041 isDirector = true marketcap < $371,668m hasTransactionCodeP = true 0.017 492 #5 directInsiderBuyDollars $27,764 directInsiderBuyDollars < $90,041 marketcap < $371,668m hasTransactionCodeP = true 0.014 1428 96 5.7 Discussion PRIMER may not work with every dataset. Ideally, the average response of the full set of events closely fits the given transfer function, despite the diluted average impact. Because PRIMER uses greedy search, an average response sufficiently fitting the transfer function must be located at least within the first search iteration (length- 1 rule). Otherwise, the program will exit with only the default rule. For best results, there must be a greedy path through the search tree to find high-impact branches. The decay patterns we have studied work well in this regard. The average of two exponential decay functions is another exponential decay function if the decay parameters are the same and only the scale parameters differ. The same is true for power law functions. Even when the decay parameters are not the same, the average of two decay functions often resembles another decay function closely enough for greedy search to work. 97 Algorithm 4 PRIMER Input: events . set of events or interventions event:features . each event has a set of descriptive features event:response . time series segment starting at the time of the event b . beam size l . max rule length T . transfer function . alpha for calculating fit parameter confidence intervals Output: ruleList . ordered list of discovered rules 1: ruleList [ ] . init empty rule list 2: repeat 3: rule FINDBESTRULE(events,b,l,T ,) 4: ruleList.append(rule) 5: coveredEvents events inevents covered byrule 6: events eventsncoveredEvents 7: untilevents =? 8: function FINDBESTRULE(events,b,l,T ,) 9: emptyRule:conditions fg. empty rule has no conditions, covers all events 10: emptyRule:score . a tiny score makes it rank higher than 0-score rules 11: beam [ ] .beam is an array of rule lists .beam[2] holds a list of length-2 rules, etc. 12: beam[0] [emptyRule] . init length-0 rule list to hold the empty rule 13: fori = 1 tol do 14: rules all refinements to rules inbeam[i 1] 15: for eachrule2rules do 16: rule:score EVALUATERULE(rule,events,T ,) 17: end for 18: beam[i] BEST(rules,b) . keepb best rules of lengthi 19: end for 20: return best rule in all ofbeam 21: end function 98 Algorithm 4 PRIMER (continued) 22: function EVALUATERULE(rule,events,T ,) 23: avgResponse average response of events covered byrule 24: ci calculate parameter confidence intervals ofT fitted toavgResponse 25: if min(ci) 0 then 26: score = 0 . because a parameter is not significantly different from zero 27: else 28: p min MINAUC(ci) . for each parameter, choose the value from .ci that minimizes the area under the curve 29: score AUC(T ,p min ) . smallest confident area under the curve 30: end if 31: returnscore 32: end function 99 Chapter 6 Learning Objective Functions from Domain Expert Feedback One data mining goal is to automatically identify interesting patterns in a dataset. Algorithms developed for this purpose utilize an interestingness measure as an objective function that assigns a numerical score to each discovered pattern, in order to evaluate and rank them, as described in Section 2.1.1. Numerous interestingness measures have been proposed, surveyed, and evaluated for different domains (Hilderman & Hamilton, 2001; Vaillant et al., 2004; McGarry, 2005; Geng & Hamilton, 2006; Lenca et al., 2008; Tan et al., 2002; Blanchard et al., 2005). The choice of interestingness measure depends on the specific domain since a pattern can exhibit multiple desirable attributes which must be traded-off against each other. Designing an interestingness measure for a specific domain is challenging and typ- ically requires a domain expert to create a new function and identify a set of features that can be calculated from the dataset attributes (Ohsaki et al., 2004). As an alternate approach, we propose a method to learn an interestingness measure from crowd-sourced data collected from end-users in the domain community. In our approach, domain users are presented with pairs of candidate patterns and are asked to rank one over the other. Pairwise ranking is a non-arduous way for domain users to share preference informa- tion. It also facilitates the combining of preference information from multiple users. The collected pairwise rankings are then provided as input to a learning-to-rank algorithm to learn a model of user preference which can be used as an interestingness measure. The 100 features in the learning model are previously proposed interestingness measures for the domain. The result is a custom measure that represents “real human interest” (Ohsaki et al., 2004) in the domain as expressed by its users. We demonstrate the proposed approach and evaluate its effectiveness in the domain of finance, specifically the task of learning an investment performance measure that reflects the preferences of investment professionals. Investment preference rankings are collected from users of online discussion forums comprised of quantitative analysts and traders. The model features that are used in the learning-to-rank algorithm include existing investment performance metrics and ratios. The learned model achieves an accuracy of 80% for predicting the domain users’ preference, while the highest accuracy of any single existing performance measure is 77%. We believe that learning such an interestingness measure can benefit this domain since there is a large number of investment choices. For instance, the United States has over 5,000 exchange-traded stocks and over 7,000 mutual fund choices. Our proposed approach can enable individuals to locate investments that match their specific interests. Moreover, the learned interestingness measure can also be used as an objective function for portfolio selection and optimization. The contributions in this chapter are as follows: 1. We propose a novel approach based on learning-to-rank algorithms that enables a domain-specific performance measure to be learned from domain community con- tributions. The method requires only pairwise preferences from domain experts. 2. We evaluate this approach in the domain of investment ranking and show that the learned performance measure has higher accuracy than existing domain-specific measures. We also address issues of data quality that are critical in crowd-sourced datasets. 101 3. We show the benefit for individuals of using even a moderate amount of training data to guide the selection of an existing performance measure. 4. We provide all data collected as part of this study to encourage further research in this area 1 . 6.1 Finance Background Investment performance measures are designed to weigh the risk as well as the reward, and are therefore called “risk-adjusted returns.” Metrics are structured as ratios, with return on investment in the numerator, and risk in the denominator. In this way, a single metric can compare two investment options with different risk profiles. While return on investment is a standard measure of reward, there are numerous measures of risk and hence consensus has not yet been reached as to which performance measure is best (Farinelli et al., 2008). New performance metrics continue to be pro- posed (Cogneau & Hübner, 2009; Mistry & Shah, 2013), and investors have to choose from among them (Bacon, 2012). We first describe equity graphs which provide a visualization of asset performance, followed by a summary of performance measures that will be used as features in our learning model. 6.1.1 Equity Graphs Historical performance is often presented as an equity graph, which shows the value of one’s investment account over time. Equity graphs enable domain experts to rapidly evaluate historical performance. While there are different types of equity graphs, in our 1 http://thames.usc.edu/rank.zip 102 work we use the common variant where the graph presents a cumulative sum of daily returns. This is equivalent to assuming exactly one dollar was invested each day, with profits removed from the account. Such a graph is easy to examine, since the ideal is a straight line from the lower left corner to the upper right corner. Examples are shown in Figures 6.1, 6.2, and 6.3. 6.1.2 Distribution-Based Measures Many performance measures calculate risk based on the distribution of returns. For a time seriesR, the return on investment for each period,R t is: R t S t S t1 S t1 whereS t is the asset value at timet. The baseline investment performance measure is the reward to variability ratio, the Sharpe ratio (Sharpe, 1966). The Sharpe ratio is widely used (Eling, 2008), with surveys showing its use by up to 93% of money managers (Bacon, 2012). This performance measure is “optimal” if the return distribution is normal. The Sharpe ratio is closely related to the t-statistic for measuring the statistical significance of the mean differential return (Sharpe, 1994). Using the same notation as Sharpe (1994), letR Ft be the return of the investment in periodt,R Bt the return of the benchmark security (commonly the risk-free interest rate) in periodt, andD t the differential return in periodt: D t R Ft R Bt 103 Let D be the average value ofD t from periodt = 1 throughT : D 1 T T X t=1 D t and D be the standard deviation over the period: D s P T t=1 (D t D) 2 T 1 The Sharpe Ratio (S h ) is: S h D D Many performance evaluation measures are modifications of the Sharpe ratio. Given that asset returns are often non-normal, researchers have developed measures that incor- porate higher moments of the distribution (Keating & Shadwick, 2002). The Sortino ratio (Sortino & Van Der Meer, 1991) is similar to the Sharpe ratio, except it uses the semi-standard deviation (downside risk) in the denominator. Other measures consider only the very worst returns in the tail of the return distribution (Dowd, 1998; Alexander & Baptista, 2003). 6.1.3 Multi-Period-Based Measures Shape-based measures focus on multi-period drawdowns instead of return distribu- tions. The Maximum Drawdown is defined as the maximum peak-to-valley decline in the equity graph. Figure 6.1 shows how two orderings of returns can have very different maximum drawdowns while still having the same daily Sharpe ratio. The chart on the right has an unappealing drawdown of 22%, yet it has the exact same distribution of returns as the chart on the left (with a drawdown of only 6%). 104 0 0.2 0.4 0.6 0.8 1 Cumulative Returns Time (most recent 5 years) Sharpe ratio = 2.5 Maximum drawdown = −0.057 Time (most recent 5 years) Sharpe ratio = 2.5 Maximum drawdown = −0.22 Figure 6.1: The red chart on the right was generated by permuting the daily returns from the blue chart on the left. Both have the same distribution of daily returns, and hence the same daily Sharpe ratio. This figure illustrates how distribution-based performance measures cannot capture some features preferred by traders, such as a small maximum drawdown. Drawdown can also be defined as a string of consecutive negative returns. Many performance measures consider aspects of the distribution of such drawdowns instead of returns, including the mean, standard deviation, and selected number of worst draw- downs. The Martin ratio, or “Ulcer performance index” has the same numerator as the Sharpe ratio, but has the Ulcer index as the denominator. Using the notation in Bacon (2012), letD 0 i be the drawdown since the previous peak in periodi. The Ulcer index is then defined as: Ulcer indexUI = v u u t n X i=1 D 02 i n Figure 6.2 shows an equity graph with each D 0 i shown in black. The Ulcer index penalizes long drawdowns. 105 The Pain ratio also has the same numerator as the Sharpe ratio. The denominator is the Pain index, a modified form of the Ulcer index: Pain indexPI = n X i=1 jD 0 i j n The Pain index also penalizes long drawdowns but does not penalize deep drawdowns as severely as the Ulcer index. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cumulative Returns Time (most recent 5 years) Figure 6.2: The Pain index is the area colored black. The Ulcer index is the root mean squared height of each vertical black line. Max Days Since First at This Level is an intuitive measure that we define as the longest horizontal line that can be drawn between two points on the graph, as shown in Figure 6.3. We introduce it here because it is not found in the literature, and we find it ranks highly in our experiments. 106 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cumulative Returns Time (most recent 5 years) 466 business days since first reaching this height Figure 6.3: “Max Days Since First at This Level” is the longest horizontal line that can be drawn between two points on the graph. 6.1.4 Selecting an Investment Performance Metric To the best of our knowledge, no work has been published on comparing investment performance measure rankings against real human interest. Justification for proposed measures is axiomatic, based on the properties of the measures (Alexander & Baptista, 2003; Keating & Shadwick, 2002). Farinelli et al. (2008) compare eleven performance ratios. Their work includes a limited empirical simulation, evaluating how well each ratio performed forecasting five stock indexes. They find that asymmetrical performance ratios work better and recom- mend that more than a single performance ratio be used. Cogneau & Hübner (2009) 107 survey over 100 investment performance measures. They provide a taxonomy and clas- sification of measures based on their objectives, properties, and degree of generalization. Bacon (2012) also provides a thorough survey of measures grouped into categories. Some of the current research indicates that different performance metrics produce substantially the same rank orders. Hahn et al. (2002) used 10 performance measures to rank data from two proprietary trading books and found high values of Spearman’s rank correlation. Eling & Schuhmacher (2007) find high rank correlation (0.96) between 13 performance measures that were used to rank the returns of 2,763 hedge funds. Eling (2008) confirmed the high rank correlation between measures when applied to 38,954 mutual funds from 7 asset classes. On the other hand, Zakamouline (2010) describe several less correlated measures and suggest the use of Kendall’s tau instead of Spear- man’s rho for measuring rank correlation. None of these four studies considered the Pain, Ulcer, and Martin-related measures discussed in Section 6.1.3. 6.2 Approach We now describe our approach to learn an investment performance measure with higher rank prediction accuracy than the current performance measures, using crowd- sourced domain user input. The steps of our approach are as follows: 1. Generate equity graphs simulating reasonable investment performance. 2. Collect preference data for the generated equity graphs from domain users in the form of pairwise rankings. 3. Use learning-to-rank algorithms with individual performance measures as features to create a new performance measure. 108 6.2.1 Generating Equity Graphs Our approach uses equity graphs as a means for enabling domain experts to rapidly compare two strategies or investments. We generated (synthetic) equity graphs that fol- low a log-normal random walk. In this model, the asset price,S t , follows the stochastic differential equation: dS t =S t dt +S t dW t where is the constant drift, is the constant volatility, anddW t is a Wiener process. We generated discrete differential simple returns representing five years with 252 business days per year. The returns are normally distributed with a mean of 0:125 and a standard deviation of 1. These values were chosen to lead to a broad distribution of Sharpe ratios centered around 2. Of these, only graphs with Sharpe ratios between 1:5 and 2:5 are retained. This range corresponds to the range of Sharpe ratios typically encountered. Ratios below 1:5 are unattractive as an investment, and ratios greater than 2:5 are very rare in practice. In total, we generated 2,000 charts. For each graph, we normalize the set of returns to sum to 1. Normalizing the cumu- lative return enables domain experts to directly compare risk metrics (such as the maxi- mum drawdown) on the same scale. 6.2.2 Collection of Ranking Data One of our innovations is the collection of domain expert preferences in the form of pairwise rankings. We believe that it is easier for a participant to choose between two equity graphs than to decide on a numeric score for every individual graph. In particular, numeric scores require that these be normalized before aggregating scores to account for the different preference scales of participants. This normalization would be difficult for cases where a participant only labeled a small number of charts. In contrast, 109 our pairwise ranking-based method is fast for human users with median ranking time between 3 and 4 seconds. We created a web page that described our research goal and presented two randomly chosen equity graphs side-by-side, as seen in Figure 6.4. A participant is asked which of these two investments is more attractive to invest in for the future. We requested participation from domain experts in two online forums. The first forum targets quanti- tative analysts and risk managers. The second forum targets individual traders, although some members run small hedge funds or are commodity trading advisers. 66 different anonymous people from these forums ranked a total of 1,004 chart pairs. We believe that the participation of many professionals is validation of community interest in improving investment performance measurement. One author also ranked 1,659 equity graph pairs, including a re-ranking of every pair ranked by the community. In order to estimate self-consistency of rankings, the author later re-ranked each of the same 1,659 graph pairs. The estimate of self-consistency is 90%. In all rankings and re-rankings, the equity graph positions (i.e., left or right side) were chosen randomly. 6.2.3 Data Quality Ensuring quality of crowd-sourced data is a recognized problem (Lease, 2011). As expected, we found that some of the crowd-sourced data was of low quality. In this section, we describe the steps performed to derive a higher quality data subset from the crowd-sourced annotations. One author tagged each of the pairs of equity graphs used for crowd-sourced ranking as either “close call” (81%) or “clear choice” (19%). A “clear choice” tag indicates that the author’s preference was strong and this view was likely to reflect universal 110 Figure 6.4: Web page used to collect pairwise investment performance ranking data from traders and quantitative analysts. preferences. The author was 100% self-consistent when re-ranking “clear choice” equity graph pairs. To identify low quality contributions, we evaluated each contribution according to the following characteristics: Small median time between clicks 111 A high fraction of times the participant clicked the same button (i.e., left or right), rather than alternating approximately uniformly between the two A systematic preference for the chart with the lower Sharpe ratio A relatively high fraction of rankings that contradict the author’s “clear choice” rankings Overall, we filtered out 129 rankings, leaving 875 of the original 1,004. As such a data quality filter is subjective, we also ran all experiments on the unfiltered dataset in addition to making the data publicly available. 6.2.4 Learning-to-Rank A learning-to-rank algorithm predicts the order of two objects given training data consisting of partial orders of objects (and their features). We use the learning-to-rank algorithm proposed by Herbrich et al. (1999). In this method, the ranking task is trans- formed into a supervised binary classification task by considering the difference between corresponding features. This transformation also enables the use of other learning algo- rithms in addition to support vector machines as originally proposed by Herbrich et al. (1999). The three classification algorithms we use in this work are: 1. Logistic regression, withL 1 -norm regularization (Friedman et al., 2010) 2. Random forests (Breiman, 2001) 3. SVM with linear and RBF kernels (Chang & Lin, 2011) Given two objects,A andB, the learning-to-rank task is to predict ifA > B based on their respective features. It is redundant to include bothA > B andB > A (with 112 negated feature differences) when training a model. In order to ensure balanced numbers of classes for the model to learn, we chose one of either A > B or B > A for each instance such that there were equal numbers of positive and negative instances in the training data. Balancing the training data also ensures that the intercept or bias term will be zero for logistic regression. Features The features we use as inputs to the machine learning models include relevant risk and performance metrics found in Bacon’s comprehensive survey (Bacon, 2012) which also provides descriptions of each measure using uniform notation. Note that for our normalized charts, risk metrics produce identical rank orderings as their respective per- formance measures. We nevertheless include both, because models such as logistic regression use linear combinations of features, and we do not know a priori which fea- ture will combine best with other features. 6.3 Experimental Results In our experiments, we consider the following three datasets: 1. The full set of all 1,004 community rankings (ACR) 2. The filtered set of 875 community rankings (FCR) 3. The set of 1,659 author rankings (AR) Each experiment followed these steps for evaluation: 1. Randomly shuffle the data 2. Separate 25% of the data for testing 113 3. Choose optimal hyper-parameters using 5-fold cross-validation on the training data 4. Test the accuracy of the final model on the held-out test data We performed each experiment 8 times and averaged the test accuracies. All models were trained and tested on the same random shuffle of the data to better compare their accuracies. In order to estimate the impact of the number of pairwise rankings needed for training on the accuracy of the learned performance measure, we tested progressively increasing amounts of training data. The data was not reshuffled as training instances were added, i.e., for n = 200, the first 100 data points are the same ones used for n = 100. Figure 6.5 shows accuracies obtained for each of the three datasets, using each of the models, trained with an increasing number of pairwise ranking samples. Each point on the graphs represents the average of 8 runs. For reference, we show the most commonly used performance measure as a baseline, the monthly Sharpe ratio. In addition, we also show the performance of the ex post facto best measure for each dataset, although in practice which measure would perform the best on a given dataset would not be known. From these experiments, we observed that none of the established performance mea- sures in this domain is able to fully predict domain expert preferences. Our performance measure trained from domain expert preferences is able to achieve better prediction accuracy. For the filtered community ranking dataset, the random forests approach nar- rowly outperformed logistic regression, with 80% accuracy. The best baseline for this dataset is the monthly Pain index, with 77% accuracy (Table 6.1). For the dataset con- taining all community rankings, logistic regression has the best performance, with 74% accuracy. The best baseline for this dataset is the daily Pain index, with 74% accuracy. 114 0 200 400 600 65% 70% 75% 80% 85% 90% Filtered Community Rankings Number of Training Samples Test Accuracy (average of 8 runs) 0 200 400 600 All Community Rankings Number of Training Samples logistic regression svm−linear random forest svm−rbf best baseline common baseline 0 500 1000 Author Rankings Number of Training Samples Author is 90% consistent, making that the upper bound. Figure 6.5: Accuracy of learning-to-rank models trained and tested on crowd-sourced “real human interest” data in the form of pairwise rankings. For the dataset containing author rankings, logistic regression again has the best perfor- mance, with 86% accuracy. Note that for this dataset, the same author performed each pairwise ranking twice. As these two sets of rankings have an agreement rate of 90%, this forms an upper bound for any model’s predictive accuracy. The best baseline for this dataset is the daily Martin ratio, with 79% accuracy. Learning-to-rank accuracies are lower for the community datasets than the author dataset. This is because community members have idiosyncratic preferences, contribut- ing inconsistency to the community training and test data. The learning curves in Figure 6.5 are relatively flat. This indicates that ranking more equity graph pairs would not lead to higher accuracies, given the models and features we have chosen. A small number of rankings (approximately 300) is adequate to learn a trader’s preferences. Given median ranking times between 3 and 4 seconds, a trader would likely spend 15 to 20 minutes ranking 300 chart pairs. Table 6.1 shows the accuracy of each baseline performance measure on each dataset, sorted in descending order for the filtered community rankings. There was no train/test 115 split in the data for this table. All pairwise rankings were treated as test data and evalu- ated against each measure. The best measures are consistently those based on the Pain index and Ulcer index, described in Section 6.1.3. Table 6.1: Baseline accuracies when tested against filtered community rankings (FCR), all community rankings (ACR), and author rankings (AR). Performance Measure FCR ACR AR Monthly Pain Index 77% 73% 78% Monthly Pain Ratio 77% 73% 78% Daily Martin Ratio 77% 74% 79% Daily Pain Index 77% 74% 79% Daily Pain Ratio 77% 74% 79% Daily Ulcer Index 77% 74% 79% Max Days Since First at This Level 76% 72% 73% Monthly Martin Ratio 76% 72% 78% Monthly Ulcer Index 76% 72% 78% Daily Reward to Max Days Since First at This Level 75% 72% 73% MAR Ratio 74% 70% 76% Max Drawdown 74% 70% 76% Monthly Modified Burke Ratio 73% 70% 75% Monthly Average Drawdown 73% 70% 74% Monthly Sterling-3 Ratio 72% 69% 76% Sterling-Calmar Ratio 72% 69% 76% Monthly Gini Ratio 71% 68% 74% Monthly Prospect Ratio 71% 68% 75% Monthly Parametric VaR 71% 68% 74% Monthly Sharpe Ratio 71% 68% 74% Monthly Half-Variance 70% 68% 75% Monthly Reward to Half-Variance 70% 68% 75% Monthly Downside Risk Sharpe Ratio 70% 67% 74% Monthly MAD Ratio 70% 68% 74% Monthly Pure Downside Risk 70% 67% 74% Continued on next page 116 Table 6.1 – Continued from previous page Performance Measure FCR ACR AR Monthly Pure Downside Variance 70% 67% 74% Monthly Sortino Ratio 70% 67% 74% Monthly Bernardo-Ledoit Ratio 70% 68% 74% Monthly Pure Downside Potential 70% 68% 74% Monthly Pure Upside Potential 70% 68% 74% Monthly Conditional Sharpe Ratio 70% 67% 71% Monthly Conditional VaR 70% 67% 71% Monthly Upside Potential Ratio 70% 67% 73% Weekly Sharpe Ratio 70% 67% 73% Monthly Modified Sharpe Ratio 70% 67% 74% Monthly Modified VaR 70% 67% 74% Daily MAD Ratio 70% 67% 72% Monthly Drawdown Deviation 70% 67% 73% Daily Historical Simulation VaR 69% 67% 70% Daily Reward to Historical Simulation VaR 69% 67% 70% Daily Bernardo-Ledoit Ratio 69% 67% 72% Daily Pure Downside Potential 69% 67% 72% Daily Pure Upside Potential 69% 67% 72% Max Absolute Deviation from Straight Line 69% 68% 70% Daily Gini Ratio 69% 67% 72% Monthly Largest Individual Drawdown 69% 66% 72% Daily Modified Burke Ratio 69% 67% 71% Daily Average Drawdown 69% 67% 71% Daily Half-Variance 69% 67% 71% Daily Reward to Half-Variance 69% 67% 71% Daily Parametric VaR 69% 66% 71% Daily Prospect Ratio 69% 66% 71% Daily Sharpe Ratio 69% 66% 71% Monthly Variability Skewness 69% 66% 71% Monthly V olatility Skewness 69% 66% 71% Daily Downside Risk Sharpe Ratio 68% 66% 71% Continued on next page 117 Table 6.1 – Continued from previous page Performance Measure FCR ACR AR Daily Pure Downside Risk 68% 66% 71% Daily Pure Downside Variance 68% 66% 71% Daily Sortino Ratio 68% 66% 71% Daily Drawdown Deviation 68% 66% 71% Daily Modified Sharpe Ratio 68% 66% 71% Daily Modified VaR 68% 66% 71% Daily Tail Gain 68% 66% 71% Monthly Adjusted Sharpe Ratio 68% 66% 71% Monthly Historical Simulation VaR 68% 66% 70% Monthly Reward to Historical Simulation VaR 68% 66% 70% Daily Adjusted Sharpe Ratio 67% 65% 70% Yearly Sharpe Ratio 67% 65% 66% Daily Conditional Sharpe Ratio 67% 65% 70% Daily Conditional VaR 67% 65% 70% Daily Drawdown at Risk 67% 65% 69% Daily Conditional Drawdown at Risk 66% 64% 71% Daily Reward to Conditional Drawdown 66% 64% 71% Daily Upside Potential Ratio 66% 65% 70% Monthly Tail Gain 66% 64% 68% Daily Sterling-3 Ratio 66% 63% 67% Root Mean Squared Deviation from Straight Line 65% 63% 67% Daily Variability Skewness 65% 63% 68% Daily V olatility Skewness 65% 63% 68% Daily Largest Individual Drawdown 64% 62% 67% Mean Absolute Deviation from Straight Line 63% 62% 66% Days in Max Drawdown 63% 62% 64% Monthly Double VaR Ratio 63% 62% 63% Monthly Rachev Ratio 63% 62% 64% Daily Sharpe Ratio Last Three Years 62% 60% 70% Daily Tail Ratio 62% 60% 64% Daily Tail Risk 62% 60% 64% Continued on next page 118 Table 6.1 – Continued from previous page Performance Measure FCR ACR AR Daily Sharpe Ratio Last Two Years 61% 60% 71% Monthly d Ratio 61% 60% 60% Daily d Ratio 59% 58% 58% Daily Double VaR Ratio 59% 57% 58% Fraction of Positive Months 59% 57% 60% Fraction of Positive Weeks 58% 57% 62% Monthly Tail Ratio 57% 56% 58% Monthly Tail Risk 57% 56% 58% Daily Sharpe Ratio Last Year 57% 56% 69% Fraction of Positive Days 57% 56% 60% Daily Rachev Ratio 56% 54% 58% Convexity Last Two Years 53% 53% 56% Convexity Last Three Years 53% 53% 60% Monthly Skewness-Kurtosis Ratio 52% 52% 54% Convexity Last Five Years 52% 52% 60% Lag-1 Autocorrelation 51% 51% 54% Daily Skewness-Kurtosis Ratio 51% 51% 51% 6.3.1 Simple Model Complex models involving many features are less likely to be adopted. For this rea- son, we also created a simple model. The form is a linear combination of two features, with the relative weight between them found using un-regularized logistic regression. Our method was similar to that of the main experiment. We first randomly shuffled the pairwise ranking data. We then held out 25% as test data. We performed 10-fold cross- validation on the training data. For each fold, we trained models on 90% of the training data and tested on the remaining 10%. We tested every combination of two features in each fold. We chose the pair of features with the highest average validation accuracy. 119 We then retrained a model with the two chosen features on all the training data and tested on the 25% held-out test data. The accuracy for filtered community rankings is 80%, outperforming the best single feature by 3%. Let F 1 = Daily Pain Ratio and let F 2 = Max Days Since First at This Level then the simple model for the filtered community rankings is a linear combination of the two baseline measures: SM FCR F 1 0:0311F 2 (6.1) The accuracy for author rankings is 85%, outperforming the best single feature by 6%: Let F 1 = Daily Martin Ratio and let F 2 = Daily Sharpe Ratio Last Year then the simple model for the author rankings is: SM AR F 1 + 1:4211F 2 (6.2) Combining two performance measures is not arduous, and the results are comparable with the full model using all performance measures. The simple models are also easy to 120 interpret. We see that the author places more emphasis on recent performance than the community does. 6.4 Idiosyncratic Preferences Initially, we collected author rankings only for the purpose of checking if crowd- sourced rankings looked reasonable. However, we see in Figure 6.5 a noticeable dif- ference in accuracy between the model learned for community rankings and the model learned for author rankings. We believe the community model is limited because the community is not as self-consistent as an individual. To investigate further, we disaggregate the filtered community data back into individ- ual datasets. We calculate the accuracy of each performance metric for each individual contributor. Table 6.2 shows the average accuracy across the (13) individuals who con- tributed 25 or more rankings, for a subset of top performance measures. We see that the average accuracy for the top measures is 76-78%, about the same as the accuracy as when run on the combined dataset. The interesting point this table brings out is the high standard deviation of accuracies, 6-10%. Each measure works very well for some individuals and relatively poorly for others. These results suggest that individuals would benefit from finding out which measure works best for them personally. Our recommendation is to do that by ranking charts using a web page we have set up for that purpose 2 . Finally, we evaluate how many pairwise rankings are needed to know which individ- ual performance measures are best suited for a person. We use the author rankings for this purpose. As a second data point, we also study rankings from the individual (shown as User42) who contributed the most to the community dataset (100 rankings). 2 http://thames.usc.edu/rank.php 121 For this experiment, we: 1. Randomly choosek rankings from an dataset 2. Choose the single performance measure with the highest accuracy on thek rank- ings 3. Test the accuracy using the rest of the dataset, excluding thek rankings 4. Repeat this process 1,000 times perk to estimate confidence intervals In order to have sufficient test data for User42, we limitk to 50. Figure 6.6 shows, for both the author and User42, the relationship between accuracy,k, and confidence. The results indicate that, for high levels of confidence, ranking 50 chart pairs can increase expected accuracy by 10-15%. Given median ranking times between 3 and 4 seconds, we estimate ranking 50 charts should only take less than 5 minutes. 122 Figure 6.6: These charts show the relationship between accuracy, number of rankings, and confidence for two users (an author, and the community member with the greatest contribution – User42). The results are based on randomly choosing k of the user’s rankings, using them to choose the most accurate performance measure, then testing the accuracy of that measure on the rest of the user’s rankings. Confidence was evaluated by doing this calculation 1,000 times perk. k is capped at 50 because User42 only ranked 100 chart pairs, and we held out at least 50 for testing. 123 Table 6.2: Mean and standard deviation of accuracies across the (13) participants who ranked 25 or more chart pairs, for selected top performance measures. Performance Measure Mean Accuracy Accuracy St. Dev. Simple Filtered Community Model 78% 7% Simple Author Model 77% 8% Daily Martin Ratio 77% 7% Monthly Pain Ratio 77% 10% Daily Pain Ratio 76% 7% Monthly Martin Ratio 76% 11% Max Days Since First at This Level 75% 11% MAR Ratio 74% 10% Monthly Modified Burke Ratio 74% 9% Monthly Average Drawdown 74% 8% Sterling-Calmar Ratio 73% 8% Monthly Sterling-3 Ratio 73% 9% Monthly Prospect Ratio 72% 8% Monthly Gini Ratio 71% 8% Monthly Drawdown Deviation 71% 9% Monthly Sharpe Ratio 71% 8% Monthly Reward to Half-Variance 71% 8% Monthly Sortino Ratio 70% 7% Max Abs. Deviation from Straight Line 70% 6% Monthly Bernardo-Ledoit Ratio 70% 10% Monthly Modified Sharpe Ratio 70% 8% Monthly Conditional Sharpe Ratio 69% 8% Monthly Adjusted Sharpe Ratio 67% 8% Daily Sharpe Ratio 67% 8% 124 Chapter 7 Conclusion In this thesis we have presented novel methods of combining heuristics to form cus- tomized data mining objective functions. Our contributions include: 1. FrontierMiner, a new algorithm for classification rule learning 2. PRIMER, a new algorithm for regression rule learning 3. A method for learning custom investment performance metrics Each contribution focuses on improving interpretability in data mining. In particular, we seek to facilitate the customization of objective functions by developing algorithms that ask only for information a domain expert can reasonably provide. 7.1 FrontierMiner In Chapter 4 we propose FrontierMiner as a new high-precision classification rule finding algorithm. Many classification rule finding methods have already been proposed. Each typically has options to choose from and parameters to adjust that can potentially tune the algorithm for better performance on a particular dataset. However, per-dataset tuning is uncommon in the literature. We have not found a comprehensive comparison of rule finding methods that performs thorough and joint parameter tuning on each dataset. For this reason, a considerable portion of our experiment effort is spent developing a baseline against which FrontierMiner can be compared. 125 We show that FrontierMiner does outperform the baseline, tuned and untuned, on a large number of both synthetic and real-world classification tasks. FrontierMiner uses bootstrap sampling to learn, on a per-dataset basis, a non-parametric penalty function for evaluating rules. The penalty function corrects the over-optimism of in-sample precision for low-coverage rules. FrontierMiner is designed to be simple to use. There are no options to choose from, and we have shown performance is insensitive to the only user-defined parameter value, which is the number of bootstrap iterations. FrontierMiner has higher computational complexity than the untuned baseline method, primarily because of the initial bootstrapping phase when it learns the penalty function. However, based on our overall experiment completion times, we estimate FrontierMiner is an order of magnitude faster than using grid search with cross- validation, at least with the settings and datasets we have chosen. 7.2 PRIMER In Chapter 5 we propose PRIMER as a new regression rule learning system for inter- vention optimization. The results of our experiments show that, with respect to impact prediction ranking, PRIMER is competitive with state-of-the-art regression techniques in a large financial event study. One advantage of using PRIMER is in the interpretability of the model. PRIMER outputs a list of rules ordered by predicted impact. This allows the top rules to stand alone without exceptions or caveats. It also allows the algorithm to be terminated early, once a sufficient number of the highest-impact rules are found. This is useful in domains where resource constraints limit the potential number of interventions. 126 Rule learning systems require the user to specify multiple critical operational param- eters. The optimal values are domain-dependent, yet domain experts have no intu- itive way of selecting values, except through trial-and-error or by using default values. PRIMER has an advantage in this regard. The most critical user input to PRIMER is the transfer function, which is likely well-known to a domain expert. The confidence interval parameter, , does has some effect on the number of rules learned. However, experimental results show relative insensitivity to. 7.3 Learning Investment Performance Metrics In Chapter 6 we propose a method for learning custom objective function from user preference data. This method uses pairwise learning-to-rank algorithms with previously proposed performance measures as input features. We demonstrate and evaluate this approach for the case of learning a performance measure to rank investments. Our experimental results show that machine learning algorithms can find combinations of performance measures that improve accuracy in this domain. We are the first to use crowd-sourcing to collect gold-standard pairwise investment ranking data from traders and quantitative analysts. From a data quality perspective, we report that a few participants provided random rankings, and at least one provided contrary rankings. We show ways of detecting these bad actors and filtering their con- tributions. Using this data, we find that performance ratios based on the Pain index and the Ulcer index most accurately reflect the preferences of the trader community. We also show that machine learning algorithms can find linear combinations of performance measures that improve accuracy. 127 In the simplest case (a linear combination of two measures) accuracy is boosted from 77% to 80% with the filtered community rankings, and from 79% to 85% with the author rankings. We find the author to be 90% self-consistent, so the simple model has only 5% room for improvement. The learning-to-rank models we use require only 300 pairwise rankings for training, which should typically take a person 20 minutes to perform. Learning-to-rank accuracies are lower for the community dataset than the author dataset. This is because community members have idiosyncratic preferences, contribut- ing inconsistency to the community training and testing data. We show that traders can find the best individual performance measures for them, personally, by ranking around 50 chart-pairs. We provide a web-based user interface where investors can perform rank- ing and receive recommendations on performance measures best suited to their prefer- ences. We also provide all data 1 (charts, measure calculations, and rankings) and code to encourage further study. 7.4 Future Work There are many possible directions for future work in the areas we have explored. One direction would be to incorporate FrontierMiner into a separate-and-conquer strat- egy for learning a complete classifier. Further study is needed to see how the penalty function evolves as the number of uncovered samples decreases. It may be the case that the penalty function does not need to be re-learned before each new rule is found. Research is needed to develop a strategy for determining the most beneficial times to re-run the bootstrapping phase of the algorithm. Perhaps patterns will emerge as to how penalty functions evolve, which may lead to sensible constraints that keep the penalty 1 http://thames.usc.edu/rank.zip 128 functions from fluctuating too significantly throughout the process, potentially reducing over-fitting. We believe the bootstrapping techniques used in FrontierMiner would also be effec- tive in regression rule learning. Instead of combining the precision and frequency heuris- tics, a new algorithm might combine frequency with some heuristic relating to the good- ness of the regression fit. Another possible direction for future work would be to simplify the way domain experts specify the expected pattern using the transfer function in PRIMER. It might be possible to replace the transfer function with something such as a simple assumption or constraint. For example, instead of specifying an exact decay function, one might simply specify that the response is expected to decay monotonically over time. PRIMER may also work with no user defined transfer function. Rather, the expected response pattern could be learned automatically from the pattern seen in the default rule. The pattern found when averaging responses to all events may have diluted impact, but it may indicate a useful transfer function to use for the search for refinements to the default rule. In this way, PRIMER could be made even simpler for domain experts to use. One possible direction for future work with investment performance metrics would be improving the customization with fewer pairwise rankings using techniques from active learning. We believe collaborative filtering techniques could also be used to help personalize a metric in the context of the broader information about the preferences of other domain experts. At a higher level, we plan to continue research into ways of improving data mining accuracy, while also improving interpretability in the way such algorithms are tuned by practitioners. 129 Bibliography Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI et al. (1996) Fast dis- covery of association rules. Advances in Knowledge Discovery and Data Min- ing 12:307–328. Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Com- puting 17:255–287. Alexander GJ, Baptista AM (2003) Portfolio performance evaluation using value at risk. The Journal of Portfolio Management 29:93–102. Bacon CR (2012) Practical Risk-adjusted Performance Measurement John Wiley & Sons. Bayardo Jr RJ, Agrawal R (1999) Mining the most interesting rules In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 145–154. ACM. Berry MJ, Linoff G (1997) Data Mining Techniques: For Marketing, Sales, and Cus- tomer Support John Wiley & Sons, Inc. 130 Blanchard J, Guillet F, Gras R, Briand H (2005) Using information-theoretic measures to assess association rule interestingness In Data Mining, Fifth IEEE International Conference on, p. 8. IEEE. Box GE, Tiao GC (1975) Intervention analysis with applications to economic and envi- ronmental problems. Journal of the American Statistical Association 70:70–79. Breiman L (2001) Random forests. Machine Learning 45:5–32. Brookhouse J, Otero FE (2015) Discovering regression rules with ant colony optimiza- tion In Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion ’15, pp. 1005–1012, New York, NY , USA. ACM. Bruner JS, Goodnow JJ, Austin GA (1956) A Study of Thinking New York: John Wiley & Sons, Inc. Caruana R, Kangarloo H, Dionisio J, Sinha U, Johnson D (1999) Case-based explana- tion of non-case-based learning methods. In Proceedings of the AMIA Symposium, p. 212. American Medical Informatics Association. Caruana R, Lou Y , Gehrke J, Koch P, Sturm M, Elhadad N (2015) Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission In Proceed- ings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM. Carvalho DR, Freitas AA, Ebecken N (2005) Evaluating the correlation between objec- tive rule interestingness measures and real human interest In Knowledge Discovery in Databases: PKDD 2005, pp. 453–461. Springer. 131 Cestnik B (1990) Estimating probabilities: A crucial task in machine learning In ECAI, V ol. 90, pp. 147–149. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans- actions on Intelligent Systems and Technology (TIST) 2:27. Clark P, Niblett T (1987) Induction in noisy domains In EWSL, pp. 11–30. Citeseer. Clark P, Niblett T (1989) The CN2 induction algorithm. Machine Learning 3:261–283. Cogneau P, Hübner G (2009) The (more than) 100 ways to measure portfolio perfor- mance. part 1: standardized risk-adjusted measures. Journal of Performance Mea- surement 13. Cohen WW (1995) Fast effective rule induction In Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann. Coleman TF, Li Y (1996) A reflective Newton method for minimizing a quadratic function subject to bounds on some of the variables. SIAM Journal on Optimiza- tion 6:1040–1058. Cooper GF, Aliferis CF, Ambrosino R, Aronis J, Buchanan BG, Caruana R, Fine MJ, Glymour C, Gordon G, Hanusa BH et al. (1997) An evaluation of machine- learning methods for predicting pneumonia mortality. Artificial Intelligence in Medicine 9:107–138. Cortes C, Vapnik V (1995) Support-vector networks. Machine learning 20:273–297. Dembczy´ nski K, Kotłowski W, Słowi´ nski R (2008) Solving regression by learning an ensemble of decision rules In International Conference on Artificial Intelligence and Soft Computing, 2008, V ol. 5097 of Lecture Notes in Artificial Intelligence, pp. 533–544. Springer-Verlag. 132 Dowd K (1998) Beyond Value at Risk: The New Science of Risk Management, V ol. 3 Wiley Chichester. Dreyfus HL, Dreyfus SE (1992) What artificial experts can and cannot do. AI & Soci- ety 6:18–26. Dzeroski S, Cestnik B, Petrovski I (1993) Using the m-estimate in rule induction. Jour- nal of Computing and Information Technology 1:37–46. Efron B, Gong G (1983) A leisurely look at the bootstrap, the jackknife, and cross- validation. The American Statistician 37:36–48. Eling M (2008) Does the measure matter in the mutual fund industry? Financial Analysts Journal pp. 54–66. Eling M, Schuhmacher F (2007) Does the choice of performance measure influence the evaluation of hedge funds? Journal of Banking & Finance 31:2632–2647. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. The Journal of Machine Learning Research 9:1871–1874. Fang W, Lu M, Xiao X, He B, Luo Q (2009) Frequent itemset mining on graphics processors In Proceedings of the Fifth International Workshop on Data Management on New Hardware, pp. 34–42. ACM. Farinelli S, Ferreira M, Rossello D, Thoeny M, Tibiletti L (2008) Beyond Sharpe ratio: Optimal asset allocation using different performance ratios. Journal of Banking & Finance 32:2057–2063. Frank E, Witten IH (1998) Generating accurate rule sets without global optimiza- tion In Shavlik J, editor, Fifteenth International Conference on Machine Learning, pp. 144–151. Morgan Kaufmann. 133 Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33:1. Fürnkranz J (1999) Separate-and-conquer rule learning. Artificial Intelligence Review 13:3–54. Fürnkranz J, Flach PA (2005) ROC ‘n’ rule learning–towards a better understanding of covering algorithms. Machine Learning 58:39–77. Fürnkranz J, Gamberger D, Lavraˇ c N (2012) Foundations of Rule Learning Springer Science & Business Media. Geng L, Hamilton HJ (2006) Interestingness measures for data mining: A survey. ACM Computing Surveys (CSUR) 38:9. Gibbons JD, Chakraborti S (2011) Nonparametric Statistical Inference Springer. Hahn C, Wagner FP, Pfingsten A (2002) An empirical investigation of the rank correla- tion between different risk measures In EFA 2002 Berlin Meetings Presented Paper, pp. 02–01. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11:10–18. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation In ACM SIGMOD Record, V ol. 29, pp. 1–12. ACM. Herbrich R, Graepel T, Obermayer K (1999) Large margin rank boundaries for ordinal regression. Advances in Neural Information Processing Systems pp. 115–132. Hilderman RJ, Hamilton HJ (2001) Evaluation of interestingness measures for rank- ing discovered knowledge In Advances in Knowledge Discovery and Data Mining, pp. 247–259. Springer. 134 Ho CH, Lin CJ (2012) Large-scale linear support vector regression. The Journal of Machine Learning Research 13:3323–3348. Holmes G, Hall M, Frank E (1999) Generating rule sets from model trees In Proceedings of the 12th Australian Joint Conference on Artificial Intelligence (AI-99), pp. 1–12. Springer. Huysmans J, Dejaeger K, Mues C, Vanthienen J, Baesens B (2011) An empirical evalua- tion of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems 51:141–154. Janssen F, Fürnkranz J (2009) A re-evaluation of the over-searching phenomenon in inductive rule learning. In SDM, pp. 329–340. SIAM. Janssen F, Fürnkranz J (2010a) On the quest for optimal rule learning heuristics. Machine Learning 78:343–379. Janssen F, Fürnkranz J (2010b) Separate-and-conquer regression. Proceedings of LWA 2010: Lernen, Wissen, Adaptivität, Kassel, Germany pp. 81–89. Janssen F, Fürnkranz J (2011) Heuristic rule-based regression via dynamic reduction to classification In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, V ol. 22, p. 1330. Jensen DD, Cohen PR (2000) Multiple comparisons in induction algorithms. Machine Learning 38:309–338. Keating C, Shadwick WF (2002) A universal performance measure. Journal of Perfor- mance Measurement 6:59–84. Kruskal JB (1964) Nonmetric multidimensional scaling: a numerical method. Psy- chometrika 29:115–129. 135 Langdon W, Yoo S, Harman M (2011) Non-recursive beam search on GPU for formal concept analysis. RN 11:1. Lease M (2011) On quality control and machine learning in crowdsourcing. In Human Computation. Leinweber DJ (2007) Stupid data miner tricks: overfitting the S&P 500. The Journal of Investing 16:15–22. Lenca P, Meyer P, Vaillant B, Lallich S (2008) On selecting interestingness measures for association rules: User oriented description and multiple criteria decision aid. European Journal of Operational Research 184:610–626. Li EX, Ramesh K (2009) Market reaction surrounding the filing of periodic SEC reports. The Accounting Review 84:1171–1208. Lipton ZC, Kale DC, Elkan C, Wetzell R, Vikram S, McAuley J, Wetzell RC, Ji Z, Narayaswamy B, Wang CI et al. (2016) The mythos of model interpretability. IEEE Spectrum . Manning CD, Raghavan P, Schütze H et al. (2008) Introduction to information retrieval, V ol. 1 Cambridge University Press. McGarry K (2005) A survey of interestingness measures for knowledge discovery. Knowledge Eng. Review 20:39–61. Michalski RS (1969) On the quasi-minimal solution of the general covering prob- lem In Proceedings of the Fifth International Symposium on Information Processing, V ol. A3, pp. 125–128. 136 Minnaert B, Martens D, De Backer M, Baesens B (2015) To tune or not to tune: rule evaluation for metaheuristic-based sequential covering algorithms. Data Mining and Knowledge Discovery 29:237–272. Mistry J, Shah J (2013) Dealing with the limitations of the Sharpe ratio for portfolio evaluation. Journal of Commerce and Accounting Research 2:10–18. Možina M, Demšar J, Žabkar J, Bratko I (2006) Why is rule learning optimistic and how to correct it Springer. Ng AY , Jordan MI (2002) On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing Systems 14:841–848. Ohsaki M, Kitaguchi S, Okamoto K, Yokoi H, Yamaguchi T (2004) Evaluation of rule interestingness measures with a clinical dataset on hepatitis In Knowledge discovery in databases: PKDD 2004, pp. 362–373. Springer. Parpinelli RS, Lopes HS, Freitas AA (2002) Data mining with an ant colony optimiza- tion algorithm. IEEE Transactions on Evolutionary Computation 6:321–332. Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. Knowledge discovery in databases pp. 229–238. Quinlan J, Cameron-Jones R (1995) Oversearching and layered search in empirical learning. Breast Cancer 286:2–7. Quinlan JR (1990) Learning logical definitions from relations. Machine Learn- ing 5:239–266. Seber G, Wild C (1989) Nonlinear Regression John Wiley & Sons, New York. 137 Sharpe WF (1966) Mutual fund performance. Journal of Business pp. 119–138. Sharpe WF (1994) The Sharpe ratio. Journal of Portfolio Management 21:49–58. Sikora M, Skowron A, Wróbel Ł (2012) Rule quality measure-based induction of unordered sets of regression rules In Artificial Intelligence: Methodology, Systems, and Applications, pp. 162–171. Springer. Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional net- works: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 . Sortino FA, Van Der Meer R (1991) Downside risk. The Journal of Portfolio Manage- ment 17:27–31. Srinivasan A, Ramakrishnan G (2011) Parameter screening and optimisation for ILP using designed experiments. The Journal of Machine Learning Research 12:627–662. Swartout WR (1983) Xplain: A system for creating and explaining expert consulting programs. Artificial Intelligence 21:285–325. Tan PN, Kumar V , Srivastava J (2002) Selecting the right interestingness measure for association patterns In Proceedings of the Eighth ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pp. 32–41. ACM. Vaillant B, Lenca P, Lallich S (2004) A clustering of interestingness measures In Dis- covery Science, pp. 290–297. Springer. You H, Zhang Xj (2009) Financial reporting complexity and investor underreaction to 10-K information. Review of Accounting Studies 14:559–586. Zakamouline V (2010) The choice of performance measure does influence the evalua- tion of hedge funds. Available at SSRN 1403246 . 138 Zhang A, Shi W, Webb GI (2016) Mining significant association rules from uncertain data. Data Mining and Knowledge Discovery pp. 1–36. 139 Appendix A GPU-Acclerated Beam Search In this section, we describe a method to accelerate beam search using a commod- ity general purpose graphics processing unit (GPU). Such acceleration makes it more feasible to use cross-validation for tuning hyper-parameters in order to customize data mining objective functions for individual domains. There has been some work related to GPU-accelerated rule learning. Langdon et al. (2011) implemented beam search on a GPU for the purpose of Formal Concept Analy- sis, although they did not achieve any speedup relative to an optimized CPU algorithm. Fang et al. (2009) implemented the Apriori algorithm on a GPU and showed up to 7.8x speedup versus their CPU implementation. Conclusions are difficult to draw from com- parisons with these two papers, given the different algorithms, datasets, and hardware. Our large-scale experiments in Chapter 4 are made feasible through the considerable processing power and high memory bandwidth provided by GPUs. To take advantage of the power, an algorithm must be structured to fit into a single instruction, multiple data (SIMD) paradigm. Each thread in a GPU performs the exact same operations, and essentially only the thread ID is unique. The thread ID is used to index into different parts of an array in global memory, so that each thread can be made to see different data. For simplicity, we parallelize the algorithm by having each thread evaluate one rule at a time. This choice works well for medium and large problems, but does not fully utilize the GPU threads for small problems. As a rough rule of thumb, if the number of candidate rules for each rule length (number of features times beam size) is greater than 1,000, the GPU outperforms a pure CPU implementation. 140 Algorithm 5 shows a high-level overview of our approach to beam search on a GPU. The CPU loads data, generates candidate rules, and manages the beam. The GPU per- forms the computationally intensive task of comparing each candidate rule against each sample in the training data. Algorithm 5 GPU Beam Search 1: procedure BEAMSEARCH 2: Loaddata into GPU global memory 3: beam[0] fg . init with empty rule 4: fori = 1 tol do .l = maximum rule length 5: Preparerules by refiningbeam[i 1] 6: EVALUATECANDIDATES(rules) 7: beam[i] BEST(rules,b) .b = beam size 8: end for 9: end procedure 10: function EVALUATECANDIDATES(rules) 11: Copyrules from host to GPU global memory 12: forruleÎrules do . one thread per rule 13: p = 0 . init covered positives 14: n = 0 . init covered negatives 15: for each samplesÎdata do 16: ifrule coverss then 17: ifs is positive class then 18: p p + 1 19: else 20: n n + 1 21: end if 22: end if 23: end for 24: end for 25: Copy allp andn back to host 26: Calculate interestingness measure for each rule 27: end function The GPU code consists of copying all candidate rules of the same length to the GPU, evaluating the rules, and copying back the counts ofn (covered negative samples) and p (covered positive samples) to the host. We benchmark the execution time of the GPU 141 code compared to functionally equivalent multi-threaded CPU code. The GPU code runs 20 times faster on our largest datasets. These tests were run on a desktop computer with an Intel Core i7-3820 CPU and an NVIDIA GeForce GTX 980 Ti GPU. Two GPU code optimizations further increase the speedup to 28x. The first is caching the rules under evaluation into shared memory, which has higher bandwidth than global memory. The second optimization is having a single thread per block load each sample and label into shared memory visible to the other threads. This reduces the number of threads reading from global memory. 142
Abstract (if available)
Abstract
Data mining is the process of sifting through large amounts of information to find interesting patterns or relationships. It is a broad field because of the many possible ideas of what ""interesting"" means. Each task should ideally be matched with a custom solution, tailored to the objective and to the characteristics of the data. The problem is that current algorithm options and parameters are not specified in terms that align with domain expert knowledge. The result is that domain expertise yields little benefit, and that options and parameters must be set through trial-and-error or simply left at default recommended values. ❧ Customizing data mining objective functions is difficult because it involves specifying the trade-offs between multiple heuristics. Some heuristics align with the domain objective, some favor simplicity and parsimony, and others are included for regularization, or to reduce the chances of finding spurious patterns. Furthermore, the optimal objective function is dependent on the search algorithm. For example, the effect of over-searching, or finding false discoveries by chance due to an extensive search, can be mitigated by choosing an objective function with higher preference for frequently occurring patterns. In general, even among use-cases with the same objective, the optimal set of techniques and parameters to use varies due to differences in the data. ❧ Our approach to objective function customization is two-pronged: First, we explore new and existing ways for domain experts to contribute the knowledge they have. Second, we make extensive use of bootstrapping and cross-validation to customize objective functions in the absence of domain expertise. ❧ Our first contribution is FrontierMiner, a new rule-based algorithm for predicting a target class with high precision. It has no tuning parameters and learns an objective function directly from the data. We show evidence from a large-scale study that FrontierMiner finds higher-precision rules more often than competing systems. Our second contribution is PRIMER, a new algorithm for maximizing event impact on time series. It has an objective function that adapts to the level of noise in the data. It also incorporates user-provided input on the expected response pattern as a heuristic that helps prevent over-fitting. Our third contribution is a method of learning an objective function from user feedback in the form of pairwise rankings. With this feedback, we use learning-to-rank algorithms to combine existing heuristics into an overall objective function that more closely matches the user's preference.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Discovering and querying implicit relationships in semantic data
PDF
Prediction models for dynamic decision making in smart grid
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
From matching to querying: A unified framework for ontology integration
PDF
Learning to diagnose from electronic health records data
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Learning objective functions for autonomous motion generation
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Understanding semantic relationships between data objects
PDF
Bayesian methods for autonomous learning systems
PDF
Speeding up multi-objective search algorithms
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Learning at the local level
PDF
Responsible AI in spatio-temporal data processing
PDF
Machine learning for efficient network management
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
A complex event processing framework for fast data management
PDF
Transfer learning for intelligent systems in the wild
Asset Metadata
Creator
Harris, Gregory F.
(author)
Core Title
Customized data mining objective functions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/09/2017
Defense Date
11/30/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
classification rule learning,data mining,interestingness measures,intervention analysis,intervention optimization,machine learning,OAI-PMH Harvest,objective functions,performance evaluation measure,regression rule learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Horowitz, Ellis (
committee member
), Raghavendra, Cauligi (
committee member
)
Creator Email
gfharris@usc.edu,gregharris2@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-332387
Unique identifier
UC11257850
Identifier
etd-HarrisGreg-5015.pdf (filename),usctheses-c40-332387 (legacy record id)
Legacy Identifier
etd-HarrisGreg-5015.pdf
Dmrecord
332387
Document Type
Dissertation
Rights
Harris, Gregory F.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
classification rule learning
data mining
interestingness measures
intervention analysis
intervention optimization
machine learning
objective functions
performance evaluation measure
regression rule learning