Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning to diagnose from electronic health records data
(USC Thesis Other)
Learning to diagnose from electronic health records data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning to Diagnose from Electronic Health Records Data by David C. Kale A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) December 2018 Copyright 2018 David C. Kale Dedication For Teresa and Wally and for my father. ii Acknowledgements Applied research is an inherently collaborative process, and the research presented in this thesis is no exception. Throughout my Ph.D. and before, I have been blessed by an embarrassing wealth of amazing collaborators and colleagues. These folks hail from many dierent walks of life and a variety of backgrounds. Some were mentors, some were supporters, and some were in the trenches with me, but they all contributed to this thesis and to making my Ph.D. one of the richest parts of my life. I will begin where this all began: I want to thank Dr. Randall Wetzel and his entire VPICU team for opening the door for me to clinical machine learning. They are my original collaborators { there would be no \leaning to diagnose" without the three years I spent at CHLA. I particularly want to thank the VPICU for giving me the freedom and discretion to not only explore but also reach out and make connections. Had it not been for the VPICU, I would never have met Prof. Ben Marlin, with whom I published my rst ever machine learning for healthcare paper. Ben was my rst machine learning mentor, and I recall many of the lessons he taught me, particularly about patience { with research and with toddlers. Next I want to thank my advisor Prof. Greg Ver Steeg for supporting me as I wandered far and wide. Greg epitomizes that most important trait in research { curiosity { and iii fosters that in his students. Further, even though our individual research programs are quite dierent, he nds value in what I do. I will never forget the way that nearly every conversation in his oce eventually led to the whiteboard. I still hold out hope that he will one day do a performantive math poster presentation at a conference. I owe an equally large thank you to Prof. Aram Galstyan. Should I have the chance to manage a team of my own someday, I will remember the lessons of leadership that Aram has taught me. Aram is always there to provide a push or a helping hand, whichever his students need. My only consolation for leaving the MINDS lab is that I don't have to say a real goodbye, at least for a while! I want to thank my other two committee members, Gaurav Sukhatme and Raghu Raghavendra. Their support and feedback have been invaluable to me throughout my thesis process. I owe gratitude to two other mentors who have played critical roles in my development as a researcher: Prof. Yan Liu and Prof. Nigam Shah. Yan went to bat for me during admissions, and helped me start my Ph.D. at a fast pace, both of which I'll never forget. Nigam has opened a variety of doors for me and provided fantastic advice over the years. I also owe a huge thanks to the Alfred E. Mann Institute for funding me during the second half of my Ph.D. and to Lizsl Deleon, the single most important person in the USC Computer Science Department. I want to thank my primary collaborators in my research, the foremost among them Prof. Zack Lipton. In retrospect, our collaboration seems brief, but it was fruitful and so well-timed that it almost seemed fated. The wild ride of ICLR and MLHC was one of the most fun periods of my Ph.D. We really need to put together a reunion and try to recapture the magic. Another person I would like to work with again is Taha Bahadori, iv who was perhaps the largest in uence on me during my early Ph.D. days. Taha provided unceasing support and never failed to help when asked. I loved discussing his crazy ideas with him. I also want to thank my virtual labmate Ken Jung for the many valuable brainstorming sessions; I will miss our weekly Skype call. Finally, I owe a massive debt of gratitude to Hrayr Harutyunyan and Hrant Khachatrian. Without their hard work, the benchmark project would never have come to fruition. I look forward to our next joint endeavor! I want to acknowledge another group of people who have been incredibly in uential upon me and my career: Finale Doshi-Velez, Jenna Wiens, Byron Wallace, Rajesh Ran- ganath, and Jim Fackler. Organizing the Machine Learning for Healthcare Conference with them has been one of the highlights of my career. The conclusion of this thesis was inspired by a quote from Jim. And now that I am nished with my Ph.D., I hope that Finale and I will nally get to that do the computational fantasy literature project that we've been discussing for years! Last but certainly not least, I want to thank my family. My father Robert has always been my inspiration, my mother Jo Ann my rock and foundation. Dad and Mom shaped my thinking about not only life but also medicine, as both are medical professionals. I would not be where I am today without their hard work and support. Meg, as I've always said, is my \older" younger sister: I have lived on this earth longer than she has, but she has far surpassed me in wisdom. I have always relied and will continue to rely on her for good advice. I want to oer a special thank you to my California family: Carol, John, Keith, and Elizabeth. Their nonstop support throughout my Ph.D., particularly over the last year, made a huge dierence. Also, shout out to Uncle Andy! v Finally, thank you to my wife Teresa, my best friend in the world, and to my son, Wally. From day one Teresa fully supported my decision to forfeit an income in order to return to graduate school. No doubt she is relieved that I am defending, but she would never say so. And this entire process would never have been possible if Wally hadn't started sleeping through the night six months ago. vi Table of Contents Dedication ii Acknowledgements iii List Of Tables xii List Of Figures xiv Abstract xvii Chapter 1: Introduction 1 1.1 Key Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Aligning machine learning objectives with clinical goals . . . . . . 4 1.1.2 Choosing a useful representation for digital health data . . . . . . 6 1.1.3 Learning without large labeled data sets . . . . . . . . . . . . . . . 8 1.1.4 Enabling reproducibility and accelerating innovation . . . . . . . . 9 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Biomedical Informatics . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.1 Included in Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.2 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 2: Background 18 2.1 Problem Formulation and Notation . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Electronic Health Records Data . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 Inpatient observational time series . . . . . . . . . . . . . . . . . . 27 2.2.1.1 VPICU Dataset . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1.2 MIMIC-III Database . . . . . . . . . . . . . . . . . . . . 29 2.2.1.3 Physionet Challenge 2012 . . . . . . . . . . . . . . . . . . 30 2.2.2 Clinician-generated text . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.2.1 i2b2 Obesity Challenge 2008 . . . . . . . . . . . . . . . . 32 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Diagnostic expert systems . . . . . . . . . . . . . . . . . . . . . . . 33 vii 2.3.2 Computational phenotyping . . . . . . . . . . . . . . . . . . . . . . 34 2.3.3 Machine learning for diagnosis and phenotyping . . . . . . . . . . . 35 I Searching to Diagnose from Similar Patients 38 Chapter 3: Accelerated Similarity Search for Clinical Time Series 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Backgroud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Kernelized Hashing of Time Series . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Kernelized locality sensitive hashing . . . . . . . . . . . . . . . . . 46 3.4 Time series similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.0.1 Multivariate dynamic time warping . . . . . . . . . . . . 50 3.4.0.2 Global alignment kernels . . . . . . . . . . . . . . . . . . 52 3.4.0.3 Vector autoregressive kernels . . . . . . . . . . . . . . . . 54 3.4.0.4 Euclidean distance . . . . . . . . . . . . . . . . . . . . . . 56 3.4.1 Kernelized hashing framework . . . . . . . . . . . . . . . . . . . . . 57 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.2 Design and goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 67 II Learning to Diagnose with Deep Neural Networks 70 Chapter 4: Learning and Analyzing Representations of Clinical Data 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.1 Background: feature extraction from time series . . . . . . . . . . 77 4.3.2 Unsupervised autoencoders for phenotyping clinical time series . . 78 4.3.3 Supervised neural networks for phenotyping clinical time series . . 80 4.3.4 Discovery of causal phenotypes from clinical time series . . . . . . 81 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.1 Classication performance . . . . . . . . . . . . . . . . . . . . . . . 87 4.4.2 Causal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 viii Chapter 5: Multilabel Classication of Long Clinical Time Series 92 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.1 LSTM RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.2 Neural Networks for Medical Data . . . . . . . . . . . . . . . . . . 95 5.2.3 Neural Networks for Multilabel Classication . . . . . . . . . . . . 96 5.2.4 Machine Learning for Clinical Time Series . . . . . . . . . . . . . . 96 5.2.5 Target Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.6 Regularizing Recurrent Neural Networks . . . . . . . . . . . . . . . 98 5.2.7 Key Dierences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4.1 LSTM Architectures for Multilabel classication . . . . . . . . . . 102 5.4.2 Sequential Target Replication . . . . . . . . . . . . . . . . . . . . . 102 5.4.3 Auxiliary Output Training . . . . . . . . . . . . . . . . . . . . . . 104 5.4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5.1 Multilabel Evaluation Methodology . . . . . . . . . . . . . . . . . 106 5.5.2 Baseline Classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.8 Hourly Diagnostic Predictions . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.9 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.10 Per Diagnosis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter 6: Modeling Missing Values in Clinical Time Series 121 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2.2 Diagnostic labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Recurrent Neural Networks for Multilabel Classication . . . . . . . . . . 126 6.4 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4.1 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4.2 Learning with Missing Data Indicators . . . . . . . . . . . . . . . . 129 6.4.3 Hand-engineered missing data features . . . . . . . . . . . . . . . . 130 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.7.1 The Perils and Inevitability of Modeling Treatment Patterns . . . 136 6.7.2 Complex Models or Complex Features? . . . . . . . . . . . . . . . 137 6.7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.8 Per Diagnosis Classication Performance . . . . . . . . . . . . . . . . . . . 139 6.9 Missing Value Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 ix Chapter 7: Heterogeneous Multitask Learning for Clinical Time Series 145 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.3 Multitask Clinical Prediction Data Set . . . . . . . . . . . . . . . . . . . . 149 7.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.4.1 Per-task output layers . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.4.2 Channel-wise LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.4.3 Deep supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.4.4 Multitask learning LSTM . . . . . . . . . . . . . . . . . . . . . . . 158 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.5.1 Logistic regression baseline . . . . . . . . . . . . . . . . . . . . . . 160 7.5.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.5.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 III Learning to Diagnose Without Cleanly Labeled Data 169 Chapter 8: Accelerating Active Learning with Transfer Learning 170 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.3.1 Transfer learning framework . . . . . . . . . . . . . . . . . . . . . . 174 8.3.2 Active learning framework . . . . . . . . . . . . . . . . . . . . . . . 176 8.3.3 Transfer active learning . . . . . . . . . . . . . . . . . . . . . . . . 178 8.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.4.1 Deviation bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.4.2 Generalization bound . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.4.3 Label complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 8.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Chapter 9: Weakly Supervised Learning to Diagnose 198 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 9.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 9.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 9.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 9.4.1 20 Newsgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 9.4.2 i2b2 Obesity Challenge 2008 . . . . . . . . . . . . . . . . . . . . . 204 9.4.3 Anchored CorEx for Discriminative Tasks . . . . . . . . . . . . . . 206 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 x IV Conclusion 209 Chapter 10:A Learning to Diagnose Benchmark 210 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 10.2 The Case for a Clinical Benchmark . . . . . . . . . . . . . . . . . . . . . . 212 10.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 10.4 Benchmark Data Set and Tasks . . . . . . . . . . . . . . . . . . . . . . . . 215 10.4.1 In-hospital mortality . . . . . . . . . . . . . . . . . . . . . . . . . . 218 10.4.2 Physiologic Decompensation . . . . . . . . . . . . . . . . . . . . . . 221 10.4.3 Forecasting length of stay . . . . . . . . . . . . . . . . . . . . . . . 222 10.4.4 Acute care phenotype classication . . . . . . . . . . . . . . . . . . 225 10.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 10.5.1 Adding error bars to performance metrics . . . . . . . . . . . . . . 228 10.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 10.6.1 In-hospital mortality prediction . . . . . . . . . . . . . . . . . . . . 229 10.6.2 Decompensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 10.6.3 Length-of-stay prediction . . . . . . . . . . . . . . . . . . . . . . . 233 10.6.4 Phenotype classication . . . . . . . . . . . . . . . . . . . . . . . . 235 10.6.5 Published work using our benchmark . . . . . . . . . . . . . . . . . 235 10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Chapter 11:Concluding Remarks 238 11.1 Open Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 11.1.1 Representing heterogeneous clinical data . . . . . . . . . . . . . . . 241 11.1.2 Learning from small data with less compute . . . . . . . . . . . . . 241 11.1.3 Incorporating domain knowledge into data-driven models . . . . . 242 Bibliography 244 xi List Of Tables 2.1 Categories of digital health data and whether they are recorded in modern EHRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Relative speed-up (time g /time h ) for each time series kernel and data set, when performing an exhaustive search. The ground truth VAR kernel was intractable for surgical and eeg. . . . . . . . . . . . . . . . . . . . . . . 67 3.2 Semantic (label) accuracy and gap for 10nn retrieval across data sets and kernels. For PCPC codes, we measure the average squared dierence from the query point's score (i.e., lower is better). . . . . . . . . . . . . . . . . . 67 4.1 Classication performance on the Physionet Challenge 2012 data set. We report mean and standard deviation (across 10 folds) for each metric. We use the following abbreviations: R: raw time series, H: hand-designed fea- tures, NNet(I,L): L-layer neural network with input I . . . . . . . . . . . 88 4.2 Magnitude of the causal relationships identied by using the representa- tions learned by the deep learning algorithm. . . . . . . . . . . . . . . . . 89 5.1 Results on performance metrics calculated across all labels. mAUC and mF1 refer to the micro-averaged metrics, MAUC and MF1 the macro- averaged metrics. DO, TR, and AO indicate dropout, target replication, and auxiliary outputs, respectively. AO (Diagnoses) uses the extra diag- nosis codes and AO (Categories) uses diagnostic categories as additional targets during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 LSTM-DO-TR performance on the 6 diagnoses with highest F1 scores. . . 109 5.3 F1 and AUC scores for individual diagnoses. . . . . . . . . . . . . . . . . . 118 xii 6.1 Results on performance metrics calculated across all labels. mAUC and mF1 refer to the micro-averaged metrics, MAUC and MF1 the macro- averaged metrics. Performance on aggregate metrics for logistic regression (Log Reg), MLP, and LSTM classiers with and without imputation and missing data indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 AUC and F1 scores for individual diagnostic codes. . . . . . . . . . . . . . 141 6.3 Sampling rates and missingness statistics for all 13 features. . . . . . . . . 144 9.1 Evolution of Obesity and Obstructive Sleep Apnea (OSA) topics as anchors are added. Colors and font weight indicate anchors, spurious terms, and intruder terms from other known topics. Multiword and negated terms are the result of the preprocessing pipeline. . . . . . . . . . . . . . . . . . . . 204 9.2 F1 scores on soc.religion.christianity. . . . . . . . . . . . . . . . . . 206 9.3 Classication performance on Obesity 2008. . . . . . . . . . . . . . . . . . 207 10.1 Prevalence of ICU phenotypes in the benchmark data set. . . . . . . . . . 227 10.2 Results for length of stay prediction task (regression) . . . . . . . . . . . . 234 10.3 Per-phenotype classication performance for best overall multitask LSTM. 237 xiii List Of Figures 3.1 Results for picu data set. Top row: semantic accuracy gap for diagnosis label and PCPC code. Bottom left: 10-nearest neighbor recall. Bottom right: average query times. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 Left: 10-nearest neighbor recall for surgical. Right: semantic accuracy gap for eeg. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1 Deep neural networks for phenotyping from clinical time series. . . . . . . 78 4.2 Causal features learned from ICU time series. . . . . . . . . . . . . . . . . 91 5.1 A simple RNN model for multilabel classication. Green rectangles rep- resent inputs. The recurrent hidden layers separating input and output are represented with a single blue rectangle. The red rectangle represents targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 An RNN classication model with target replication. The primary target (depicted in red) at the nal step is used at prediction time, but during training, the model back-propagates errors from the intermediate targets (purple) at every sequence step. . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Our data set contains many labels. For our task, a subset of 128 are of interest (depicted in red). Our Auiliary Output neural network makes use of extra labels as additional training targets (depicted in purple). At inference time we generate predictions for only the labels of interest. . . . 104 5.4 Training curves showing the impact of the DO, AO, and TR strategies on overtting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5 Each chart depicts the probabilities assigned by each of four models at each (hourly re-sampled) time step. LSTM-Simple uses only targets at the nal time step. LSTM-TR uses target replication. LSTM-AO uses auxiliary outputs (diagnoses), and LSTM-TR,AO uses both techniques. LSTMs with target replication learn to make accurate diagnoses earlier. . 116 xiv 5.6 Training and validation performance plotted for the simple multilabel net- work (LSTM-Simple), LSTM with target replication (LSTM-TR), and LSTM with auxiliary outputs (LSTM-AO). Target replication appears to increase the speed of learning and confers a small regularizing eect. Aux- iliary outputs slow down the speed of learning but impart a strong regu- larizing eect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.1 Missingness artifacts created by discretization . . . . . . . . . . . . . . . . 123 6.2 (top left) no imputation or indicators, (bottom left) imputation absent indicators, (top right) indicators but no imputation, (bottom right) indi- cators and imputation. Time ows from left to right. . . . . . . . . . . . . 129 6.3 Depiction of RNN zero-lled inputs and missing data indicators. . . . . . 130 7.1 Correlations between task labels. . . . . . . . . . . . . . . . . . . . . . . . 158 7.2 LSTM-based network architecture for multitask learning. . . . . . . . . . 159 7.3 Results for in-hospital mortality prediction task . . . . . . . . . . . . . . . 162 7.4 Results for decompensation prediction task . . . . . . . . . . . . . . . . . 163 7.5 Results for length of stay prediction task . . . . . . . . . . . . . . . . . . . 167 7.6 Results for phenotyping task . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.1 20 Newsgroups results. The lefthand plots show test set error versus num- ber of points seen by the active learner. The upper right shows test set error versus number of queries. The bottom right shows the query rate. . 193 8.2 sentiment results. The lefthand plots show test set error versus number of points seen by the active learner. The upper right shows test set error versus number of queries. The bottom right shows the query rate. . . . . 194 8.3 Summary of experiments, including number of labeled source examples and approximate d HH distances, and values of . . . . . . . . . . . . . . . . . 195 9.1 A hierarchical topic model learned by CorEx. Anchored latent factors are labeled in red with anchor words marked with a \*". . . . . . . . . . . . . 202 10.1 Progress on ImageNet classication task from 2010 until the competition was retired in 2016. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 10.2 Benchmark generation process . . . . . . . . . . . . . . . . . . . . . . . . 216 xv 10.3 Distribution of LOS. Plot (a) shows the distribution of LOS for full ICU stays and remaining LOS per hour. The rightmost 5% of both distributions are not shown to keep the plot informative. Plot (b) shows a histogram of bucketed patient and hourly remaining LOS (less than one day, one each for 1-7 days, between 7 and 14 days, and over 14 days). . . . . . . . . . . 223 10.4 Calibration of in-hospital mortality and decompensation prediction by the best linear, non-multitask and multitask LSTM-based models. . . . . . . . 230 10.5 In-hospital mortality prediction performance vs. length-of-stay. The con- dence intervals and standard deviations are estimated with bootstrapping on the data of each bucket. . . . . . . . . . . . . . . . . . . . . . . . . . . 231 10.6 Prediction of C + DS baseline over the time. Each row shows the last days of a single ICU stay. Darker colors mean high probability predicted by the model. Red and blue colors indicate the ground-truth label is negative and positive, respectively. Ideally, the right image should be all white, and the left image should be all white except the right-most 24 hours, which should be all dark blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 xvi Abstract With the widespread adoption of electronic health records (EHRs), US hospitals now digitally record millions of patient encounters each year. At the same time, we have seen high-prole successes by machine learning, including superhuman performance in complex games. These factors have driven speculation that similar breakthroughs in healthcare are just around the corner, but there are major obstacles to replicating these successes. In this thesis, we study these challenges in the context of learning to diagnose, which involves building software to recognize diseases based on the analysis of historical data rather than expert knowledge. Our central hypothesis is that we can build such systems while minimizing the burden of eort on clinical experts. We demonstrate one of the rst successful applications of recurrent neural networks to the classication of multivariate clinical time series. We then extend this framework to model non-random missing values and heterogeneous prediction tasks. We next examine the scenario in which labeled data are scarce, proposing practical solutions based on active learning and weak supervision. Finally, we describe a public benchmark for clinical prediction and multitask learning that promotes reproducibility and lowers the barrier to entry for new researchers. We conclude by considering the broader impact of information technology on healthcare and how machine learning can help fulll the vision of a learning healthcare system. xvii Chapter 1 Introduction One of the most anticipated frontiers in machine learning is health care, where adoption of electronic health records (EHRs) and growing consumer use of wearable sensors has created an explosion in digital health data. Millions of hospital encounters are recorded in EHRs in the United States (US) annually: as of 2015, more than 83% of US hospitals use a basic EHR system [125], and over 30 million patients visit hospitals every year [5]. In the same period, there have been several high prole successes by machine learning: com- puter vision systems achieved better-than-human performance in face recognition [256], while the AlphaGo deep reinforcement learning system soundly defeated the Go world champion, Lee Sedol [274]. Together these factors have increased speculation that similar breakthroughs in medicine must be just around the corner. There remain, however, major obstacles to replicating these successes in the clinical domain. One such obstacle is the nature of the data itself. Machine learning's most pub- licized successes have come in domains where the input space is relatively homogeneous (pixel intensities), and the data are for the most part clean and fully observed. Data may be structured, e.g., sequences of words, but the structure is regular and well-behaved. 1 EHR data, in contrast, have mixed data types, are inherently noisy with missing values, and exhibit complex structure such as sequences with irregular sampling and uncertain timestamps [187]. This makes choosing the right representation for clinical data, which is critical to the successful application of machine learning, more challenging than more traditional domains like computer vision and speech recognition. An equally formidable challenge is the lack of large-scale annotated data sets. State-of- the-art performance in object recognition and Go required training on millions of labeled images and tens of millions of moves from real and simulated games, respectively. In health care, large databases are increasingly accessible, but ground truth labels for many predictive tasks are unavailable, unreliable, or incomplete. Many outcomes of interest, such as adverse reactions to medications, occur outside the hospital and are not ob- served, while many diseases may be misdiagnosed or simply not recorded in the patient's record [212]. In this thesis, we study these challenges in the context of diagnosis, which we dene as \the art or act of identifying a disease from its signs and symptoms" [192]. Diagnosis is of- ten a critical rst step in clinical care, providing guidance for subsequent clinical decisions regarding tests and treatments. Misdiagnosis can harm patients by delaying necessary treatment or prompting unnecessary or inappropriate interventions [276]. Unfortunately, recent estimates suggest that one in twenty adults will be misdiagnosed during their life- time and at least half of these errors could lead to a poor outcome [210, 277]. In inpatient and acute care settings where timely intervention is critical, the rate of misdiagnosis is often even higher with deadlier consequences [78, 320, 152]. What is more, there is no discernable relationship between clinician condence and diagnostic accuracy [193], and 2 research suggests that human cognitive biases are key contributing factors in diagnostic errors [6, 113]. Thus, diagnosis presents an important opportunity for computational decision support [194]. Computer scientists have been interested in medical diagnosis and clinical decision making for decades [174, 107, 219, 121, 281]. Traditional expert system-based approaches have relied heavily on clinical expertise, encoded as rules [288] or as conditional indepen- dencies in a probabilistic model [121, 122]. More recent work on the related problem of computational phenotyping denes diseases in terms of inclusion and exclusion criteria derived from clinical expertise and guidelines. These denitions are often implemented as queries against an EHR database [66, 246]. Such approaches are expensive to develop and validate because they require signicant time and eort from clinicians and researchers. This approach does not easily scale to large numbers of patient records and diseases. As a result, biomedical researchers have begun exploring the creation of data-driven denitions of disease based on the application of machine learning techniques to large EHR databases [2, 246], a problem we refer to as learning to diagnose. In this setting, diagnosis is most commonly formulated as a binary classication problem, in which data- driven models are trained to predict a diagnosis label [246]. Such models have a number of advantages over expert systems: they can more eciently utilize the large amounts of data available in EHRs, and can readily adapt to new data as it becomes available. They are inherently probabilistic and can provide uncertainty estimates along with predictions. They also aord the opportunity for serendipitous discovery of new patterns and disease boundaries [160]. Machine learning-based diagnostic models have also found a variety 3 of applications in heathcare beyond diagnosis: cohort construction for genomic, pharma- ceutical, and clinical research [66]; comorbidity detection and risk adjustment [85]; and quality improvement [313]. Despite the promise of this approach and some early successes [40, 54, 129], progress in this area has been stymied by many of the same obstacles that limit machine learning's impact on health care in general. Paradoxically, overcoming these obstacles frequently requires input from clinical experts, e.g., manual annotation of training data for statistical classiers. Thus, the true development cost of data-driven diagnosis models is no less expensive than that of expert-based systems. In the proposed research, we will show that we can learn to accurately recognize diseases from digital health data using data- driven machine learning. We demonstrate how to build such models while minimizing the time and labor required from clinical experts, making our approach more practical and scalable in real world settings. Finally, where prior clinical knowledge is available, we can incorporate it in a lightweight, low-labor fashion. 1.1 Key Challenges We formulate \learning to diagnose" as a supervised learning task and aim to build statistical classiers to recognize dozens to hundreds of diseases. In doing so, we face a number of challenges. 1.1.1 Aligning machine learning objectives with clinical goals The foremost challenge in practical machine learning involves aligning machine learn- ing objectives with real world goals. It is deceptively easy to apply innovative machine 4 learning techniques to a healthcare problem and achieve a high accuracy, only to discover afterward that it is fatally awed in some subtle way or practically useless. A simple example involves modeling risk of readmission, where a large dataset from a single in- stitution likely includes \frequent yer" patients [155]. These are patients, usually with persistent chronic diseases, who are admitted multiple times in a year, often for the same reason each time. When splitting such a dataset into training and test sets, we must be careful to split based on patient rather than admission. Otherwise, we may end up with the same patient in both splits, leak testing set information and enabling a statisti- cal model to achieve articially high test set accuracy by memorizing information about frequent yer patients. A more subtle issue concerns model performance evaluation: clinical machine learn- ing researchers often overlook model calibration in their pursuit of higher discrimination (accuracy), but research suggests that good calibration is critical for clinical utility and preventing harm [202]. Calibration is especially important at the modest levels of accu- racy (AUC 0.7-0.8) frequently reported in clinical machine learning research papers [39]. Nonetheless, few researchers report model calibration and instead tout (often modest) improvements in classication accuracy. As Shah, et al. [202], point out, For example, more than 1,000 cardiovascular clinical prediction models have been developed and cataloged, yet only a small number of these are routinely used to support decision making in clinical care. It seems unlikely that in- cremental improvements in discriminative performance of the kind typically demonstrated in machine learning research will ultimately drive a major shift in clinical care. 5 This argument seems overly pessimistic but provides an important admonition regardless: clinical impact is likely more complicated than achieving a modest boost in AUC. In this thesis, we take concrete steps to align our work to realistic clinical objectives and to consider constraints that might arise in real world clinical applications. In Chapter 3, we propose similar patient retrieval as an interpretable form of diagnostic decision support. In Chapters 5, 6, 7, and 10, we use model evaluations that are inspired by clinical applications, and in Chapter 7, we report calibration for certain models. 1.1.2 Choosing a useful representation for digital health data Medicine: phenotypes, biomarkers 1 Measurable attributes of patient/disease. 2 Independent of other biomarkers. 3 Separate patients into meaningful groups. 4 Improve outcome prediction, risk assessment. 5 Clinically plausible, interpretable. August 3, 2018 1 / 2 (a) Criteria for new biomarkers established by Biomarkers Denitions Working Group [30]. ML: features, representations 1 Measurable properties of objects. 2 Independent, disentangle factors of variation. 3 Form natural clusters. 4 Useful for discriminative, predictive tasks. 5 Interpretable, provide insight into problem. August 3, 2018 2 / 2 (b) Properties of good representations for ma- chine learning [23]. The successful application of machine learning to real world prediction problems de- pends critically on how we represent the data [23]. The right representation emphasizes important patterns and facilitates prediction, while the wrong representation may oblit- erate any useful signal, rendering prediction dicult or impossible. Suppose, for example, that we want to determine whether a hospitalized infant experienced a transient brady- cardia, an abnormally low heart rate lasting less than minute [191]. We would be much more likely to detect it by examining the infant's minimum heart rate, rather than the mean, which might wash out a handful of irregular measurements. 6 The complex and diverse nature of digital health data can make nding the right representation challenging. In the example of bradycardia, prior knowledge enabled us to design an eective representation by hand, but for many other problems, we may lack the required insight, forcing us to conduct an ad hoc search for a good representation for each problem. Such an approach does not scale to the large number of diseases we wish to learn to diagnose. Such representations can be susceptible to noisy data: if, for example, an electrocardiogram (ECG) lead becomes displaced, we might record heart rates of zero, triggering a false alert and obscuring an actual bradycardia. Moreover, designing a representation based on domain knowledge encodes pre-existing biases about the problem and limits the opportunity for discovery of unexpected patterns and relationships [160]. A natural question to ask is what makes for a good representation of digital health data in general? One place to look for guidance is research on clinical biomarkers, usually dened as measurable attributes of a patient or disease process. Figure 1.1a enumerates the criteria for new biomarkers determined by the Biomarkers Denitions Working Group in 2001 [30]. In Figure 1.1b, we list the properties of a good representation in the context of machine learning [23]. There is a striking one-to-one correspondence, which suggests that we consider the recent trend in machine learning away from hand-designed repre- sentations, e.g., SIFT features for images [267], and toward representations learned from data using tools like neural networks [158]. In this thesis, we investigate the eectiveness of data-driven, rather than hand- designed, representations of clinical data. In Chapter 3, we work directly with raw clinical time series and focus our eort on dening appropriate methods for comparison. We also propose a framework for accelerating searches for similar patients over large time series 7 databases. In Chapters 4, 5, 6, 7, and 9, we use exible modeling techniques, such as neural networks, to discover rich latent representations of clinical time series and text. 1.1.3 Learning without large labeled data sets Many practical breakthroughs in machine learning have relied on the availability of datasets with thousands to millions of labeled examples [158]. In contrast, clinical datasets are usually orders of magnitude smaller and frequently lack labels for many outcomes of interest. In particular, ground truth diagnoses are rarely recorded during the course of clinical care and are unavailable in large numbers [246]. Structured data, such as ICD-9 (Ninth Revision of the International Classication of Diseases [214]) di- agnostic codes, are assigned by administrative sta with the goal of justifying clinical decisions for billing, rather than documenting actual health conditions. While we could treat these as noisy labels, most research on learning from noisy labels assumes that the noise is modest and suciently random [200, 245]. In contrast, the bias in diagnostic codes is often systemic and quite large [212, 43], with positive predictive values (or preci- sion) as low as 34% [135, 319]. Moreover, recent results suggest that powerful prediction models such as neural networks can model data with randomly assigned labels, further emphasizing the importance of having cleanly labeled data [335]. In clinical research, diagnosis labels assigned via retrospective chart review are consid- ered the \gold-standard." However, chart review by clinical experts can be prohibitively expensive, especially since reviewers may require additional guidelines and training to be reliable [190]. Even in combination with techniques like active learning, chart review often does not scale to large datasets [54]. This is due to the cold start problem [144]: 8 when starting with zero labeled examples, active learning performs no better than ran- dom sampling until we acquire enough labeled examples to train a reasonably accurate classier that can make useful queries [144]. Confounding this problem, many diseases are so rare that nding positive cases becomes a needle-in-a-haystack problem. Acute myocardial infarction (AMI), for example, has an estimated population prevalence of less than 1%: this means we would have to review over 10,000 records to nd a mere 100 cases [327]. This thesis proposes several solutions to these problems. In Chapter 8, we accel- erate active learning by initializing our classier using transfer learning, alleviating the cold start problem. In Chapter 9, we use domain knowledge to provide a weak form of supervision in lieu of ground truth labels. 1.1.4 Enabling reproducibility and accelerating innovation Publicly available benchmarks serve a vital role in accelerating machine learning research and innovation. The MNIST digit dataset [164] and Penn Treebank [186] have been used in thousands of research projects over more than two decades. The introduction of ImageNet [80] led directly to the dominance of neural networks in image recognition research, and even models used in medical imaging are often pretrained on ImageNet [296, 267]. Benchmark datasets increase participation by lowering the barrier to entry and focus the community on specic research questions. Equally important, benchmarks facilitate reproducibility by reducing the eort required to replicate published results. 9 Sadly, few such benchmarks exist in clinical machine learning research. Even in cases where data are publicly accessible, they do not constitute a benchmark or lead to repro- ducible research. One example is the MIMIC-III database [142], one of the most widely researched digital health databases. Even work using MIMIC-III to address the same question is often dicult or impossible to reproduce or compare [38, 101, 183, 166, 262]. A recent metareview of research using MIMIC-III to study risk of in-hospital mortality was unable to reproduce the patient cohorts used in nearly two dozen published pa- pers [141]. This is because the released MIMIC-III database was not prepared for any particular prediction task, and so for each project, researchers must choose an appropri- ate subset of patients, engineer features, derive labels, etc. It is understandably dicult to describe these choices in sucient detail in manuscripts with page limits. Moreover, working with MIMIC-III is challenging for researchers without prior experience with EHR data or access to medical experts. In this thesis, we propose a rst-of-its-kind public clinical prediction benchmark suite, derived from MIMIC-III. In Chapter 10 we describe the design of the benchmark tasks and datasets and review published research that has utilized it already. 1.2 Contributions This thesis contributes both practical results within the area of biomedical informatics and novel methodology within machine learning and data mining. Our simplest contribution is perhaps one of the most important: we are among the rst to show that deep learning (noveau neural networks) and related techniques can be successfully applied to EHR data. 10 1.2.1 Biomedical Informatics We use sophisticated machine learning techniques to learn to accurately recognize dozens of diagnoses from modern EHR data, including clinical time series and free text discharge reports. We provide empirical evidence that modern deep learning techniques, including recurrent neural networks (RNNs), can be trained successfully on modest sized clinical time series datasets (Chapters 4, 5, 6, and 7). We study the impact and import of clinical data that is not missing at random. In particular, we learn to accurately recognize a number of acute conditions based solely on how often particular variables are measured (Chapter 6). We successfully train a single multitask neural network architecture to solve mul- tiple clinical prediction tasks simultaneously, including diagnosis, modeling risk of mortality, and forecasting length of stay (Chapter 7). We learn to recognize a dozen diagnoses in clinical discharge notes without labeled training data by incorporating informal clinical knowledge (Chapter 9). We propose and publish a rst-of-its-kind public benchmark suite for clinical predic- tion tasks including risk of mortality, decompensation, length of stay, and diagnosis classication (Chapter 10). 11 1.2.2 Machine Learning We study the application of kernelized locality sensitive hashing to nearest neighbor search for multivariate time series and show that it can substantially accelerate nearest neighbor searches (Chapter 3). We propose a simple deep supervision technique that improves the accuracy of RNNs trained to classify long sequences where the most discriminative subsequences vary in length and location (Chapters 5, 6, and 7). We demonstrate the ability of RNNs to exploit the signal in data that are not missing at random (Chapters 6 and 7). We propose an original multitask RNN architecture that learns four dierent tasks that are heterogeneous in both output type and temporal structure (Chapter 7). We propose a novel algorithm that uses importance weighting to combine active learning and transfer learning and provide a theoretical analysis that bounds the empirical risk of the resulting classier (Chapter 8). We describe a novel extension of the Correlation Explanation latent factor model, based on the information bottleneck, that enables us to perform weakly supervised topic modeling (Chapter 9). We propose and publish a rst-of-its-kind public benchmark suite for heterogeneous multitask learning and clinical prediction that lowers the barrier to entry for ma- chine learning researchers interested in healthcare (Chapter 10). 12 1.3 Publications 1.3.1 Included in Thesis H. Harutyunyan, H. Khachatrian, D. Kale, G. Ver Steeg, and A. Galstyan. \Multitask Learning and Benchmarking with Clinical Time Series Data." In preparation. H. Harutyunyan, H. Khachatrian, D. Kale, G. Ver Steeg, and A. Galstyan. \A Public Benchmark for Clinical Prediction and Multitask Learning." NIPS 2017 Workshop on Machine Learning for Health. Z. Lipton, D. Kale, and R. Wetzel. Directly Modeling Missing Data in Sequences with RNNs: Improved Classication of Clinical Time Series. Proceedings of the Machine Learn- ing for Healthcare Conference (MLHC), 2016. K. Reing, D. Kale, G. Ver Steeg, and A. Galstyan. \Toward Interpretable Topic Discov- ery via Anchored Correlation Explanation." ICML Workshop on Human Interpretability in Machine Learning (ICML WHI), 2016. Z. Lipton, D. Kale, C. Elkan, and R. Wetzel. \Learning to Diagnose with LSTM Re- current Neural Networks." International Conference on Learning Representations, 2016. Z. Lipton, D. Kale, and R. Wetzel. \Phenotyping of Clinical Time Series with LSTM Recurrent Neural Networks." NIPS 2015 Workshop on Machine Learning for Healthcare, 13 2015. D. Kale, Z. Che, M. T. Bahadori, W. Li, and Y. Liu. \Causal Phenotype Discovery via Deep Networks." Proceedings of the 2015 American Medical Informatics Assocation Annual Symposium (AMIA), 2015. D. Kale, D. Gong, Z. Che, Y. Liu, G. Medioni, R. Wetzel, and P. Ross. \An Examina- tion of Multivariate Time Series Hashing with Applications to Health Care." Proceedings of the IEEE 14th International Conference on Data Mining (ICDM), 2014. D. Kale, Z. Che, Y. Liu, and R. Wetzel. \Computational Discovery of Physiomes in Critically Ill Children Using Deep Learning." 1st AMIA Workshop on Data Mining for Medical Informatics (AMIA DMMI), 2014. D. Kale and Y. Liu. \Accelerating Active Learning with Transfer Learning." Proceedings of the IEEE 13th International Conference on Data Mining (ICDM), 2013. 1.3.2 Other S. Dubois, N. Romano, D. Kale, N. Shah, and K. Jung. \Ecient Representations of Clinical Text." In preparation. 14 T. Quisel, L. Foschini, A. Signorini, and D. Kale. \Collecting and Analyzing Millions of mHealth Data Streams." Proceedings of the 23rd ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2017. S. Dubois, N. Romano, D. Kale, N. Shah, and K. Jung. \The Eectiveness of Transfer Learning in Electronic Health Records." International Conference on Learning Repre- sentations Workshop Track, 2017. T. Quisel, D. Kale, and L. Foschini. \Intra-day Activity Better Predicts Chronic Condi- tions." NIPS 2016 Workshop on Machine Learning for Healthcare, 2016. G. Iglesias, D. Kale, and Y. Liu. \An Examination of Deep Learning for Extreme Cli- mate Pattern Analysis." 5th International Workshop on Climate Informatics Proceedings, 2015. Z. Che, D. Kale, W. Li, M. T. Bahadori, and Y. Liu. \Deep Computational Pheno- typing." Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2015. D. Kale, M. Ghazvininejad, A. Ramakrishna, J. He, and Y. Liu. \Hierarchical Active Transfer Learning." Proceedings of the 2015 SIAM International Conference on Data Mining (SDM), 2015. 15 M. T. Bahadori, D. Kale, Y. Fan, and Y. Liu. \Functional Subspace Clustering with Ap- plication to Time Series." Proceedings of the 2015 International Conference on Machine Learning (ICML), 2015. D. Kale, S. Di, Y. Liu, and Y. Gil. \Capturing Data Analytics Expertise with Visual- ization in Work ows." AAAI Fall Symposium Series Discovery Informatics Workshop (DIS), 2013. 1.4 Organization The dissertation is organized as follows. Chapter 2 denes the learning to diagnose problem and relevant notation, describes the opportunities and challenges presented by EHR data, and discusses related work. Part I describes a framework in which we search for patients with similar patterns of physiology in order to help diagnose a query patient. Chapter 3 describes how to use kernelized locality sensitive hashing to accelerate similarity search for multivariate time series. Part II investigates the use of neural networks to learn to diagnose hospitalized pa- tients from clinical time series. Chapter 4 describes the use of unsupervised autoencoders to discover meaningful patterns of physiology and the application of techniques from causal inference to analyze the learned representations. Chapter 5 demonstrates that 16 RNNs can successfully learn to diagnose patients from physiologic time series and pro- poses the use of target replication to improve classication accuracy when working with long sequences. Chapter 6 shows empirically that neural networks can extract meaningful signal from missing values that are not missing at random and discusses challenges that this presents for clinical modeling. Chapter 7 proposes an original RNN architecture that uses channel-wise layers and heterogeneous multitask output layers to simultaneously model four heterogeneous clinical prediction tasks. Part III discusses a number of solutions to the problem of learning to diagnose without cleanly labeled training data. Chapter 8 proposes a novel algorithm that uses importance weighting to combine active learning and transfer learning, thus alleviating the cold start problem. Chapter 9 describes a weakly supervised extension of the Correlation Explana- tion latent factor model and its application to discovering diagnoses in clinical discharge notes. Part IV concludes the dissertation by discussing grand challenges in the application of machine learning to healthcare. Chapter 10 describes a rst-of-its-kind public benchmark suite for clinical prediction and multitask learning, covering four dierent tasks: risk of mortality, decompensation, length of stay, and diagnosis classication. 17 Chapter 2 Background Our objective in this thesis is to to build computer programs that can learn to recognize known medical conditions by analyzing past patient encounters. We refer to this task as learning to diagnose [178]. During learning, the input to this program is a digital database of historical medical records including observed signs and symptoms as well as known outcomes such as diagnoses. When the trained system is applied to a new patient, its goal is to decide which diseases, if any, best explain the patient's observed condition. In this way, a learning to diagnose system might be considered a computer aided diagnosis or clinical decision support system [197]. Learning to diagnose is also closely related to computational phenotyping, a biomedical informatics task whose goal is the development of computable denitions of disease [209], with subtle dierences. Learning to diagnose refers specically to systems that auto- matically learn to recognize diseases by analyzing historical data with minimal human intervention. In contrast, computational phenotyping encompasses a broader range of approaches, many of which rely heavily on human expertise and use data only indirectly or not at all [66, 134]. Moreover, the primary application of learning to diagnose is to 18 identify diseases as they develop, in order to support clinical decision making. Compu- tational phenotypes are more commonly applied to historical records to identify patients who have or had a disease for, e.g., case identication in research [66] or quality im- provement [313]. Nonetheless, there is signicant overlap: computational phenotypes can be used for diagnosis and vice versa. In some cases, we actually train our diagnostic models to detect diseases after observing a full patient encounter, a problem structurally equivalent to phenotyping [178]. Hence, we use learning to diagnose and computational phenotyping interchangeably throughout this thesis. 2.1 Problem Formulation and Notation We formulate the problem of predicting a diagnosis as binary classication: we wish to design or learn a function that accepts a patient record as an input and outputs a decision about whether the patient has a disease. Alternatively, the function may output the probability that the patient has a disease. We will represent each input patient record as a mathematical objectx, often a vector or matrix, that we will usually refer to as a data point or sample. Depending on context, we will refer to the individual entries inx variously as inputs, observations, measurements, or variables. In order to distinguish raw data from derived representations, we will usually reserve the term features for the latter. In particular, we will refer to hand-engineered representations and hidden unit activations in neural networks as features. However, we may at times use features to refer to inputs in general when the meaning is clear. 19 In most of our work, x refers to a multivariate time series, a sequence of T vector- valued measurementsx (1) ;x (2) ;:::;x (T ) taken at a sequence of time steps 1; 2;:::;T . We index time steps with t and indicate the measurements at time t withx (t) 2R p . When x is a time series of length T , we often add a superscriptx (1:T ) to make this fact clear. We should note that the above notation implies a regular sampling rate { that mea- surements are measured at regular intervals, e.g., every hour. However, we can represent irregularly sampled data by encoding time as an additional input, and so we refrain from introducing explicit notation for irregularly sampled time series (see [93] for an example of such notation). We represent a patient's diagnoses using a binary vectory2f0; 1g K where y (k) = 1 if and only the patient received the kth diagnosis, e.g., acute respiratory disease, from an ordered list of K potential non-mutually exclusive diagnoses. We seek to learn a vector valued function (or hypothesis) h (or alternatively, K distinct functions h (k) for k = 1:::K) to decide which of the K diagnoses to assign to a given patient. We use ^ y =h(x) to refer to the predicted diagnoses. We dene ^ y (k) = h (k) (x) as the diagnosis predicted by the kth output ofh (or by the kth separate function). Our goal in learning to diagnose is to choose an optimal h 2H from a designated hypothesis classH that will minimize the probability of misdiagnosing a patient, which we write formally as h = argmin h2H Pfh(x)6=yg = argmin h2H E hx;yisD [I(h(x)6=y)] 20 wherehx;yi is a record and corresponding diagnosis vector sampled from a true but unknown underlying probability distribution over patients D. This is known as risk minimization, where the probability of a mistake is referred to as the risk. This probability is equal to the expected value of an indicator function that returns 1 when we make a mistake and 0 otherwise. This is known as 0-1 loss. To make this feasible, we must make several approximations. First, we must replace the true risk, which we cannot directly estimate sinceD is unknown, with the empirical risk with respect to a nite sample. We refer to this sample asS =fhx i ;y i ig N i=1 , say that S contains N samples, and index samples fromS with subscript i.S is often referred to as training data. We can now minimize the empirical risk with respect toS: ^ h = argmin h2H E hx;yisS [I(h(x)6=y)] = argmin h2H N X i=1 I(h(x)6=y) We use ^ h to denote the hypothesis that minimizes the empirical risk in order to distinguish it fromh , the optimal hypothesis with respect toD. It is worth noting that there is no guarantee that ^ h also minimizes the true risk. Second, directly optimizing 0-1 is known to be NP-hard, making it very dicult to optimize eciently, regardless of our choice ofH. Hence, we replace it with an approxi- mate loss function, usually one that is convex, that permits us to apply techniques from continuous optimization, e.g., gradient descent. In most of our work, we use logistic loss, which can be interpreted as the cross entropy beween the true and predicted diagnosis 21 labels, if we view them as random variables. For K-vector valued labels, the cumulative cross entropy loss can be written as L CE (^ y;y) = K X k=1 y (k) log ^ y (k) (1y (k) ) log(1 ^ y (k) ) and our empirical risk minimization problem becomes ^ h = argmin h2H N X i=1 L CE (h(x);y) We should note that when choosing a hypothesis classH, we should be mindful that the logistic loss becomes undened or unbounded when ^ y (k) < 0 or ^ y (k) > 1, respec- tively. For a suitable choice of hypothesis classH, e.g., logistic regression, this becomes a straightforward optimization problem. Conveniently, minimizing the logistic loss can be viewed as maximizing the likeli- hood of our observed data under a probabilistic model. Without loss of generality, let us consider a single diagnosis, e.g., diabetes: let Y (1) be a binary random variable rep- resenting whether a patient has diabetes and (with a slight abuse of notation) X be a collection of random variables representing the patient record. Now suppose thatH includes probabilistic models with a specic form, parameterized by 2 . Then the conditional probability distribution for a patient having diabetes, under models inH, is P (Y (1) = 1jX =x;) =h (1) (x). The probability assigned by our model to a particular patient's set of diagnoses is K Q k=1 P (Y (k) =y (k) jX =x), which we abbreviatep(yjx). Note 22 that this assumes the probabilities of two dierent diagnoses are conditionally indepen- dent given the patient record. This is a reasonable assumption to make if the patient record contains all information required to to make each diagnosis, which is sensible if not always true in practice. Now we can learn by choosing the parameters that maximize the likelihood of our data: ^ = argmax 2 N Y i=1 p(y i jx i ;) = argmax 2 N Y i=1 K Y k=1 p(y (k) i jx i ;) = argmin 2 log N Y i=1 K Y k=1 p(y (k) i jx i ;) = argmin 2 N X i=1 K X k=1 logp(y (k) i jx i ;) = argmin 2 N X i=1 K X k=1 log h (k) (x i ) y (k) i (1h (k) (x i )) (1y (k) i ) = argmin 2 N X i=1 K X k=1 y (k) i logh (k) (x i ) + (1y (k) i ) log(1h (k) (x i )) = argmin 2 N X i=1 K X k=1 L CE (h (k) (x i );y (k) i ) The second step uses our conditional independence assumption, while the third and fourth exploit the properties of concave and convex functions and the fact that the logarithm of a product is equal to the sum over logarithms. The fth step utilizes the fact thath (k) (x i ) outputs the probability that y (k) i = 1 and that our binary labels can act as \switches" turning individual terms on or o. From there we use the properties of the logarithm and the denition of cross entropy loss. Here we see the connection between empirical risk minimization and maximum likelihood, under the assumption that our labels are binary 23 random variables and that our hypothesis class models a conditional Bernoulli random variable distribution. Now all that remains is to choose a hypothesis classH and a procedure to update the parameters of our model in order to minimize the empirical risk. Throughout this thesis, we focus on neural networks, commonly referred to collectively as deep learning, which are in most cases fully deterministic and dierentiable, facilitating the application of gradient descent for learning. We point the reader to [105] as the most complete resource on the topic of deep learning and neural networks to date and defer detailed discussion of model architectures to individual chapters. 2.2 Electronic Health Records Data The EHR is intended as a digital replacement for the traditional paper medical chart and so records all possible information from each encounter a patient has with the healthcare system. Table 2.1 lists broad categories of data commonly recorded in modern EHR systems, along with examples of each and details. We can quickly see that EHR data are highly complex and heterogeneous, including a variety of data types recorded at various frequencies in dierent settings. To get a handle on the diverse data, we can group them according to three criteria: 1. data type (text, image, time series) 2. whether the data result from a measurement or are generated by clinical sta or the patient themselves 3. context of care (outpatient encounter, hospitalization, imaging) 24 What we nd is that the type of data available is often highly correlated with the care scope and context. For example, many hospitalized patients have an elevated risk of becoming unstable and so are monitored closely. Nursing sta routinely chart vital signs during rounds and periodically order lab tests to check, e.g., blood chemistry. Because clinical sta administer medications, often through intravenous drips, the actual amounts of medica- tion administered are recorded in detail. Clinicians enter progress notes at various points throughout the patient stay, and then generate a detailed summary, often including a primary diagnosis, at discharge. The result is a fairly detailed record of a hospital stay, often lasting a day or more, that includes physiologic and medication time series, one or more text documents, and some structured codes. In contrast, most outpatient encounters, e.g., visits to a primary care doctor, generate far less data. Clinicians will record a snapshot of a patient's vital signs, but this may happen only once or twice in a year. Lab tests may not be ordered at all, and if a patient is taking a medication at home, then the EHR may only record the medication order with no details about actual administration. However, due to the need to justify care decisions for billing, nearly all clinical encounters will generate doctor's notes and administrative codes. Thus, in longitudinal care settings, in which we are interested in a patient's health over years, the data most commonly available to us are structured and text data representing (billable) care decisions, rather than direct measurements of a patient. What nearly all data in modern EHRs have in common is that they are generated during an ocial encounter with the healthcare system and veried by clinical sta before 25 being documented in the patient's ocial medical record. This is a critical point with signicant implications that are easy to overlook in the age of ubiquitous tracking and wearable sensors. The average consumer can record, view, and share their per-minute heart rate with a $100 wrist-worn sensor. Hospitals have signicantly more expensive and sophisticated ECG devices so it is natural to assume that they would record heart rate continuously and at a much higher rate with greater delity. Nonetheless, this is not the case { even in hospitals with continuous monitoring, clinical sta must still verify vital signs before they are charted. This has several implications [187]. First, observations are recorded at a sparse, irregular rate that can vary across dierent variables, over time, and with patient condition. This poses challenges to traditional time series models which typically assume a xed clock rate. Second, manual charting results in subtle forms of selection bias { in particular, clinical sta are more likely to chart abnormal values than normal values. This is especially true for laboratory tests requiring a blood draw, which clinical sta will often order only if they expect an abnormal result. However, this bias is present even in routinely charted vital signs because hospital policies often recommend more frequent measurement for unstable patients. The result is that charted data may make a patient's condition appear more deranged than would measurements recorded at random. Robust machine learning models are likely to learn this bias, which could have a deleterious eect on future accuracy if new data do not re ect this same skew in distribution. Finally, manual charting produces non-random missing values that indirectly encode information about patient condition and care. For example, certain variables like fraction inspired oxygen (FIO 2 ) are more commonly measured during interventions provided to 26 sicker patients. Hence, the presence or absence of such measurements may leak informa- tion about treatment, enabling a machine learning model to, for example, predict that a patient has a respiratory disease because she is ventilated. For some applications, e.g., phenotype classication to construct cohorts for clinical research, this may be acceptable and even desirable. However, learning such correlations could invalidate a model whose purpose is to inform treatment decisions [220]. 2.2.1 Inpatient observational time series In most experiments presented in this thesis, we utilize time series of clinical observations from hospitalized patients recorded in EHR databases. These include a wide range of measurement types: physiological measurements (vital signs like heart rate) manual assessments (Glasgow coma scale) laboratory test results (glucose) intervention-related variables (end-tidal expired CO 2 ) time varying patient characteristics (weight) These vary signicantly with respect to distribution, sampling rate, dynamics and rate of change, body system, and clinical relevance. Heart rate, for example, measures cardio- vascular health, has an approximately Gaussian distribution, is measured on roughly an hourly rate, and can change rapidly, as in the case of a bradycardia. In contrast, Glasgow coma scale, a subjective measure of cognitive function and awareness, is in fact a discrete distribution, takes only 13 integer values between 3 and 15, and changes infrequently. 27 There are a variety of other data sources and types recorded in EHRs that constitute time series but which we do not use in our experiments. Primary among them are interventions, such as intravenous uids, medications, and ventilator settings. Many applications require us to consider interventions separately from observational data. For example, for treatment recommendation learned via reinforcement learning, observational data become measurements of state, while interventions are treated as actions which we seek to optimize in order to obtain some reward. For learning to diagnose, incorporating such data is relatively straightforward: we simply treat them as additional inputs to our models. Although treatments may leak in- formation about our diagnostic labels, this is already a problem with purely observational data for patients undergoing treatment, as discussed earlier. There is reason to think that explicitly including treatments may in fact help us control for treatment eects [42]. We utilize three distinct time series datasets in this thesis. 2.2.1.1 VPICU Dataset This is a collection of roughly 10,000 pediatric ICU (PICU) episodes from Children's Hospital Los Angeles, created and maintained as a part of an IRB approved study by the Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care Unit (VPICU). First described in [187], the VPICU dataset consists of irregularly sampled time series for thirteen clinical variables, including vital signs (heart rate), lab results (glucose), uid outputs (urine), and subjective assessments (Glascow coma scale), as well as a handful of variables related to ventilation (end-tidal CO 2 ). The most frequently measured variables, such as pulse oximetry and respiratory rate, are recorded on average once per hour, 28 while lab tests like pH are recorded once or twice per day. The data also include several non-temporal patient characteristics, such as age, weight, and height. Each episode is associated with one primary diagnosis and zero or more secondary diagnoses, chosen from an in-house taxonomy used for research, making these diagnoses more reliable than codes assigned for billing purposes. The taxonomy includes a mapping to ICD-9 codes for convenience. Prevalent diagnoses include acute respiratory distress, congestive heart failure, seizures, renal failure, and sepsis. Other outcomes available include whether the patient died before discharge and PICU length of stay. The duration of PICU visits is highly variable, though the vast majority of stays last fewer than seven days and over half last fewer than two days. The minimum visit length is 12 hours. 2.2.1.2 MIMIC-III Database The Medical Information Mart for Intensive Care (MIMIC-III) database [142] is a publicly available database of 60,000 anonymized ICU episodes from Beth Israel Deaconess Hos- pital. Similar in character to the VPICU data, the MIMIC-III database contains sparsely sampled time series of clinical observations recorded during the delivery of inpatient care. However, because the MIMIC-III database is an extract of the full EHR database, it also includes interventions and medications, clinician notes, and microbiology reports. Its breadth, along with its size, make it a compelling data source for retrospective clinical studies and machine learning research. However, MIMIC-III is not appropriate for machine learning research out-of-the-box. For each project, researchers must rst dene an outcome or task and derive the corre- sponding label or target, often using a surrogate when ground truth is not available. Then 29 they must dene a patient cohort, i.e., which patients to include in the study, and gener- ate their own training and test splits. Next they must perform signicant preprocessing of the data: identifying which charted events correspond to heart rate, disambiguating units of measurement (Celsis vs. Farenheit), and detecting outliers, among other things. Fi- nally, they must engineer features appropriate for their modeling task. As such, research using MIMIC-III can be dicult or impossible to reproduce [141], and understanding and using MIMIC-III is challenging for researchers without prior experience with EHR data or access to medical experts. These challenges motivate our work on a benchmark data set derived from MIMIC-III, as described in Chapter 10. 2.2.1.3 Physionet Challenge 2012 The Physionet Challenge 2012 dataset [273] is a curated, public subset of the MIMIC-III database designed to support a competition involving the development of innovative new risk of mortality scores. It is similar in size to the VPICU dataset (8,000 patients) but includes only adult patients who were deceased at the time of its release. Further, it includes encounters from not only the general ICU but also two specialty ICUs (coronary care and cardiac) and a general surgery recovery unit. It also has nearly three times as many variables, including many more laboratory tests, e.g., Bilirubin, and explicit treatment information in the form of a binary variable indicating that the patient is ventilated. Each episode has 48 hours of data and has several associated severity of illness scores, including SAPS-I [163] and SOFA [302]. These data have several shortcomings for our purposes: rst, the only outcomes in- cluded are mortality and length of stay. There are no diagnosis codes so we cannot 30 perform our standard learning to diagnose task (although detecting patients at risk of in-hospital mortality is highly related). In addition, outcome labels are available for only half of patient encounters, making the labeled dataset very small. Nonetheless, it is an important source of clinical time series data since it is highly curated, fully public, and widely used in research. 2.2.2 Clinician-generated text Clinical notes, such as daily progress reports and discharge summaries, often record details of clinical care that cannot be found elsewhere and provide complementary information to clinical measurements and structured data [100, 38, 92]. They have been used in a wide variety of applications ranging from discovering o-label drug uses [143] to phar- macovigilance [169, 120], as well as learning to diagnose [92, 331, 2]. However, clinical text poses a number of obstacles even beyond the standard challenges of natural language processing and information extraction. Clinical text often uses large, diverse vocabulary rife with technical jargon. Many clinical concepts have a variety of synonyms that are dicult to recognize without spe- cialized knowledge, e.g., myocardial infarction and heart attack. This increases both the dimensionality and sparsity of the data: suppose, for example, we have M synonyms for each of N concepts appearing in a corpus of doctor's notes. Then a simple vocabulary will have MN terms even though it represents only N actual concepts. Additionally, each individual term will appear only 1=M times as frequently as the actual concept it represents. This problem is further exacerbated by the fact that dierent clinicians often use their own dialect, shorthand, and abbreviations. 31 Clinical notes often exhibit a complex \sequence of sequences" structure uncommon in more traditional text [82]: in particular, we often have a sequence of documents recorded over time with the same kinds of irregular timestamps found in other EHR data. It is not immediately obvious what the right representation or model is for these irregular sequences of documents or what the signicance of time might be. One notable advantage of clinical notes is that they often contain clinical insights and inferences rather than mere observations: references to prescribed treatments, potential diagnoses, symptoms not recorded elsewhere, etc. This information provides a poten- tial source of supervision for machine learning models in the absence of cleanly labeled data [114, 2, 116, 117]. This is the context in which we study clinical text in this thesis: as a source of weak supervision for learning to diagnose in the absence of clean labels (Chapter 9). 2.2.2.1 i2b2 Obesity Challenge 2008 The i2b2 Obesity Challenge 2008 dataset is a collection of 1,237 deidentied clinical dis- charge summaries from the Partners HealthCare Research Patient Data Repository [295]. Each summary has been labeled by clinical experts with one or more of 16 conditions related to obesity, ranging from coronary artery disease to depression. 32 2.3 Related Work 2.3.1 Diagnostic expert systems There is a long history of research on computer-aided diagnosis (CAD) systems, dat- ing back half a century. Most early research on CAD involved expert systems focused on reasoning about databases of clinical knowledge. Examples include CASNET, which provided consultations for glaucoma [314]; MYCIN which helped clinicians diagnose infec- tious diseases [288, 269]; and INTERNIST-I and CADUCEUS, which formulated a general framework for identifying likely diagnostic explanations for abnormal observations [230]. PIP was designed to study the cognitive processes underlying medical decision making and was applied to diagnosis of edema [219]. CONSIDER [174] and RECONSIDER [33] performed diagnosis by matching terms in large databases of disease denitions stored as structured natural language. The spiritual successors of these systems are computational phenotyping algorithms which use similar combinations of term-matching and rules but can be applied automatically to EHR data, rather than requiring human interaction. Automated diagnosis took a large leap forward with the advent of probabilistic ap- proaches to articial intelligence. Perhaps one of the most famous examples applied to di- agnosis was Pathnder [124, 123], which was designed to diagnose over 60 dierent lymph node diseases. Pathnder began as a rule-based expert system, but second and third ver- sions based on a Naive Bayes classier (with expert-provided probabilities) signicantly outperformed the rules. The nal version, Pathnder IV, was a full Bayesian network with over 10,000 parameters [157, 121]. Similarly, CPCS [231] and QMR-DT [270, 195] 33 were Bayesian networks derived from the INTERNIST-1 and QMR knowledge databases, respectively. 2.3.2 Computational phenotyping In medicine, a phenotype consists of the measurable properties (biologic, physiologic, be- havioral, clinical) of health or a disease process [209]. Phenotypes are most commonly used for case identication in research [66], although they also have applications in de- cision support (for example, comorbidity detection [85]) and quality improvement and reporting [313]. The \gold standard" approach to phenotyping is manual chart review, which does not scale to large numbers of records. There is also some disagreement in the community about the general reliability of chart review, with some studies showing high levels of inter-rater agreement [292, 103, 211] while others nd sucient variability as to warrant recommendations about best practice [298, 326]. With the advent of EHRs, researchers began developing search- and query-based denitions, with several large scale collabora- tive eorts dedicated to validating and disseminating computational phenotype denitions [40, 66, 134]. These denitions are similar in characteristic to classic knowledge-based expert systems and hence suer many of the same weaknesses. In particular, development of such denitions remains time-consuming and so this approach does not scale to large numbers of phenotypes. Further, these denitions are often validated on relatively small sets of labeled records and sometimes underperform when applied at new institutions, raising questions about their generalizability. 34 2.3.3 Machine learning for diagnosis and phenotyping The literature on using statistical learning to build medical diagnosis systems is extensive, dating back multiple decades and covering a wide range of applications. Rather than attempt (and fail) to survey it in its entirety, we will restrict our attention to the most representative work since the passage of the Health Information Technology for Economic and Clinical Health (HITECH) Act in the United States in 2009, when the adaoption of EHRs really began to accelerate in the US. We will also focus on work using the same EHR data types (observational time series and text) that we do, specically excluding work on medical images and waveforms like ECG. One broad category of work that is highly relevant uses supervised machine learning to detect research-grade phenotypes in logintudinal EHRs consisting primarily of structured data and text [54, 92, 246]. Several of these works propose innovative solutions for using domain knowledge and ontologies to derive surrogate labels to train classiers in the absence of ground truth diagnosis [2, 116, 117]. Another relevant line of clinical machine research seeks to diagnose life-threatening conditions, such as sepsis. There are clinically accepted criteria for recognizing sepsis, so researchers have focused on accurately predicting sepsis before obvious symptoms mani- fest. Approaches vary, ranging from hand-engineered features [126] to linear dynamical systems with latent factors [283, 284] to recurrent neural networks [93]. Contemporane- ous with the work in this thesis, there has been a boom in research on using deep learning to perform multilabel diagnosis classication from sequential clinical data, including in- patient time series [50], longitudinal lab tests [243], and medical codes [55]. 35 Much of the work in machine learning-based phenotyping has been unsupervised, where the goal is to discover clusters or latent representations that explain observed symptoms and maybe associated with known diseases. An early example is [187] which applied a Gaussian mixture model to the same VPICU dataset we use in our experiments. They discovered clusters that were weakly correlated with major diagnostic categories and predictive of future mortality. [257] describes a hierarchical Bayesian framework for discovering clusters of trajectories and use to discovery potential subtypes among patients from a national scleroderma registry. [129] formulate phenotype discovery as a tensor factorization problem, discovering latent factors that explain correlations across dierent types of medical codes. These methods are challenging to evaluate, as they are often applied in settings where no ground truth labels are available. Most published work thus far utilizes subjective, qualitative evaluations by a small number of clinical experts. One notable exception is [160], who demonstrated that features extracted by an unsupervised neural network could distinguish between two dierent chronic diseases. 36 Category Example Type Note Currently recorded in EHR: Demographics ethnicity categorical Characteristics age numerical History past surgery text Vital signs heart rate numerical Time series in acute care settings only Observations pain mixed Time series in acute care settings only Lab tests glucose categorical Time series in acute care settings only Medications pencillin categorical Actual administrated amount in acute care settings only Radiology X-ray 2D or 3D image Typically not integrated with other data Notes discharge summary text Sequences of documents Codes diagnosis, procedure categorical Primarily administrative Not currently recorded in EHR: Raw signals ECG numerical May be stored separately Wearables heart rate numerical Mobile health diet log various Genetics Table 2.1: Categories of digital health data and whether they are recorded in modern EHRs. 37 Part I Searching to Diagnose from Similar Patients 38 Chapter 3 Accelerated Similarity Search for Clinical Time Series In many cases, the best representation of clinical time series may be the data themselves. Many diseases, particularly acute conditions, manifest as characteristic patterns of symp- toms and signs, which are recorded in EHRs as multivariate time series. Patients with similar conditions will manifest similar patterns, and so diagnosing a new patient may reduce to searching a database for past patients with similar clinical patterns. How- ever, similarity search for multivariate time series is challenging because (1) there is no consensus on the right denition of similarity and (2) computing similarity between two multivariate time series is often computationally expensive. In this chapter, we inves- tigate this problem and show that kernelized locality sensitive hashing can accelerate similarity search for time series using a number of dierent denitions of similarity. We then apply our framework to three large data sets of clinical time series to demonstrate the speed and accuracy of this approach of this approach. 39 3.1 Introduction Multivariate time series data are becoming ubiquitous and big. Nowhere is this trend more obvious than in healthcare, with the growing adoption of electronic health records (EHRs) systems. According to a 2009 survey, hospital intensive care units (ICUs) in the United States (US) treated nearly 55,000 patients per day, 1 generating digital health databases containing millions of individual measurements, many of which constitute mul- tivariate time series. Clinicians naturally want to utilize these data in new and innovative ways to aid in the diagnosis and treatment of new patients. An increasingly popular idea is to search these databases to nd \patients like mine," i.e., past cases that are similar to the present one [223]. This classic data mining task, known as similarity search, must be both accurate and fast, which depends crucially on two choices: representation and similarity measure. For traditional data types (e.g., structured data, free text, images, etc.), the standard approach is to dene a set of features that we extract from each object in our database and then apply straightforward measures of similarity (e.g., Euclidean distance) to these. This has been applied to time series data [168], but designing good features can be dicult and time-consuming. Researchers have shown empirically that the best representation for time series is often the data themselves, combined with specialized similarity measures [150]. The classic example is dynamic time warping (DTW), an extension of Euclidean distance that permits nonlinear warping along the temporal axis in order to nd the optimal alignment 1 From the American Hospital Association Hospital Statistics survey conducted in 2009 and published in 2011 by Health Forum, LLC, and the American Hospital Association. 40 between two time series [306]. The choice in time series similarity measures ranges from simple approaches, such as Euclidean distance and DTW, to complex approaches based on tting parametric models [187]. Indeed, there has been an explosion in the number and variety of time series similarity and distance metrics proposed in the literature over the last decade [171] [52] [68] [173]. In order to implement fast similarity search for multivariate time series, we make the following two observations: (1) dierent similarities work best for dierent data and prob- lems; and (2) the most eective similarity measures are often computationally expensive and ill-suited to large scale search. The rst observation is best demonstrated by the thorough empirical evaluation in [173], in which no single metric achieves the best perfor- mance for all data sets and problems. In other words, choosing the right similarity metrics (much like designing good features) requires experience, intuition, and experimentation. The second observation is more nuanced; some similarity metrics can be calculated fast by using a combination of smart engineering and heuristics (e.g., the wonderful work to compute DTW as in [240]). However, these speed-ups cannot be easily generalize beyond specic tasks or expendable to other metrics. In this chapter, we investigate a general solution that applies to a large class of time series similarities via locality sensitive hashing (LSH) [104]. LSH utilizes one or more hash functions to map the input data to a xed-length representation (typically a binary code), which then can be used as indexes in a large-scale storage architecture. Well-constructed hash functions will assign similar codes to similar objects, allowing us to store them to- gether and to nd them with a quick lookup. LSH has been used to build fast search and retrieval algorithms over massive databases of text and images [104, 108, 75]. Here we 41 utilize a specic hashing framework, namely kernelized locality-sensitive hashing, (KLSH) [159] for fast search over multivariate time series. KLSH kernelizes the LSH search frame- work so that it can be used with arbitrary time series similarity metrics. We investigate a series of time series kernel functions, including modied Euclidean distance, multivariate dynamic time warping [240], global alignment kernels [67], and vector autoregressive ker- nels [68], and show that KLSH provides signicant speed-ups across all similarity metrics without signicantly compromising the quality of the search results. We believe that this work is but a rst step toward a comprehensive approach to fast similarity search for time series (and other structured objects) and that it opens the door to innovative work on time series hashing. 3.2 Backgroud 3.2.1 Notation We dene a univariate time series of lengthT as a set ofT samples from a random process parameterized by time (and indexed by t): x =fx(1):::x(t):::x(T )g. In this chapter, we assume that time is discrete and that measurements are regularly sampled with no missing values. We will denote a single univariate time series asx2R T . A multivariate 42 time series (MVT) is a set of P time series (one per each of P variables) sampled at the same time intervals: X = 8 > > > > > > > > > > < > > > > > > > > > > : 2 6 6 6 6 6 6 6 6 6 6 4 x 1 (1) x 2 (1) : : : x P (1) 3 7 7 7 7 7 7 7 7 7 7 5 ::: 2 6 6 6 6 6 6 6 6 6 6 4 x 1 (t) x 2 (t) : : : x P (t) 3 7 7 7 7 7 7 7 7 7 7 5 ::: 2 6 6 6 6 6 6 6 6 6 6 4 x 1 (T ) x 2 (T ) : : : x P (T ) 3 7 7 7 7 7 7 7 7 7 7 5 9 > > > > > > > > > > = > > > > > > > > > > ; We can represent a MVT as a P -by-T matrix: X 2R PT , where the ith row is the ith univariate time series x i 2R T and column t is a P -vector of observations at time t: x(t)2R P . We refer to the dimensionality of this MVT as P and consistently use a subscript plus lowercase bold (x i ) to indicate the ith dimensional time series in a single MVT X. Throughout this chapter, we will often refer to P -dimension MVTs simply as \time series" when our meaning is clear from context. Suppose now that we have a database (which we will refer to asD) of P -dimension MVTs, N in total: D =fX j :X j 2R PT j g N j=1 . We will consistently use a subscript plus uppercase bold (X j ) to index MVTs in a database. All of the time series in our database are of the same dimension P but may vary in length. 43 3.2.2 Problem Denition We dene the task of multivariate time series similarity search as follows: given a query time seriesX q 2R PTq , we are interested in nding the time seriesX 2D that is most similar toX q , given an arbitrary denition of similarity between MVTs S(X;X q ): X = argmax X2D S(X q ;X) i.e., the time series X 2D that maximizes S(X q ;). Alternatively, if we are given a distance measure D instead of similarity, we seek to minimize D(X q ;). More generally we are interested in the k nearest neighbors (knn) problem: nding the K most similar time series to X q . Furthermore, we are interested in performing this search as fast as possible. 3.2.3 Related Work Fast search in a database of variable length MVTs presents two interrelated challenges that thwart most standard approaches: 1. Designing a time series similarity (or distance) measure that is both eective and fast can be dicult. 2. Many traditional approximate search methods cannot easily be applied to variable length MVTs. Computing the similarity between variable length time series can be quite computa- tionally expensive and so an approach based on an exhaustive search (i.e., linear scan) 44 of the database will not scale to large-scale data sets (i.e., some combination of large P , largeT , and largeN). This problem is not unique to variable length time series. In com- puter vision, for example, researchers have found that the best representations for images tend to be extremely high dimensional feature vectors, which can be expensive in terms of both memory and computation [108]. With such data, linear scan searches using Eu- clidean distance become prohibitively slow, necessitating the development of alternative approaches. One possible solution is to use a fast approximate search for nearest neighbors via an alternative time series representation. A framework that has enjoyed widespread success and popularity is symbolic approximate aggregation (SAX), which discretizes the contin- uous observation space and replaces real values with symbols [171]. By converting the data to a discrete representation, we gain access to fast algorithms for mining sequences of symbols (e.g., longest common subsequence). Subsequent work has shown that SAX does a good job of approximating the true similarity between short time series and can be used to perform scalable indexing of large time series databases [263]. How to apply SAX to multivariate time series remains an open research question. Another alternative representation that enables both ecient storage and fast search is binary codes, such as those produced by hashing. Locality sensitive hashing (LSH) is one of the most widely used approaches to constructing xed length binary representations of data for fast approximate nearest neighbor search; it is based on the intuition that the probability that two objects share the same code should be proportional to their similarity [104]. LSH has received numerous extensions, including the incorporation of supervision through the use of metric learning [76]. There is also a wide variety of alternative methods 45 for learning binary embeddings of data which can then be used for hashing and fast comparison. These include semantic hashing [251]; spectral hashing [315]; shift-invariant kernel hashing [237]; directly learning a hamming distance metric [208]; etc. All of these approaches share a common limitation: they require xed size data representations, which prohibits their direct application to time series data. Two major advances in the last ve years should enable us to perform fast approximate nearest neighbor search over MVTs via hashing. The rst is kernelized locality sensitive hashing (KLSH) [159]. KLSH permits the hashing of objects for which similarity is dened using arbitrary kernel functions. The second major advance has been a large number of proposed time series similarity and distance measures. Many of these can naturally be formulated into valid kernels and used within the KLSH framework, allowing us to combine the power of these similarity measures with the speed of hashing based search. 3.3 Kernelized Hashing of Time Series In this section, we rst provide a thorough review of kernel-based hashing and time series kernels, and then describe how to connect these two concepts into a unied kernelized time series hashing framework. 3.3.1 Kernelized locality sensitive hashing Locality sensitive hashing (LSH) enables fast similarity search by generating bit vector representations of objects, which can be compared very rapidly on modern hardware [104]. Given an object X, we can generate a B-bit hash code for the object using a collection ofB hash functionsfh 1 ;:::;h B g. For each hash function, the probability that 46 two objects receive the same bit (1 or 0) is proportional to their similarity: Pfh i (X) = h i (X 0 )g/ S(X;X 0 ) for all i = 1;:::;B. A common way to achieve this is by dening h i (X) = 1 ifa > X 0 and 0 otherwise, wherea is a random hyperplane sampled from a zero mean, unit variance Gaussian: asN (0;I) [48]. Standard LSH assumes xed size feature vectors and uses cosine similarity. It cannot be used directly with more specialized notions of similarity or objects with variable sizes or special structure (e.g., time series). Kernel methods allow us to apply machine learning to problems where the data does not live in a xed dimensional space or where we want to use similarity measures that incorporate specialized information or structure [260]. Assume we have a positive denite kernel function that produces a cosine-like similarity between X and X 0 : (X;X 0 ). Here induces a feature space for which we cannot directly examine(X) or calculate, for example, a Euclidean distance between two examples (k(X)(X 0 )k 2 ). The kernel trick allows us to compute the desired similarity using since(X;X 0 ) =(X) > (X 0 ). We can use this kernel function to perform kernelized LSH in much the same way that we use kernel methods for support vector machines [159]. We compute the random hyperplane a as a weighted sum over M N training examples in the kernel-induced feature space: a = P M j=1 w j (X j ). Again, we cannot explicitly compute this, but we observe that we can evaluatea > (X 0 ) for some new example (X 0 ): a > (X 0 ) = 0 @ M X j=1 w j (X j ) 1 A (X 0 ) = M X j=1 w j (X j )(X 0 ) = M X j=1 w j (X j ;X 0 ) 47 and so our KLSH hashing function becomes h a (X 0 ) = 8 > > < > > : 1 if sign P M j=1 w j (X j ;X 0 ) 0 0 otherwise Two questions remain: how big should M be (and how do we choose our M points?); and how do we choose our weightsw j such thata is a \valid" hyperplane for performing LSH (i.e., is asN (0;I))? Clearly if M is very large (i.e., M N), then evaluating hash function becomes almost as expensive as performing a linear scan using our kernel- based similarity. Thus, we wantMN. [159] suggests samplingM =O( p N) points at random, which guarantees sublinear runtime and usually produces good accuracy. For the second question, we choose w =K 1=2 e R . K2R MM is our similarity kernel matrix over our M samples (i.e., K jl = (X j ;X l )). e R 2f0; 1g M is an indicator vector with only R < M nonzero entries, corresponding to a subset of R points chosen at random from the M points included inK. [159] gives a derivation showing that, by the central limit theorem,a will be approximately distributed according toN (0;I P ). The main computational expense of KLSH is the initial training, which requires com- puting and the inverting the M-by-M kernel matrixK (anO(M 3 ) operation) to ndw. Searching for the nearest neighbors of a new queryX q is very fast. First, we evaluate our kernel function O(BM) times to compute B bits. By choosing MN, we ensure that this is much faster than performing an exhaustive search, which requires O(N) kernel evaluations. In the second step, we search over binary hash codes to nd nearest neigh- bors, which is trivially fast, even in large data sets. Finally, we can optionally rene our set of nearest neighbors by evaluating the true similarity over a small set of candidates 48 [75]. The result is a signicant speed-up over an exhaustive search using the ground truth similarity metric. 3.4 Time series similarity Time series similarity measures can roughly be classied into two groups: those that capture shape and those that capture higher level or global structure [173]. The most popular shape-based approach is dynamic time warping (DTW) [306]. It has been shown to be very eective in a wide variety of settings and applications. It is computationally expensive (O(T 2 ) for comparing two time series each of length T ), but researchers have shown a variety of ways to speed it up [250][151][252][240]. A more signicant weakness is the fact that it does not obey the triangle inequality and so does not constitute a true distance metric. This prevents its use in kernel methods without modication [266]. A generalization of DTW called the Global Alignment (GA) kernel, which computes the similarity between two time series based on all possible alignments, has been shown to produce positive denite kernels and is often competitive with DTW [67]. Structure-based similarity measures attempt to capture higher level structure in the time series. [172] rst convert univariate time series to a symbolic representation and then construct histograms over a \vocabulary" of symbol subsequences. They call this a bag-of-patterns representation. Also included in this general category are model-based approaches, which compare time series using an assumed model. [52] describe a kernel based on echo state networks that rst trains a recurrent reservoir model and then com- pares two time series in the model space [52]. [68] propose an autoregressive time series 49 kernel (ARK) based on linear vector autoregressive models that eliminates the two-step \train, then compare" process by showing that it can be formulated as a covariance kernel. Here we describe four representative similarity measures that can be combined with kernelized hashing to yield fast similarity search over multivariate time series. However, our framework applies to any time series similarity or distance. 3.4.0.1 Multivariate dynamic time warping Dynamic time warping was rst proposed nearly fty years ago and remains one of the gold standards for comparing time series. Suppose we have a pair of time series X2 R PT andX 0 2R PT 0 and a measure of discrepancy d(x(t 1 );x 0 (t 2 )) between a pair of time pointsx(t 1 ) andx 0 (t 2 ), one from each time series, wheret 1 andt 2 need not be equal. For example, we might use the squared Euclidean distance between these points: d(x(t 1 );x 0 (t 2 )) =kx(t 1 )x 0 (t 2 )k 2 2 = P X i=1 x i (t 1 )x 0 i (t 2 ) 2 Next we dene an alignment (or warping function) =f(1; 1);:::; (t j ;t 0 j );:::; (T;T 0 )g as a list of non-decreasing pairs of indices ofX,X 0 [67]. satises these properties: jjT +T 0 1. (j + 1)(j) for all j = 1;:::;jj. (1; 1)(j) (T;T 0 ) for all j = 1;:::;jj. (j + 1)(j)2f(0; 1); (1; 0); (1; 1)g for all j = 1;:::;jj. 50 The jth pair of indices aligns the points x((j; 1)) and x((j; 2)). Every point must be aligned, and we only allow forward movements along X or X 0 . We then dene the -alignment distance betweenX andX 0 as D (X;X 0 ) = jj X j=1 d(x((j; 1));x 0 ((j; 2)) The multivariate dynamic time warping (MDTW) distance is dened as the minimum -alignment distance over a set of possible alignments given by (X;X 0 ): MDTW(X;X 0 ) = 1 j j min 2(X;X 0 ) D (X;X 0 ) where = arg min 2(X;X 0 ) D (X;X 0 ). We normalize the -alignment distance by j j so that we can compare MDTW distances within a database of variable length time series. To convert MDTW distance to similarity, we use a radial basis function (RBF) kernel: MDTW (X;X 0 ) = expfMDTW(X;X 0 )=(2)g where is a param- eter that we can choose based on the application data. In our experiments, we let = median X;X 0 2D MDTW(X;X 0 ), i.e., the median MDTW distance between time series in our database. MDTW is computationally expensive, requiringO(T 2 ) evaluations of our discrepancy function. One of the main ways to speed it up is to impose constraints on our set of alignments . For example, the Sakoe-Chiba band requires that aligned points be within a certain number of steps of one another, i.e., thatj(j; 1)(j; 2)jC [250]. In practice such constraints are implemented by using a limited window of possible alignments, but 51 formally we add weights to the denition of D , such that any two points that violate a constraint (e.g., fall outside the Sakoe-Chiba band) receive a very large discrepancy. The computational complexity of MDTW with the Sakoe-Chiba band is O(TC), which can be a relatively large gain if T is large but C is small. The price we pay for the speed-up is potentially suboptimal alignments. 3.4.0.2 Global alignment kernels MDTW does not obey the triangle inequality and cannot be used (without modication) to dene a positive denite kernel, a requirement for KLSH. MDTW also has a practical limitation: it denes distance based only on the optimal alignment between time series, ignoring other potentially poor alignments. A reasonable alternative might consider all possible alignments, in order to reward pairs of time series with multiple good alignments. Global alignment kernels address both the theoretical and pragmatic limitations of MDTW [67]. We can think of a global alignment (GA) kernel as a weighted average over all possible-alignment distances in (X;X 0 ), smoothing out the-distances of outlier 52 alignments (i.e., unusually small or large distances). More formally, [67] denes a global alignment kernel as an exponentiated soft-minimum: GA (X;X 0 ) = X 2(X;X 0 ) expfD (X;X 0 )g = X 2(X;X 0 ) expf jj X j=1 d(x((j; 1));x 0 ((j; 2))g = X 2(X;X 0 ) jj Y j=1 expfd(x((j; 1));x 0 ((j; 2))g = X 2(X;X 0 ) jj Y j=1 s(x((j; 1));x 0 ((j; 2)) where the function s(x(t 1 );x 0 (t 2 )) gives a measure of similarity between time points, analogous to the discrepancy function. This is also called the local kernel within the framework of mapping kernels, a recent generalization of convolutional kernels. [268] describe a set of conditions under which a mapping kernel is guaranteed to be positive denite for all local kernels, and [67] show that GA kernel satises these with only the mild assumption of geometric divisibility. Informally, this means that for positive denite local kernel s(x;y), the ratio s(x;y)=(1 +s(x;y)) is also positive denite. This means that we should only use geometrically divisible kernels in our GA kernel denition to ensure its positive deniteness. This excludes, for example, Gaussian and Laplace kernels [67]. Like MDTW, GA kernel runs in O(T 2 ) but can be sped up to O(TC) using a second local kernel !(t;t 0 ) over positions (instead of values), which is then multiplied by the 53 similarity kernel. [67] recommends using a triangular kernel: !(t;t 0 ) = (1jtt 0 j=C) + , which is analogous to the Sakoe-Chiba band. 3.4.0.3 Vector autoregressive kernels The previous two kernels measure time series similarity based on shape. An alternative paradigm involves comparing time series in terms of higher level structures, such as temporal dependencies and correlations between dierent variables. Such structure is often detected by rst specifying a type of model (often parametric and probabilistic) and then tting a separate model to each time series to be compared. We can then measure similarity between time series based on the similarity of their model parameters. One simple but very eective model is linear vector autoregression (VAR) [184]. An order-L VAR or VAR(L) model for a P dimensional time series species that x i (t) is equal to a linear combination of observations of all P variables from the previous L time steps plus some zero mean Gaussian noise: x t = P L l=1 A l x tl +b +" t . L is the lag, A 1 ;:::;A L 2R PP are the transition matrices,b2R P the intercept, and"sN (0; ) the noise. We can rapidly train a VAR(L) model on a time series X 2R PT using linear regression. We rst preprocess the data into two matrices, Z 2R P(TL) and W 2R (PL+1)(TL) . Z contains TL observations of each variable, which we stack horizontally to form a matrix of responses. W containsTL lagged windows that we use to predict the next time step, which we reshape into feature vectors for our regression. We then solve the ordinary least squares problemZ =BW . The resulting b B =ZW 1 contains our transition matrices and intercept. We can then compare two time series on the basis of their respective VAR(L) transition matrices and intercepts. 54 In [68], an elegant solution is developed to integrate the two-step \train, then com- pare" process for VAR-based similarity by formulating a VAR-related covariance kernel using the Z and W matrices. Using a Bayesian linear regression framework with a non-informative prior, they dene the following kernel function: VAR (X;X 0 ) = jW > W +I c j 1 +jW > W +Z > Z +I c j P=2 whereW = [W W 0 ], Z = [Z Z 0 ], c = T + T 0 1, and depends on P and the degrees of freedom d of the matrix-normal inverse Wishart prior but can be treated as a tunable parameter. Note that this is the Gram matrix formulation; there is also a Variance matrix formulation. They are equivalent but give dierent computational performance, depending on the relative sizes of T and P . It is demonstrated that under certain conditions (specicallyd>P1), VAR is positive denite and innitely divisible, such that 1=n VAR is also positive denite for alln2 R [68]. The latter property is especially convenient because it means that we can substitute a dierent exponent forP=2 to make the result more numerically stable. The VAR kernel is computationally expensive. The Gram formulation above is e O(P + T 3 ), while the Variance formulation is e O(T +P 3 ). IfT orP is small (i.e., our time series are short or low dimensional), then we have a choice of formulation that will run eciently. However, if T and P are both large, the VAR kernel may be slow and even intractable. 55 3.4.0.4 Euclidean distance Euclidean distance (ED) is among the simplest of time series distance measures. We can think of ED as an alignment-based kernel with a xed specic alignment. The main open question is how to apply ED for two time series with dierent lengths; there are a variety of choices (truncation, extension, resampling, interpolation), though none seems particularly justied. We choose a simple strategy: if the shorter time series is X 0 , we appendTT 0 copies ofx 0 (T 0 ) to the end ofX 0 so that the result has the same length as X. Using the language of-alignments, we dene the ED alignment of two time series X andX 0 with T 0 <T : ED =f(1; 1);:::; (T 0 ;T 0 );:::; (T 1;T 0 ); (T;T 0 )g and then dene variable length time series ED as ED(X;X 0 ) = 1 T T X j=1 d x( ED (j; 1));x 0 ( ED (j; 2)) This seems preferable to, for example, truncating the longer time series (which discards potentially useful information) or just choosing an alignment at random. We might consider using a model (e.g., tting a line or training a VAR(L) model) to extend the shorter time series, but doing so adds computational complexity for a presumably small improvement, when the primary virtue of ED is itsO(N) speed. As in the case of MDTW, we normalize by the length of the longer time series so we that can fairly compare distances within our database. 56 To convert Euclidean distance to similarity, we use a radial basis function (RBF) kernel: ED (X;X 0 ) = expfED(X;X 0 )=(2)g where is a parameter we can choose or tune. In our experiments, we let = median X;X 0 2D ED(X;X 0 ), i.e., the median Euclidean distance between time series in our data. While this seems hopelessly naive, researchers have repeatedly shown that ED is often competitive with more sophisticated approaches, particularly for large databases of long time series [173]. Our approach seems reasonable for clinical time series: extending a shorter time series by repeating the nal measurement assumes that the patient's nal condition is relatively stable, which is true for both healthy patients who were discharged and in-hospital mortalities. 3.4.1 Kernelized hashing framework Now we provide an overview of a kernelized time series hashing framework. Suppose that we have a databaseD ofP -dimensional, variable length MVTs, wherejDj =N and that we want to represent each with a B bit binary code. We can apply KLSH by performing the following steps: Choose a similarity kernel (e.g., MDTW, GA, VAR, ED). Randomly choose a sampleS =fX j g M j=1 D, withMN, and form theM-by-M kernel matrixK. FormB hash functions. For thejth function, choose a subsampleR j S ofR<M examples to estimate the jth weights vectorw j =K 1=2 e R j . Use the hash functions to generate B bit binary codes for each time series inD. 57 For a new query pointX q , generate its B-bit hash code. Search for thek-nearest neighbors ofX q using standard LSH techniques (e.g., linear scan of binary codes or a hash plus a small number of object comparisons). We can see that while the training stage (i.e., learning our hashing functions) is time- consuming, searching for the nearest neighbors of a new query should be dramatically faster. An exhaustive search over binary codes requires O(N) comparisons, but on mod- ern hardware, these are incredibly fast. If we apply the (1 +")-near neighbor indexing strategy of [104], then we use the binary codes to generate a list ofO(N 1=(1+") ) candidate nearest neighbors and then perform that many similarity comparisons to choose the K best (the choice of " trades o speed and accuracy). Our perceived speedup will also be governed by two other factors: (1) our choice and implementation of similarity (or distance) function; and (2) the \size" of our database in terms of N, P , and T . If we have a fast similarity function (e.g., ED or MDTW with a narrow Sakoe-Chiba band and early stopping heuristics) and a small number of short univariate time series, then we may notice little or no gain. Even for slower similarity functions, if N is relatively small and we use clever indexing or storage, we may perceive little or no speedup. However, for an expensive similarity function (of the kind that we use for multivariate time series) and a large database, we can expect substantial gains. 58 3.5 Experiments 3.5.1 Data We performed experiments with a number of dierent data sets from dierent domains, including two large, real world clinical databases and one EEG database. Each data set has its own unique set of properties, dynamics, and challenges, but they are all multi- variate time series data sets with moderate to large numbers of examples N, moderate dimensionality P , and medium to long (and often variable) lengths T . PICU data. The PICU data set (picu), a version of which was rst described in [187], is a fully anonymized collection of clinical MVTs recorded over a decade in the pediatric ICU at Children's Hospital LA (CHLA). The version we work with has roughly 10,000 MVTs of P = 13 variables, including vital signs (e.g., heart rate), lab results (e.g., glucose), and subjective assessments (e.g., Glascow coma score). The duration of ICU visits is highly variable, though the vast majority of stays last fewer than seven days (T = 128 for an hourly sampling rate) and over half last fewer than two days (T = 48). This data set in part motivates our interest in this problem, as doctors and clinical researchers are increasingly interested in quantifying patient similarity in terms of complex temporal patterns (rather than a priori personal facts, like age or gender) and of discovering or searching for similar cases in large historical EHR databases. This data comes with many interesting potential labels and responses; we utilize two in our experiments. The rst is the patient's primary diagnostic category; each patient is assigned a single primary diagnosis, which comes from one of eleven dierent categories (there are originally fourteen categories, but we combine some of the extremely small 59 ones). We use this category as a multiclass label. The second response we use is the Pediatric Cerebral Performance Category (PCPC) code, an integer-valued score between 1 and 5 indicating a patient's level of cognitive function [90]. Real world clinical time series data are very challenging to work with, and standard techniques often fail dramatically. [187] describe the challenges (and potential) of this of data in elegant detail. These include missing time series, irregular sampling, wildly dierent dynamics between variables, noise, and age dependency. Like [187], we impose an hourly bucketing on the data (taking the mean of multiple measurements within the same bucket). Where this creates missing values, we propagate forward the previous measurement; variables with a sampling frequency of less than an hour typically change very slowly, so this is a fairly reasonable assumption. When one time series are missing entirely, we impute a normal value for this variable; this is also a reasonable assumption, as variables are often missing because clinical sta believed them to be normal and chose not to measure them. Surgical data. The surgical data set (surgical) includes fully anonymous MVT time series extracted from an operating room anesthesia database at CHLA. It includes over 60,000 MVTs that describe patient status and treatment during surgery. This data is sampled at a much higher frequency than the picu data set (roughly one measure- ment/minute, versus one measurement/hour), yielding richer and longer time series. There are P = 15 variables, consisting largely of vital signs, and length varies between T = 30 samples and values ofT in the thousands. The median value ofT is 80. surgical exhibits many of the same phenomena as picu (missing data, irregular sampling, noise), 60 though less extreme due to the higher sampling frequency. We apply the same kind of preprocessing to surgical as we did to picu. Clinical researchers are interested in searching this database for adverse events (for example, a desaturation or a bronchospasm). However, only about 1% of cases have a label (all positive); the remaining cases have no label, positive or negative. This is both a limitation and an opportunity. It limits our ability to assess the performance of search algorithms on the currently labeled data set. However, it also illustrates a potential use case for time series similarity search over large medical databases: nding and labeling potential adverse events in such databases would enable both clinical research and quality assurance studies. However, nding such events is a needle in a haystack problem for clinicians; the current suite of software tools available at most hospitals limit them to searching for keywords in free text notes (unreliable) or \eyeballing" thousands of cases (tedious and unreliable). EEG data. The EEG Database data set 2 (eeg) is a classic machine learning bench- mark data set from the UCI Machine Learning Repository [16] that was originally col- lected during a study on the correlation between EEG patterns and alcoholism. It is a reasonably large MVT data set, including nearly 11,000 MVTs with P = 64 channels and consistent length T = 256. While it is smaller than surgical, it has four times as many variables and is longer on average. It includes binary labels indicating whether the subject is an alcoholic and is reasonably class balanced (64% versus 36%). Normalization. In the above data sets, the ranges and magnitudes of dierent variables vary substantially; using this data as is could bias our time series distances. 2 https://archive.ics.uci.edu/ml/datasets/EEG+Database 61 Furthermore, it is well known that both the shape-based kernels (MDTW and GA) work best with zero mean, unit variance data and that VAR models make similar assumptions, including stationarity. Thus, we applied standardization, including shifting (e.g., by the mean) and scaling (e.g., by the standard deviation). We used the overall mean and standard deviation (i.e., computed across all training data), as opposed to each individual MVT's statistics. While this does not in fact ensure that each time series is zero mean, unit variance, we feel this was a reasonable compromise for our purposes. 3.5.2 Design and goals The goal of our experiments is to demonstrate that kernelized hashing can be utilized to speed up time series similarity search based on a variety of dierent similarity metrics, as well as to conrm our hypothesis that no one similarity measure is best. To that end, we evaluate the performance of our algorithms using the following three criteria inspired from parallel work on hashing in computer vision [108]: 1. Semantic accuracy: For data sets where ground truth labels are available, we measure the semantic accuracy sacc(X q ;knn(X q )) of aknn search by the fraction of nearest neighbor labels that agree with theX q 's label, where sacc(X q ;fX j g K j=1 ) = P K j=1 1fy q =y j g =K. This evaluates the potential ecacy of using our search procedure for, e.g., knn classication tasks. We are less interested in the actual semantic accuracy of a particular search technique than we are in the gap (or dif- ference) between the semantic accuracy of an exhaustive search using a similarity 62 measure and the semantic accuracy of a KLSH search based on the same measure. We dene this gap as gap (X q ;knn GT ;knn KLSH ) = sacc(X q ;knn GT (X q )) sacc(X q ;knn KLSH (X q )) If this gap is small, then our hashing procedure is doing a good job of approximating an exhaustive search; if it is large, then we face a trade o between computational performance and accuracy. Because we use only predened kernels, there is no reason to expect our hashing procedure to beat the exhaustive search, and such results are probably anomalous. Note that we use the same measure of accuracy for both binary and multiclass labels. For the integer-valued PCPC response in picu, we measure semantic accuracy using the average mean squared error (MSE) between the query and nearest neighbor responses. 2. Nearest neighbor recall: We can assess the ecacy of our hashing procedure by examining the degree to which it approximates the true neighborhood around a query X q . We measure this in the form of recall: how many nearest neighbors does KLSH need to retrieve in order to discover a xed number K of ground truth nearest neighbors (i.e., using the similarity measure). This can be plotted like a recall-precision curve, with the number of nearest neighbors returned by KLSH on the x-axis and the count of the K target nearest neighbors found on the y axis. 63 3. Relative speed-up: nally, we are interested in just how much of a speed-up we get using hashing vs. an exhaustive search. We measure this as a ratio of the average query time for the exhaustive search (time g ) divided by the average query time for a hashing-based search (time h ): speedup = time g =time h . The larger this number, the bigger the speed-up. For each experiment, we perform stratied 10-fold or 5-fold cross-validation to esti- mate our performance statistics. We average accuracy, recall, and speed-up across queries and then take the mean of averages across folds. We use no special preprocessing of the data beyond basic standardization and do not tailor our kernels to each data set. For the VAR kernel, we used a xed lag of L = 5, as suggested by [68]. When performing hashing-based search for nearest neighbors, we perform a single linear scan in the space of binary codes, with no subsequent set of comparisons using the core similarity function. It is worth noting that including this last step would improve our accuracy. 3.5.3 Results and Discussion Table 3.1 shows the relative speed-up (time g /time h ) that we get from using KLSH for each kernel: we dene this as the ratio of search time using the true similarity function (time g ) divided by the search time using the hashing-based approach (time h ). We observe a fairly signicant gain in speed across the board, demonstrating the generality of our approach. There are two other important trends in the speed-ups. First, the more expensive the kernel, the more speed-up we receive from using hashing. Second, the speed-up also increases as our data set size grows. The alignment kernels (MDTW and GA) receive over two orders of magnitude boost in their speed for the huge surgical data set. These 64 results conrm our intuition that hashing is a very useful tool for time series similarity search, if our search accuracy is acceptable. The bottom right bar plot in Figure 3.1 shows the actual average query times on the picu data set. The ground truth VAR kernel proved intractable for the surgical and eeg data sets; in this case, hashing is the only viable solution. We present semantic accuracy and gap results in Table 3.2. We see that for all kernels, the gap in accuracy between the ground truth similarity metric and KLSH search is usually small and always less than 0.05. This is pretty remarkable given that we are retrieving nearest neighbors using a pure linear scan of the binary hash codes and performing no distance-based comparisons to rene our results. Also worth noting is that, as we anticipated, there is not a decisive winner among the kernels, though GA tends to have the best performance (in terms of both semantic accuracy and gap) on average. ED performs surprisingly well on picu; we see in Figure 3.1 that it has the best gap for both diagnosis accuracy and PCPC MSE loss. One curious result in Figure 3.1 is the discrepancy between the semantic accuracy results in the top left plot and the 10-nearest neighbor (10nn) recall in the bottom left. KLSH does a much worse job of capturing the true neighborhood structure of MDTW and VAR than it does with GA, but there is relatively small dierence in terms of se- mantic accuracy. This can be explained by the labels that we chose to use for our picu experiments. Notice in Table 3.2 that all ground truth similarities have low 10nn seman- tic accuracy (around 0.2), suggesting that there is little correlation between the structure of any local similarity neighborhood and the labels. This is not surprising, given what we know about the source of our labels. We used patient primary diagnosis codes from a 65 custom database for our classication task. Patients typically receive multiple diagnostic codes, but only one is designated as the primary diagnosis, based on a variety of factors not limited to patient status and physiology. For that reason, we feel that the neighborhood recall results are more signicant for assessing the success of our time series hashing framework. Figure 3.1 and Figure 3.2 show the 10nn recall results for picu and surgical, respectively. We see that GA does a modest job of recalling the true neighborhood structure for query points. ED works well for picu but is terrible for surgical. MDTW does not appear to work well at all; this conrms what theory suggests, namely that MDTW is a poor kernel without signicant modications. GA appears to be the superior alignment-based similarity for kernel methods. Again, we note that no one similarity works best for all data sets. It is disappointing that VAR does not perform well on the picu data set (the only one for which the ground truth search completed). There are several possible explana- tions, the main one being that it makes strong modeling assumptions about the data (linear correlations, stationarity, etc.) that are likely not true for these real world data sets. Linear models seem especially poorly suited for the PICU data where the sampling rate is so low. Additionally, VAR has many tunable parameters with which we did not experiment, foremost among them the lag parameterL. We omit eeg results because the ground truth VAR search failed to terminate; this represents an extreme but important case where hashing is the only feasible option. 66 Relative speed-up Kernel picu surgical eeg ED 9.78 76.50 18.70 MDTW 19.13 163.00 31.73 GA 18.17 175.02 32.08 VAR 20.34 Table 3.1: Relative speed-up (time g /time h ) for each time series kernel and data set, when performing an exhaustive search. The ground truth VAR kernel was intractable for surgical and eeg. 10nn semantic accuracy and gap picu eeg picu (diagosis) (PCPC L 2 loss) ED sacc 0.205 0.561 2.36 gap 0.010 0.046 0.053 MDTW sacc 0.204 0.565 2.53 gap 0.047 0.049 0.41 GA sacc 0.210 0.579 2.45 gap 0.032 0.036 0.193 VAR sacc 0.181 2.41 gap 0.040 0.121 Table 3.2: Semantic (label) accuracy and gap for 10nn retrieval across data sets and kernels. For PCPC codes, we measure the average squared dierence from the query point's score (i.e., lower is better). 3.6 Conclusion and Future Work There is a growing need for fast and accurate similarity search over multivariate time se- ries, particularly in healthcare. Many time series similarity measures have been proposed by researchers, but no single metric performs consistently well across all data sets and problems. What is more, most of these approaches do not scale to large time series data sets. Kernelized hashing presents a exible, unied approach to speeding up time series search, regardless of choice of representation and distance measure. We have described how the general framework of kernalized locality-sensitive hashing (KLSH) can be com- bined with arbitrary time series similarity metrics to yield ecient and accurate similarity 67 0 10 20 30 40 50 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Number of nearest neighbors gap Accuracy gap: picu (diagnosis) ED vs. ED−KLSH MDTW vs. MDTW−KLSH GAK vs. GAK−KLSH VAR vs. VAR−KLSH 0 10 20 30 40 50 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Number of nearest neighbors gap MSE gap: picu (PCPC) ED vs. ED−KLSH MDTW vs. MDTW−KLSH GAK vs. GAK−KLSH VAR vs. VAR−KLSH 0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 KLSH NN Recall of ground truth 10NN 10−nearest neighbor recall: picu ED−KLSH MDTW−KLSH GAK−KLSH VAR−KLSH ED MDTW GAK VAR 0 0.2 0.4 0.6 0.8 1 CPU time (sec) Average time per query: picu Exhaustive search KLSH Figure 3.1: Results for picu data set. Top row: semantic accuracy gap for diagnosis label and PCPC code. Bottom left: 10-nearest neighbor recall. Bottom right: average query times. 0 50 100 150 200 250 300 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 KLSH NN Recall of ground truth 10NN 10−nearest neighbor recall: surgical 10−NN ED 10−NN MDTW 10−NN GAK 0 20 40 60 80 100 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 Number of nearest neighbors gap Accuracy gap: eeg ED vs. ED−KLSH MDTW vs. MDTW−KLSH GAK vs. GAK−KLSH Figure 3.2: Left: 10-nearest neighbor recall for surgical. Right: semantic accuracy gap for eeg. 68 search. Using three large, complex medical data sets, we demonstrated empirically that this framework is orders of magnitude faster than existing approaches and provides an acceptable trade o between speed and accuracy. The future of this line of work is promising. A logical next step would be to incorporate other time series metrics into the KLSH framework and even combine multiple metrics to capture dierent notions of similarity. We would also like to incorporate supervision into the hash function learning, which has been shown to work well for images [76] [251]. We focused our attention on locality sensitive hashing and a handful of representative similarity functions, but the kernel approach may be applied to other hashing frameworks. Finally, we feel that the major limitation of this framework is that it is non-adaptive. We would like to explore adaptive, learning-based approaches for selecting coding functions, such as deep learning [208] [251]. We feel that these will not only improve the performance of hashing-based search but may also be useful for discovering interesting structure and generating novel representations of multivariate time series. 69 Part II Learning to Diagnose with Deep Neural Networks 70 Chapter 4 Learning and Analyzing Representations of Clinical Data Deep learning and related techniques can extract succinct, predictive representations from diverse raw inputs ranging from images to audio signals. In this chapter, we investigate whether this success can be replicated in clinical time series. We show that simple neural networks trained on subsequences from longer clinical time series can extract lower dimen- sional representations that are predictive of a variety of diagnoses, as well as in-hospital mortality, and which outperform hand-engineered features. We also propose a heuristic to visualize the temporal patterns detected by hidden units and assess whether the learned patterns are associated with recognizable symptoms. Finally, we apply o-the-shelf tools from causal inference to quantify the degree to which the learned representations are causally predictive of clinical outcomes. 4.1 Introduction The increasing volume, detail, and availability of stored digital health data oers an un- precedented opportunity to learn richer, data-driven descriptions of health and illness [187]. This principle has driven the rapid development of a program of research known as 71 computational phenotyping. This broad eld encompasses a variety of research eorts that apply modern computational methods from, e.g., data mining and machine learning, to increasingly large and complex medical data sets. They are further unied by a common goal: to not only build successful predictive models but to learn interpretable, clinically meaningful descriptors of health. Computational phenotyping has attracted many re- searchers in machine learning and data mining, applying a broad set of methods (ranging from Gaussian processes to tensor factorization) to learn meaningful representations from a variety of data sources and types (e.g., clinical time series, text, event counts, etc.). [187, 321, 160, 336, 129, 257, 255] Learning robust representations of human physiology is especially challenging because the underlying causes of health and illness span body systems and physiologic processes, creating complex and nonlinear relationships among observed measurements (e.g., patients with septic shock may exhibit fever or hypother- mia). Whereas classic shallow models (e.g., cluster models) may struggle in such settings, properly trained deep neural networks can often discover, model, and disentangle these types of latent factors [23]. Deep learning (e.g., multilayer neural networks) has achieved state of the art results in speech recognition [71] and computer vision [213] and seems well-suited to modeling the complexity of human physiology. However, despite rapid advances in methods [282] and software, [28, 106, 140, 64] there has been comparatively little formal research on interpretation of deep learning architectures [87, 88]. This becomes an especially critical question when applying deep learning to health and medicine. Neural networks are classically viewed as \black box" models that may achieve high predictive accuracy but are uninterpretable by humans and 72 unsafe for use on clinical problems. One solution for unraveling the complex represen- tations produced by deep learning to apply ideas and tools from causal inference [221]. Feed-forward architectures are in fact directed acyclic graphs (DAGs), in which inputs cause higher layer activations, which in turn cause outputs. Thus, they may be thought of as causal models, which makes them amenable to causal analysis. There is a growing body of research on discovering causal relationships among vari- ables from observational data under a variety of assumptions and settings [264, 265, 136]. In particular, we can identify potential causal relationships among the variables if they are not distributed according to a Gaussian distribution. [265] This is often the case for many predictive tasks of interest. In this chapter, we present a rst step toward automatic discovery of causal phenotypes and for cracking open the black box of neural networks, making them more readily applicable to medical data. Our framework is a two stage process. First, we use a simple deep neural network architecture to learn latent represen- tations of physiology from clinical time series. Then we apply a state-of-the-art nonlinear causal inference algorithm [136] to analyze the relationships between these learned phe- notypes and patient outcomes and diseases of interest. We show that this algorithm discovers intuitive patterns of physiology known to be associated with acute illnesses. We also propose an informal causality-based framework for measuring the quality of learned representations. 73 4.2 Related Work While a young eld (at least by that name), computational phenotyping is advancing rapidly, spurred on by the increased adoption of electronic health records (EHRs) and the growing interest from data mining and machine learning researchers. There is already a large body of excellent research, much of it published in just the last ve years or so. One popular approach to computational phenotyping is to construct a large multi- dimensional array (i.e., tensor) view of clinical data and then apply dimensionality re- duction or feature selection techniques to learn a lower-dimensional set of latent bases that can be treated as phenotypes. Each basis can be seen as a combination of observa- tions (from the original tensor), while each patient can be represented as a sparse set of phenotypes. This approach has a strong parallel with topic models used widely in text analysis. Two primary examples of this paradigm include [336, 129], which apply such frameworks to outpatient disease data and medicare claims, respectively, with very good results. Such an approach could be applied to physiologic time series, as well. An alternative phenotyping paradigm includes probabilistic models, which assume a generative process and then t the model to data using, e.g., maximum a prior inference. Such models can be robust to uncertainty, noise, and (and some types of missing val- ues) and are often interpretable. [187] used a Gaussian mixture model with a temporal smoothness kernel prior to discover meaningful physiologic patterns (or physiomes) in multivariate time series from acute care settings similar to ours. [255] proposed a Time Series Topic Model (TSTM) that can learn bag-of-AR (autoregressive linear models) rep- resentations from dense time series (e.g., ECG waveforms) and has been used to develop 74 a novel severity of illness score for neonatal patients. [257] recently proposed the Prob- abilistic Subtyping Model, a Bayesian framework that combines splines and Gaussian processes to cluster longitudinal data from chronic Scleroderma patients and discover potentially novel subtypes. To our knowledge, [160] describes one of the rst applications of modern deep learning to clinical time series. They train stacked autoencoders on 30-day windows of uric acid readings to learn features that are competitive with expert-designed features for classify- ing gout versus leukemia. They handle irregular, biased sampling by warping their time series and then sampling from a tted Gaussian process. This framework successfully learns time series features that are both visually intuitive and useful for discerning the two diagnoses. [145] recently demonstrated that neural networks (unsupervised and su- pervised) can be used to discover and detect interpretable subsequences in multivariate physiologic time series that are useful for classifying respiratory conditions. Given that deep learning approaches have achieved breakthrough results in language modeling [196], speech recognition [71], and music transcription [34], we expect similar results (with time and eort) in health and medicine. In deep learning research, feature analysis is often secondary to, e.g., prediction perfor- mance, and focuses on visualization. Strategies include sampling from generative models and optimizing (using, e.g., stochastic gradient ascent) over inputs rather than param- eters [87]. Each method has strengths and weaknesses (e.g., simplicity, computational eciency, local optima), but they share several properties: they work best for data that are easily interpreted by human beings (e.g., images [162]); they employ heuristics and 75 approximations; and they analyze each hidden unit independently. However, recent re- search has begun to provide a more rigorous understanding of the representations learned by deep architectures. [289] showed that the semantics encoded by hidden unit activations in one layer are preserved when projected onto random bases, instead of the next layer's bases. This implies that the practice of interpreting individual units can be misleading. It has also been shown that it is often easy to create articial inputs that either receive high condence predictions from the network but are unrecognizable to humans or are misclassied by the network but are indistinguishable from correctly classied examples [205, 289]. This suggests that the behavior of deep models may be more complex than previously believed. One solution for unraveling the complex representations produced by deep learning is to apply ideas and tools from causal inference [221]. [45] recently proposed a theoretical framework that reformulates image classication as a causality problem (i.e., an image causes an agent to label it as a 7 or 9) and uses active learning to perform interventions that can separate causal features from spurious correlations. This idea oers a partial solution to the problems described above [289] but requires the ability to perform in- terventions that may not be possible in clinical data analysis. Alternatively, there is a growing body of research on discovering causal relationships among variables from ob- servational data under a variety of assumptions and settings [264, 265, 136], which we discuss in detail later. 76 4.3 Methods In this section we describe our two-stage framework for discovery and analysis of causal phenotypes from clinical time series data uses deep neural networks. We rst describe the background of feature (i.e., phenotype) extraction from time series in Section 4.3.1. We then demonstrate how deep neural networks can be used to perform both unsuper- vised (Section 4.3.2) and supervised (Section 4.3.3) discovery of latent representations of physiology (i.e., phenotypes) from clinical time series. Finally, we show in Section 4.3.4 how state-of-the-art causal inference algorithms can be used to analyze the learned phenotypes and to identify potential causal relationships between phenotypes and critical illness. 4.3.1 Background: feature extraction from time series Given a multivariate time series with P variables and length T , we can represent it as a matrix X 2 R PT . We denote the time series of the pth variable as a row vector x p; 2R T and the tth time as a column vectorx ;t 2R P . A feature map for time series X is a functionf :R PT 7!R D that mapsX into aD-dimensional feature space, which can be used for machine learning tasks like classication, segmentation, and indexing [81]. In a medical context, we can think of features as phenotypes. These can take the form of extreme measurements (as in severity of illness scores), thresholds, or important patterns. In multivariate time series, features become increasingly complex to design, so automated feature discovery is an attractive proposition. Given the recent success of 77 deep learning in a variety of applications, it is natural to investigate its eectiveness for feature learning from clinical time series data. (a) One-layer SDAE. (b) SDAE second layer. Classifica(on+ Feature+extrac(on+ (c) Time series feature map. Figure 4.1: Deep neural networks for phenotyping from clinical time series. 4.3.2 Unsupervised autoencoders for phenotyping clinical time series We explore several deep neural network architectures for automatic discovery and detec- tion of important physiologic patterns. We begin with a simple denoising autoencoder (DAE) [304]. This is a one layer unsupervised model that simultaneously learns paired encoding and decoding functions, similar to sparse coding but able to incorporate more easier to optimize and incorporate into deep architectures. Figure 4.1a shows a simple illustration of the DAE. We encode and decode x using rules similar to making a prediction using a logistic function: h =g(Wx +b); ^ x =g 0 (W 0 h +b 0 ) 78 where h2 [0; 1] D 0 is the latent representation, ^ x is the reconstruction, and g and g 0 are elementwise nonlinearities (a common choice is the sigmoid function g(z) = 1=(1 + expfzg). The choice ofg 0 depends on the type of inputx. As described later, in this work we scale all variables to fall between 0 and 1, so we can use a sigmoid for decoding as well. As is typical, we also tie the weights, letting W 0 =W > . Finally, in DAEs we actually add random corruption to the input before encoding, sampling ~ xsP corr (~ xjx). In our case, P corr applies a binary masking to x, zeroing out each entry independently with some probability p. We train the weights by minimizing the reconstruction loss for each training example. For [0; 1] inputs, we use cross entropy loss L = D X d=1 (x d log ^ x d + (1x d ) log(1 ^ x d )) where x d is the dth dimension ofx. Note that for a DAE, ^ x =g 0 (W > g(W ~ x +b) +b 0 ), i.e., the reconstruction of the corrupted input ~ x. We use standard (stochastic) gradient methods to minimize the reconstruction error with respect toW ,b andb 0 for all training examples. We can construct a deep autoencoder by stacking multiple DAEs, forming a stacked denoising autoencoder (SDAE) [305]. SDAEs are typically trained using greedy layer-wise training [26], as shown in Figure 4.1b. Once the weights for layer` have been trained, we can map each training example into its feature space, producingh (`) . This then becomes the input for training the (` + 1)th layer. Note thath (0) =x, the input. 79 In our setting, x = vec(X), a vectorization of our PT time series. We can then use an SDAE of any number of layers as a feature map f for time series, as shown in Figure 4.1c. Each element ofh (`) is a nonlinear function of the SDAE's inputs, meaning that it can capture complex correlations across both time and variables. This makes it well-suited tool for phenotyping from clinical time series, especially when working with relatively small data sets with few or unreliable labels [160]. 4.3.3 Supervised neural networks for phenotyping clinical time series We can convert a deep autoencoder into a deep feed-forward neural network by adding an additional output layer to make predictions, as shown in Figure 4.1c. For binary classication, we typically use a sigmoid nonlinearity applied to a linear activation, i.e., a logistic regression: y k =( > k h (`) ) wherey k is thekth output unit andh are the hidden unit activations (we omit the bias for brevity). Neural networks lend themselves naturally to multi-output prediction problems (also called multi-label or multi-task learning), and training such neural networks can often improve prediction performance by enabling the neural network discover shared features that are useful across a range of tasks. In a medical context, this approach can be used to train a single model to predict multiple outcomes or diagnoses. A neural network with L hidden layers and an output layer has hidden layer param- etersfhW (`) ;b (`) ig L `=1 and output parametersf k g K k=1 for D (L) hidden units in the Lth 80 layer and K outputs. For K binary classication tasks, the loss function during super- vised training also uses cross-entropy but with the true labels (vs. reconstruction, as in the SDAE): L = K X k=1 y k log( > k h (L) ) + (1y k ) log(1( > k h (L) )) where h (L) = g(W (L) h (L1) +b (L) ) and h (0) = x. Again, we minimize the loss with respect to all model parameters using (stochastic) gradient descent and backpropagation. 4.3.4 Discovery of causal phenotypes from clinical time series One of the main advantages of deep learning is its ability to disentangle factors of variation that are present in the data but unobserved [23]. This makes subsequent learning (i.e., training a classier) much easier since it counteracts the curse of dimensionality [27]. In addition, knowledge about one factor usually improves estimation about another [244]. However, it can often be dicult to analyze and understand the learned latent repre- sentations and to understand whether the model is learning truly important relationships or spurious correlations. One way to explore and demonstrate this fact is to perform causal analysis of the features extracted by the hidden layers of a deep neural network. Disentangled representations should have clearer and stronger causal relationships with other variables of interest (e.g., mortality) than raw outputs and other choices of features. Additionally, causality is of primary interest in medicine and health, especially if analyt- ics will contribute to decisions about treatment and care, which can signicantly impact 81 patient lives and outcomes. Thus, discerning correlation from true causal relationships is of vital importance. Given a set of features denoted by h2R D (L) and a response variabley, we investigate the causal relationship between each featureh j ,j = 1;:::;D (L) , and the response variable y. There are two options: either the direction of the edge is from feature to the response variable h j ! y or vice versa h j y. We are interested only in the former case where the features are causally predictive of the response variable. Thus, we need to use a causality discovery procedure to nd the direction of causation between the features and the response variable. Classic causal inference algorithms require a set of causal priors to be available for the variable to be able to cancel out the impact of spurious causation paths [221]. While we often do not have such priors available for our outputs, we can still identify causation among the variables if they are not distributed according to a Gaussian distribution. [264, 265, 136] Binary labels (e.g., mortality prediction) satisfy the requirements of many causal inference frameworks. We apply a state-of-the-art causal inference algorithm Pairwise LiNGAM [136], based on DirectLiNGAM, in order to discover the causal edges between each feature and response variable. The key idea of this algorithm is to compute the likelihood ratio of the two models h j !y and h j y for j = 1;:::;D (L) and select the direction that makes the log-likelihood ratio positive. In particular, we have R = 1 n L(h j !y) 1 n L(h j y) ) 8 > > < > > : h j !y if R> 0, h j y if R< 0. 82 wheren denotes the number of observations. The log-likelihood values are computed using the non-parametric entropy estimation techniques. Pairwise LiNGAM requires that the two variables be non-Gaussian distributed, which makes it especially useful for analyzing deep neural networks with nonlinear activation functions in hidden layers and logistic outputs. This is the case in our setting. It is important to emphasize that causal inference algorithms do not necessarily select those features that are most correlated with the response or most useful in predicting it. Our goal in causal analysis, rather, is to discover a subset of features that are the best candidates to be true causes of the response and which may provide insight (not necessarily more predictive power). Next we propose an informal method for quantifying the causal power of features (derived or learned). After learning the causal features, we follow the recommendation recommendation[136], we t a logistic regression model to the features selected by the causality discovery algorithm as follows: b ;b 0 = argmax ; 0 ( n X i=1 h y i log( > ~ h i + 0 ) + (1y i ) log(1( > ~ h i + 0 ) i ) where ~ h represent the set of features selected by the causality discovery algorithm and ; 0 denote the prediction vector and the intercept, respectively. We treat the resulting weights as the magnitude of each variable's causal relationship. Finally, we use the L 2 norm of the regression coecient vectorkk 2 to quantify the overall causal power of the features being analyzed. We can use this to compare the causal power of dierent representations. 83 4.4 Experiments In order to demonstrate the eectiveness of our framework, we performed a series of phe- notype experiments using two clinical time series data sets collected during the delivery of care in intensive care units (ICUs) at large hospitals. After describing our data and ex- perimental set up, we brie y present quantitative results for several classication tasks, in order to demonstrate the predictive power of features discovered by neural networks (Sec- tion 4.4.1). Then in Section 4.4.2, we apply causal inference tools to the learned features in order to discover the most clinically meaningful features and to analyze the quality of the learned phenotypes. We also provide example visualizations of causal features learned by neural networks that capture clinically signicant physiologic patterns. Physionet Challenge 2012 Data. The rst data set comes from PhysioNet Chal- lenge 2012 website [273] which is a publicly available collection of 8000 multivariate clinical time series from one ICU and three specialty units, including coronary care and cardiac and general surgery recovery units. As with the competition, we focus on mor- tality prediction (from the rst 48 hours of each episode) as our main prediction task. This is a challenging problem: no competition entry scored precision or recall higher than 0.53. We used both Training Subsets (A and B) for any unsupervised training but only the labeled Training Subset A for supervised training and evaluation (to our knowl- edge, labels are not available for Subset B). Each episode is a multivariate time series of roughly 48 hours and containing over 30 variables. While each episode also has a variety of static variables available (e.g., age, weight, gender), we focus our experiments on just 84 the time series. In all supervised learning experiments, we use label stratied 10-folds cross validation when estimating performance scores. PICU Data. The second data set consists of ICU clinical time series extracted from the electronic health records (EHRs) system from Children's Hospital LA (CHLA), previously described in [187, 146]. The original data set includes roughly ten thousand episodes of varying lengths, but we exclude episodes shorter than 24 hours, yielding a data set of roughly 8500 multivariate time series of thirteen physiologic variables. Each episode has zero or more associated diagnostic codes from the Ninth Revision of the International Classication of Diseases (ICD-9) [214]. We aggregate the ve-digit ICD- 9 codes according to the standard seventeen broad category codes (e.g., 460-519 for respiratory diseases) and supplementary V and E groups. We then treat predicting each category code as a distinct binary classication task. The sparse multi-label nature of these data prevents us from applying cross-validation; we instead create ve 80/20 random splits of the data, ensuring that each split has a minimum number of positives examples for each ICD-9 label. Preprocessing. We perform three steps of preprocessing to both data sets before analysis. First, we scale each variable to a [0; 1] range. Where variables have known ranges (e.g., Total Glasgow Coma Scale or binary variables), we use those. Otherwise, we treat the 1st and 99th percentiles of all measurements of a variable as its minimum and maximum values. Outliers are truncated to 0 or 1. This is applied to both time series and static variables. Next, we resample all time series to a xed hourly sampling rate using a simple bucketing procedure: we divide each time series into 48 non-overlapping hour-long windows. When a window includes more than one measurement, we take the 85 mean. Where this creates missing values, we propagate forward the previous measure- ment. This makes a reasonable assumption that each time series is relatively stable and that important changes are observed and recorded. Finally, we handle entirely missing time series (e.g., a patient may have zero measurements of end-tidal CO 2 if she is not ventilated) by imputing a \normal" value. For variables without known normals, we use the median of all measurements in the data set. This strategy the fact that missing time series are typically not missing-at-random but rather are missing because clinical sta decided not to measure a particular variable. Often this is because they also assume it is normal. Neural network training. We implemented all neural networks in Theano [28] as variations of a multilayer perceptron with 3-5 hidden layers (of the same size) of sigmoid units. The input layer has PT input units for P variables and T time steps, while the output layer has one sigmoid output unit per label. We initialize each neural network by training it as an unsupervised stacked denoising autoencoder (SDAE). We found this helps signicantly because our data sets are relatively small and our labels are quite sparse. We use minibatch stochastic gradient descent to minimize cross-entropy loss during unsupervised pretraining and logistic loss during supervised netuning. We use ten-fold cross validation, and both neural networks and classiers are not trained on the test folds. Additionally, we use grid search and one training fold to tune parameters (e.g., the strength of the L1 penalty). 86 4.4.1 Classication performance We rst present a quantitative evaluation of the predictive performance of dierent types of features, both hand-designed and learned, on the Physionet Challenge 2012 data set. To ensure a fair comparison, we use the same type of classier in all experiments: a linear support vector machine (SVM) with hinge loss and a L1 regularization penalty [222]. We do not use our neural networks to make predictions. We select the strength of the L1 penalty by performing a grid search over the range [10 2 ; 10 2 ] and choosing the value that maximizes the Area Under the Precision-Recall Curve (AUPRC) on a held-out subset of our training data. We report both Area Under the Receiver Operator Curve (AUROC) and AUPRC, as well as the Precision when Recall is 90%. All three metrics are more robust to the class imbalance of our label than accuracy and give us an idea of the trade-o between false negatives and false positives [273]. Our baselines include the raw data and hand-designed features that capture the ex- tremes, central tendency, variation, and trends within the entire time series. While rela- tively simple from a machine learning perspective, these features are often quite eective for clinical predictive modeling and similar to those used in classic severity of illness scores [156, 228]. Table 4.1 shows the mortality prediction performance for the Physionet Challenge 2012 data for our best-performing baselines and neural network features. We see that features learned using a 3-layer neural network beat the raw data fairly substantially and are competitive with the hand-designed features. Given the success of neural networks in other domains, it is somewhat disappointing that the learned features do not beat the 87 AUROC AUPRC Prec@90%Rec Raw Time Series (R) 0.787 0.029 0.407 0.043 0.221 0.017 Hand-designed Features (H) 0.829 0.021 0.468 0.0479 0.259 0.049 NNet(R,3) 0.821 0.021 0.444 0.032 0.256 0.030 NNet(H,3) 0.832 0.016 0.462 0.048 0.272 0.026 H+R 0.823 0.018 0.438 0.035 0.256 0.032 H+NNet(R,3) 0.845 0.017 0.487 0.047 0.291 0.034 Table 4.1: Classication performance on the Physionet Challenge 2012 data set. We report mean and standard deviation (across 10 folds) for each metric. We use the following abbreviations: R: raw time series, H: hand-designed features, NNet(I,L): L-layer neural network with input I hand-engineered features soundly. However, we make several observations that temper this disappointment: rst, we trained the neural networks on the full 48 hour time series, rather than on shorter subsequences (as in [145]). This both increases the dimension- ality of the inputs and substantially reduces the amount of training data for the neural networks. Indeed, we found that adding additional layers beyond 3 hurt performance, suggesting that we did not have enough data to train deeper models. Second, we found that training a neural network on the hand-engineered features themselves actually im- proved performance (see NNet(H,3) in the table). Finally, we experimented with combining dierent kinds of features, as shown in the bottom two rows of the table. We observe that while adding the raw data to the hand- designed features reduces performance slightly, adding the 3-layer neural net features achieves the best performance across all three metrics. This suggests that the neural network is discovering features that not only are useful for classication but that are not redundant with the hand-designed features in terms of discriminative power. What is more, as we shall see in Section 4.4.2, they have increased causal power. 88 Interestingly, we found minimal dierence between the performance of features learned using unsupervised and supervised neural networks. We speculate that this has two causes. The principle reason, we speculate, is the class imbalance in our labels, which we did not attempt to handle in any way during neural network training. Second, we believe that mortality prediction from early admission data is a dicult problem and that it may not be possible to do substantially better (our results are quite similar to those from the competition itself). 4.4.2 Causal analysis Next, we perform causal inference on the hand-designed and learned features using the framework described in Section 4.3.4. Table 4.2 shows the per-feature causal power (i.e., the L1 norm of the coecient vector divided by the number of features) of the raw, hand-designed, and neural network features for Acute Respiratory Distress Syndrome (ARDS) in the PICU data.. We see the the neural network features have a score orders of magnitude larger than either of the baselines. Table 4.2: Magnitude of the causal relationships identied by using the representations learned by the deep learning algorithm. Raw Time Series Hand-designed Features NNet(R,3) 0.013 0.0063 0.093 0.058 0.25 0.27 Figure 4.2 shows visualizations of two the signicant physiologic patterns learned by the neural networks. For each we used causal inference to discover the subset of features with the strongest causal relationship with our outcome of interest. Then we found the 50 input subsequences with the highest activations in those units and plotted the mean trajectories 89 for some or all physiologic variables. Figure 4.2a visualizes features that were found to be causally related to the ICD-9 circulatory disease category from the PICU data. We see these features detect highly elevated blood pressure and heart rate, as well as depressed pH. The features also detect elevated end-tidal CO2 (ETCO2) and fraction-inspired oxy- gen (FIO 2 ), which likely indicate ventilation and severe critical illness. Interesting, these features also detect elevated urine output, and thus it is not surprising that these fea- tures are also correlated with diagnostic labels related to urinary disorders. Figure 4.2b visualizes the First-48-hour physiologic patterns detected by features that are causal of mortality in the Physionet Challenge 2012 data. 4.5 Discussion We have presented a simple, two-stage framework for discovering latent phenotypes from clinical time series that have strong causal relationships with patient outcomes and critical illness. Our framework combines feature learning using neural networks with nonlinear causal inference tools to discover latent phenotypes that are causally predictive of clinical outcomes. While our results are preliminary, we believe that this general line of research will help us discover more clinically meaningful representations of health and illness and to eventually develop tools for automatic discovery of causal phenotypes. 4.6 Acknowledgements David Kale was supported by the Alfred E. Mann Innovation in Engineering Doctoral Fellowship, and the VPICU was supported by grants from the Laura P. and Leland K. 90 (a) Features selected by causal inference for ICD-9 circulatory disease category (390- 459). (b) 48-hour causal phenotype for mortality, learned from the Physionet Challenge data. Figure 4.2: Causal features learned from ICU time series. Whitter Foundation. Mohammad Taha Bahadori was supported by NSF award num- ber IIS-1254206. Yan Liu was supported by NSF IIS-1134990 and IIS-1254206 awards. The views and conclusions are those of the authors and should not be interpreted as representing the ocial policies of the funding agency, or the U.S. Government. 91 Chapter 5 Multilabel Classication of Long Clinical Time Series Recurrent neural networks (RNNs) represent the state-of-the-art for a variety of sequential modeling problems owing to their ability to model a large number of hidden states with nonlinear dynamics and temporal correlations. This makes them a natural choice for analyzing clinical time series recorded hospitalized patients. However, clinical data pose a number of challenges to standard RNNs, including irregular sampling and potentially very long dependencies. In this chapter, we investigate whether RNNs can be used successfully to classify multiple acute care diagnoses from physiologic time series. We nd that they can but are outperformed by a strong baseline consisting of a multilayer perceptron using hand-engineered features. Hypothesizing that this is due to the diculty of capturing extremely long dependencies, we design a simple strategy that replicates the diagnostic labels at each time step. This approach not only improves predictive accuracy but also speeds up training and reduces overtting. The resulting model outperforms all baselines when predicting over one hundred critical illnesses. 92 5.1 Introduction Time series data comprised of clinical measurements, as recorded by caregivers in the pediatric intensive care unit (PICU), constitute an abundant and largely untapped source of medical insights. Potential uses of such data include classifying diagnoses accurately, predicting length of stay, predicting future illness, and predicting mortality. However, besides the diculty of acquiring data, several obstacles stymie machine learning research with clinical time series. Episodes vary in length, with stays ranging from just a few hours to multiple months. Observations, which include sensor data, vital signs, lab test results, and subjective assessments, are sampled irregularly and plagued by missing values [187]. Additionally, long-term time dependencies complicate learning with many algorithms. Lab results that, taken together, might imply a particular diagnosis may be separated by days or weeks. Long delays often separate onset of disease from the appearance of symptoms. For example, symptoms of acute respiratory distress syndrome may not appear until 24-48 hours after lung injury [189], while symptoms of an asthma attack may present shortly after admission but change or disappear following treatment. Recurrent Neural Networks (RNNs), in particular those based on Long Short-Term Memory (LSTM) [131], model varying-length sequential data, achieving state-of-the-art results for problems spanning natural language processing, image captioning, handwriting recognition, and genomic analysis [15, 287, 307, 147, 181, 111, 229, 308, 322]. LSTMs can capture long range dependencies and nonlinear dynamics. Some sequence models, such as Markov models, conditional random elds, and Kalman lters, deal with sequential data but are ill-equipped to learn long-range dependencies. Other models require domain 93 knowledge or feature engineering, oering less chance for serendipitous discovery. In contrast, neural networks learn representations and can discover unforeseen structure. This chapter presents the rst empirical study using LSTMs to classify diagnoses given multivariate PICU time series. Specically, we formulate the problem as multil- abel classication, since diagnoses are not mutually exclusive. Our examples are clinical episodes, each consisting of 13 frequently but irregularly sampled time series of clinical measurements, including body temperature, heart rate, diastolic and systolic blood pres- sure, and blood glucose, among others. Associated with each patient are a subset of 429 diagnosis codes. As some are rare, we focus on the 128 most common codes, classifying each episode with one or more diagnoses. Because LSTMs have never been used in this setting, we rst verify their utility and compare their performance to a set of strong baselines, including both a linear classier and a MultiLayer Perceptron (MLP). We train the baselines on both a xed window and hand-engineered features. We then test a straightforward target replication strategy for recurrent neural networks, inspired by the deep supervision technique of [165] for training convolutional neural networks. We compose our optimization objective as a convex combination of the loss at the nal sequence step and the mean of the losses over all sequence steps. Additionally, we evaluate the ecacy of using additional information in the patient's chart as auxiliary outputs, a technique previously used with feedforward nets [41], showing that it reduces overtting. Finally, we apply dropout to non-recurrent connections, which improves the performance further. LSTMs with target replication and dropout surpass the performance of the best baseline, namely an MLP trained on hand-engineered features, even though the LSTM has access only to raw time series. 94 5.2 Related Work Our research sits at the intersection of LSTMs, medical informatics, and multilabel clas- sication, three mature elds, each with a long history and rich body of research. While we cannot do justice to all three, we highlight the most relevant works below. 5.2.1 LSTM RNNs LSTMs were originally introduced in [131], following a long line of research into RNNs for sequence learning. Notable earlier work includes [248], which introduced backpropagation through time, and [86], which successfully trained RNNs to perform supervised machine learning tasks with sequential inputs and outputs. The design of modern LSTM mem- ory cells has remained close to the original, with the commonly used addition of forget gates [98] (which we use), and peep-hole connections [97] (which we do not use). The connectivity pattern among multiple LSTM layers in our models follows the architecture described by [110]. [217] explores other mechanisms by which an RNN could be made deep. Surveys of the literature include [109], a thorough dissertation on sequence labeling with RNNs, [79], which surveys natural language applications, and [177], which provides a broad overview of RNNs for sequence learning, focusing on modern applications. 5.2.2 Neural Networks for Medical Data Neural networks have been applied to medical problems and data for at least 20 years [41, 21], although we know of no work on applying LSTMs to multivariate clinical time series of the type we analyze here. Several papers have applied RNNs to physiologic signals, including electrocardiograms [271, 9, 294] and glucose measurements [293]. RNNs 95 have also been used for prediction problems in genomics [229, 322, 308]. Multiple recent papers apply modern deep learning techniques (but not RNNs) to modeling psychological conditions [69], head injuries [247], and Parkinson's disease [118]. Recently, feedforward networks have been applied to medical time series in sliding window fashion to classify cases of gout, leukemia [160], and critical illness [50]. 5.2.3 Neural Networks for Multilabel Classication Only a few published papers apply LSTMs to multilabel classication tasks, all of which, to our knowledge, are outside of the medical context. [180] formulates music composition as a multilabel classication task, using sigmoidal output units. Most recently, [328] uses LSTM networks with multilabel outputs to recognize actions in videos. While we could not locate any published papers using LSTMs for multilabel classication in the medical domain, several papers use feedforward nets for this task. One of the earliest papers to investigate multi-task neural networks modeled risk in pneumonia patients [41]. More recently, [50] formulated diagnosis as multilabel classication using a sliding window multilayer perceptron. 5.2.4 Machine Learning for Clinical Time Series Neural network methodology aside, a growing body of research applies machine learning to temporal clinical data for tasks including artifact removal [7, 235], early detection and prediction [283, 126], and clustering and subtyping [187, 257]. Many recent papers use models with latent factors to capture nonlinear dynamics in clinical time series and to discover meaningful representations of health and illness. Gaussian processes are popular 96 because they can directly handle irregular sampling and encode prior knowledge via choice of covariance functions between time steps and across variables [187, 101]. [254] combined a hierarchical dirichlet process with autoregressive models to infer latent disease \topics" in the heart rate signals of premature babies. [235] used linear dynamical systems with latent switching variables to model physiologic events like bradycardias. Seeking deeper models, [284] proposed a second \layer" of latent factors to capture correlations between latent states. 5.2.5 Target Replication In this work, we make the task of classifying entire sequences easier by replicating targets at every time step, inspired by [165], who place an optimization objective after each layer in convolutional neural network. While they have a separate set of weights to learn each intermediate objective, our model is simpler owing to the weight tying in recurrent nets, having only one set of output weights. Additionally, unlike [165], we place targets at each time step, but not following each layer between input and output in the LSTM. After nishing this manuscript, we learned that target replication strategies similar to ours have also been developed by [332] and [72] for the tasks of video classication and character- level document classication respectively. [332] linearly scale the importance of each intermediate target, emphasizing performance at later sequence steps over those in the beginning of the clip. [72] also use a target replication strategy with linearly increasing weight for character-level document classication, showing signicant improvements in accuracy. They call this technique linear gain. 97 5.2.6 Regularizing Recurrent Neural Networks Given the complexity of our models and modest scale of our data, regularization, in- cluding judicious use of dropout, is crucial to our performance. Several prior works use dropout to regularize RNNs. [224], [333], and [72] all describe an application of dropout to only the non-recurrent weights of a network. The former two papers establish the method and apply it to tasks with sequential outputs, including handwriting recognition, image captioning, and machine translation. The setting studied by [72] most closely re- sembles ours as the authors apply it to the task of applying static labels to varying length sequences. 5.2.7 Key Dierences Our experiments show that LSTMs can accurately classify multivariate time series of clinical measurements, a topic not addressed in any prior work. Additionally, while some papers use LSTMs for multilabel classication, ours is the rst to address this problem in the medical context. Moreover, for multilabel classication of sequential clinical data with xed length output vectors, our work is the rst, to our knowledge, to demonstrate the ecacy of a target replication strategy, achieving both faster training and better generalization. 5.3 Data Description Our experiments use a collection of anonymized clinical time series extracted from the EHR system at Children's Hospital LA [187, 51] as part of an IRB-approved study. The 98 data consists of 10; 401 PICU episodes, each a multivariate time series of 13 variables: diastolic and systolic blood pressure, peripheral capillary rell rate, end-tidal CO 2 , frac- tion of inspired O 2 , Glascow coma scale, blood glucose, heart rate, pH, respiratory rate, blood oxygen saturation, body temperature, and urine output. Episodes vary in length from 12 hours to several months. Each example consists of irregularly sampled multivariate time series with both miss- ing values and, occasionally, missing variables. We resample all time series to an hourly rate, taking the mean measurement within each one hour window. We use forward- and back-lling to ll gaps created by the window-based resampling. When a single variable's time series is missing entirely, we impute a clinically normal value as dened by domain experts. These procedures make reasonable assumptions about clinical practice: many variables are recorded at rates proportional to how quickly they change, and when a variable is absent, it is often because clinicians believed it to be normal and chose not to measure it. Nonetheless, these procedures are not appropriate in all settings. Back- lling, for example, passes information from the future backwards. This is acceptable for classifying entire episodes (as we do) but not for forecasting. Finally, we rescale all variables to [0; 1], using ranges dened by clinical experts. In addition, we use published tables of normal values from large population studies to correct for dierences in heart rate, respiratory rate, [91] and blood pressure [201] due to age and gender. Each episode is associated with zero or more diagnostic codes from an in-house tax- onomy used for research and billing, similar to the Ninth Revision of the International Classication of Diseases (ICD-9) codes [214]. The data set contains 429 distinct labels indicating a variety of conditions, such as acute respiratory distress, congestive heart 99 failure, seizures, renal failure, and sepsis. Because many of the diagnoses are rare, we focus on the most common 128, each of which occurs more than 50 times in the data. These diagnostic codes are recorded by attending physicians during or shortly after each patient episode and subject to limited review afterwards. Because the diagnostic codes were assigned by clinicians, our experiments represent a comparison of an LSTM-based diagnostic system to human experts. We note that an attending physician has access to much more data about each patient than our LSTM does, including additional tests, medications, and treatments. Additionally, the physician can access a full medical history including free-text notes, can make visual and physical inspections of the patient, and can ask questions. A more fair comparison might require asking additional clinical experts to assign diagnoses given access only to the 13 time series available to our models. However, this would be prohibitively expensive, even for just the 1000 examples, and dicult to justify to our medical collaborators, as this annotation would provide no immediate benet to patients. Such a study will prove more feasible in the future when this line of research has matured. 5.4 Methods In this work, we are interested in recognizing diagnoses and, more broadly, the observable physiologic characteristics of patients, a task generally termed phenotyping [209]. We cast the problem of phenotyping clinical time series as multilabel classication. Given a series of observations x (1) ;:::;x (T ) , we learn a classier to generate hypotheses ^ y of the true labels y. Here, t indexes sequence steps, and for any example, T stands for the length 100 of the sequence. Our proposed LSTM RNN uses memory cells with forget gates [98] but without peephole connections [99]. As output, we use a fully connected layer atop the highest LSTM layer followed by an element-wise sigmoid activation function, because our problem is multilabel. We use log loss as the loss function at each output. The following equations give the update for a layer of memory cells h (t) l whereh (t) l1 stands for the previous layer at the same sequence step (a previous LSTM layer or the inputx (t) ) andh (t1) l stands for the same layer at the previous sequence step: g (t) l =(W gx l h (t) l1 +W gh l h (t1) l +b g l ) i (t) l =(W ix l h (t) l1 +W ih l h (t1) l +b i l ) f (t) l =(W fx l h (t) l1 +W fh l h (t1) l +b f l ) o (t) l =(W ox l h (t) l1 +W oh l h (t1) l +b o l ) s (t) l =g (t) l i (i) l +s (t1) l f (t) l h (t) l =(s (t) l )o (t) l : In these equations, stands for an element-wise application of the sigmoid (logistic) function, stands for an element-wise application of the tanh function, and is the Hadamard (element-wise) product. The input, output, and forget gates are denoted by i,o, andf respectively, whileg is the input node and has a tanh activation. 101 5.4.1 LSTM Architectures for Multilabel classication We explore several recurrent neural network architectures for multilabel classication of time series. The rst and simplest (Figure 5.1) passes over all inputs in chronological order, generating outputs only at the nal sequence step. In this approach, we only have output ^ y at the nal sequence step, at which our loss function is the average of the losses at each output node. Thus the loss calculated at a single sequence step is the average of log loss calculated separately on each label. loss(^ y;y) = 1 jLj l=jLj X l=1 (y l log(^ y l ) + (1y l ) log(1 ^ y l )): Figure 5.1: A simple RNN model for multilabel classication. Green rectangles represent inputs. The recurrent hidden layers separating input and output are represented with a single blue rectangle. The red rectangle represents targets. 5.4.2 Sequential Target Replication One problem with the simple approach is that the network must learn to pass information across many sequence steps in order to aect the output. We attack this problem by replicating our static targets at each sequence step (Figure 5.2), providing a local error 102 signal at each step. This approach is inspired by the deep supervision technique that [165] apply to convolutional nets. This technique is especially sensible in our case because we expect the model to predict accurately even if the sequence were truncated by a small amount. The approach diers from [165] because we use the same output weights to calculate ^ y (t) for all t. Further, we use this target replication to generate output at each sequence step, but not at each hidden layer. For the model with target replication, we generate an output ^ y (t) at every sequence step. Our loss is then a convex combination of the nal loss and the average of the losses over all steps: 1 T T X t=1 loss(^ y (t) ;y (t) ) + (1) loss(^ y (T ) ;y (T ) ) where T is the total number of sequence steps and 2 [0; 1] is a hyper-parameter which determines the relative importance of hitting these intermediary targets. At prediction time, we take only the output at the nal step. In our experiments, networks using target replication outperform those with a loss applied only at the nal sequence step. Figure 5.2: An RNN classication model with target replication. The primary target (depicted in red) at the nal step is used at prediction time, but during training, the model back-propagates errors from the intermediate targets (purple) at every sequence step. 103 5.4.3 Auxiliary Output Training Recall that our initial data contained 429 diagnostic labels but that our task is to pre- dict only 128. Given the well-documented successes of multitask learning with shared representations and feedforward networks, we wish to train a stronger model by using the remaining 301 labels or other information in the patient's chart, such as diagnostic categories, as auxiliary targets [41]. These additional targets serve reduce overtting as the model aims to minimize the loss on the labels of interest while also minimizing loss on the auxiliary targets (Figure 5.3). Figure 5.3: Our data set contains many labels. For our task, a subset of 128 are of interest (depicted in red). Our Auiliary Output neural network makes use of extra labels as additional training targets (depicted in purple). At inference time we generate predictions for only the labels of interest. 5.4.4 Regularization Because we have only 10; 401 examples, overtting is a considerable obstacle. Our exper- iments show that both target replication and auxiliary outputs improve performance and reduce overtting. In addition to these less common techniques we deploy` 2 2 weight decay 104 and dropout. Following the example of [333] and [224], we apply dropout to the non- recurrent connections only. We rst compute each hidden layer's sequence of activations in the left-to-right direction and then apply dropout before computing the next layer's activations. In our experiments, we nd that dropout decreases overtting, enabling us to double the size of each hidden layer. 5.5 Experiments All models are trained on 80% of the data and tested on 10%. The remaining 10% is used as a validation set. We train each LSTM for 100 epochs using stochastic gradient descent (SGD) with momentum. To combat exploding gradients, we scale the norm of the gradient and use ` 2 2 weight decay of 10 6 , both hyperparameters chosen using validation data. Our nal networks use 2 hidden layers and either 64 memory cells per layer with no dropout or 128 cells per layer with dropout of 0:5. These architectures are also chosen based on validation performance. Throughout training, we save the model and compute three performance metrics (micro AUC, micro F1, and precision at 10) on the validation set for each epoch. We then test the model that scores best on at least two of the three validation metrics. To break ties, we choose the earlier epoch. We evaluate a number of baselines as well as LSTMs with various combinations of target replication (TR), dropout (DO), and auxiliary outputs (AO), using either the additional 301 diagnostic labels or 12 diagnostic categories. To explore the regularization eects of each strategy, we record and plot both training and validation performance after each epoch. Additionally, we report performance of a target replication model (Linear 105 Gain) that scales the weight of each intermediate target linearly as opposed our proposed approach. Finally, to show that our LSTM learns a model complementary to the baselines, we evaluate an ensemble of the best LSTM with the best baseline. 5.5.1 Multilabel Evaluation Methodology We report micro- and macro-averaged versions of Area Under the ROC Curve (AUC). By micro AUC, we mean a single AUC computed on attened ^ Y andY matrices, whereas we calculate macro AUC by averaging each per-label AUC. The blind classier achieves 0:5 macro AUC but can exceed 0:5 on micro AUC by predicting labels in descending order by base rate. Additionally, we report micro- and macro-averaged F1 score, computed in similar fashion to the respective micro and macro AUCs. F1 metrics require a thresholding strategy, and here we select thresholds based upon validation set performance. We refer to [176] for an analysis of the strengths and weaknesses of each type of multilabel F-score and a characterization of optimal thresholds. Finally, we report precision at 10, which captures the fraction of true diagnoses among the model's top 10 predictions, with a best possible score of 0:2281 on the test split of this data set because there are on average 2:281 diagnoses per patient. While F1 and AUC are both useful for determining the relative quality of a classier's predictions, neither is tailored to a real-world application. Thus, we consider a medically plausible use case to motivate this more interpretable metric: generating a short list of the 10 most probable diagnoses. If we could create a high recall, moderate precision list of 10 likely diagnoses, it could be a valuable hint-generation tool for dierential diagnosis. Testing for only the 10 most probable conditions is much more realistic than testing for all conditions. 106 5.5.2 Baseline Classiers We provide results for a base rate model that predicts diagnoses in descending order by incidence to provide a minimum performance baseline for micro-averaged metrics. We also report the performance of logistic regression, which is widely used in clinical research. We train a separate classier for each diagnosis but choose an overall ` 2 2 penalty for all individual classiers based on validation performance. For a much stronger baseline, we train a multilabel MLP with 3 hidden layers of 300 hidden units each, rectied linear activations, and dropout of 0.5. All MLPs were trained for 1000 epochs, with hyperpa- rameters chosen based on validation set performance. Each baseline is tested with two sets of inputs: raw time series and hand-engineered features. For raw time series, we use the rst and last six hours. This provides classiers with temporal information about changes in patient state from admission to discharge within a xed-size input, as required by all baselines. We nd this works better than providing the rst or last 12 hours alone. Our hand-engineered features are inspired by those used in state-of-the-art severity of illness scores [228]: for each variable, we compute the rst and last measurements and their dierence scaled by episode length, mean and standard deviation, median and quartiles, minimum and maximum, and slope of a line t with least squares. These 143 features capture many of the indicators that clinicians look for in critical illness, including admission and discharge state, extremes, central tendencies, variability, and trends. They previously have been shown to be eective for these data [187, 51]. Our strongest baseline is an MLP using these features. 107 5.5.3 Results Our best performing LSTM (LSTM-DO-TR) used two layers of 128 memory cells, dropout of probability 0.5 between layers, and target replication, and outperformed the MLP with hand-engineered features. Moreover simple ensembles of the best LSTM and MLP outperformed both on all metrics. Table 5.1 shows summary results for all models. Table 5.2 shows the LSTM's predictive performance for six diagnoses with the highest F1 scores. Full per-diagnosis results can be found in Section 5.10. Target replication improves performance on all metrics, accelerating learning and re- ducing overtting (Figure 5.4). We also nd that the LSTM with target replication learns to output correct diagnoses earlier in the time series, a virtue that we explore qualita- tively in Section 5.8. As a comparison, we trained a LSTM-DO-TR variant using the linear gain strategy of [332, 72]. In general, this model did not perform as well as our simpler target replication strategy, but it did achieve the highest macro F1 score among the LSTM models. Auxiliary outputs improved performance for most metrics and reduced overtting. While the performance improvement is not as dramatic as that conferred by target repli- cation, the regularizing eect is greater. These gains came at the cost of slower training: the auxiliary output models required more epochs (Figure 5.4 and Section 5.9), especially when using the 301 remaining diagnoses. This may be due in part to severe class imbal- ance in the extra labels. For many of these labels it may take an entire epoch just to learn that they are occasionally nonzero. 108 Classication performance for 128 ICU phenotypes Model mAUC MAUC mF1 MF1 Prec. at 10 Base Rate 0.7128 0.5 0.1346 0.0343 0.0788 Log. Reg., First 6 + Last 6 0.8122 0.7404 0.2324 0.1081 0.1016 Log. Reg., Expert features 0.8285 0.7644 0.2502 0.1373 0.1087 MLP, First 6 + Last 6 0.8375 0.7770 0.2698 0.1286 0.1096 MLP, Expert features 0.8551 0.8030 0.2930 0.1475 0.1170 LSTM Models with two 64-cell hidden layers LSTM 0.8241 0.7573 0.2450 0.1170 0.1047 LSTM, AuxOut (Diagnoses) 0.8351 0.7746 0.2627 0.1309 0.1110 LSTM-AO (Categories) 0.8382 0.7748 0.2651 0.1351 0.1099 LSTM-TR 0.8429 0.7870 0.2702 0.1348 0.1115 LSTM-TR-AO (Diagnoses) 0.8391 0.7866 0.2599 0.1317 0.1085 LSTM-TR-AO (Categories) 0.8439 0.7860 0.2774 0.1330 0.1138 LSTM Models with Dropout (probability 0.5) and two 128-cell hidden layers LSTM-DO 0.8377 0.7741 0.2748 0.1371 0.1110 LSTM-DO-AO (Diagnoses) 0.8365 0.7785 0.2581 0.1366 0.1104 LSTM-DO-AO (Categories) 0.8399 0.7783 0.2804 0.1361 0.1123 LSTM-DO-TR 0.8560 0.8075 0.2938 0.1485 0.1172 LSTM-DO-TR-AO (Diagnoses) 0.8470 0.7929 0.2735 0.1488 0.1149 LSTM-DO-TR-AO (Categories) 0.8543 0.8015 0.2887 0.1446 0.1161 LSTM-DO-TR (Linear Gain) 0.8480 0.7986 0.2896 0.1530 0.1160 Ensembles of Best MLP and Best LSTM Mean of LSTM-DO-TR & MLP 0.8611 0.8143 0.2981 0.1553 0.1201 Max of LSTM-DO-TR & MLP 0.8643 0.8194 0.3035 0.1571 0.1218 Table 5.1: Results on performance metrics calculated across all labels. mAUC and mF1 refer to the micro-averaged metrics, MAUC and MF1 the macro-averaged metrics. DO, TR, and AO indicate dropout, target replication, and auxiliary outputs, respectively. AO (Diagnoses) uses the extra diagnosis codes and AO (Categories) uses diagnostic categories as additional targets during training. Top 6 diagnoses measured by F1 score Label F1 AUC Precision Recall Diabetes mellitus with ketoacidosis 0.8571 0.9966 1.0000 0.7500 Scoliosis, idiopathic 0.6809 0.8543 0.6957 0.6667 Asthma, unspecied with status asthmaticus 0.5641 0.9232 0.7857 0.4400 Neoplasm, brain, unspecied 0.5430 0.8522 0.4317 0.7315 Delayed milestones 0.4751 0.8178 0.4057 0.5733 Acute Respiratory Distress Syndrome (ARDS) 0.4688 0.9595 0.3409 0.7500 Table 5.2: LSTM-DO-TR performance on the 6 diagnoses with highest F1 scores. The LSTMs appear to learn models complementary to the MLP trained on hand- engineered features. Supporting this claim, simple ensembles of the LSTM-DO-TR and 109 MLP (taking the mean or maximum of their predictions) outperform the constituent models signicantly on all metrics (Table 5.1). Further, there are many diseases for which one model substantially outperforms the other, e.g., intracranial hypertension for the LSTM, septic shock for the MLP (Section 5.10). 5.6 Discussion Our results indicate that LSTM RNNs, especially with target replication, can successfully classify diagnoses of critical care patients given clinical time series data. The best LSTM beat a strong MLP baseline using hand-engineered features as input, and an ensemble combining the MLP and LSTM improves upon both. The success of target replication accords with results by both [332] and [72], who observed similar benets on their re- spective tasks. However, while they saw improvement using a linearly increasing weight on each target from start to end, this strategy performed worse in our diagnostic clas- sication task than our uniform weighting of intermediate targets. We believe this may owe to the peculiar nature of our data. Linear gain emphasizes evidence from later in the sequence, an assumption which often does not match the progression of symptoms in critical illnesses. Asthma patients, for example, are often admitted to the ICU severely symptomatic, but once treatment begins, patient physiology stabilizes and observable signs of disease may abate or change. Further supporting this idea, we observed that when training xed-window baselines, using the rst 6 and last 6 hours outperformed using the last 12 hours only. 110 Figure 5.4: Training curves showing the im- pact of the DO, AO, and TR strategies on overtting. While our data is of large scale by clin- ical standards, it is small relative to data sets found in deep learning tasks like vi- sion and speech recognition. At this scale, regularization is critical. Our experiments demonstrate that target replication, auxil- iary outputs, and dropout all work to re- duce the generalization gap. as shown in Figure 5.4 and Section 5.9. However, some of these techniques are complementary while others seem to cancel each other out. For example, our best model combined target replication with dropout. This combination signicantly improved upon the performance using target replication alone, and enabled the eective use of larger capacity models. In contrast, the benets of dropout and auxiliary output training appear to wash each other out. This may be because target replication confers more than regularization, mitigating the diculty of learning long range dependencies by providing local objectives. 5.7 Conclusion While our LSTMs produce promising results, this is only a rst step in this line of research. Recognizing diagnoses given full time series of sensor data demonstrates that LSTMs can capture meaningful signal, but ultimately we would like to predict developing conditions and events, outcomes such as mortality, and treatment responses. In this chapter we used 111 diagnostic labels without timestamps, but we are obtaining timestamped diagnoses, which will enable us to train models to perform early diagnosis by predicting future conditions. In addition, we are extending this work to a larger PICU data set with 50% more patients and hundreds of variables, including treatments and medications. On the methodological side, we would like to both better exploit and improve the capabilities of LSTMs. Results from speech recognition have shown that LSTMs shine in comparison to other models using raw features, minimizing need for preprocessing and feature engineering. In contrast, our current data preparation pipeline removes valuable structure and information from clinical time series that could be exploited by an LSTM. For example, our forward- and back-lling imputation strategies discard useful informa- tion about when each observation is recorded. Imputing normal values for missing time series ignores the meaningful distinction between truly normal and missing measurements. Also, our window-based resampling procedure reduces the variability of more frequently measured vital signs (e.g., heart rate). In future work, we plan to introduce indicator variables to allow the LSTM to distin- guish actual from missing or imputed measurements. Additionally, the exibility of the LSTM architecture should enable us to eliminate age-based corrections and to incorporate non-sequential inputs, such as age, weight, and height (or even hand-engineered features), into predictions. Other next steps in this direction include developing LSTM architec- tures to directly handle missing values and irregular sampling. We also are encouraged by the success of target replication and plan to explore other variants of this technique and to apply it to other domains and tasks. Additionally, we acknowledge that there remains a debate about the interpretability of neural networks when applied to complex 112 medical problems. We are developing methods to interpret the representations learned by LSTMs in order to better expose patterns of health and illness to clinical users. We also hope to make practical use of the distributed representations of patients for tasks such as patient similarity search. 5.8 Hourly Diagnostic Predictions Our LSTM networks predict 128 diagnoses given sequences of clinical measurements. Be- cause each network is connected left-to-right, i.e., in chronological order, we can output predictions at each sequence step. Ultimately, we imagine that this capability could be used to make continuously updated real-time alerts and diagnoses. Below, we explore this capability qualitatively. We choose examples of patients with a correctly classied diag- nosis and visualize the probabilities assigned by each LSTM model at each sequence step. In addition to improving the quality of the nal output, the LSTMs with target replica- tion (LSTM-TR) arrive at correct diagnoses quickly compared to the simple multilabel LSTM model (LSTM-Simple). When auxiliary outputs are also used (LSTM-TR,AO), the diagnoses appear to be generally more condent. Our LSTM-TR,AO eectively predicts status asthmaticus and acute respiratory dis- tress syndrome, likely owing to the several measures of pulmonary function among our inputs. Diabetic ketoacidosis also proved easy to diagnose, likely because glucose and pH are included among our clinical measurements. We were surprised to see that the network classied scoliosis reliably, but a deeper look into the medical literature suggests 113 that scoliosis often results in respiratory symptoms. This analysis of step-by-step pre- dictions is preliminary and informal, and we note that for a small number of examples our data preprocessing introduces a target leak by back-lling missing values. In future work, when we explore this capability in greater depth, we will reprocess the data. 5.9 Learning Curves We present visualizations of the performance of LSTM, LSTM-DO (with dropout proba- bility 0:5), LSTM-AO (using the 301 additional diagnoses), and LSTM-TR (with = 0:5), during training. These charts are useful for examining the eects of dropout, auxiliary outputs, and target replication on both the speed of learning and the regularization they confer. Specically, for each of the four models, we plot the training and validation mi- cro AUC and F1 score every ve epochs in Figure 5.6. Additionally, we plot a scatter of the performance on the training set vs. the performance on the validation set. The LSTM with target replication learns more quickly than a simple LSTM and also suers less overtting. With both dropout and auxiliary outputs, the LSTM trains more slowly than a simple LSTM but suers considerably less overtting. 114 5.10 Per Diagnosis Results While averaged statistics provide an ecient way to check the relative quality of various models, considerable information is lost by reducing performance to a single scalar quan- tity. For some labels, our classier makes classications with surprisingly high accuracy while for others, our features are uninformative and thus the classier would not be prac- tically useful. To facilitate a more granular investigation of our model's predictive power, we present individual test set F1 and AUC scores for each individual diagnostic label in Table 5.3. We compare the performance our best LSTM, which combines two 128-cell hidden layers with dropout of probability 0.5 and target replication, against the strongest baseline, an MLP trained on the hand-engineered features, and an ensemble predicts the maximum probability of the two. The results are sorted in descending order using the F1 performance of the LSTM, providing insights into the types of conditions that the LSTM can successfully classify. 115 (a) Asthma with Status Asthmaticus (b) Acute Respiratory Distress Syndrome (c) Diabetic Ketoacidosis (d) Brain Neoplasm, Unspecied Nature (e) Septic Shock (f) Scoliosis Figure 5.5: Each chart depicts the probabilities assigned by each of four models at each (hourly re-sampled) time step. LSTM-Simple uses only targets at the nal time step. LSTM-TR uses target replication. LSTM-AO uses auxiliary outputs (diagnoses), and LSTM-TR,AO uses both techniques. LSTMs with target replication learn to make accu- rate diagnoses earlier. 116 (a) AUC learning curves (b) F1 learning curves (c) AUC training vs. validation (d) F1 training vs. validation Figure 5.6: Training and validation performance plotted for the simple multilabel network (LSTM-Simple), LSTM with target replication (LSTM-TR), and LSTM with auxiliary outputs (LSTM-AO). Target replication appears to increase the speed of learning and confers a small regularizing eect. Auxiliary outputs slow down the speed of learning but impart a strong regularizing eect. 117 Classier Performance on Each Diagnostic Code, Sorted by F1 LSTM MLP Ensemble Condition F1 AUC F1 AUC F1 AUC Diabetes mellitus with ketoacidosis 0.8571 0.9966 0.8571 0.9966 0.8571 0.9966 Scoliosis, idiopathic 0.6809 0.8543 0.6169 0.8467 0.6689 0.8591 Asthma, unspecied with status asthmaticus 0.5641 0.9232 0.6296 0.9544 0.6667 0.9490 Neoplasm, brain, unspecied nature 0.5430 0.8522 0.5263 0.8463 0.5616 0.8618 Developmental delay 0.4751 0.8178 0.4023 0.8294 0.4434 0.8344 Acute respiratory distress syndrome (ARDS) 0.4688 0.9595 0.3913 0.9645 0.4211 0.9650 Hypertension, unspecied 0.4118 0.8593 0.3704 0.8637 0.3636 0.8652 Arteriovenous malformation of brain 0.4000 0.8620 0.3750 0.8633 0.3600 0.8684 End stage renal disease on dialysis 0.3889 0.8436 0.3810 0.8419 0.3902 0.8464 Acute respiratory failure 0.3864 0.7960 0.4128 0.7990 0.4155 0.8016 Renal transplant status post 0.3846 0.9692 0.4828 0.9693 0.4800 0.9713 Epilepsy, unspecied, not intractable 0.3740 0.7577 0.3145 0.7265 0.3795 0.7477 Septic shock 0.3721 0.8182 0.3210 0.8640 0.3519 0.8546 Other respiratory symptom 0.3690 0.8088 0.3642 0.7898 0.3955 0.8114 Biliary atresia 0.3636 0.9528 0.5000 0.9338 0.4444 0.9541 Acute lymphoid leukemia, without remission 0.3486 0.8601 0.3288 0.8293 0.3175 0.8441 Congenital hereditary muscular dystrophy 0.3478 0.8233 0.0000 0.8337 0.2727 0.8778 Liver transplant status post 0.3448 0.8431 0.3333 0.8104 0.3846 0.8349 Respiratory complications after prodecure 0.3143 0.8545 0.2133 0.8614 0.3438 0.8672 Grand mal status 0.3067 0.8003 0.3883 0.7917 0.3529 0.8088 Intracranial injury, closed 0.3048 0.8589 0.3095 0.8621 0.3297 0.8820 Diabetes insipidus 0.2963 0.9455 0.3774 0.9372 0.4068 0.9578 Acute renal failure, unspecied 0.2553 0.8806 0.2472 0.8698 0.2951 0.8821 Other diseases of the respiratory system 0.2529 0.7999 0.1864 0.7920 0.2400 0.8131 Croup syndrome 0.2500 0.9171 0.1538 0.9183 0.0000 0.9263 Bronchiolitis due to other infection 0.2466 0.9386 0.2353 0.9315 0.2712 0.9425 Congestive heart failure 0.2439 0.8857 0.0000 0.8797 0.0000 0.8872 Infantile cerebral palsy, unspecied 0.2400 0.8538 0.1569 0.8492 0.2083 0.8515 Congenital hydrocephalus 0.2393 0.7280 0.2247 0.7337 0.1875 0.7444 Cerebral edema 0.2222 0.8823 0.2105 0.9143 0.2500 0.9190 Craniosynostosis 0.2222 0.8305 0.5333 0.8521 0.6154 0.8658 Anoxic brain damage 0.2222 0.8108 0.1333 0.8134 0.2500 0.8193 Pneumonitis due to inhalation 0.2222 0.6547 0.0326 0.6776 0.0462 0.6905 Acute and subacute necrosis of the liver 0.2182 0.8674 0.2778 0.9039 0.2381 0.8964 Respiratory syncytial virus 0.2154 0.9118 0.1143 0.8694 0.1622 0.9031 Unspecied disorder of kidney and ureter 0.2069 0.8367 0.1667 0.8496 0.1667 0.8559 Craniofacial malformation 0.2059 0.8688 0.4444 0.8633 0.3158 0.8866 Pulmonary hypertension, secondary 0.2000 0.9377 0.0870 0.8969 0.2105 0.9343 Bronchopulmonary dysplasia 0.1905 0.8427 0.1404 0.8438 0.1333 0.8617 Drowning and non-fatal submersion 0.1905 0.8341 0.1538 0.8905 0.1429 0.8792 Genetic abnormality 0.1828 0.6727 0.1077 0.6343 0.1111 0.6745 Other and unspecied coagulation defects 0.1818 0.7081 0.0000 0.7507 0.1600 0.7328 Vehicular trauma 0.1778 0.8655 0.2642 0.8505 0.2295 0.8723 Table 5.3: F1 and AUC scores for individual diagnoses. 118 Classier Performance on Each Diagnostic Code, Sorted by F1 LSTM MLP Ensemble Condition F1 AUC F1 AUC F1 AUC Other specied cardiac dysrhythmia 0.1667 0.7698 0.1250 0.8411 0.0800 0.8179 Acute pancreatitis 0.1622 0.8286 0.1053 0.8087 0.1379 0.8440 Esophageal re ux 0.1515 0.8236 0.0000 0.7774 0.1739 0.8090 Cardiac arrest, outside hospital 0.1500 0.8562 0.1333 0.9004 0.1765 0.8964 Unspecied pleural eusion 0.1458 0.8777 0.1194 0.8190 0.1250 0.8656 Mycoplasma pneumoniae 0.1429 0.8978 0.1067 0.8852 0.1505 0.8955 Unspecied immunologic disorder 0.1429 0.8481 0.1000 0.8692 0.1111 0.8692 Congenital alveolar hypoventilation 0.1429 0.6381 0.0000 0.7609 0.0000 0.7246 Septicemia, unspecied 0.1395 0.8595 0.1695 0.8640 0.1905 0.8663 Pneumonia due to adenovirus 0.1379 0.8467 0.0690 0.9121 0.1277 0.8947 Insomnia with sleep apnea 0.1359 0.7892 0.0752 0.7211 0.0899 0.8089 Debrination syndrome 0.1333 0.9339 0.1935 0.9461 0.2500 0.9460 Unspecied injury, unspecied site 0.1333 0.8749 0.0000 0.7673 0.1250 0.8314 Pneumococcal pneumonia 0.1290 0.8706 0.1149 0.8664 0.1461 0.8727 Genetic or other unspecied anomaly 0.1277 0.7830 0.0870 0.7812 0.1429 0.7905 Other spontaneous pneumothorax 0.1212 0.8029 0.0972 0.8058 0.1156 0.8122 Bone marrow transplant status 0.1176 0.8136 0.0000 0.8854 0.2353 0.8638 Other primary cardiomyopathies 0.1176 0.6862 0.0000 0.6371 0.1212 0.6635 Intracranial hemorrhage 0.1071 0.7498 0.1458 0.7306 0.1587 0.7540 Benign intracranial hypertension 0.1053 0.9118 0.0909 0.7613 0.1379 0.8829 Encephalopathy, unspecied 0.1053 0.8466 0.0909 0.7886 0.0000 0.8300 Ventricular septal defect 0.1053 0.6781 0.0741 0.6534 0.0833 0.6667 Crushing injury, unspecied 0.1017 0.9183 0.0952 0.8742 0.1200 0.9111 Malignant neoplasm, disseminated 0.0984 0.7639 0.0588 0.7635 0.0667 0.7812 Orthopaedic surgery, post status 0.0976 0.7605 0.1290 0.8234 0.0845 0.8106 Thoracic surgery, post status 0.0930 0.9160 0.0432 0.7401 0.0463 0.9137 Ostium secundum type atrial septal defect 0.0923 0.7876 0.1538 0.8068 0.1154 0.7998 Malignant neoplasm (gastrointestinal) 0.0853 0.8067 0.1111 0.7226 0.1412 0.7991 Coma 0.0833 0.7255 0.1111 0.6542 0.1250 0.7224 Pneumonia due to inhalation 0.0800 0.8282 0.0923 0.8090 0.0952 0.8422 Extradural hemorrage from injury 0.0769 0.7829 0.0000 0.8339 0.0988 0.8246 Prematurity (less than 37 weeks gestation) 0.0759 0.7542 0.1628 0.7345 0.1316 0.7530 Asthma without status asthmaticus 0.0734 0.6679 0.0784 0.6914 0.0678 0.6867 Gastrointestinal surgery, post status 0.0714 0.7183 0.0984 0.6999 0.0851 0.7069 Nervous disorder, not elsewhere classied 0.0708 0.7127 0.1374 0.7589 0.1404 0.7429 Unspecied gastrointestinal disorder 0.0702 0.6372 0.0348 0.6831 0.0317 0.6713 Pulmonary congestion and hypostasis 0.0678 0.8359 0.0000 0.8633 0.0000 0.8687 Thrombocytopenia, unspecied 0.0660 0.7652 0.0000 0.7185 0.0000 0.7360 Lung contusion, no open wound 0.0639 0.9237 0.0000 0.9129 0.2222 0.9359 Acute pericarditis, unspecied 0.0625 0.8601 0.0000 0.9132 0.0000 0.9089 Nervous system complic. from implant 0.0597 0.6727 0.0368 0.7082 0.0419 0.7129 Heart disease, unspecied 0.0588 0.8372 0.0000 0.8020 0.0000 0.8264 Suspected infection in newborn or infant 0.0588 0.6593 0.0000 0.7090 0.0606 0.6954 119 Classier Performance on Each Diagnostic Code, Sorted by F1 LSTM MLP Ensemble Condition F1 AUC F1 AUC F1 AUC Anemia, unspecied 0.0541 0.7782 0.0488 0.7019 0.0727 0.7380 Muscular disorder, not classied 0.0536 0.6996 0.0000 0.7354 0.1000 0.7276 Malignant neoplasm, adrenal gland 0.0472 0.6960 0.0727 0.6682 0.0548 0.6846 Hematologic disorder, unspecied 0.0465 0.7315 0.1194 0.7404 0.0714 0.7446 Hematemesis 0.0455 0.8116 0.0674 0.7887 0.0588 0.8103 Dehydration 0.0435 0.7317 0.1739 0.7287 0.0870 0.7552 Unspecied disease of spinal cord 0.0432 0.7153 0.0571 0.7481 0.0537 0.7388 Neurobromatosis, unspecied 0.0403 0.7494 0.0516 0.7458 0.0613 0.7671 Intra-abdominal injury 0.0333 0.7682 0.1569 0.8602 0.0690 0.8220 Thyroid disorder, unspecied 0.0293 0.5969 0.0548 0.5653 0.0336 0.6062 Hereditary hemolytic anemia, unspecifed 0.0290 0.7474 0.0000 0.6182 0.0000 0.6962 Subdural hemorrage, no open wound 0.0263 0.7620 0.1132 0.7353 0.0444 0.7731 Unspecied intestinal obstruction 0.0260 0.6210 0.2041 0.7684 0.0606 0.7277 Hyposmolality and/or hyponatremia 0.0234 0.6999 0.0000 0.7565 0.0000 0.7502 Primary malignant neoplasm, thorax 0.0233 0.6154 0.0364 0.6086 0.0323 0.5996 Supraventricular premature beats 0.0185 0.8278 0.0190 0.7577 0.0299 0.8146 Injury to intrathoracic organs, no wound 0.0115 0.8354 0.0000 0.8681 0.0000 0.8604 Child abuse, unspecied 0.0000 0.9273 0.3158 0.9417 0.1818 0.9406 Acidosis 0.0000 0.9191 0.1176 0.9260 0.0000 0.9306 Infantile spinal muscular atrophy 0.0000 0.9158 0.0000 0.8511 0.0000 0.9641 Fracture, femoral shaft 0.0000 0.9116 0.0000 0.9372 0.0513 0.9233 Cystic brosis with pulm. symptoms 0.0000 0.8927 0.0000 0.8086 0.0571 0.8852 Panhypopituitarism 0.0000 0.8799 0.2222 0.8799 0.0500 0.8872 Blood in stool 0.0000 0.8424 0.0000 0.8443 0.0000 0.8872 Sickle-cell anemia, unspecied 0.0000 0.8268 0.0000 0.7317 0.0000 0.7867 Cardiac dysrhythmia, unspecied 0.0000 0.8202 0.0702 0.8372 0.0000 0.8523 Agranulocytosis 0.0000 0.8157 0.1818 0.8011 0.1667 0.8028 Malignancy of bone, no site specied 0.0000 0.8128 0.0870 0.7763 0.0667 0.8318 Pneumonia, organism unspecied 0.0000 0.8008 0.0952 0.8146 0.0000 0.8171 Unspecied metabolic disorder 0.0000 0.7914 0.0000 0.6719 0.0000 0.7283 Urinary tract infection, no site specied 0.0000 0.7867 0.0840 0.7719 0.2286 0.7890 Obesity, unspecied 0.0000 0.7826 0.0556 0.7550 0.0000 0.7872 Apnea 0.0000 0.7822 0.2703 0.8189 0.0000 0.8083 Respiratory arrest 0.0000 0.7729 0.0000 0.8592 0.0000 0.8346 Hypovolemic shock 0.0000 0.7686 0.0000 0.8293 0.0000 0.8296 Hemophilus meningitis 0.0000 0.7649 0.0000 0.7877 0.0000 0.7721 Diabetes mellitus, type I, stable 0.0000 0.7329 0.0667 0.7435 0.0833 0.7410 Tetralogy of fallot 0.0000 0.7326 0.0000 0.6134 0.0000 0.6738 Congenital heart disease, unspecied 0.0000 0.7270 0.1333 0.7251 0.0000 0.7319 Mechanical complication of V-P shunt 0.0000 0.7173 0.0000 0.7308 0.0000 0.7205 Respiratory complic. due to procedure 0.0000 0.7024 0.0000 0.7244 0.0000 0.7323 Teenage cerebral occlusion/infarction 0.0000 0.6377 0.0000 0.5982 0.0000 0.6507 120 Chapter 6 Modeling Missing Values in Clinical Time Series Clinical time series are rife with missing values that are seldom missing at random and that may provide clues about patient's condition and clinical decision-making. For ex- ample, lab results are present only when the corresponding tests are ordered by clinicians suspecting an abnormal nding. Other missing values result from the fact that some variables are measured more frequently than others, providing information about how closely clinical sta are monitoring a patient. In this chapter, we investigate the impact of missing data when learning to diagnose. By explicitly encoding whether a particular input was measured or missing, we are able boost the predictive accuracy of an RNN trained to predict diagnoses in critically ill patients. We also show that some diseases can be reliably predicted from what tests are run while ignoring the actual measurements. Finally, we analyze our ndings and discuss the larger implications for clinical machine learning. 121 6.1 Introduction For each admitted patient, hospital intensive care units record large amounts data in electronic health records (EHRs). Clinical sta routinely chart vital signs during hourly rounds and when patients are unstable. EHRs record lab test results and medications as they are ordered or delivered by physicians and nurses. As a result, EHRs contain rich sequences of clinical observations depicting both patients' health and care received. We would like to mine these time series to build accurate predictive models for diagnosis and other applications. Recurrent neural networks (RNNs) are well-suited to learning sequential or temporal relationships from such time series. RNNs oer unprecedented predictive power in myriad sequence learning domains, including natural language pro- cessing, speech, video, and handwriting. Recently, [178] demonstrated the ecacy of RNNs for multilabel classication of diagnoses in clinical time series data. However, medical time series data present modeling problems not found in the clean academic data sets on which most RNN research focuses. Clinical observations are recorded irregularly, with measurement frequency varying between patients, across vari- ables, and even over time. In one common modeling strategy, we represent these observa- tions as a sequence with discrete, xed-width time steps. Problematically, the resulting sequences often contain missing values [187]. These values are typically not missing at random, but re ect decisions by caregivers. Thus, the pattern of recorded measurements contain potential information about the state of the patient. However, most often, re- searchers ll missing values using heuristic or unsupervised imputation [160], ignoring the potential predictive value of the missingness itself. 122 HR DIA BP SYS BP TEMP RESPRATE FRAC O2 O2 SAT END CO2 CAP RATE PH URINE OUT GLUCOSE GLASGOW Figure 6.1: Missingness artifacts created by discretization In this work we extend the methodology of [178] for RNN-based multilabel predic- tion of diagnoses. We focus on data gathered from the Children's Hospital Los Angeles pediatric intensive care unit (PICU). Unlike [178], who approach missing data via heuris- tic imputation, we directly model missingness as a feature, achieving superior predictive performance. RNNs can realize this improvement using only simple binary indicators for missingness. However, linear models are unable to use indicator features as eec- tively. While RNNs can learn arbitrary functions, capturing the interactions between the missingness indicators the sequence of observation inputs, linear models can only learn substitution values. For linear models, we introduce an alternative strategy to capture this signal, using a small number of simple hand-engineered features. Our experiments demonstrate the benet modeling missing data as a rst-class fea- ture. Our methods improve the performance of RNNs, multilayer perceptrons (MLPs), and linear models. Additionally we analyze the predictive value of missing data infor- mation by training models on the missingness indicators only. We show that for several diseases, what tests are run can be as predictive as the actual measurements. While we focus on classifying diagnoses, our methods can be applied to any predictive modeling 123 problem involving sequence data and missing values, such as early prediction of sepsis [126] or real-time risk modeling [317]. It is worth noting that we may not want our predictive models to rely upon the patterns of treatment, as argued by [42]. Once deployed, our models may in uence the treatment protocols, shifting the distribution of future data, and thus invalidating their predictions. Nonetheless, doctors at present often utilize knowledge of past care, and treatment signal can leak into the actual measurements themselves in ways that suciently powerful models can exploit. As a nal contribution of this chapter, we present a critical discussion of these practical and philosophical issues. 6.2 Data Our data set consists of patient records extracted from the EHR system at CHLA [187, 51] as part of an IRB-approved study. In all, the data set contains 10; 401 PICU episodes. Each episode describes the stay of one patient in the PICU for a period of at least 12 hours. In addition, each patient record contains a static set of diagnostic codes, annotated by physicians either during or after each PICU visit. 6.2.1 Inputs In their rawest representation, episodes consist of irregularly spaced measurements of 13 variables: diastolic and systolic blood pressure, peripheral capillary rell rate, end-tidal CO 2 (ETCO 2 ), fraction of inspired O 2 (FIO 2 ), total Glascow coma scale, blood glucose, heart rate, pH, respiratory rate, blood oxygen saturation, body temperature, and urine output. To render our data suitable for learning with RNNs, we convert to discrete 124 sequences of hourly time steps, where time stept covers the interval between hourst and t + 1, closed on the left but open on the right. Because actual admission times are not recorded reliably, we use the time of the rst recorded observation as time step t = 0. We combine multiple measurements of the same variable within the same hour window by taking their mean. Vital signs, such as heart rate, are typically measured about once per hour, while lab tests requiring a blood draw (e.g., glucose) are measured on the order of once per day (see appendix B for measurement frequency statistics). In addition, the timing of and time between observations varies across patients and over time. The resulting sequential representation have many missing values, and some variables missing altogether. Note that our methods can be sensitive to the duration of our discrete time step. For example, halving the duration would double the length of the sequences, making learning by backpropagation through time more challenging [25]. For our data, such cost would not be justied because the most frequently measured variables (vital signs) are only recorded about once per hour. For higher frequency recordings of variables with faster dynamics, a shorter time step might be warranted. To better condition our inputs, we scale each variable to the [0; 1] interval, using expert-dened ranges. Additionally, we correct for dierences in heart rate, respiratory rate, [91] and blood pressure [201] due to age and gender using tables of normal values from large population studies. 125 6.2.2 Diagnostic labels In this work, we formulate phenotyping [209] as multilabel classication of sequences. Our labels include 429 distinct diagnosis codes from an in-house taxonomy at CHLA, similar to ICD-9 codes [214] commonly used in medical informatics research. These labels include a wide range of acute conditions, such as acute respiratory distress, congestive heart failure, and sepsis. A full list is given in appendix A. We focus on the 128 most frequent, each having at least 50 positive examples in our data set. Naturally, the diagnoses are not mutually exclusive. In our data set, the average patient is associated with 2:24 diagnoses. Additionally, the base rates of the diagnoses vary widely (see appendix A). 6.3 Recurrent Neural Networks for Multilabel Classication While our focus in this chapter is on missing data, for completeness, we review the LSTM RNN architecture for performing multilabel classication of diagnoses introduced by [178]. Formally, given a series of observations x (1) ;:::;x (T ) , we desire a classier to generate hypotheses ^ y of the true labels y, where each input x t 2 R D and the output ^ y2 [0; 1] K . Here, D denotes the input dimension, K denotes the number of labels, t indexes sequence steps, and for any example, T denotes the length of that sequence. Our proposed RNN uses LSTM memory cells [131] with forget gates [98] but without peephole connections [99]. As output, we use a fully connected layer followed by an element-wise logistic activation function . We apply log loss (binary cross-entropy) as the loss function at each output node. 126 The following equations give the update for a layer of memory cellsh (t) l , whereh (t) l1 stands for the previous layer at the same sequence step (a previous LSTM layer or the inputx (t) ) andh (t1) l stands for the same layer at the previous sequence step: g (t) l = (W gx l h (t) l1 +W gh l h (t1) l +b g l ) i (t) l = (W ix l h (t) l1 +W ih l h (t1) l +b i l ) f (t) l = (W fx l h (t) l1 +W fh l h (t1) l +b f l ) o (t) l = (W ox l h (t) l1 +W oh l h (t1) l +b o l ) s (t) l = g (t) l i (i) l +s (t1) l f (t) l h (t) l = (s (t) l )o (t) l In these equations, stands for an element-wise application of the logistic function, stands for an element-wise application of the tanh function, and is the Hadamard (element-wise) product. The input, output, and forget gates are denoted by i,o, andf respectively, whileg is the input node and has a tanh activation. The loss at a single sequence step is the average log loss calculated across all labels: loss(^ y;y) = 1 K l=K X l=1 (y l log(^ y l ) + (1y l ) log(1 ^ y l )): To overcome the diculty of learning to pass information across long sequences, we use the target replication strategy proposed by [178], in which we replicate the static targets at each sequence step providing a local error signal. This technique is also motivated by our problem: we desire to make accurate predictions even if the sequence were truncated 127 (as in early-warning systems). To calculate loss, we take a convex combination of the nal step loss and the average of the losses over predictions ^ y (t) at all steps t: 1 T T X t=1 loss(^ y (t) ;y (t) ) + (1) loss(^ y (T ) ;y (T ) ) where2 [0; 1] is a hyper-parameter determining the relative importance of performance on the intermediary vs. nal targets. At inference time, we consider only the output at the nal step. 6.4 Missing Data In this section, we explain our procedures for imputation, missing data indicator se- quences, engineering features of missing data patterns. 6.4.1 Imputation To address the missing data problem, we consider two dierent imputation strategies (forward-lling and zero imputation), as well as direct modeling via indicator variables. Because imputation and direct modeling are not mutually exclusive, we also evaluate them in combination. Suppose that x (t) i is \missing." In our zero-imputation strategy, we simply set x (t) i := 0 whenever it is missing. In our forward-lling strategy, we impute x (t) i as follows: If there is at least one previously recorded measurement of variable i at a time t 0 <t, we perform forward-lling by setting x (t) i :=x (t 0 ) i . 128 If there is no previous recorded measurement (or if the variable is missing entirely), then we impute the median estimated over all measurements in the training data. This strategy is motivated by the intuition that clinical sta record measurements at intervals proportional to rate at which they are believed or observed to change. Heart rate, which can change rapidly, is monitored much more frequently than blood pH. Thus it seems reasonable to assume that a value has changed little since the last time it was measured. Figure 6.2: (top left) no imputation or indicators, (bottom left) imputation absent indica- tors, (top right) indicators but no imputation, (bottom right) indicators and imputation. Time ows from left to right. 6.4.2 Learning with Missing Data Indicators Our indicator variable approach to missing data consists of augmenting our inputs with binary variables m (t) i for every x (t) i , where m (t) i := 1 if x (t) i is imputed and 0 otherwise. Through their hidden state computations, RNNs can use these indicators to learn arbi- trary functions of the past observations and missingness patterns. However, given the same data, linear models can only learn hard substitution rules. To see why, consider a linear model that outputs predictionf(z), wherez = P i w i x i . With indicator variables, we might say thatz = P i w i x i + P i i m i where i are the weights for eachm i . Ifx i is set to 0 and m i to 1, whenever the feature x i is missing, then the impact on the output 129 i m i = i is exactly equal to the contribution w i x i for some x i = i =w i . In other words, the linear model can only use the indicator in a way that depends neither on the previously observed values (x 1 i :::x t1 i ), nor any other evidence in the inputs. Figure 6.3: Depiction of RNN zero-lled inputs and missing data indicators. Note that for a linear model, the impact of a missing data indicator on predictions must be monotonic. In contrast, the RNN might infer that for one patient heart rate is missing because they went for a walk, while for another it might signify an emergency. Also note that even without indicators, the RNN might learn to recognize lled-in vs real values. For example, with forward-lling, the RNN could learn to recognize exact repeats. For zero-lling, the RNN could recognize that values set to exactly 0 were likely missing measurements. 6.4.3 Hand-engineered missing data features To overcome the limits of the linear model, we also designed features from the indicator sequences. As much as possible, we limited ourselves to features that are simple to calculate, intuitive, and task-agnostic. The rst is a binary indicator for whether a variable was measured at all. Additionally, we compute the mean and standard deviation 130 of the indicator sequence. The mean captures the frequency with which each variable is measured which carries information about the severity of a patient's condition. The standard deviation, on the other hand, computes a non-monotonic function of frequency that is maximized when a variable is missing exactly 50% of the time. We also compute the frequency with which a variable switches from measured to missing or vice versa across adjacent sequence steps. Finally, we add features that capture the relative timing of the rst and last measurements of a variable, computed as the number of hours until the measurement divided by the length of the full sequence. 6.5 Experiments We now present the training details and empirical ndings of our experiments. Our LSTM RNNs each have 2 hidden layers of 128 LSTM cells each, non-recurrent dropout of 0:5, and ` 2 2 weight decay of 10 6 . We train on 80% of data, setting aside 10% each for validation and testing. We train each RNN for 100 epochs, retaining the parameters corresponding to the epoch with the lowest validation loss. We compare the performance of RNNs against logistic regression and multilayer per- ceptrons (MLPs). We apply ` 2 regularization to the logistic regression model. The MLP has 3 hidden layers with 500 nodes each, rectied linear unit activations, and dropout (with probability of 0.5), choosing the number of layers and nodes by validation perfor- mance. We train the MLP using stochastic gradient descent with momentum. 131 We evaluate each baseline with two sets of features: raw and hand-engineered. Note that our baselines cannot be applied directly to variable-length inputs. For the raw fea- tures, we concatenate three 12-hour subsequences, one each from the beginning, middle, and end of the time series. For shorter time series, these intervals may overlap. Thus raw representations contain 2 3 12 13 = 936 features. We train each baseline on ve dierent combinations of raw inputs: (1) measurements with zero-lling, (2) measure- ments with forward-lling, (2) measurements with zero-lling + missing data indicators, (4) forward-lling + missing data indicators, and (5) missing data indicators only. Our hand-engineered features capture central tendencies, variability, extremes, and trends. These include the rst and last measurements and their dierence, maximum and minimum values, mean and standard deviation, median and 25th and 75th percentiles, and the slope and intercept of least squares line t. We also computed the 8 missing data features described in Section 6.4. We improve upon the baselines in [178] by computing the hand-engineered features over dierent windows of time, giving them access to greater temporal information and enabling them to better model patterns of missingness. We extract hand-engineered features from the entire time series and from three possibly overlapping intervals: the rst and last 12 hours and the interval between (for shorter sequences, we instead use the middle 12 hours). This yields a total of 41213 = 624 and 4 8 13 = 416 hand-engineered measurement and missing data features, respectively. We train baseline models on three dierent combinations of hand-engineered features: (1) measurement-only, (2) indicator-only, and (3) measurement and indicator. We evaluate all models on the same training, validation, and test splits. Our eval- uation metrics include area under the ROC curve (AUC) and F1 score (with threshold 132 chosen based on validation performance). We report both micro-averaged (calculated across all predictions) and macro-averaged (calculated separately on each label, then averaged) measures to mitigate the weaknesses in each [176]. Finally we also report pre- cision at 10, whose maximum is 0:2238 because we have on average 2:238 diagnoses per patient. This metric seems appropriate because we could imagine this technology would be integrated into a diagnostic assistant. In that case, its role might be to suggest the most likely diagnoses among which a professional doctor would choose. Precision at 10 evaluates the quality of the top 10 suggestions. 6.5.1 Results The best overall model by all metrics (micro AUC of 0:8730) is an LSTM with zero- imputation and missing data indicators. It outperforms both the strongest MLP baseline and LSTMs absent missing data indicators. For the LSTMs using either imputation strategy, adding the missing data indicators improves performance in all metrics. While all models improve with access to missing data indicators, this information confers less benet to the raw input linear baselines, consistent with theory discussed in Section 6.4.2. The results achieved by logistic regression with hand-engineered features indicates that our simple hand-engineered missing data features do a reasonably good job of capturing important information that neural networks are able to mine automatically. We also nd that LSTMs (with or without indicators) appear to perform better with zero-lling than with with imputed values. Interestingly, this is not true for either baseline. It suggests that the LSTM may be learning to recognize missing values implicitly by recognizing a tight range about the value zero and inferring that this is a missing value. If this 133 Classication performance for 128 ICU phenotypes Model mAUC MAUC mF1 MF1 P@10 Base Rate 0.7128 0.5 0.1346 0.0343 0.0788 Best Possible 1.0 1.0 1.0 1.0 0.2281 Logistic Regression Log Reg - Zeros 0.8108 0.7244 0.2149 0.0999 0.1014 Log Reg - Impute 0.8201 0.7455 0.2404 0.1189 0.1038 Log Reg - Zeros & Indicators 0.8143 0.7269 0.2239 0.1082 0.1017 Log Reg - Impute & Indicators 0.8242 0.7442 0.2467 0.1234 0.1045 Log Reg - Indicators Only 0.7929 0.6924 0.1952 0.0889 0.0939 Multilayer Perceptron MLP - Zeros 0.8263 0.7502 0.2344 0.1072 0.1048 MLP - Impute 0.8376 0.7708 0.2557 0.1245 0.1031 MLP - Zeros & Indicators 0.8381 0.7705 0.2530 0.1224 0.1067 MLP - Impute & Indicators 0.8419 0.7805 0.2637 0.1296 0.1082 MLP - Indicators Only 0.8112 0.7321 0.1962 0.0949 0.0947 LSTMs LSTM - Zeros 0.8662 0.8133 0.2909 0.1557 0.1176 LSTM - Impute 0.8600 0.8062 0.2967 0.1569 0.1159 LSTM - Zeros & Indicators 0.8730 0.8250 0.3041 0.1656 0.1215 LSTM - Impute & Indicators 0.8689 0.8206 0.3027 0.1609 0.1196 LSTM - Indicators Only 0.8409 0.7834 0.2403 0.1291 0.1074 Models using Hand-Engineered Features Log Reg HE 0.8396 0.7714 0.2708 0.1327 0.1118 Log Reg HE Indicators 0.8472 0.7752 0.2841 0.1376 0.1165 Log Reg HE Indicators Only 0.8187 0.7322 0.2287 0.1001 0.1020 MLP HE 0.8599 0.8052 0.2953 0.1556 0.1168 MLP HE Indicators 0.8669 0.8160 0.2954 0.1610 0.1202 MLP HE Indicators Only 0.8371 0.7682 0.2351 0.1179 0.1028 Table 6.1: Results on performance metrics calculated across all labels. mAUC and mF1 refer to the micro-averaged metrics, MAUC and MF1 the macro-averaged metrics. Perfor- mance on aggregate metrics for logistic regression (Log Reg), MLP, and LSTM classiers with and without imputation and missing data indicators. is true, perhaps imputation interferes with the LSTM's ability to implicitly recognize missing values. Overall, the ability to implicitly infer missingness may have broader implications. It suggests that we might never completely hide this information from a suciently powerful model. 134 6.6 Related Work This work builds upon research relating to missing values and machine learning for med- ical informatics. The basic RNN methodology for phenotyping derives from [178], ad- dressing a data set and problem described by [50]. The methods rely upon LSTM RNNs [131, 98] trained by backpropagation through time [128, 316]. A comprehensive perspec- tive on the history and modern applications of RNNs is provided by [177], while [178] list many of the previous works that have applied neural networks to digital health data. While a long and rich literature addresses pattern recognition with missing data [61, 8], most of this literature addresses xed-length feature vectors [96, 225]. Indicator variables for missing data were rst proposed by [61], but we could not nd papers that combine missing data indicators with RNNs. Only a handful of papers address missing data in the context of RNNS. [24] demonstrate a scheme by which the RNN learns to ll in the missing values such that the lled-in values minimize output error. In 2001, [216] built upon this method to improve automatic speech recognition. [19] suggests using a mask of indicators in a scheme for weighting the contribution of reliable vs corrupted data in the nal prediction. [293] address missing values by combining an RNN with a linear state space model to handle uncertainty. This chapter may be one of the rst to engineer explicit features of missingness patterns in order to improve discriminative performance. Also, to our knowledge, we are the rst to harness patterns of missing data to improve the classication of critical care phenotypes. 135 6.7 Discussion Data processing and discriminative learning have often been regarded as separate disci- plines. Through this separation of concerns, the complementarity of missing data indi- cators and training RNNs for classication has been overlooked. This chapter proposes that patterns of missing values are an underutilized source of predictive power and that RNNs, unlike linear models, can eectively mine this signal from sequences of indicator values. Our hypotheses are conrmed by empirical evidence. Additionally, we introduce and conrm the utility of a simple set of features, engineered from the sequence of miss- ingness indicators, that can improve performance of linear models. These techniques are simple to implement and broadly applicable and seem likely to confer similar benets on other sequential prediction tasks, when data is missing not at random. One example might include nancial data, where failures to report accounting details could suggest internal problems at a company. 6.7.1 The Perils and Inevitability of Modeling Treatment Patterns For medical applications, the predictive power of missing data raises important philo- sophical concerns. We train models with supervised learning, and verify their utility by assessing the accuracy of their classications on hold-out test data. However, in practice, we hope to make treatment decisions based on these predictions, exposing a fundamental incongruity between the problem on which our models are trained and those for which they are ultimately deployed. As articulated in [175], these supervised models, trained oine, cannot account for changes that their deployment might confer upon the real 136 world, possibly invalidating their predictions. [42] present a compelling case in which a pneumonia risk model predicted a lower risk of death for patients who also have asthma. The better outcomes of the asthma patients, as it turns out, owed to the more aggressive treatment they received. The model, if deployed, might be used to choose less aggressive treatment for the patients with both pneumonia and asthma, clearly a sub-optimal course of action. On the other hand, to some degree, learning from treatment signal may be inevitable. Any imputation might leak some information about which values are likely imputed and which are not. Thus any suciently powerful supervised model might catch on to some amount of missingness signal, as was the case in our experiments with the LSTM using zero-lled missing values. Even physiologic measurements contain information owing to patterns of treatment, possibly re ecting the medications patients receive and the procedures they undergo. Sometimes the patterns of treatments may be a reasonable and valuable source of information. Doctors already rely on this kind of signal habitually: they read through charts, noting which other doctors have seen a patient, inferring what their opinions might have been from which tests they ordered. While, in some circumstances, it may be problematic for learning models to rely on this signal, removing it entirely may be both dicult and undesirable. 6.7.2 Complex Models or Complex Features? Our work also shows that using only simple features, RNNs can achieve state of the art performance classifying clinical time series. The RNNs far outperform linear models. 137 Still, in our experience, there is a strong bias among practitioners toward more familiar models even when they require substantial feature engineering. In our experiments, we undertook extensive eorts to engineer features to boost the performance of both linear models and MLPs. Ultimately, while RNNs performed best on raw data, we could approach its performance with an MLP and signicantly improve the linear model by using hand-engineered features and windowing. A question then emerges: how should we evaluate the trade-o between more complex models and more complex features? To the extent that linear models are believed to be more interpretable than neural networks, most popular notions of interpretability hinge upon the intelligibility of the features [175]. When performance of the linear model comes at the price of this intelligibility, we might ask if this trade-o undermines the linear model's chief advantage. Additionally, such a model, while still inferior to the RNN, relies on application-specic features less likely to be useful on other data sets and tasks. In contrast, RNNs seem better equipped to generalize to dierent tasks. While the model may be complex, the inputs remain intelligible, opening the possibility to various post-hoc interpretations [175]. 6.7.3 Future Work We see several promising next steps following this work. First, we would like to validate this methodology on tasks with more immediate clinical impact, such as predicting sepsis, mortality, or length of stay. Second, we'd like to extend this work towards predicting clinical decisions. Called policy imitation in the reinforcement literature, such work could pave the way to providing real-time decision support. Finally, we see machine learning as cooperating with a human decision-maker. Thus a machine learning model needn't 138 always make a prediction/classication; it could also abstain. We hope to make use of the latest advances in mining uncertainty information from neural networks to make condence-rated predictions. 6.8 Per Diagnosis Classication Performance In this appendix, we provide per-diagnosis AUC and F1 scores for three representative LSTM models trained with imputed measurements, with imputation plus missing indi- cators, and with indicators only. By comparing performance on individual diagnoses, we can gain some insight into the relationship between missing values and dierent condi- tions. Rows are sorted in descending order based on the F1 score of the imputation plus indicators model. It is worth noting that F1 scores are sensitive to threshold, which we chose in order to optimize per-disease validation F1, sometimes based on a very small number of positive cases. Thus, there are cases where one model will have superior AUC but worse F1. 6.9 Missing Value Statistics In this appendix, we present information about the sampling rates and missingness char- acteristics of our 13 variables. The rst column lists the average number of measurements per hour in all episodes with at least one measurement (excluding episodes where the vari- able is missing entirely). The second column lists the fraction of episodes in which the variable is missing completely (there are zero measurements). The third column lists the 139 missing rate in the resulting discretized sequences. a 140 Classier Performance on Each Diagnostic Code, Sorted by F1 Msmt. Msmt./indic. Indic. Condition Base rate AUC F1 AUC F1 AUC F1 Diabetes mellitus w/ketoacidosis 0.0125 1.0000 0.8889 0.9999 0.9333 0.9906 0.7059 Asthma with status asthmaticus 0.0202 0.9384 0.6800 0.8907 0.6383 0.8652 0.5417 Scoliosis (idiopathic) 0.1419 0.9143 0.6566 0.8970 0.6174 0.8435 0.5235 Tumor, cerebral 0.0917 0.8827 0.5636 0.8799 0.5560 0.8312 0.4627 Renal transplant, status post 0.0122 0.9667 0.2963 0.9544 0.4762 0.9490 0.5600 Liver transplant, status post 0.0106 0.7534 0.3158 0.8283 0.4762 0.8271 0.2581 Acute respiratory distress syn. 0.0193 0.9696 0.3590 0.9705 0.4557 0.9361 0.3333 Developmental delay 0.0876 0.8108 0.4382 0.8382 0.4331 0.6912 0.2366 Diabetes insipidus 0.0127 0.9220 0.2727 0.9486 0.4286 0.9266 0.4000 End stage renal disease 0.0241 0.8548 0.2778 0.8800 0.4186 0.9043 0.4255 Seizure disorder 0.0816 0.7610 0.3694 0.7937 0.4059 0.6431 0.1957 Acute respiratory failure 0.0981 0.8414 0.4128 0.8391 0.3835 0.8358 0.4542 Cystic brosis 0.0076 0.8628 0.2353 0.8740 0.3810 0.8189 0.0000 Septic shock 0.0316 0.8296 0.3363 0.8911 0.3750 0.8506 0.1429 Respiratory distress 0.0716 0.8411 0.3873 0.8502 0.3719 0.7857 0.2143 Intracranial injury, closed 0.0525 0.8886 0.2817 0.9002 0.3711 0.8442 0.3208 Arteriovenous malformation 0.0223 0.8620 0.3590 0.8716 0.3704 0.8494 0.2857 Seizures, status epilepticus 0.0348 0.8381 0.4158 0.8505 0.3704 0.8440 0.3226 Pneumonia due to adenovirus 0.0123 0.8604 0.1250 0.9065 0.3077 0.8792 0.1818 Acute leukemia 0.0287 0.8585 0.2783 0.8845 0.3059 0.8551 0.2703 Dissem. intravasc. coagulopathy 0.0099 0.9556 0.5000 0.9532 0.2857 0.9555 0.2500 Septicemia, other 0.0240 0.8586 0.2400 0.8870 0.2812 0.7593 0.0000 Bronchiolitis 0.0162 0.9513 0.2667 0.9395 0.2703 0.8826 0.1778 Congestive heart failure 0.0133 0.8748 0.1429 0.8756 0.2703 0.8326 0.1364 Upper airway obstruc. (UAO) 0.0378 0.8206 0.2564 0.8573 0.2542 0.8350 0.1964 Diabetes mellitus type I, stable 0.0064 0.7105 0.0000 0.9625 0.2500 0.9356 0.3333 Cerebral palsy (infantile) 0.0262 0.8230 0.2609 0.8359 0.2500 0.6773 0.0980 Coagulopathy 0.0131 0.7501 0.1111 0.8098 0.2449 0.8548 0.1667 UAO, ENT surgery, post-status 0.0302 0.9059 0.4058 0.8733 0.2400 0.8364 0.1975 Hypertension, systemic 0.0169 0.8740 0.2105 0.8831 0.2388 0.8216 0.2857 Acute renal failure, unspecied 0.0191 0.9242 0.2381 0.9510 0.2381 0.9507 0.3291 Trauma, vehicular 0.0308 0.8673 0.2105 0.8649 0.2381 0.8022 0.1395 Hepatic failure 0.0176 0.8489 0.2222 0.9239 0.2308 0.8598 0.1935 Craniosynostosis 0.0064 0.7824 0.0000 0.9267 0.2286 0.8443 0.0315 Prematurity 0.0321 0.7520 0.1548 0.7542 0.2245 0.7042 0.1266 Congenital hydrocephalus 0.0381 0.7118 0.2099 0.7500 0.2241 0.7065 0.1961 Pneumothorax 0.0134 0.8220 0.1176 0.7957 0.2188 0.7552 0.3243 Congenital muscular dystrophy 0.0121 0.8427 0.2500 0.8491 0.2143 0.7460 0.0800 Cardiomyopathy (primary) 0.0191 0.7508 0.1290 0.6057 0.2143 0.6372 0.1818 Pulmonary edema 0.0076 0.8839 0.0769 0.8385 0.2105 0.8071 0.0870 Table 6.2: AUC and F1 scores for individual diagnostic codes. 141 Classier Performance on Each Diagnostic Code, Sorted by F1 Msmt. Msmt./indic. Indic. Condition Base rate AUC F1 AUC F1 AUC F1 (Acute) pancreatitis 0.0106 0.8712 0.0769 0.9512 0.2000 0.8182 0.0571 Tumor, dissem. or metastatic 0.0180 0.7178 0.0938 0.7415 0.1967 0.6837 0.1062 Hematoma, intracranial 0.0299 0.7724 0.2278 0.8249 0.1892 0.7518 0.1474 Neutropenia (agranulocytosis) 0.0108 0.8285 0.0000 0.8114 0.1852 0.8335 0.2609 Arrhythmia, other 0.0087 0.8536 0.0000 0.8977 0.1818 0.8654 0.0000 Child abuse, suspected 0.0065 0.9544 0.2222 0.8642 0.1818 0.8227 0.0870 Encephalopathy 0.0116 0.8242 0.1429 0.8571 0.1818 0.8009 0.0800 Epidural hematoma 0.0098 0.7389 0.0455 0.8233 0.1818 0.7936 0.1000 Tumor, gastrointestinal 0.0100 0.8112 0.1429 0.8636 0.1778 0.8732 0.0984 Craniofacial malformation 0.0133 0.8707 0.2667 0.8514 0.1778 0.6928 0.2286 Gastroesophageal re ux 0.0182 0.7571 0.1818 0.8554 0.1690 0.7739 0.1600 Pneumonia, bacterial 0.0186 0.8876 0.1333 0.8806 0.1600 0.8616 0.0000 Pneumonia, undetermined 0.0179 0.8323 0.1481 0.8269 0.1583 0.7772 0.0947 Cerebral edema 0.0059 0.8275 0.0000 0.9469 0.1538 0.9195 0.1500 Pneumonia due to inhalation 0.0078 0.7917 0.1111 0.8602 0.1538 0.8268 0.0566 Metabolic or endocrine disorder 0.0095 0.7718 0.0364 0.6929 0.1538 0.6319 0.2000 Disorder of kidney and ureter 0.0204 0.8486 0.2857 0.8650 0.1500 0.8238 0.2500 Urinary tract infection 0.0137 0.7478 0.1154 0.7402 0.1481 0.7229 0.0588 Subdural hematoma 0.0147 0.8270 0.1449 0.8884 0.1429 0.8190 0.0476 Near drowning 0.0068 0.8296 0.0741 0.7917 0.1404 0.6897 0.0397 Cardiac arrest, outside hospital 0.0118 0.8932 0.0976 0.8791 0.1379 0.8881 0.0556 Pleural eusion 0.0113 0.8549 0.1081 0.8186 0.1351 0.7605 0.1151 Bronchopulmonary dysplasia 0.0252 0.8309 0.1915 0.7952 0.1304 0.8503 0.1203 Hyponatremia 0.0056 0.5707 0.0187 0.7398 0.1176 0.8775 0.0000 Suspected septicemia, rule out 0.0143 0.7378 0.0923 0.7402 0.1029 0.6769 0.0000 Thrombocytopenia 0.0112 0.7381 0.0822 0.7857 0.1026 0.8585 0.0800 Intracranial hypertension 0.0099 0.8494 0.0000 0.9018 0.1020 0.8586 0.1224 Pericardial eusion 0.0055 0.8997 0.0870 0.9085 0.1017 0.9000 0.0714 Pulmonary contusion 0.0068 0.9029 0.0606 0.8831 0.0984 0.8197 0.0225 Surgery, gastrointestinal 0.0104 0.6705 0.0714 0.6666 0.0976 0.5545 0.0233 Respiratory Arrest 0.0062 0.8404 0.0000 0.8741 0.0952 0.8127 0.0444 Trauma, abdominal 0.0105 0.7426 0.1667 0.8623 0.0930 0.6991 0.0426 Atrial septal defect 0.0107 0.7766 0.0727 0.7765 0.0909 0.7197 0.0000 Genetic abnormality 0.0557 0.6629 0.1324 0.6470 0.0876 0.5705 0.1165 Arrhythmia, ventricular 0.0062 0.8532 0.0303 0.8703 0.0870 0.8182 0.1250 Hematologic disorder 0.0114 0.6736 0.0800 0.6898 0.0870 0.8074 0.0800 Asthma, stable 0.0171 0.7010 0.0925 0.6607 0.0870 0.5907 0.0741 Neurobromatosis 0.0079 0.8022 0.0469 0.7984 0.0816 0.7388 0.0160 Tumor, bone 0.0090 0.8830 0.0727 0.8174 0.0800 0.7649 0.0417 Shock, hypovolemic 0.0088 0.7703 0.0000 0.8433 0.0741 0.8040 0.0000 Gastrointestinal bleed, other 0.0064 0.8325 0.0541 0.7974 0.0741 0.7996 0.0909 142 Classier Performance on Each Diagnostic Code, Sorted by F1 Msmt. Msmt./indic. Indic. Condition Base rate AUC F1 AUC F1 AUC F1 Chromosomal abnormality 0.0173 0.8047 0.1034 0.7197 0.0714 0.6300 0.1600 Encephalopathy, other 0.0093 0.8265 0.1250 0.8736 0.0688 0.8335 0.1250 Respiratory syncytial virus 0.0130 0.8876 0.2857 0.9145 0.0645 0.8716 0.0930 Hereditary hemolytic anemia 0.0088 0.7582 0.0548 0.8544 0.0645 0.9125 0.0513 Obstructive sleep apnea 0.0185 0.7564 0.0613 0.8200 0.0645 0.8087 0.1111 Apnea, central 0.0142 0.7871 0.1600 0.8134 0.0625 0.8051 0.0000 Neuromuscular, other 0.0132 0.7163 0.0452 0.7069 0.0619 0.6484 0.0392 Anemia, acquired 0.0056 0.7378 0.1017 0.7596 0.0615 0.8129 0.0519 Meningitis, bacterial 0.0070 0.4431 0.0000 0.7676 0.0606 0.5480 0.0000 Trauma, long bone injury 0.0096 0.8757 0.0952 0.9085 0.0597 0.7946 0.1176 Bowel (intestinal) obstruction 0.0104 0.7512 0.0984 0.6559 0.0597 0.6936 0.0424 Neurologic disorder, other 0.0288 0.7628 0.1481 0.6978 0.0588 0.5971 0.0769 Panhypopituitarism 0.0057 0.7763 0.0000 0.7724 0.0571 0.6415 0.0000 Thyroid dysfunction 0.0072 0.6310 0.0369 0.6420 0.0541 0.6661 0.0000 Coma 0.0056 0.6483 0.1250 0.6823 0.0513 0.7155 0.0000 Spinal cord lesion 0.0133 0.7298 0.0585 0.7052 0.0488 0.8168 0.0414 Pneumonia (mycoplasma) 0.0188 0.8589 0.1613 0.8792 0.0476 0.8424 0.1164 Trauma, blunt 0.0065 0.9156 0.0513 0.8138 0.0469 0.7426 0.0177 Surgery, thoracic 0.0058 0.7405 0.0000 0.6948 0.0469 0.6087 0.0909 Neuroblastoma 0.0059 0.6526 0.0306 0.7268 0.0360 0.7775 0.0346 Obesity 0.0098 0.7503 0.0365 0.6814 0.0351 0.6647 0.0667 Ventriculoperitoneal shunt 0.0073 0.6824 0.0267 0.7114 0.0331 0.7516 0.0667 Ventricular septal defect 0.0119 0.6641 0.1081 0.5680 0.0294 0.5593 0.0444 Croup Syndrome, UAO 0.0069 0.9418 0.2222 0.9834 0.0000 0.9682 0.2222 Sickle-cell anemia, unspecied 0.0080 0.6262 0.0000 0.9627 0.0000 0.8661 0.1250 Biliary atresia 0.0063 0.9383 0.2667 0.9164 0.0000 0.7589 0.0714 Metabolic acidosis (<7.1) 0.0083 0.9475 0.1818 0.9046 0.0000 0.9143 0.1538 Immunologic disorder 0.0094 0.9539 0.1500 0.8868 0.0000 0.8969 0.1212 Pulmonary hypertension 0.0112 0.9259 0.2500 0.8826 0.0000 0.8098 0.0000 Trauma, chest 0.0051 0.9261 0.0000 0.8818 0.0000 0.7820 0.0000 Spinal muscular atrophy 0.0052 0.9666 0.0000 0.8658 0.0000 0.8362 0.0000 Trauma, unspecied 0.0065 0.7153 0.1481 0.8657 0.0000 0.8224 0.0594 Bone marrow transplant 0.0097 0.8161 0.5217 0.8562 0.0000 0.8505 0.1695 Surgery, orthopaedic 0.0180 0.7839 0.1029 0.8192 0.0000 0.7331 0.0000 Gastrointestinal bleed, upper 0.0063 0.8388 0.0000 0.8078 0.0000 0.7256 0.0000 Arrhythmia, supravent. tachy. 0.0055 0.8178 0.0385 0.7867 0.0000 0.8199 0.0000 Congenital alveolar hypovent. 0.0057 0.7067 0.0000 0.7716 0.0000 0.7282 0.0000 Tetralogy of fallot 0.0061 0.5759 0.0000 0.7614 0.0000 0.7637 0.0000 Cardiac disorder 0.0071 0.7229 0.0519 0.7552 0.0000 0.6287 0.0000 Hydrocephalus, shunt failure 0.0083 0.7715 0.0000 0.7542 0.0000 0.7986 0.0635 Cerebral infarction (CVA) 0.0058 0.6766 0.0000 0.7495 0.0000 0.7148 0.1333 Congenital heart disorder 0.0084 0.7590 0.0000 0.7277 0.0000 0.7803 0.0583 Gastrointestinal disorder 0.0139 0.6755 0.0336 0.6821 0.0000 0.6465 0.1026 Aspiration 0.0072 0.6727 0.0533 0.6734 0.0000 0.6792 0.0333 Dehydration 0.0105 0.7356 0.0690 0.6636 0.0000 0.5899 0.0000 Tumor, thoracic 0.0077 0.6931 0.0513 0.6249 0.0000 0.6815 0.0292 UAO, extubation, status post 0.0085 0.8295 0.0672 0.6063 0.0000 0.6128 0.0000 143 Variable Msmt./hour Missing entirely Frac. missing Diabstolic blood pressure 0.5162 0.0135 0.1571 Systolic blood pressure 0.5158 0.0135 0.1569 Peripheral capillary refall rate 1.0419 0.0140 0.5250 End-tidal CO 2 0.9318 0.5710 0.5727 Fraction inspired O 2 1.3004 0.1545 0.7873 Total glasgow coma scale 1.0394 0.0149 0.5250 Glucose 1.4359 0.1323 0.9265 Heart rate 0.2477 0.0133 0.0329 pH 1.4580 0.3053 0.9384 Respiratory rate 0.2523 0.0147 0.0465 Pulse oximetry 0.1937 0.0022 0.0326 Temperature 1.0210 0.0137 0.5235 Urine output 1.1160 0.0353 0.5980 Table 6.3: Sampling rates and missingness statistics for all 13 features. 144 Chapter 7 Heterogeneous Multitask Learning for Clinical Time Series To date, most eorts to develop machine learning tools for medicine have proceeded in silos, with researchers developing solutions for one task (mortality prediction) at a time. This approach is detached from the realities of clinical decision making, in which clinical sta perform many tasks simultaneously. Equally important, there is evidence that many clinical problems are interrelated and that discarding the correlations between them may yield suboptimal performance. In this chapter, we formulate a heterogeneous multitask learning problem that involves jointly learning four high priority clinical prediction tasks: in-hospital mortality, decompensation, length of stay, and diagnosis classication. The heterogeneous nature of these tasks requires modeling correlations between dierent types of outputs distributed in time. To do this, we design a multitask recurrent neural net- work architecture that learns all four tasks jointly and can capture potential long-term dependencies between tasks. Our proposed architecture equals or outperforms individual models trained to optimize performance on each task. 145 7.1 Introduction \Big clinical data" represent a signicant opportunity for researchers in data mining and machine learning to transform healthcare. [20] outline six use cases where big data has the greatest potential to impact the US healthcare system, including early triage and risk assessment [338], prediction of physiologic decompensation [318], identication of high cost patients [70], and characterization of complex, multi-system diseases [253]. None of these problems is new to the era of big clinical data, but the rapid adoption of electronic health record (EHR) systems [125] and recent high prole machine learning successes [89, 274] have sparked renewed interest. While there has been a steady growth in machine learning research on these and related topics, the eorts to develop machine learning tools for clinical applications have for the most part proceeded in individual silos, with researchers developing new methods for one clinical prediction task at a time, e.g., mortality prediction [38] or condition monitoring [235]. This approach is detached from the realities of clinical decision making, in which all the above tasks are often performed simultaneously by clinical sta [161]. Perhaps more importantly, there is accumulating evidence that those prediction tasks are interrelated. For instance, the highest risk and highest cost patients are often those with complex co-morbidities [133] while decompensating patients have a higher risk for poor outcomes [318]. Discarding the correlations in the predicted clinical outcomes will likely yield suboptimal results. In this section, we instead formulate a heterogeneous multitask learning problem that involves jointly learning four prediction tasks motivated by the use cases highlighted in 146 [20]: (1) Modeling risk of in-hospital mortality early in a patient admission (triage); (2) Predicting mortality within xed horizon after assessment (decompensation); (3) Fore- casting patient length of stay (high cost patients); (4) Classication of clinical phenotypes from physiologic time series (joint modeling of complex diseases). These tasks vary in not only output type but also temporal structure: LOS involves a regression at each time step, while in-hospital mortality risk is predicted once early in admission. Their hetero- geneous nature requires a modeling solution that can not only handle sequence data but also model correlations between tasks distributed in time. Then, to address these challenges, we propose a multitask recurrent neural network architecture that learns these four tasks jointly and can capture potential longterm de- pendencies between tasks. The proposed architecture includes one or more LSTM layers shared across tasks, followed by task-specic output layers that are temporally aligned to their respective tasks. This modular architecture can be trained jointly and generalizes to larger numbers of diverse prediction problems. 7.2 Related Work Multitask learning has its roots in clinical prediction: [41] used future lab values as auxiliary targets during training to improve prediction of mortality among pneumonia patients. Subsequent work on multitask learning also found applications in healthcare, where correlations and shared hidden structure between related tasks can improve pre- diction performance, especially in the presence of limited training data. [207] and [50] 147 both encode predened similarity between diseases (based on an ontology) into an ex- plicit graph Laplacian regularizer for mortality prediction and phenotype classication. [178] and [243] formulate phenotyping as multi-label classication, using neural networks to implicitly capture co-morbidities in hidden layers. [241] address the problem of small drug discovery data sets by training a single neural net on 259 separate data sets, with a separate output layer for each task, and showed that performance improve steadily as the number of tasks increased. [324] introduce a general framework for learning a mix- ture of regression and classication models with joint feature selection and apply it to gene expression data. [204] apply a similar framework to jointly solving thirteen related clinical tasks, including predicting mortality and length of stay. In computer vision, [170] and [84] both showed that using a single model to perform joint detection (classication) and pose estimation (regression) improves generalization on both tasks. However, none of this work addresses problem settings where sequential or temporal structure varies across tasks. The closest work in spirit to ours is by [63], who use a single convolutional network to perform a variety of natural language tasks (part-of-speech tagging, named entity recognition, and language modeling) with diverse sequential structure. Several recent works have considered multiple clinical tasks. [233] use an adversarial recurrent framework to perform domain adaptation between dierent patient age groups. They apply this model to both mortality prediction and diagnostic classication, but use separate data sets for each task. [234] evaluate RNN-based architectures on a similar set of clinical prediction tasks to ours but do not investigate multitask learning. 148 Similarly, [238] showed that RNN architectures perform well for a set of tasks similar to ours (inpatient mortality, 30-day unplanned readmission, long length of stay, and discharge diagnoses), but there are key dierences. First, they use longitudinal EHRs consisting primarily of codes and text, whereas we work with moderately dense time series from hospitalized patients. Next, unlike our work, their tasks are not heterogeneous: each is posed as binary classication, and all are predicted at the same time points early in admission. Finally, they do not train a multitask architecture. 7.3 Multitask Clinical Prediction Data Set In our experiments, we use a benchmark data set derived from the MIMIC-III database and described in detail in Chapter 10. It includes 42,276 ICU stays from the Beth Israel Deaconness Hospital, 17 physiologic time series representing a subset of the variables from the Physionet/CinC Challenge 2012 [273], and patient characteristics such as height and age. The overall benchmark has a xed test set of 15% (5,070) of patients, including 6,328 ICU stays. The benchmark includes four heterogeneous prediction tasks: In-hospital mortality involves quantifying a patient's risk of dying before being dis- charge from the hospital. This is a binary classication problem, predicted once at 48 hours after admission. Because this task requires at least 48 hours of data, the mortality benchmark uses only a subset of the full data set. The training and test sets have 17,903 and 3,236 ICU stays, respectively. We report area under the receiver operator character- istic curve (AUC-ROC), the most commonly used metric in the literature, and area under 149 the precision-recall curve (AUC-PR), which is often more informative when dealing with imbalanced classes. Decompensation involves detecting patients whose physiologic conditions are rapidly deteriorating and who are at risk of imminent death. This is a binary classication prob- lem, predicted once per hour starting at four hours after admission. The training and test splits are comprised of 2,908,414 and 523,208 instances (hours eligible for predicting decompensation), respectively. We use the same metrics for decompensation as for mor- tality, i.e., AUC-ROC and AUC-PR, micro-averaged across all time points. Length of stay (LOS) involves forecasting the total duration of a patient's stay at the hospital. This is originally a regression problem, but we also experiment with a multi- class classication version in which we bucket LOS into ten windows ranging from very short (less than one day) to very long (longer than two weeks). Like decompensation, we predict LOS once per hour starting at four hours after admission. We use mean absolute deviation (MAD) and Cohen's linear weighted kappa [60, 36] as metrics for the regression and classication versions of the problem, respectively. Phenotype classication is structurally equivalent to the learning to diagnose prob- lem described in Chapter 5 but with fewer disease classes, each representing a coherent group of ICD-9 codes in the HCUP CCS taxonomy [4]. This is a multilabel classication problem predicted at the end of each ICU stay, and like [178], we report macro- and 150 micro-averaged AUC-ROC. 7.4 Methods In this section, we describe a heterogeneous multitask LSTM architecture tailored to solve four clinical prediction tasks that vary with respect to output type and time. We begin by reviewing the basic denition of the LSTM architecture, our data preprocessing, and the loss function for each task. We then describe an atypical channel-wise variant of the LSTM that processes each variable separately, a deep supervision training strategy, and nally our heterogeneous multitask architecture. We begin by brie y revisiting the fundamentals of long short-term memory recurrent neural networks (LSTM RNNs) [131] and introducing notation for benchmark prediction tasks in order to describe our LSTM based models. THe LSTM is a type of RNN designed to capture long term dependencies in sequential data. The LSTM takes a sequencefx t g T t1 of length T as its input and outputs a T -long sequence offh t g T t1 hidden state vectors using the following equations: i t =(x t W xi +h t1 W hi ) f t =(x t W xf +h t1 W hf ) c t =f t c t1 +i t tanh(x t W xc +h t1 W hc +b c ) o t =(x t W xo +h t1 W ho +b o ) h t =o t h (c t ) 151 The (sigmoid) and tanh functions are applied element-wise. Like [178], we do not use peephole connections [97]. TheW matrices andb vectors are the trainable parameters of the LSTM. Later we will useh t =LSTM(x t ;h t1 ) as a shorthand for the above equations. Following [224] we apply dropout on non-recurrent connections between LSTM layers and before outputs. For LSTM based models we re-sample the time series into regularly spaced intervals. If there are multiple measurements of the same variable in the same interval, we use the value of the last measurement. We impute the missing values using the previous value if it exists and a pre-specied \normal" value otherwise. 1 In addition, we also provide a binary mask input for each variable indicating the timesteps that contain a true (vs. imputed) measurement [179]. Categorical variables are encoded using a one-hot vector. Numeric the inputs are standardized by subtracting the mean and dividing by standard deviation. The statistics are calculated per variable after imputation of missing values. After the discretization and standardization steps we get 17 pairs of time series for each ICU stay: (f (i) t g T t1 ;fc (i) t g T t1 ), where (i) t is a binary variable indicating whether variable i was observed at time step t and c (i) t is the value (observed or imputed) of variable i at time step t. Byfx t g T t1 we denote the concatenation of allf (i) t g T t1 and fc (i) t g T t1 time series, where concatenation is done across the axis of variables. In all our experiments x t becomes a vector of length 76. We also have a set of targets for each stay: fd t g T t1 where d t 2f0; 1g is a list of T binary labels for decompensation, one for each hour; m2f0; 1g is single binary label indicating whether the patient died in-hospital; f` t g T t1 where ` t 2 R is a list of real 1 The dened normal values we used can be found at our project github repository. 152 valued numbers indicating remaining length of stay (hours until discharge) at each time step; and p 1:K 2f0; 1g K is a vector of K binary phenotype labels. When training our models to predict length of stay, we instead use a set of categorical labelsfl t g T t1 where l t 2f1;:::; 10g indicates in which of the ten length-of-stay buckets ` t belongs. When used in the context of equations (as the output of a softmax or in a loss function, for example), we will interpret l t as a one-of-ten hot binary vector, indexing the ith entry as l ti . Note that because of task-specic lters are applied in the creation of benchmark tasks, we may have situations where for any given stay, one or more targets (m or d t ) may be missing for some time steps. Without abusing the notation in our equations we will assume that all targets are present. In the code missing targets are discarded. We next detail the notation for samples (or instances) in each benchmark task, as this notation will be useful in describing how dierent baselines work. Each instance of in-hospital mortality prediction task is a pair (fx t g 48 t1 ;m), wherefx t g 48 t1 is the matrix of clinical observations of rst 48 hours of the ICU stay and m is the label. An instance of decompensationt or LOS is a pair (fx t g t1 ;y), wherefx t g t1 is the matrix of clinical observations of rst hours of the stay andy is eitherd ,` orl . Each instance of phe- notype classication is a pair (fx t g T t1 ;p 1:K ), wherefx t g T t1 is the matrix of observation of the whole ICU stay and p 1:K are the phenotype labels. 7.4.1 Per-task output layers Our rst LSTM based architecture takes an instance (fx t g T t1 ;y) of a prediction task and uses a single LSTM layer to process the input: h t = LSTM(x t ;h t1 ). We design 153 an LSTM architecture for each clinical prediction task by adding a task-specic output layer: c d T = w (d) h T +b (d) b m = w (m) h T +b (m) c ` T = relu w (`) h T +b (`) b l T = softmax W (l) h T +b (l) b p i = W (p) i; h T +b (p) i wherey isd T ,m,` T ,l T orp 1:K respectively. The corresponding loss functions we use to train each task-specic model are listed in below: L d =CE(d T ; c d T ) L m =CE(m;b m) L ` = ( c ` T ` T ) 2 L l =MCE(l T ; b l T ) L p = 1 K K X i=1 CE(p k ;b p k ) 154 where CE(y;b y) is the binary cross entropy and MCE(y;b y) is multiclass cross entropy dened over the C classes: CE(y;b y) = (y log(b y) + (1y) log(1b y)) MCE(y;b y) = C X k=1 y k log(b y k ) 7.4.2 Channel-wise LSTM We also investigate a modied LSTM architecture that we call a channel-wise LSTM. While the standard LSTM network works directly on the concatenationfx t g T t1 of the time series, the channel-wise LSTM processes the measurement values and indicators (f (i) t g T t1 ;fc (i) t g T t1 ) for each variablei independently using a separate bidirectional LSTM layer [258] for each variable. The outputs of these LSTM layers then are concatenated and are fed another LSTM layer: p (i) t =LSTM([ (i) t ;c (i) t ];p (i) t1 ) q (i) t =LSTM([ (i) t ; c (i) t ];q (i) t+1 ) u t = [p (1) t ; q (1) t ;:::p (17) t ; q (17) t ] h t =LSTM(u t ;h t1 ) x t denotes the t-th element in the reversed sequencefx t g 1 tT . The output layers and loss functions for each task are the same as those in the standard LSTM baseline. 155 The intuition behind the channel-wise architecture is two-fold. First, it uses sepa- rates the learning and detection of single variable temporal correlations and multivariate relationships. This restricts the hypothesis space we search during learning by exclud- ing models that try to capture multivariate and temporal correlationships concurrently. Moreover, the channel-wise layer facilitates incorporation of missing data information by explicitly specifying which indicators correspond to which variables. In contrast, a full LSTM must search over a factorial number of possible matches Second, it also imposes a useful form of prior knowledge, encouraging the model to look for patterns similar to the kinds of higher-level features that practitioners frequently use to represent clinical time series, e.g., max and average heart rate and blood pressure over a window of time. This way the model can learn to store some useful information related to only that particular variable. For example, it can learn to store the maximum heart rate or the average blood pressure in earlier time steps. Note that this channel-wise module can be used as a replacement of the input layer in any neural architecture which takes the concatenation of time series of dierent variables as its input. 7.4.3 Deep supervision We have thus far discussed only models that predict the target in the last step. During learning this means that supervision comes exclusively from the last time step, requiring the model to pass information across many time steps. We propose two alternative methods where we provide supervision at each time step. We refer broadly to both approaches as \deep supervision." 156 For in-hospital mortality and phenotype prediction tasks we use target replication [178]. require the model to accurately predict the sequence target at every time step. The target replication loss functions for mortality and phenotype classication become: L m = (1) CE(m;d m T ) + 1 T T X t=1 CE(m;c m t ) L p = 1 K K X i=1 (1) CE(p k ;b p T;k ) + 1 T T X t=1 CE(p k ;b p tk ) ! 2 [0; 1] is a hyperparameter that represents the strength of target replication part in loss functions, b d t is decompensation prediction at time step t and c p tk is the prediction of k-th phenotype a time step t. For decompensation and LOS, each time step has a corresponding prediction target indexed by t, e.g., d t or l t . To perform deep supervision, rather than create a separate training instance for each time step in a single ICU stay, we can instead group these samples and predict them in a single pass. The loss functions for this approach to deep supervision become: L d = 1 T T X t=1 CE(d t ; b d t ) L ` = 1 T T X t=1 ( b ` t ` t ) 2 L l = 1 T T X t=1 MCE(l t ; b l t ) 157 Note that whenever we group the instances of a single ICU stay, we use simple left-to- right LSTMs instead of bidirectional LSTMs, so that the data from future time steps is not used. 7.4.4 Multitask learning LSTM Figure 7.1: Correlations between task labels. Another natural question to ask is whether we can train a single model to predict mortality, decompensation, LOS, and phenotypes simultaneously, rather than having a separate model for each. This is called a multitask learning. Correlations between the targets of dierent tasks are presented in the gure 7.1. The combined multitask loss is a weighted sum of task-specic losses, i.e., L mt = d L d + l L l + m L m + p L p 158 ℎ = 48 = 1 = (76, ) (1024, ) (2, ) (10, ) (2, ) (25, ) Figure 7.2: LSTM-based network architecture for multitask learning. where the task weights are non-negative. When working with raw LOS, we replaceL l withL ` in the multitasking loss function. In multitask learning, we group the instances coming from a single ICU stay and predict all targets associated with a single ICU jointly. This means we use deep supervi- sion for decompensation and LOS. We are free to choose whether we want to use target replication for in-hospital mortality and phenotype prediction tasks. For in-hospital mortality, we consider only the rst 48 timestepsfx t g tm t1 , and predict in-hospital mortality b m at t m = 48 by adding a multilayer perceptron output layer with sigmoid activation that takes onlyh tm as its input. For decompensation, we consider the full sequencefx t g T t1 and generate a sequence of mortality predictionsf b dg T t1 by adding an MLP output at every step. For phenotyping, we consider the full sequence but predict phenotypes b p only at the last timestep T by adding a multi-label MLP with sigmoid activations only at the last timestep. Similar to decompensation, we predict LOS by adding an MLP output at each timestep, but we experimented with two dierent types of output layers. The rst is a dense linear output MLP that makes a real valued prediction b ` t . The second is a softmax output MLP with ten neurons that outputs a distribution 159 over the ten LOS buckets b l t . The full multitask LSTM architecture is illustrated in Fig. 7.2. 7.5 Experiments 7.5.1 Logistic regression baseline For our logistic regression baselines, we use a more elaborate version of the hand-engineered features described in [178]: for each variable, we compute six dierent sample statistic features on seven dierent subsequences of a given time series. For mortality, this includes only the rst 48 hours, while for phenotyping this includes the entire ICU stay. For each instance of decompensation and LOS, this includes the entire timespan of the instance. The per-subsequence features include minimum, maximum, mean, standard deviation, skew and number of measurements. The seven subsequences include the full time series, the rst 10% of time, rst 25% of time, rst 50% of time, last 50% of time, last 25% of time, last 10% of time. In total, we obtain 17 7 6 = 714 features per time series. We train a separate logistic regression classier for each of mortality, decompensation, and the 25 phenotypes. For LOS, we trained a softmax regression model to solve the 10-class bucketed LOS problem. 7.5.2 Experimental setup During training, we split the benchmark training set into training and validation splits containing 85% and 15% of ICU stays respectively. We used a grid search to tune all hyperparameters based on validation set performance. The best model for each baseline 160 is chosen according to the performance on the validation set. The nal scores are reported on the benchmark test set, which we used sparingly during model development in order to avoid unintentional test set leakage. The only hyperparameters of logistic/linear regression models are the coecients L1 and L2 regularizations. When discretizing the data for LSTM-based models, we set the length of regularly spaced intervals to 1 hour. This gives a reasonable balance between amount of missing data and number of measurements of the same variable that fall into the same interval. This choice also agrees with the rate of sampling prediction instances for decompensation and LOS prediction tasks. We also tried to use intervals of length 0.8 hours, but there was no improvement in the results. For LSTM-based models, hyperparameters include the number of memory cells in LSTM layers, the dropout rate, and whether to use one or two LSTM layers. Channel-wise LSTM models have one more hyperparameter - the number of units in channel-wise LSTMs (all the 17 LSTMs having the same number of units). Whenever the target replication is enabled we set = 0:5 in the corresponding loss function. For multitask models we have 4 more hyperparameters: d , m , l and p weights in the loss function. We didn't do full grid search for tuning these hyperparameters. In- stead we tried 5 dierent values of ( d ; m ; l ; p ) : (1; 1; 1; 1); (4; 2:5; 0:3; 1); (1; 0:4; 3; 1); (1; 0:2; 1:5; 1) and (0:1; 0:1; 0:5; 1). The rst has the same weighs for each task, while the second tries to make the 4 summands of the loss function approximately equal. The three remaining combinations were selected by looking at the speeds of learning of each task. All LSTM-based models were trained using ADAM [154] with a 10 3 learning rate and 1 = 0:9. 161 Model AUC-ROC AUC-PR SAPS 0.7200 (0.7197, 0.7203) 0.3013 (0.3008, 0.3018) APS-III 0.7500 (0.7497, 0.7503) 0.3568 (0.3563, 0.3573) OASIS 0.7603 (0.7601, 0.7606) 0.3115 (0.3110, 0.3119) SAPS-II 0.7768 (0.7765, 0.7770) 0.3762 (0.3757, 0.3767) LR 0.8485 (0.8279, 0.8682) 0.4744 (0.4188, 0.5293) S 0.8547 (0.8349, 0.8732) 0.4848 (0.4308, 0.5372) S + DS 0.8558 (0.8362, 0.8750) 0.4928 (0.4379, 0.5486) C 0.8623 (0.8436, 0.8807) 0.5153 (0.4640, 0.5680) C + DS 0.8543 (0.8340, 0.8734) 0.5023 (0.4472, 0.5544) MS 0.8607 (0.8416, 0.8784) 0.4933 (0.4388, 0.5482) MC 0.8702 (0.8523, 0.8872) 0.5328 (0.4797, 0.5835) 0.84 0.86 0.88 AUC-ROC LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC - 15.8 12.9 1.9 18.1 4.1 0.0 84.2 - 38.5 3.4 53.8 13.7 0.2 87.1 61.5 - 10.7 63.4 19.2 0.6 98.1 96.6 89.3 - 95.5 62.3 4.7 81.9 46.2 36.6 4.5 - 12.9 0.1 95.9 86.3 80.8 37.7 87.1 - 0.9 100.0 99.8 99.4 95.2 99.9 99.1 - Figure 7.3: Results for in-hospital mortality prediction task After model selection we evaluate selected models on the test set of the corresponding tasks. Since the test score is an estimate of model performance on unseen examples, we use bootstrapping to estimate condence intervals of the score. To estimate a 95% condence interval we resample the test setK times; calculate the score on the resampled sets; and use 2.5 and 97.5 percentiles of these scores as our condence interval estimate. For in-hospital mortality and phenotype predictionK is 10,000, while for decompensation and LOS prediction K is 1,000, since the test sets of these tasks are much bigger. In all gures and tables we use the following abbreviations. LR stands for logistic regression and LinR stands for linear regression. Standard LSTM models are denoted with letter S, while channel-wise LSTM models are denoted with letter C. Multitask 162 Model AUC-ROC AUC-PR LR 0.8700 (0.8666, 0.8734) 0.2138 (0.2054, 0.2227) S 0.8917 (0.8886, 0.8950) 0.3235 (0.3135, 0.3326) S + DS 0.9039 (0.9008, 0.9069) 0.3247 (0.3142, 0.3349) C 0.9056 (0.9026, 0.9086) 0.3334 (0.3233, 0.3440) C + DS 0.9106 (0.9076, 0.9133) 0.3445 (0.3341, 0.3541) MS 0.9043 (0.9016, 0.9071) 0.3212 (0.3120, 0.3307) MC 0.9050 (0.9021, 0.9079) 0.3172 (0.3071, 0.3278) 0.88 0.90 AUC-ROC LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC - 0.0 0.0 0.0 0.0 0.0 0.0 100.0 - 0.0 0.0 0.0 0.0 0.0 100.0 100.0 - 6.2 0.0 35.1 13.2 100.0 100.0 93.8 - 0.0 91.4 73.1 100.0 100.0 100.0 100.0 - 100.0 100.0 100.0 100.0 64.9 8.6 0.0 - 12.5 100.0 100.0 86.8 26.9 0.0 87.5 - Figure 7.4: Results for decompensation prediction task versions of these LSTM models are denoted with MS and MC respectively. DS stands for deep supervision. The results for each of the mortality, decompensation, LOS, and phenotyping tasks are reported in Figures 7.3, 7.4, 7.5, and 7.6, respectively. Each of the gures consists of three parts. 1. The rst subgure is a table that lists the values of the metrics for all models along with 95% condence intervals obtained by bootstrapping the test set. 2. The second subgure visualizes the condence intervals for one of the metrics. Black circle corresponds to the mean of K values. 2 The thick black line shows standard deviation and narrow grey line shows 95% condence interval. 2 Note that the dierence between the mean ofK values and the value on the original test set is not more than 0.01% for all measured metrics. 163 3. The third subgure shows the signicance of the dierence between the models. We resample the test set K times with repetition and count the number of times the i-th model performed better than the j-th model (denoted by c i;j ). The cell at the i-th row and the j-th column of the table shows the percentage of c i;j in K. We say that thei-th model is signicantly better than thej-th model if c i;j K > 0:95 and highlight the corresponding cell of the table. We rst note that LSTM-based models outperformed linear models by substantial margins across all metrics on every task. The dierence is signicant in every case except three out of six LSTM models for in-hospital mortality. This is consistent with previous research comparing neural networks to linear models for mortality prediction [58] and phenotyping [178], but it is nonetheless noteworthy because questions still remain about the potential eectiveness of deep learning for health data, especially given the often modest size of the data relative to their complexity. Our results provide further evidence that complex architectures can be eectively trained on non-Internet scale health data and that while challenges like overtting persist, they can be mitigated with careful regularization schemes, including dropout and multitask learning. Our experiments also show that channel-wise LSTMs and multitask training act as regularizers for almost all tasks. Channel-wise LSTMs perform signicantly better than standard LSTMs for all four tasks, while multitasking helps for all tasks except pheno- typing (the dierence is signicant for decompensation and LOS prediction tasks). We hypothesize that this is because phenotype classication is itself a multitask problem and already benets from regularization by sharing LSTM layers across the 25 dierent 164 phenotypes. The addition of further tasks with loss weighting may limit the multitask LSTM's ability to eectively learn to recognize individual phenotypes 3 . The combination of the channel-wise layer and multitasking is also useful. Multi- task versions of channel-wise LSTMs perform signicantly better than the corresponding single-task versions for in-hospital mortality prediction and phenotyping tasks. Deep supervision with replicated targets did not help for in-hospital mortality pre- diction. For phenotyping, it helped for the standard LSTM model (as discovered in an earlier work [178]), but did not help for channel-wise models. On the other hand, we see signicant improvements from deep supervision for decompensation and LOS prediction tasks (except for the Standard LSTM model for LOS prediction). For both these tasks the winner models are channel-wise LSTMs with deep supervision. For decompensation, the winner is signicantly better than all other models and for LOS the winner is signif- icantly better than all others except the runner-up model, which is a multitask standard LSTM. 7.5.3 Hyperparameters For in-hospital mortality and decompensation prediction, the best performing logistic regression used L2 regularization with C = 0:001. For phenotype prediction, the best performing logistic regression used L1 regularization with C = 0:1. For LOS prediction, the best performing logistic regression used L2 regularization with C = 10 5 . 3 Note that the hyperparameter search for multitask models did not include zero coecients for any of the four tasks. That is why the best multitask models sometimes perform worse than single-task models. 165 The best values of hyperparameters of LSTM-based models vary across the tasks. 4 . Generally, we noticed that dropout helps a lot to reduce overtting. In fact, all LSTM- based baselines for in-hospital mortality prediction task (where the problem of overtting is the most severe) use 30% dropout. In multitasking setup overtting was a serious problem, with mortality and decom- pensation prediction validation performance degrading faster than the others. All of the best multitask baselines use either (1; 0:2; 1:5; 1) or (0:1; 0:1; 0:5; 1) for ( d ; m ; l ; p ). The rst conguration performed the best for in-hospital mortality, decompensation and LOS prediction tasks, whereas the second conguration was better for phenotype pre- diction task. The fact that d , m and l of the best multitask baselines for phenotype prediction task are relatively small supports the hypothesis that additional multitasking in phenotype prediction task hurts the performance. 7.6 Conclusion In this chapter, we proposed a novel multitask RNN architecture designed for tasks that have both heterogeneous types and varied temporal structure. We then applied it to four varied clinical problems that include a mixture of classication and regression tasks made early in a hospital stay, at discharge, and continuously. We also studied additional RNN variants that used a form of deep supervision and channel-wise LSTM cells for processing input channels independently. 4 The full information about the hyperparameters of all baselines can be found at https://github.com/YerevaNN/mimic3-benchmarks/blob/master/mimic3models/pretrained models.md 166 Model Kappa MAD LR 0.4024 (0.4006, 0.4043) 162.32 (161.83, 162.85) S 0.4382 (0.4365, 0.4400) 123.10 (122.65, 123.52) S + DS 0.4315 (0.4297, 0.4334) 110.91 (110.46, 111.36) C 0.4421 (0.4403, 0.4438) 136.59 (136.11, 137.10) C + DS 0.4508 (0.4490, 0.4527) 143.14 (142.65, 143.62) MS 0.4503 (0.4486, 0.4520) 112.00 (111.55, 112.46) MC 0.4496 (0.4478, 0.4514) 122.83 (122.34, 123.27) 0.40 0.42 0.44 Kappa LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC - 0.0 0.0 0.0 0.0 0.0 0.0 100.0 - 100.0 0.0 0.0 0.0 0.0 100.0 0.0 - 0.0 0.0 0.0 0.0 100.0 100.0 100.0 - 0.0 0.0 0.0 100.0 100.0 100.0 100.0 - 79.7 99.5 100.0 100.0 100.0 100.0 20.3 - 91.0 100.0 100.0 100.0 100.0 0.5 9.0 - Figure 7.5: Results for length of stay prediction task Our results demonstrate that the phenotyping and length-of-stay prediction tasks are more challenging and require larger model architectures than mortality and decompen- sation prediction tasks. Even small LSTM models easily overt the latter two problems. We also demonstrated that the proposed multitask learning architecture allows us to ex- tract certain useful information from the input sequence that single-task models could not leverage, which explains the better performance of multitask LSTM in some set- tings. We did not, however, nd any signicant benet in using multitask learning for the phenotyping task. We are interested in further investigating the practical challenges of multitask train- ing. In particular, for our four very dierent tasks, the model converges and then overts at very dierent rates during training. This is often addressed through the use of heuris- tics, including a multitask variant of early stopping, in which we identify the best epoch 167 Model Macro AUC-ROC Micro AUC-ROC LR 0.7385 (0.7338, 0.7431) 0.7995 (0.7961, 0.8030) S 0.7702 (0.7657, 0.7747) 0.8212 (0.8177, 0.8246) S + DS 0.7737 (0.7692, 0.7781) 0.8231 (0.8196, 0.8265) C 0.7764 (0.7719, 0.7806) 0.8251 (0.8217, 0.8284) C + DS 0.7730 (0.7688, 0.7773) 0.8224 (0.8191, 0.8257) MS 0.7678 (0.7633, 0.7723) 0.8183 (0.8148, 0.8218) MC 0.7741 (0.7697, 0.7784) 0.8227 (0.8192, 0.8261) 0.74 0.76 0.78 Macro AUC-ROC LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC LR S S + DS C C + DS MS MC - 0.0 0.0 0.0 0.0 0.0 0.0 100.0 - 0.1 0.0 1.7 97.6 0.2 100.0 99.9 - 1.3 68.7 100.0 36.5 100.0 100.0 98.7 - 99.6 100.0 96.9 100.0 98.3 31.3 0.4 - 100.0 20.1 100.0 2.4 0.0 0.0 0.0 - 0.0 100.0 99.8 63.5 3.1 79.9 100.0 - Figure 7.6: Results for phenotyping task for each task based on individual task validation loss. We proposed the use of per-task loss weighting, which reduced the problem but did not fully mitigate it. One promis- ing direction is to dynamically adapt these coecients during training, similar to the adaptation of learning rates in optimizers. 168 Part III Learning to Diagnose Without Cleanly Labeled Data 169 Chapter 8 Accelerating Active Learning with Transfer Learning Training supervised machine learning models to diagnose requires the availability of large labeled data sets, but ground truth diagnoses are not routinely recorded during care. A common solution to this problem is active learning, in which we interactively use a prediction model to select a limited number of records to annotate. However, active learning performs no better than choosing at random when starting with no labeled data, a problem we refer to as the cold start problem. In this chapter, we address cold starts by combining active learning with transfer learning, which utilizes labeled data from a dierent but related problem to help learn our primary task. We propose a principled framework for accelerating active learning with transfer learning and perform a theoretical analysis that provides insights into the algorithm's behavior and the problem in general. We then present experimental results using several well-known transfer learning data sets that conrm our theoretical analysis. 170 8.1 Introduction In the age of the \data tsunami," we are confronted with a central challenge: how do we eciently and eectively learn from massive amounts of data? Supervised learning remains the dominant learning paradigm for many practical problems, and many super- vised learning problems can be formulated as classication. Learning a classier requires class labels, which can be dicult or expensive to acquire in large quantities. In response to this dilemma, researchers have developed active learning. An active learner is given access to an (often human) oracle that can label data, a limited budget to spend on ac- quiring labels, and the freedom to choose which observations to label [259]. The goal of active learning is to build an eective classier with as few label queries as possible. Recent theoretical breakthroughs have produced active learning algorithms that are practical and have strong statistical consistency and unbiased sampling guarantees [29]. Nevertheless, there remain signicant barriers to wider adoption of active learning. One challenge that has both practical and theoretical implications is the cold start phe- nonemon. Active learning requires a good classier to generate useful label queries; training a good classier requires labeled data. If the active learner begins with zero labeled data, then it must query labels at random until it has enough to train a good classier. Thus, early in the query process or when the labeling budget is small, active learning oers little or no advantage over passive learning [74]. What is more, classier performance (e.g., test set error) often improves slowly as a function of the number of label queries. The cold start problem has not been studied in earnest, although a number of approaches (e.g., cluster-based active learning) oer potential remedies [337]. 171 Another promising solution to the cold start problem is transfer learning. The intu- ition behind transfer learning is that learning a new task should be easier if we transfer knowledge from previously learned tasks [290]. Related (or source) tasks often take the form of labeled data sets that are \similar" to our target task data. Examples include product reviews from dierent categories [31] or clinical trials data from dierent hospi- tals [323]. In these settings, straightforward supervised learning (train a model on source data, then apply it to target task) often produces models that perform poorly. However, with a proper transfer learning framework, we can use source data to improve our ability to learn the new task, especially when little or no labeled target data is available. This suggests a strategy for addressing the cold start problem in active learning: use transfer learning to initialize the active learner using data from a related task. In this way, the active learner begins with a classier to guide early label queries, eliminating the need to query at random. If the transfer from the source task is eective, then the active learner should begin with a good classier and require many fewer target label queries to improve it. This would mitigate the cold start problem. If transfer learning produces a poor classier, the active learner may be forced to query many more target labels in order to recover. In this way, we can understand transfer learning as providing an initial bias to the active learner. A good framework for combining transfer and active learning should provide a way to measure the impact of the transfer-based bias on the active learner's behavior and performance. In this paper, we describe a simple, principled approach to transfer-initialized active learning, based on two relatively new frameworks for transfer learning [32] and active 172 learning [29]. This approach is easy to implement and ecient, and it permits a the- oretical analysis that provides insight into the interaction between these two learning paradigms. We derive a bound on the generalization error that relates target task perfor- mance to the similarity between source and target tasks. We identify a trade-o between potential sources of error that can be exploited to produce eective transfer active learn- ers. We present experimental results that conrm our theory and show that this approach accelerates active learning. We conclude by identifying the most fruitful directions for future research. 8.2 Related Work To our knowledge, there has been only a handful of papers, most of them quite recent, ex- ploring the combination of transfer learning and active learning [46, 261]. [249] combines uncertainty region sampling with several transfer learning concepts, including the use of a domain separation classier trained to distinguish between unlabeled source and target samples. The authors provide convincing empirical results on a number of standard trans- fer learning tasks, as well as a simple analysis of label complexity and error rates. [49] describes a novel active transfer learning framework that combines sample re-weighting with batch mode active learning, which chooses all of its label queries simultaneously. What makes this approach especially interesting is that it uses a dierent set of criteria to select queries: diversity among labeled samples and distributional similarity between labeled and unlabeled target data. Their empirical results indicate that this approach can be used to build eective classiers with a small number of target label queries. 173 Unfortunately, most of these approaches are heuristic in nature and lack guarantees for consistency and sampling bias. The most notable exception is [323], which presents a theoretically rigorous Bayesian framework for active transfer learning, based on prior- dependent learning. Assuming a prior distribution over target concepts (i.e., classiers) greatly accelerates active learning, and the authors show that the prior is identiable from a nite number of labeled examples in sequential multitask settings. The empirical eectiveness of this approach remains an open question. 8.3 Methods Our approach to transfer active learning combines two principled learning frameworks. For transfer learning, we use a convex combination of source and target empirical risks [32]. For active learning, we use the importance weighted consistent active learner (IWAL CAL) algorithm [29]. We provide a brief overview of each and then describe how to combine them in order to address the cold start problem. 8.3.1 Transfer learning framework Formally, we dene a task or domain as a distributionD on a set of pointsX paired with a labeling function f :X 7!Y, whereY =f1g. In transfer learning, we seek to transfer knowledge from a source domainhD S ;f S i to a target domainhD T ;f T i of in- terest. When learning we search over a hypothesis spaceH for a function h : X 7!Y that does a good job of predicting the true f(x) for any point x2X . We measure the quality of a hypothesis h by its risk, relative to a domain (e.g., the target do- main T): T (h;f T ) =E xD T [1fh(x)6=f T (x)g] where 1 is the indicator function. The 174 empirical risk of a hypothesis, relative to a nite sample fx 1 ;:::;x n g, is dened as ^ T (h;f T ; (x) 1:n ) = 1 n P n i=1 1fh(x i )6= f T (x i )g where (x) 1:n is a notational convenience. When it is clear from context, we will use shorthand, such as T (h) and ^ T (h;n). Our goal is to choose a hypothesis to minimize the target risk (h = arg min h2H T (h)), though this is impossible in practice. Instead we minimize a weighted sum of empirical risks ^ (h) =^ T (h;f; (x) 1:n ) + (1)^ S (h;f; (x) 1:m ) with scalar weight 2 [0; 1]. We assume access to m> 0 labeled source examples and n 0 labeled target examples. This approach to transfer learning is attractive because of its simplicity and elegant theoretical properties. [32] derives an upper bound on the target generalization error of ^ h = arg min h2H ^ (h), the classier that minimizes the combined empirical risk. This bound includes two particularly interesting terms that quantify the similarity between do- mains. The rst is a hypothesis-dependent measure of the similarity between the source and target data distributionsD S andD T . Even if the domains share the same label- ing function (i.e., f S = f T ), training examples with dierent distributions may produce dierent classiers. We dene the d H distance between two distributions: d H (D S ;D T ) = 2 sup h2H jP D S fA h gP D T fA h gj whereA h =fx :x2X where h(x) = +1g. This is the maximum possible dierence be- tween probability masses assigned by our domains to a setA h of points classied as +1 by any hypothesis h2H. Now letHH =fg :g(x) = +1 if h(x)6=h 0 (x) for given h;h 0 2 Hg be the symmetric dierence hypothesis space. Additionally, let S (h;h 0 ) be the dis- agreement between two hypotheses h;h 0 2H about the labels of points drawn fromD S 175 (likewise for T andD T ). Then we can dene a distance d HH for which the following inequality holds for all h;h 0 2H: S (h;h 0 ) T (h;h 0 ) 1 2 d HH (D S ;D T ) This distance places an upper bound on the dierence between source label and target label disagreement between any two hypotheses h;h 0 2H. d HH has two useful proper- ties: rst, for anyH with nite VC dimension, it can be computed from nite unlabeled samplesU S D S andU T D T [153]. Second, it can be approximated using a domain separator hypothesis, i.e., a classier fromH trained to separateU S andU T [22]. The second term of interest is the combined source and target risk: ST = min h2H S (h)+ T (h). This can be thought of as a general measure of the similarity between the source and target domains. A small ST implies the existence a hypothesis h2H that simul- taneously minimizes source and target risk, which in turn implies minimal dierences between data distributions and labeling functions. This corresponds to the traditional transfer learning assumption that domains are \suciently similar" [32]. We assume that ST is negligible but acknowledge that this may not be true in real applications. 8.3.2 Active learning framework IWAL CAL is an importance weighted mellow active learner designed for online learn- ing settings: rather than choosing from a pool, it waits as points \arrive" in streaming fashion and queries each label with some probability. When a point's label is queried, 176 it is assigned an importance weight inversely proportional to its query probability. Im- portance weights correct for bias that accrues during selective sampling. After seeing t points, we choose a classier h t to minimize the importance weighted empirical risk (h;t) = (h;f; (x;w) 1:t ) = 1 t P t i=1 w i 1fh(x i )6=f(x i )g. [29] proves that this is an unbi- ased estimate of the true risk and provides a nice deviation bound for it. When compared with aggressive active learners (e.g., uncertainty region sampling), mellow active learners often exhibit a slower rate of improvement in performance as a function of the number of queries. However, they have sounder theoretical properties and are more conducive to analysis [74]. Formally, for the tth unlabeled point x t , IWAL CAL queries the label y t = f(x t ) with probability p t computed using a rejection threshold function p((x;q;w) 1:t1 ;x t ). Here q i is a binary indicator of whether the ith label was queried, and w i = 1=q i is the importance weight (bounded from above since p i > 0). (x;q;w) 1:t is a notational convenience forf(x 1 ;q 1 ;w 1 );:::; (x t ;q t ;w t )g. We can now redene over all points seen: (h;t) = (h;f; (x;q;p) 1:t ) = 1 t P t i=1 q i p i 1fh(x i )6= f(x i )g. Unlabeled points have q i = 0 and so are ignored in the error. This risk estimator is unbiased; notice thatE [Q i =p i ] = E [Q i ]=p i =p i =p i = 1 [29]. After seeingt 1 samples, IWAL CAL uses them to implicitly maintain a spaceH t1 of candidate hypotheses that with high probability contains h , the optimal classier in H. The probability p t of querying the label for x t is inversely proportional to the level of disagreement inH t1 . The dierence between the importance weighted empirical risk (h;t 1) and the true risk (h) is bounded, giving us a method to compute p t . Let h t1 = arg min h2H (h;t1) be the hypothesis that minimizes the importance weighted 177 empirical error. Next let h 0 t1 be the hypothesis that minimizes this error but disagrees with h t1 on x t 's label: h 0 t1 (x t )6= h t1 (x t ). Then G t = ( h t1 ;t 1) ( h 0 t1 ;t 1) is an estimate of the disagreement withinH tt about the label of x t . If G t exceeds the upper bound on disagreement, then h t1 likely agrees with h on x t and so it is probably unnecessary to queryx t 's label. Thus, the label forx t is queried with probability p t min 1; 1=G 2 t + 1=G t C 0 log(t)=(t 1) where C 0 =O(log(jHj=)) [29]. 8.3.3 Transfer active learning We now assume that an IWAL CAL active learner has access to m labeled points from the source domain and has seen t points from the target domain. We can dene a new weighted empirical risk over these m +t points, where the weights depend on m and t, the parameter, and the IWAL CAL importance weights q i =p i : Denition 1. We dene acombined weighted empirical risk for transfer-accelerated active learning as (h;m;t), T (h) + (1)^ S (h) or equivalently (h;m;t), 1 m +t m+t X i=1 w i 1fh(x i )6=f(x i )g where w i = 8 > > > > > > < > > > > > > : (1)(m+t) m im (source) (m+t) tp i i>m; q i = 1 (labeled target) 0 i>m; q i = 0 (unlabeled target) 178 These can be shown to be equivalent with a simple derivation. The rst form is easier to analyze, allowing us to leverage the results from [29] for IWAL CAL and from [32] for transfer learning. The second form is easier to implement as it permits us to use any supervised learning routine that accepts individually weighted training data. Algo- rithm 1 shows pseudocode for our Transfer IWAL CAL (TIWAL CAL) algorithm. It uses the combined weighted empirical risk in Steps 3-6 of the algorithm. Gbound t = q C 0 logt t1 + C 0 logt t1 + (1) q C 0 log 2 2m is the upper bound on the disagreement within H t1 . To obtain p t , we solve the quadratic equation G t = c 1 p pt c 1 + 1 q C 0 logt t1 + c 2 pt c 2 + 1 C 0 logt t1 +(1) q C 0 log 2 2m . The constants C 0 , c 1 , and c 2 can be treated as tunable parameters but are dened for analysis as follows: C 0 = O(log(jHj=), c 1 = 5+2 p 2, andc 2 = 5. This algorithm uses labeled source data to provide the active learner with a transfer-based bias that can improved by labeling target data. In the next sec- tion, we show that TIWAL CAL's behavior and performance depend upon the similarity between source and target domains and the value of . 8.3.4 Implementation Here we provide a brief note on implementation, which learning theory papers far to often gloss over. This algorithm (and its related transfer and active learning algorithms) is relatively straightforward to implement. Denition 1 shows that the learning step can be implemented using any supervised learning method that allows individual instance weighting. Getting IWAL CAL to work in practice may require dropping various terms (mostly constants). Also, implementation of step (5) of Algorithm 1 requires an approx- imation, since it denes a constrained empirial risk minimization. [29] use a decision 179 Algorithm 1 Transfer IWAL CAL 1: for t = 1; 2;::: until target samples exhausted do 2: Receive unlabeledx t 3: Compute weights (w) 1:(t1) as in Denition 1. 4: Choose h t1 = arg min h2H (h;t 1) 5: Choose h 0 t1 = arg min h2H:h(xt)6= h t1 (xt) (h;t 1) 6: Set G t = ( h 0 t1 ;t 1) ( h t1 ;t 1) 7: if G t Gbound t then 8: Set p t = 1 9: else 10: Solve for p t (see below) 11: end if 12: Sample q t Bernoulli(p t ) 13: if q t = 1 then 14: Query label y t 15: Set w t = 1=p t 16: end if 17: end for 18: return h t = arg min h2H (h;t), (x;y;q; 1=p) 1:t tree for h t and ip the label of the node to which x t is assigned to get h 0 t . When using a support vector machine (SVM), we could add a constraint, assign x t a signicantly higher weight, or adjust the decision threshold. The important thing to note is that these approximations add up: any implementation may not exhibit optimal behavior. 8.4 Theory For any active learning algorithm, we would like to answer two fundamental questions: 1. How will a classier trained after our active learner has seen t data points per- form? Performance here often means prediction error on held-out test data. Can we place upper and lower bounds on the expected error? A related question regards performance after ` label queries. 180 2. How many labels do we expect our algorithm to have queried after seeing t data points? This is called label complexity. In this section, we provide a detailed analysis of TIWAL CAL that answers the rst question formally (concluding with Theorem 1) and oer some informal thoughts about the second question. We begin by providing two useful bounds that codify our intuition from Section 8.3.3 and guide the application of our algorithm. 8.4.1 Deviation bound We begin by proving an upper bound on the deviation of the combined weighted empirical risk (h;m;t) from the true combined risk (h;m;t). This directly motivates Steps 6-10 of Algorithm 1, in which we compute Gbound t and p t and decide whether to queryx t 's label. Lemma 1. With probability at least 1, the following holds for allt 1 and allh2H: j( (h;m;t) (h ;m;t)) ( (h) (h ))j r " t p min;t (h) + " t p min;t (h) +(1) p " S where " t = O (log(tjHj=)=t), " S = O (log(2jHj=)=(2m)), and p min;t (h) is the minimum query probability assigned to a target point about whose label h and h disagree. The proof of this lemma involves decomposition of the combined empirical risk, followed by an application of the triangle inequality and Hoeding's inequality: 181 Proof. We start by decomposing and reordering the error terms into their constituent T and S terms, which in turn enables us to apply the triangle inequality: deviation( (h;h )) = j( (h;m;t) (h ;m;t)) ( (h) (h ))j = j ( T (h;t) + (1)^ S (h;m) T (h ;t) (1)^ S (h ;m)) ( T (h) + (1) S (h) T (h ) (1) S (h ))j = j (( T (h;t) T (h ;t)) ( T (h) T (h ))) +(1) ((^ S (h;m) ^ S (h ;m)) ( S (h) S (h )))j j ( T (h;t) T (h ;t)) ( T (h) T (h ))j +(1)j (^ S (h;m) ^ S (h ;m)) ( S (h) S (h ))j = deviation( T (h;h )) + (1) deviation(^ S (h;h )) The overall deviation is bounded from above by a convex combination of the deviation of the importance weighted empirical target error and the empirical source error. Lemma 1 from [29] tells us that deviation( T (h;h )) p " t =p min;t (h) +" t =p min;t (h). A straight- forward application of the union bound and Hoeding's inequality gives us the bound on deviation(^ S (h;h )): 182 = Pf9h2H : deviation(^ S (h;h ))"g = Pf [ h2H (deviation(^ S (h;h ))")g X h2H Pfdeviation(^ S (h;h ))"g = jHjPfdeviation(^ S (h;h ))"g = 2jHj expf2m" 2 g Solving the inequality 2jHj expf2m" 2 g for " gives us " = p log(2jHj)=)=(2m). We substitute into the convex combination of deviations above to arrive at the bound in the lemma. Note: the constantC 0 =O(log(jHj=)) is used above and below in place of the " terms. 8.4.2 Generalization bound Next we give an upper bound on the true combined error ( h t ) of h t , the hypothesis we can learn after seeing t target labels: Lemma 2. This lemma takes form similar to Theorem 2 from [29]. With probability at least 1, the following holds for all t 1: 183 0 ( h t ) (h ) ( h t ;m;t) (h ;m;t) + r 2C 0 log(t + 1) t + 2C 0 log(t + 1) t ! + (1) r C 0 log 2 2m which implies ( h t ) (h ) + r 2C 0 log(t + 1) t + 2C 0 log(t + 1) t ! + (1) r C 0 log 2 2m where h = arg min h2H (h) and C 0 =O(log(jHj=)). Proof. We provide only a sketch of the proof for this lemma. It proceeds via induction ont, very similar to the proof for Theorem 2 in the appendix of [29]. The main dierence is the presence of several (1) p " S terms that eventually cancel out, at which point the proofs become identical, apart from an scaling term. First, let p min = min(fp i : m < i m +t and h t (x i )6= h (x i )g[f1g), i.e., the minimum query probability of all seen target instances x i such that h t disagrees with h a st on x i 's label. Next, let t min = max(fi :m<it and p i =p min and h t (x i )6=h (x i )g), i.e., the maximum index 184 i from the set of disagreement points with query probability p min . We then proceed via inductive proof, eventually showing the following contradiction: ( h 0 t min 1 ;m;t min 1) (h ;m;t min 1)> 0 consider the dierence between the respective importance weighted empirical errors (;m;t min 1) of h and h 0 t min 1 . Our strong inductive hypothesis is that 0 ( h k ) (h ) ( h k ;m;k) (h ;m;k) + p 2" k + 2" k + (1) p " S holds for all 1 k < t, where the " terms are dened as before and the 2 multiplier is chosen based on the p min;t (h) terms from our deviation bound. We want to show that it holds for k = t. We let p min = min(fp i : m < i m +t and h t (x i )6= h (x i )g[f1g), i.e., the minimum query probability of all seen target instancesx i such that h t disagrees with h a st onx i 's label. Our inductive hypothesis holds trivially when p min 1=2 so we assume that p min < 1=2. Now let t min = max(fi :m<it and p i =p min and h t (x i )6= 185 h (x i )g), i.e., the maximum index i from the set of disagreement points with query probability p min . We can then show that: ( h 0 t min 1 ;m;t min 1) (h ;m;t min 1) = ( h 0 t min 1 ;m;t min 1) ( h t min 1 ;m;t min 1) + ( h t min 1 ;m;t min 1) (h ;m;t min 1) = c 1 p p min c 1 + 1 p " t min1 + c 2 p min c 2 + 1 " t min1 + (1) p " S + ( h t min 1 ;m;t min 1) (h ;m;t min 1) c 1 p p min c 1 + 1 p " t min1 + c 2 p min c 2 + 1 " t min1 + (1) p " S + ( h t min 1 ) (h ) p 2" t min1 + 2" t min1 (1) p " S = c 1 p p min c 1 + 1 p " t min1 + c 2 p min c 2 + 1 " t min1 + ( h t min 1 ) (h ) p 2" t min1 + 2" t min1 where step one is by denition of p t min 1 in our algorithm; step two is by our inductive hypothesis; step three is because ( h t min 1 ) (h ) > 0; and step ve is due to our assumption thatp min < 1=2, which enables us to establish a lower bound via substitution. 186 At this point our proof matches precisely the proof of Theorem 2 in the appendix of [29] (apart from the scalar factor of ), which concludes with a contradiction. Thus, our assumption that p min < 1=2 must be false and so our inductive hypothesis must hold for k =t. The fact that ( h 0 t min 1 ;m;t min 1) (h ;m;t min 1) > 0 means that h 0 t min 1 disagrees with while h t min 1 must agree with h on x t min . In turn, this implies that h t agrees with h 0 t min 1 onx t min and that ( h t ;m;t min 1) ( h 0 t min 1 ;m;t min 1). because satises the conditions for an importance weighted empirical risk. From Theorem 2 we know that 0 ( h t ) (h ) ( h t ;m;t) (h ;m;t) + r 2C 0 log(m +t + 1) m +t + 2C 0 log(m +t + 1) m +t which implies that ( h t ) (h ) + ( h t ;m;t) (h ;m;t) + r 2C 0 log(m +t + 1) m +t + 2C 0 log(m +t + 1) m +t (h ) + r 2C 0 log(m +t + 1) m +t + 2C 0 log(m +t + 1) m +t because ( h t ;m;t) (h ;m;t) 0 as h t = arg min h2H (h;m;t). 187 Now we want to connect the combined risk to target risk, which is inextricably related to the feasibility of our transfer learning task (i.e., the "adaptability" of our source and target tasks, to borrow from [32]). To formally "measure" this, we consider the combined source and target risk of hypothesis ST (h) = S (h) + T (h) and the hypothesis h 2H that minimizes it. When ST = ST (h ) is low, then both S (h ) and T (h ) must be low; this in turn implies that a classier trained on a mixture of source and target data should perform well on new target data, i.e., the domains and tasks are "similar enough" to enable transfer learning to work well. ST plays a signicant role in the analysis of [32] and here as well. To compare the source and target distributions, we will use the hypothesis class- dependent d H distance: d H (D S ;D T ) = 2sup h2H (P D S fA h gP D T fA h g) whereA h =fx : x2X where h(x) = 1g. This is the maximum dierence between probability masses assigned to a set of points classied as "positive" (A h ) over hypotheses h2H. The d H distance has two very useful properties. First, it can be computed from nite unlabeled samplesU S D S andU T D T forH with nite VC dimension [153]. Second, even more practically, it can be approximated using a domain separator hypothesis h sep 2H, i.e., a classier fromH trained to separateU S andU T [22]. We can in turn dene a distance d HH that yields the following bound on the dierence between source and target disagreement for any two hypotheses h;h 0 2H: 188 S (h;h 0 ) T (h;h 0 ) 1 2 d HH (D S ;D T ) whereHH =fh(x)h 0 (x) : h;h 0 2Hg is the symmetric dierence hypothesis space, i.e., the space of classiers that positively label points upon which a pair of hypotheses h;h 0 2H disagree. With these tools in hand, we can bound the target error of the learned hypothesis h t in terms of the optimal target error: T ( h t ) of the classier h t that can be learned after t iterations of the modied TIWAL CAL algorithm. Theorem 1. For h t = arg min h2H (h;m;t) (i.e., the hypothesis learned by IWAL CAL after seeing t target instances), the following holds with probability at least 1: T ( h t ) T (h T ) + r 2C 0 log(t + 1) t + 2C 0 log(t + 1) t ! + 2(1) r C 0 log 2 2m + 1 2 d HH (D S ;D T ) + ST ! noting that the d HH term can further be upper bounded by a computationally tractable estimate d HH (D S ;D T ) ^ d HH (U S ;U T ) + 4 s 2d log(2 min(m;t)) + log( 4 ) min(m;t) Proof. This proof is quite similar to the proof of Theorem 2 from the appendix of [32] and uses several lemmas and a theorem from that paper. Steps one, four, and ve use Lemma 2 (bounds dierence between empirical and true -combined risk); Lemma 1 (bounds 189 dierence between -combined and target risk); and Theorem 1 (bounds target risk in terms of source risk and empirical ^ d HH distance), respectively, from [32]. Step two uses our Lemma 2, while step three uses the fact that (h ) (h T ). T ( h t ) ( h t ) + (1) 1 2 d HH (D S ;D T ) + ST (h ) + r 2C 0 log(t + 1) t + 2C 0 log(t + 1) t ! + (1) r C 0 log 2 2m + 1 2 d HH (D S ;D T ) + ST ! (h T ) + r 2C 0 log(t + 1) t + 2C 0 log(t + 1) t ! + (1) r C 0 log 2 2m + 1 2 d HH (D S ;D T ) + ST ! T (h T ) + r 2C 0 log(t + 1) t + 2C 0 log(t + 1) t ! + (1) r C 0 log 2 2m +d HH (D S ;D T ) + 2 ST ! Implications: Because the above proof utilized a number of theorems and lemmas that are themselves fairly "loose," we expect that our bound in turn is quite loose (i.e., that the actual generalization error may be signicantly smaller, at least in some cases). This is fairly typical of such generalization bounds and does not impact its validity or reduce its usefulness. Indeed, its greatest utility is to guide strategy in applying this algorithm to a new active transfer learning scenario: the behavior of the algorithm will 190 be governed by the choice of , which trades o two sources of error (due to the active learning query process and to transfer learning). The active learning error term (C 0 logt=t) decreases as t grows large but may be signicant early in the algorithm (for small values of t). The transfer learning error term does not depend ont and so can be viewed as constant throughout the query process. It depends upon the number of source points m and upon the dissimilarity of our source and target domains, which in turn depends primarily upon the d HH distance since we assume ST is small enough to be ignored. The C 0 log 2=(2m) term is negligible for any reasonably sized source data set. Thus, the optimal depends on domain similarity, the number of source instances, and the number of target instances we expect to see. If the domains and tasks are too dierent and transfer error is high, then an injudicious choice of can introduce a large and constant source of error that will not decrease even as t grows large. This will harm overall accuracy while increasing the total number of label queries. On the other hand, if the transfer error is reasonably small, then a careful choice of should signicantly improve the early performance of active learning and reduce the total number of queries without introducing too much additional error. [32] gives a detailed analysis of the relationship between these factors and discussion of how to choose . The general rule of thumb is that lower values of work when the domains are similar and we have a lot of source data; if we not then we should use a higher value of or even consider plain active learning. In practice, we should begin by assessing the similarity between our domains, using, e.g., a domain separator hypothesis to approximate d HH distance with unlabeled points. Unfortunately, we cannot directly 191 measure the dierence between the labeling functions (quantied by ST ), though if we happen to have a few labeled target points lying around, we could estimate it. 8.4.3 Label complexity We do not give a rigorous analysis of label complexity here. The label complexity for IWAL CAL is known to be roughly ~ O( p C 0 t logt + C 0 log 3 t), where is the dis- agreement coecient [119]. This depends on the probability of querying a label at each step, i.e. E [Q t ], which in turn depends on the deviation bound. Because our deviation bound is so similar to that of IWAL CAL, we expect our label complexity will likewise be quite similar. This suggests that transfer learning may not necessarily reduce the overall label complexity of active learning. In other words, for a xed number of iterations t (i.e., the learner "sees" t target instances), using transfer learning IWAL CAL may still query the same number of labels. However, if by using transfer learning we construct high performing classiers while seeing fewer target instances, then we may still reduce the total number of queries made by an active learner. Speculation aside, we found in our experiments that transfer active learning does in practice make far fewer queries than does plain IWAL CAL. 8.5 Experiments We compare IWAL CAL and TIWAL CAL using two publicly available transfer learning data sets. For each, we choose a target domain and divide it into a test set and two training sets. The rst is treated as unlabeled to start and is used for active learning. The second is treated as a labeled source domain. We also choose two additional labeled 192 0 100 200 300 400 500 600 700 # unlabeled data points 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 T est set error 20 Newsgroups (BvP): T est set error vs. unlabeled data IWAL CAL T. IWAL CAL: AvC (d H∆ H =0.4368) T. IWAL CAL: HvR (d H∆ H =0.3072) T. IWAL CAL: BvP2 (d H∆ H =0.0981) Fully supervised 0 50 100 150 200 250 300 # label queries 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 T est set error 20 Newsgroups (BvP): T est set error vs. label queries IWAL CAL T. IWAL CAL: AvC (d H∆ H =0.4368) T. IWAL CAL: HvR (d H∆ H =0.3072) T. IWAL CAL: BvP2 (d H∆ H =0.0981) Fully supervised 0 50 100 150 200 # unlabeled data points 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 T est set error 20 Newsgroups (BvP): T est set error vs. unlabeled data (zoomed) IWAL CAL T. IWAL CAL: AvC (d H∆ H =0.4368) T. IWAL CAL: HvR (d H∆ H =0.3072) T. IWAL CAL: BvP2 (d H∆ H =0.0981) Fully supervised 0 100 200 300 400 500 600 700 # unlabeled data points 0 50 100 150 200 250 300 # label queries 20 Newsgroups (BvP): Label queries vs. unlabeled data IWAL CAL T. IWAL CAL: AvC (d H∆ H =0.4368) T. IWAL CAL: HvR (d H∆ H =0.3072) T. IWAL CAL: BvP2 (d H∆ H =0.0981) Figure 8.1: 20 Newsgroups results. The lefthand plots show test set error versus number of points seen by the active learner. The upper right shows test set error versus number of queries. The bottom right shows the query rate. source domains. We then compare the test set error and query rates of TIWAL CAL against IWAL CAL. The details of our data sets are shown in the table in Figure 8.3. These include approximate ^ d HH distances using a linear domain separator and hinge loss. Our base learner is a linear model with hinge loss and L2-regularization. For the free parameters in TIWAL CAL, we follow [29] by setting c 1 = c 2 = 1 and dropping the log(t) terms when computing Gbound t andp t . We use a heuristic to learn the constrained hypothesis h 0 t : set the instance weight for x t to be equal to the sum of the weights for the rest of the training data. The above changes result in an approximation to our the abstract 193 0 500 1000 1500 2000 2500 # unlabeled data points 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 T est set error Sentiment (kitchen): T est set error vs. unlabeled data IWAL CAL T. IWAL CAL: dvd (d H∆ H =0.6659) T. IWAL CAL: electronic (d H∆ H =0.2573) T. IWAL CAL: kitchen2 (d H∆ H =0.1521) Fully supervised 0 200 400 600 800 1000 1200 1400 # label queries 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 T est set error Sentiment (kitchen): T est set error vs. label queries IWAL CAL T. IWAL CAL: dvd (d H∆ H =0.6659) T. IWAL CAL: electronic (d H∆ H =0.2573) T. IWAL CAL: kitchen2 (d H∆ H =0.1521) Fully supervised 0 100 200 300 400 500 # unlabeled data points 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 T est set error Sentiment (kitchen): T est set error vs. unlabeled data (zoomed) IWAL CAL T. IWAL CAL: dvd (d H∆ H =0.6659) T. IWAL CAL: electronic (d H∆ H =0.2573) T. IWAL CAL: kitchen2 (d H∆ H =0.1521) Fully supervised 0 1000 2000 3000 4000 5000 # unlabeled data points 0 200 400 600 800 1000 1200 1400 # label queries Sentiment (kitchen): Label queries vs. unlabeled data IWAL CAL T. IWAL CAL: dvd (d H∆ H =0.6659) T. IWAL CAL: electronic (d H∆ H =0.2573) T. IWAL CAL: kitchen2 (d H∆ H =0.1521) Figure 8.2: sentiment results. The lefthand plots show test set error versus number of points seen by the active learner. The upper right shows test set error versus number of queries. The bottom right shows the query rate. algorithm that works in practice. We use C 0 = 0:25 for IWAL CAL and 25C 0 100 for TIWAL CAL. 8.5.1 Data 20 Newsgroups: Our rst data set is the 20 Newsgroups data set. We create a hand- ful of \category versus category" classication tasks. Our target domain is a subset of rec.sport.baseball vs. talk.politics.misc (BvP). Our source domains include a second subset of rec.sport.baseball vs. talk.politics.misc (BvP2),rec.sport.hockey vs. talk.religion.misc (HvR), andrec.autos vs. soc.religion.christian (AvC). 194 The original 20 Newsgroups data has 61,188 word counts, which we convert to log term frequency. We then reduce the number of features by keeping only the 250 words with the top term-frequency inverse document-frequency (TF-IDF) scores across all categories. This is an ecient way to choose a small number of interesting features without using labels [325]. It also changed the ^ d HH distance between domains, making for interesting experiments. Sentiment: Our second data set is the sentiment classication data set [31]. We use the preprocessed binary (\positive" vs. \negative" review) version, with a subset of kitchen as our target domain and a second subset of kitchen (kitchen2), dvd, and electronics as our source domains. The preprocessed version of the data includes 1,110,352 unigram and bigram count features. As with 20 Newsgroups, we convert these to log counts and then keep only the 1000 features with the top TF-IDF scores across all domains. Source m Target ^ d HH BvP2 360 BvP 0.0981 0.3 HvR 974 BvP 0.3072 0.6 AvC 1191 BvP 0.4368 0.9 kitchen2 1001 kitchen 0.1521 0.3 electronics 5760 kitchen 0.2573 0.3 dvd 4189 kitchen 0.6659 0.9 Figure 8.3: Summary of experiments, including number of labeled source examples and approximate d HH distances, and values of . 8.5.2 Results Typical results are shown in Figures 8.1 and 8.2. The rst thing to observe is that basic IWAL CAL falls prey to the cold start phenomenon. For sentiment, IWAL CAL requires 195 nearly 400 queries to reach error of 0.20 or less and thousands of queries before it reaches the same performance as fully supervised learning. On the easier 20 Newsgroups data set, IWAL CAL still needs 200 queries to achieve the supervised-level performance (error of 0.10). The results for TIWAL CAL are consistent with our analysis: the transfer learning bias drastically improves test set error early in the query process (with early error rates near optimal) and reduces the overall number of queries by as much as 50%. Further, TIWAL CAL often converges to nearly the same error rate as IWAL CAL, suggesting little or no negative bias. The exceptions to this pattern are the AvC and dvd source domains. Both yield less early improvement in test set error, and dvd actually increases the overall number of queries. These results are explained by our theory: each has a relatively high ^ d HH distance with its respective target domain. Nonetheless, even for these sources, there is still an early 30-40% reduction in error, while the \penalty" on future test set error is relatively small (TIWAL CAL converges to a 5-10% higher error). 8.6 Discussion Researchers are increasingly interested in how to introduce useful biases into active learn- ing without compromising consistency guarantees. 1 This will allow active learners to produce good classiers faster, mitigating the cold start problem. In this paper, we presented a principled framework that addresses cold starts by using transfer learning to leverage data from related tasks. Our framework is straightforward to analyze and apply. We proved a generalization bound that provides intuition into the problem and 1 See \The End of the Beginning of Active Learning" by Daniel Hsu and John Langford at http://hunch.net/?p=1800. 196 helps trade o dierent sources of error. We demonstrated empirical results that suggest this approach signicantly improves classier performance early in the query process and reduces the overall number of target label queries. In other words, we can accelerate active learning with transfer learning. Our work establishes a sound foundation that will facilitate future research on this topic and empower practitioners to apply these ideas to real world problems. Our empirical results are modest; their primary virtue is consistency with our theoreti- cal analysis. Clearly further experimentation and evaluation are warranted. In particular, it sometimes appears as though transfer learning is doing most of the work and that ac- tive learning plays little role other than to manage the label query budget. This is due to the conservative nature of IWAL CAL; the results in [29] are similarly modest. The natural question is whether this framework can be used with more aggressive approaches to active learning, such as uncertainty region sampling, while avoiding the usual sampling bias problems. We are pursuing this line of work. One unsatisfying property of our framework is the persistent nature of the transfer- based bias. This introduces a constant source of error into our generalization bound and may prevent us from learning an optimal classier, even with a large number of la- beled target examples. Intuitively, with enough target data, we should de-emphasize the source data when training classiers. One simple strategy that we are investigating is to gradually increase (the weight on target risk) as we query more target labels. An- other promising approach would be to combine active learning with an adaptive transfer learning framework that re-weights or transforms the source data to reduce the dierence between domains [18]. 197 Chapter 9 Weakly Supervised Learning to Diagnose As with many prediction tasks, diagnoses are dened by the decisions of human experts. Directly encoding the knowledge underlying these decisions in an expert system is time consuming and expensive and does not scale to a large number of prediction problems. Moreover, we cannot directly model these decisions using supervised machine learning because diagnoses are rarely recorded in medical records in structured form that is easy to extract. In this chapter, we investigate an alternative approach that requires neither complete expert knowledge nor labeled data. Instead, it requires only informal hints, for example, that a prescription for metformin makes a patient more likely to have type 2 di- abetes. We propose a simple way to use such hints to provide weak supervision to a latent factor model based on Total Correlation Explanation (CorEx). Called Anchored CorEx, this approach combines CorEx with the information bottleneck. We apply Anchored CorEx to a corpus of clinical discharge summaries and show that it discovers coherent topics that are associated with known diseases present in the patient population. 198 9.1 Introduction A clinician can look at a patient's electronic health record (EHR) and not only decide whether the patient has diabetes but also produce a succinct summary of the clinical evidence. Replicating this feat with computational tools has been the focus of much research in clinical informatics. There are major initiatives underway to codify clinical knowledge into formal representations, most often as deterministic rules that can be applied in a semi-automated fashion [203]. However, representing the intuitive judgments of human experts can be challenging, particularly when the formal system does not match the expert's knowledge. For example, many deterministic disease classiers used in clinical informatics rely heavily upon administrative codes not available at time of diagnosis. Further, developing and testing such systems is time- and labor-intensive. We propose instead a lightweight information theoretic framework for codifying in- formal human knowledge and then use it to extract interpretable latent topics from text corpora. For example, to discover patients with diabetes in a set of clinical notes, a doc- tor can begin by specifying disease-specic anchor terms [14, 114], such as \diabetes" or \insulin." Our framework then uses these to help discover both latent topics associated with diabetes and records in which diabetes-related topics occur. The user can then add (or remove) additional anchor terms (e.g., \metformin") to improve the quality of the learned (diabetes) topics. In this workshop paper, we introduce a simple approach to anchored information the- oretic topic modeling using a novel combination of Correlation Explanation (CorEx) [300] 199 and the information bottleneck [291]. This exible framework enables the user to lever- age domain knowledge to guide exploration of a collection of documents and to impose semantics onto latent factors learned by CorEx. We present preliminary experimental results on two text corpora (including a corpus of clinical notes), showing that anchors can be used to discover topics that are more specic and relevant. What is more, we demonstrate the potential for this framework to perform weakly supervised learning in settings where labeling documents is prohibitively expensive [54, 2]. With respect to interpretable machine learning, our contributions are twofold. First, our framework provides a way for human users to share domain knowledge with a statis- tical learning algorithm that is both convenient for the human user and easily digestible by the machine. Second, our experimental results conrm that the introduction of simple anchor words can improve the coherence and human interpretability of topics discovered from data. Both are essential to successful and interactive collaboration between machine learning and human users. 9.2 Methods Anchored Correlation Explanation can be understood as a combination of Total Corre- lation Explanation (CorEx) [300, 301] and the multivariate information bottleneck [291, 278]. We search for a set of probabilistic functions of the inputs p(y j jx) for j = 1;:::;m that optimize the following information theoretic objective: max p(y j jx);j=1;:::;m TC(X;Y ) + X i;j2R I(X i ;Y j ) 200 The rst term is the CorEx objective TC(X;Y ) TC(X)TC(XjY ), which aims to construct latent variables Y that best explain multivariate dependencies in the data X. Here the data consist of n-dimensional binary vectors [X 1 ;:::;X n ]. Total cor- relation, or multivariate mutual information [312], is specied as TC(X 1 ;:::;X n ) = D KL (p(x 1 ;:::;x n )jj Q i p(x i )) where D KL is the KL divergence. Maximizing TC(X;Y ) over latent factorsfY j g m j=1 amounts to minimizingTC(XjY ), which measures how much dependence in X is explained by Y . At the global optimum, TC(XjY ) is zero and the observations are independent conditioned on the latent factors. Several papers have explored CorEx for unsupervised hierarchical topic modeling [300, 53, 132]. The second term involves the mutual information between pairs of latent factors Y j ) and anchor variables X i specied in the setR =f(i;j)g. This is inspired by the infor- mation bottleneck [291, 278], a supervised information-theoretic approach to discovering latent factors. The bottleneck objective max p(yjx) I(X;Y ) +I(Z;Y ) constructs latent factorsY that trade o compression ofX against preserving information about relevance variables Z. Anchored CorEx preserves information about anchors while also explaining as much multivariate dependence between observations in X as possible. This framework is exi- ble: we can attach multiple anchors to one factor or one anchor to multiple factors. We have found empirically that = 1 works well and does not need to be tuned. Anchors allow us to both seed CorEx and impose semantics on latent factors: when analyzing medical documents, for example, we can anchor a diabetes latent factor to the word \diabetes." TheTC objective then discovers other words associated with \diabetes" and includes them in this topic. 201 While there is not space here for a full description of the optimization, it is similar in principle to the approaches in [300, 301]. Two points are worth noting: rst, the TC objective is replaced by a lower bound to make optimization feasible [301]. Second, we impose a sparse connection constraint (each word appears in only one topic) to speed up computation. Open source code implementing CorEx is available on github [299]. 27:research subject press following article posted 2 14:can use any get used where 1 7:will these want each either able 16:and to in the on this 0 10:card software video pc mac monitor 36:about more because re thing far 3:they their were said did right 28:t don it but i if 32:windows file x dos files program 21:of all them those make case 15:after during three times came took 24:drive disk sale controller scsi system 17:gas miles apr air vehicle gmt 11:just out ve got going better 37:car bike cars engine ride riding 4:government law rights state federal clinton 33:a with for have or some 22:gun crime guns criminal self mr 0:as by such *israel example *politics 25:enough perhaps asked practice often situation 18:test function results note useful water 38:thanks please mail me advance am 5:that be are so people than 30:his who he him against children 8:good much still back here ll 34:is not there no which other 23:at from least through part without 26:year last april san new york 13:using problem running work problems machine 39:was had been years now ago 6:we s our into may fact 31:many most being us others given 9:then would time only could take 35:think say go thought oh guess 2:*jesus god *christ bible christians christian 20:geb chastity dsl skepticism gordon shameful 29:university edu p r internet w 12:space nasa orbit launch moon earth 1:clipper encryption key keys chip *cryptography 19:team game players season league games Figure 9.1: A hierarchical topic model learned by CorEx. Anchored latent factors are labeled in red with anchor words marked with a \*". 9.3 Related Work There is a large body of work on integrating domain knowledge into topic models and other unsupervised latent variable models, often in the form of constraints [310], prior distributions [12], and token labels [11]. Like Anchored CorEx, seeded latent dirichlet al- location (SeededLDA) allows the specication of word-topic relationships [139]. However, SeededLDA assumes a more complex latent structure, in which each topic is a mixture of two distributions, one unseeded and one seeded. [14] rst proposed anchors in the context of topic modeling: words that are high precision indicators of underlying topics. In contrast to our approach, anchors are typi- cally selected automatically, constrained to appear in only one topic, and used primarily 202 to aid optimization [206]. In our information theoretic framework, anchors are specied manually and more loosely dened as words having high mutual information with one or more latent factors. The eects of anchors on the interpretability of traditional topic models are often mixed [167], but our experiments suggest that our approach yields more coherent topics. In health informatics, \anchor" features chosen based on domain knowledge have been used to guide statistical learning [114]. In [2], anchors are used as a source of distant supervision [65, 198] for classiers in the absence of ground truth labels. While Anchored CorEx can be used for discriminative tasks, it is essentially unsupervised. Recent work by [115] is perhaps most similar in spirit to ours: they exploit predened anchors to help learn and impose semantics on a discrete latent factor model with a directed acyclic graph structure. We utilize an information theoretic approach that makes no generative modeling assumptions. 9.4 Results and Discussion To demonstrate the utility of Anchored CorEx, we run experiments on two document collections: 20 Newsgroups and the i2b2 2008 Obesity Challenge [295] data set. Both corpora provide ground truth labels for latent classes that may be thought of as topics. 9.4.1 20 Newsgroups The 20 Newsgroups data set is suitable for a straightforward evaluation of anchored topic models. The latent classes represent mutually exclusive categories, and each document is known to originate from a single category. We nd that the correlation structure 203 among the latent classes is less complex than in the Obesity Challenge data. Further, each category tends to exhibit some specialized vocabulary not used extensively in other categories (thus satisfying the anchor assumption from [14]). To prepare the data, we removed headers, footers, and quotes and reduced the vo- cabulary to the most frequent 20,000 words. Each document was represented as a binary bag-of-words vector. In all experiemnts, we used the standard training/test split. All CorEx models used three layers of 40, 3, and 1 factors. Figure 9.1 shows an example hierarchical topic model extracted by Anchored CorEx. Obesity Obstructive Sleep Apnea Anchors Topic AUC Topic AUC not fever, not chill, not diarrhea, not dysuria, 0.600 use, drug, complication, allergy, 0.686 not cough, not abdominal pain, not guarding, sodium, infection, furosemide, docusate, not rebound, not palpitation, not night sweats shortness of breath, esomeprazole One anchor obesity, sleep apnea, morbid obese, obese, 0.762 use, complication, drug, allergy, sodium, 0.546 per disease labor, acebutolol, vaginal bleeding, infection, furosemide, docusate, klonopin, valproic acid, bacteruria shortness of breath, obstructive sleep apnea Add second obesity, morbid obese, obese, labor, 0.757 sleep apnea, obstructive sleep apnea, 0.826 OSA anchor not non-compliant, acebutolol, oxygen, duoneb, desaturation, singulair, vaginal bleeding, problem, pulmonary hypertension, hypoxemia, not deep venous thrombosis, overweight pap smear, vicodin Table 9.1: Evolution of Obesity and Obstructive Sleep Apnea (OSA) topics as anchors are added. Colors and font weight indicate anchors, spurious terms, and intruder terms from other known topics. Multiword and negated terms are the result of the preprocessing pipeline. 9.4.2 i2b2 Obesity Challenge 2008 The Obesity Challenge 2008 data set 1 includes 1237 deidentied clinical discharge sum- maries from the Partners HealthCare Research Patient Data Repository. All summaries have been labeled by clinical experts with obesity and 15 other conditions commonly co- morbid with obesity, ranging from Coronary Artery Disease (663 positives) to Depression (247) to Hypertriglyceridemia (62). 1 https://www.i2b2.org/NLP/DataSets/Main.php 204 We preprocessed each document with a standard biomedical text pipeline that extracts common medical terms and phrases (grouping neighboring words where appropriate) and detecting negation (\not" is prepended to negated terms) [73, 47]. We converted each document to a binary bag-of-words with a vocabulary of 4114 (possibly negated) medical phrases. We used the 60/40 training/test split from the competition. We are primarily interested in the ability of Anchored CorEx to extract latent top- ics that are unambiguously associated with the 16 known conditions. We train a series of CorEx models with 32 latent topics in the rst layer, each using a dierent anchor strategy. Table 9.1 shows the Obesity and Obstructive Sleep Apnea (OSA) topics for three iterations of Anchored CorEx with the ten most important terms (highest weighted connections to the latent factor) listed for each topic. Unsupervised CorEx (rst row) does not discover any topics obviously related to obesity or OSA, so we choose the topics to which the terms obesity and obstructive sleep apnea are assigned. No unambiguous Obesity or OSA topics emerge even as the number of latent factors is decreased or in- creased. In the second iteration (second row), we add the common name of each of the 16 diseases as an anchor to one factor (16 total). Adding obesity as an anchor produces a clear Obesity topic, including several medications known to cause weight gain (e.g., acebutolol, klonopin). The anchored OSA topic, however, is quite poor and in fact resembles the rather generic topic to which obstructive sleep apnea is assigned by Unsupervised CorEx. It includes many spurious or non-specic terms like drug. This is likely due to the fact that obesity is a major risk factor of OSA, and so OSA symptoms are highly correlated with obesity and its other symptoms. Thus, the total 205 correlation objective will attempt to group obesity and OSA-related terms together under a single latent factor. The sparse connection constraint mentioned in Section 9.2 prevents them from being connected to multiple factors. Indeed, sleep apnea appears in the obesity topic, suggesting the two topics are competing to explain OSA terms. In the third iteration, we correct this by adding sleep apnea as a second anchor to the OSA topic, and the resulting topic is clearly associated with OSA, including terms related to respiratory problems and medications used to treat (or believed to increase risk for) OSA. There is no noticeable reduction in quality in the Obesity topic. 9.4.3 Anchored CorEx for Discriminative Tasks In a series of follow-up experiments, we investigate the suitability of using anchored CorEx to perform weakly supervised classication. We interpret each anchored latent factor as a classier for an associated class label and then compute test set F1 (using a threshold of 0.5) and area under the curve (AUC) scores (Obesity Challenge only). Anchors F 1 Anch F 1 Unsup Jesus 0.42 0.45 God 0.49 0.43 Jesus,Christian 0.55 0.45 Naive Bayes 0.75 Table 9.2: F1 scores on soc.religion.christianity. Table 9.2 compares the classication performance of Unsupervised and Anchored CorEx on the soc.religion.christianity category from 20 Newsgroups for dierent choices of anchors. For both types of CorEx, the topic containing the corresponding 206 terms is used as the classier, but for Anchored CorEx those terms are also used as an- chors when estimating the latent factor. Unsupervised CorEx does a reasonable job of discovering a coherent religion topic that already contains the terms God, Christian, and Jesus. However, using the terms Jesus and Christian as anchors yields a topic that better predicts the actual soc.religion.christianity category. Classier Macro-AUC Macro-F1 Naive Bayes 0.7120 0.4638 Anchored CorEx 0.7445 0.5328 Table 9.3: Classication performance on Obesity 2008. Table 9.3 shows the Macro-AUC and F1 scores (averaged across all diseases) on the Obesity Challenge data for the nal anchored CorEx model and a Naive Bayes (NB) baseline, in which we train a separate classier for each disease. Surprisingly, Anchored CorEx outperforms Naive Bayes (NB) by a large margin. Of course, Anchored CorEx is not a replacement for supervised learning: NB beats Anchored CorEx on 20 Newsgroups and does not represent a \strong" baseline for Obesity 2008 (teams scored above 0.7 in Macro-F1 during the competition). It is nonetheless remarkable that Anchored CorEx performs as well as it does given that it is fundamentally unsupervised. 9.5 Conclusion We have introduced a simple information theoretic approach to topic modeling that can leverage domain knowledge specied informally as anchors. Our framework uses a novel combination of CorEx and the information bottleneck. Preliminary results suggest it can 207 extract more precise, interpretable topics through a lightweight interactive process. We next plan to perform further empirical evaluations and to extend the algorithm to handle complex latent structures present in healthcare data. 208 Part IV Conclusion 209 Chapter 10 A Learning to Diagnose Benchmark While there has been a steady growth in machine learning research for healthcare, the absence of widely accepted benchmarks to evaluate competing models has slowed progress in harnessing digital health data. Such benchmarks accelerate progress in machine learn- ing by focusing the community and facilitating reproducibility and competition. Public benchmarks also lower the barrier to entry by enabling new researchers to start with- out having to negotiate data access or recruit expert collaborators. To address these problems, we propose a clinical prediction benchmark suite using data derived from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database. The suite includes four high priority clinical prediction problems (modeling risk of mor- tality, forecasting length of stay, detecting physiologic decline, and learning to diagnose) that span a range of classic machine learning problems and are suitable for research on a variety of topics. Because our code is freely available online, anyone with access to MIMIC-III can build and use our benchmarks without extensive knowledge of medicine or clinical data. 210 10.1 Introduction In the United States alone, each year over 30 million patients visit hospitals [5], 83% of which use an electronic health record (EHR) system [125]. This trove of digital clinical data presents a signicant opportunity for data mining and machine learning researchers to solve pressing healthcare problems, such as early detection and triage of at-risk or high-cost patients [20]. These problems are not new, 1 but the growing availability of clinical data and success of machine learning [89][274] have sparked widespread interest. While there has been a steady growth in machine learning research for healthcare, several obstacles have slowed progress in harnessing digital health data. One of the largest is the absence of widely accepted benchmarks to evaluate competing models. Such benchmarks accelerate progress in machine learning by focusing the community and facilitating reproducibility and competition. For example, the winning error rate in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) plummeted an order of magnitude from 2010 (0.2819) to 2016 (0.02991) [137]. In contrast, practical progress in clinical machine learning has been dicult to measure due to variability in data sets and task denitions [38][101][183][166][141]. Public benchmarks also lower the barrier to entry by enabling new researchers to start without having to negotiate data access or recruit expert collaborators. In this chapter, we propose a public clinical prediction benchmark suite and describe initial experimental results on four dierent clinical problems: in-hospital mortality, phys- iologic decompensation, length of stay (LOS), and phenotype classication. These tasks 1 The word triage, dates back to at least World War I and possibly earlier [138], while the Apgar risk score was rst published in 1952 [13]. 211 span a range of classic machine learning problems from multilabel time series classication to regression. Our benchmark data sets are derived from the public Medical Information Mart for Intensive Care (MIMIC-III) database [142] and include rich multivariate time series from over 40,000 adult intensive care unit (ICU) stays, suitable for research on topics as diverse as multitask learning and non-random missing data. Because our code is freely available online, anyone with access to MIMIC-III can build and use our benchmarks without extensive knowledge of medicine or clinical data. 10.2 The Case for a Clinical Benchmark 2010 2011 2012 2013 2014 2015 2016 2017 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Error rate Retired! Progress on ImageNet Classification Task All entries Annual average Annual winner Figure 10.1: Progress on ImageNet classication task from 2010 until the competition was retired in 2016. 212 Public benchmarks serve a vital role in accelerating machine learning research. For one, they lower the barrier to entry for new researchers and increase participation in related research. Benchmarks also focus the community on specic research questions, fostering competition and collaboration to advance the state of the art. Finally, they facilitate reproducibility by minimizing or even eliminating the eort required to replicate published results. We can consider successful benchmarks, such as the ImageNet database [80], to better understand what constitutes a good machine learning benchmark. ImageNet is publicly available from the ocial ILSVRC website [137], making it easy for researchers and prac- titioners to obtain and get started. This has led to an explosion in the number of people working on image classication and other tasks supported by ImageNet. The simple struc- ture of the data and curated labels make it straightforward to reproduce previous research using ImageNet. Indeed, any work that uses the pre-dened data splits can be directly compared against established results without the need to reproduce any experiments. The result is that since its introduction in 2009, ImageNet has driven breakthroughs and accelerated progress in deep learning and computer vision, as illustrated in Figure 10.1. Contrast ImageNet with the MIMIC-III database [142], one of the most widely re- searched digital health databases. MIMIC-III and its predecessors have been used in hundreds of publications and discoveries, but MIMIC-III does not by itself constitute a machine learning benchmark. For each project, researchers must dene an outcome or task, choose a patient cohort, create training and test splits, engineer features, and perform other signicant data preprocessing. As a result, even research addressing the same question, e.g., patient mortality, is often dicult or impossible to reproduce or 213 compare [38][101][183][166][141][262]. Moreover, understanding and using MIMIC-III is challenging for researchers without prior experience with EHR data or access to medical experts. Our benchmark suite addresses these obstacles. First, we clearly dene four prediction problems that span a range of machine learning tasks from classication to regression. This enables machine learning researchers to work on familiar abstract problems that are nonetheless critical to the delivery of healthcare. We also transparently choose a large patient cohort with a xed test set to make it easy to reproduce our data set and to facilitate comparison of competing methods on each task. We then identify and extract relevant clinical events and perform extensive data cleaning, e.g., detecting and removing outliers based on pre-dened clinically normal ranges. This provides researchers with reasonably clean clinical data, freeing them to focus on modeling. For preprocessing that is closely tied to modeling, e.g., imputation, we provide utilities to be used at the discretion of the researcher. The result is a data set of rich multivariate time series from over 40,000 adult intensive care unit (ICU) stays, suitable for research on topics as diverse as multitask learning and missing data. The code to build our benchmark data sets from MIMIC-III is available online and has already been adopted by other researchers [35][280][236][17]. We will work with the community to curate and expand the benchmarks over time. 214 10.3 Related Work The Physionet Challenge 2012 dataset [273] is a curated, public subset of the MIMIC-III database designed to support a competition involving the development of innovative new risk of mortality scores. It is signicantly smaller than our benchmark (8,000 ICU stays), and all stays have been truncated to 48 hours of data. In addition, the only outcomes available are date of death and length of stay, making these data most appropriate for research on length of stay and in-hospital mortality. In a parallel work, [234] introduced an alternative MIMIC-III benchmark suite with similar motivation to ours. Their benchmark includes wider coverage of data types and variables, including some interventions, but truncates times series to 48 hours or shorter and omits the decompensation task. What is more, their codebase appears more compli- cated than ours, requiring the raw MIMIC-III data to be loaded into a locally installed postgres database and the user to perform nearly twice as many steps. In contrast, our code operates directly on the MIMIC-III at les and requires running only half a dozen scripts. 10.4 Benchmark Data Set and Tasks In this section we describe our benchmark data and tasks. We rst dene some termi- nology: in MIMIC-III patients are often referred to as subjects. Each patient has one or more hospital admissions. Within one admission, a patient may have one or more ICU stays, which we also refer to as episodes. A clinical event is an individual measurement, observation, or treatment. In the context of our nal task-specic data sets, we use the 215 46 476 patients 57 786 hospital admissions 61 532 ICU stays 330 712 483 rows in CHARTEVENTS 27 854 055 rows in LABEVENTS 4 349 218 rows in OUTPUTEVENTS extract subjects.py Exclude: patients admissions ICU stays ICU transfers 3 199 4 952 5 702 2+ ICU stays per admission 1 690 2 648 5 644 Pediatric patients 7 789 7 910 7 910 33 798 patients 42 276 hospital admissions 42 276 ICU stays 253 116 833 events validate events.py Exclude: 5 162 703 events with missing HADM ID 32 266 173 events with invalid HADM ID 7 115 720 events with invalid ICUSTAY ID Recover: 15 735 688 missing ICUSTAY ID given HADM ID 33 798 patients 42 276 ICU stays 208 572 237 events extract episodes from subjects.py Keep only the events for the selected clinical vari- ables 33 798 patients 42 276 ICU stays 31 868 114 events split train and test.py train test 28 728 5 070 patients 35 948 6 328 ICU stays 27 147 680 4 720 434 events create in hospital mortality.py E1 Exclude: ICU stays with missing length-of-stay ICU stays with length-of-stay less than 48 hours ICU stays with no events before 48 hours train test 15 331 2 763 patients 17 903 3 236 ICU stays 17 903 3 236 samples /data/in-hospital-mortality create decompensation.py E2 Exclude: ICU stays with missing length-of-stay Samples shorter than 4 hours Samples with no events train test 28 620 5 058 patients 35 621 6 281 ICU stays 2 908 414 523 208 samples /data/decompensation create length of stay.py E2 train test 28 620 5 058 patients 35 621 6 281 ICU stays 2 925 434 525 912 samples /data/length-of-stay create phenotyping.py E3 train test 28 620 5 058 patients 35 621 6 281 ICU stays 35 621 6 281 samples /data/phenotyping create multitask.py E3 train test 28 620 5 058 patients 35 621 6 281 ICU stays 35 621 6 281 samples Exclude: ICU stays with missing length-of-stay ICU stays with no events /data/multitask MIMIC-III database /data/root /data/root /data/root /data/root E1 E2 E3 Figure 10.2: Benchmark generation process 216 word sample to refer to an individual record processed by a machine learning model. As a rule, we have one sample for each prediction. For tasks like phenotyping, a sample consists of an entire ICU stay. For tasks requiring hourly predictions, e.g., LOS, a sam- ple includes all events that occur before a specic time, and so a single ICU stay yields multiple samples. Our benchmark preparation work ow, illustrated in Figure 10.2, begins with the full MIMIC-III critical care database, which includes over 60,000 ICU stays across 40,000 critical care patients. In the rst step (extract subjects.py), we extract relevant data from the raw MIMIC-III tables and organize them by patient. We also apply exclusion criteria to admissions and ICU stays. First, we exclude any hospital admission with multiple ICU stays or transfers between dierent ICU units or wards. This reduces the ambiguity of outcomes associated with hospital admissions rather than ICU stays. Sec- ond, we exclude all ICU stays where the patient is younger than 18 due to the substantial dierences between adult and pediatric physiology. The resulting root cohort has 33,798 unique patients with a total of 42,276 ICU stays and over 250 million clinical events. In the next two steps, we process the clinical events. First (validate events.py), we lter out 45 million events that cannot be reliably matched to an ICU stay in our cohort. Then (extract episodes from subjects.py) we compile a time series of events for each episode, retaining only the variables from a predened list and performing further clean- ing, such as rectifying units of measurement and removing extreme outliers. At the time of writing, we use 17 physiologic variables representing a subset from the Physionet/CinC Challenge 2012 [273], as well as patient characteristics like height and age. The resulting data have over 31 million events from 42,276 ICU stays. 217 Finally (split train and test.py), we x a test set of 15% (5,070) of patients, including 6,328 ICU stays and 4.7 million events. We encourage researchers to follow best practices by intereacting with the test data as infrequently as possible. Finally, we prepare the task-specic data sets. Code for generating our benchmark data set from MIMIC-III is available at our github repository. Our benchmark prediction tasks include four in-hospital clinical prediction tasks: modeling risk of mortality shortly after admission [338], real-time prediction of physi- ologic decompensation [318], continuous forecasting of patient LOS [70], and phenotype classication [178]. Each of these tasks is of interest to clinicians and hospitals and is directly related to one or more opportunities for transforming healthcare using big data [20]. These clinical problems also encompass a range of machine learning tasks, including binary and multilabel classication, regression, and time series modeling, and so are of interest to data mining researchers. 10.4.1 In-hospital mortality Our rst benchmark task involves prediction of in-hospital mortality from observations recorded early in an ICU admission. Mortality is a primary outcome of interest in acute care: ICU mortality rates are the highest among hospital units (10% to 29% depending on age and illness), and early detection of at-risk patients is key to improving outcomes, so building high accuracy predictive models for in-hospital mortality is a grand challenge in healthcare. 218 Interest in modeling risk of mortality in hospitalized patients dates back over half a century: the Apgar score [13] for assessing risk in newborns, still used for assessing new- borns throughout the world today, was rst published in 1952. The widely used Simplied Acute Physiology Score (SAPS) [163] and Pediatric Risk of Mortality (PRISM) score [227] for children predict in-hospital mortality from data available within the rst 24 hours of admission. The Acute Physiology and Chronic Health Evaluation (APACHE) score [309] can also be converted into a mortality risk estimate. Designed to be computed by hand, such scores usually require as few inputs as possible and focus on individual abnormal observations rather than trends over time. However, in the pursuit of increased accuracy, severity of illness scores have grown steadily more complex: APACHE IV requires nearly twice as many clinical variables as APACHE II, as well as extensive information about the patient demographics, pre-existing conditions, and pre-ICU hospitalization [338]. Viewed collectively, empirical comparisons of these scores have been inconclusive (see [303] for a thorough review). However, in the pursuit of increased accuracy, such scores have grown steadily more complex: the Acute Physiology and Chronic Health Evaluation (APACHE) IV score re- quires nearly twice as many clinical variables as APACHE II, as well as extensive informa- tion about the patient demographics, pre-existing conditions, and pre-ICU hospitalization [338]. Recent research has used machine learning techniques like state space models and time series mining to integrate more detailed data about the patient into mortality prediction. [187] predicted mortality among pediatric ICU (PICU) patients using a mixture of experts based on a clustering of physiologic time series, improving upon a PRISM-like regression 219 model. Much of this work aims to make predictions based on complex temporal patterns of physiology instead of individual measurements [187, 183]. Others leverage information from clinical notes, extracted using topic models [38, 101]. Existing risk scores can also be improved simply by retraining them on new data [166], a simple form of transfer learning. These approaches outperform traditional baselines but have not been compared on standardized benchmarks. Feedforward neural networks nearly always outperform logistic regression and severity of illness scores in modeling mortality risk among hospitalized patients [83, 41, 58, 44]. To our knowledge, there has been little-to-no work using recurrent neural networks to predict either mortality (eventual or imminent) or decompensation. Risk of mortality is most often formulated as binary classication using observations recorded from a limited window of time following admission. The target label indicates whether the patient died before hospital discharge. Typical models include only the rst 12-24 hours, but we use a wider 48-hour window to enable the detection of patterns that may indicate changes in patient acuity, similar to the PhysioNet/CinC Challenge 2012 [273]. In contrast to our other tasks, the xed time window in this problem allows the use of models that accept xed-length inputs without extensive feature engineering. To prepare our in-hospital-mortality data set, we begin with the root cohort and further exclude all ICU stays for which LOS is unknown or less than 48 hours or for which there are no observations in the rst 48 hours. This yields nal training and test sets of 17,903 and 3,236 ICU stays, respectively. We determined in-hospital mortality by comparing patient date of death (DOD) with hospital admission and discharge times. The resulting mortality rate is 13.23% (2,797 of 21,139 ICU stays). 220 10.4.2 Physiologic Decompensation Our second benchmark task involves the detection of patients who are physiologically de- compensating, or whose conditions are deteriorating rapidly. Such patients are the focus of \track-and-trigger" initiatives [20], which aim to improve outcomes by rapidly deliv- ering specialized care to the sickest patients. In such programs, patients with abnormal physiology trigger an alert, summoning a rapid response from a team of specialists who assume care of the triggering patient. These programs are typically implemented using early warning scores, which summa- rize patient state with a composite score and trigger alerts based on abnormally low values. Examples include the Modied Early Warning Score (MEWS) [286], the VitalPAC Early Warning Score (ViEWS) [232], and the National Early Warning Score (NEWS) [318] being deployed throughout the United Kingdom. Like risk scores, most early warning scores are designed to be computed manually and so are based on simple thresholds and a small number of common vital signs. There has been a great deal of machine learning research on the related problem of condition monitoring, but most of this work formulates the task as anomaly detection [7, 235, 185], rather than continuous mortality prediction. In contrast, decompensation has seen relatively little research: one notable exception is [59], who used a Gaussian process to impute missing values, enabling the continuous application of early warning scores even when vitals are not recorded. Other related works model the risk of individual conditions such as Clostridium dicile infection [317] or sepsis [126] over time using traditional supervised models with hand-engineered features. [311] train a Cox proportional hazards 221 model to predict time-to-mortality in ICU patients but do not directly apply it to the patient deterioration task. [38] and [100] both train temporal risk models but use them to predict eventual, rather than imminent, mortality. There are many ways to dene decompensation, but most objective evaluations of early warning scores are based on accurate prediction of mortality within a xed time window, e.g., 24 hours, after assessment [318]. Following suit, we formulate our decom- pensation benchmark task as a binary classication problem, in which the target label indicates whether the patient dies within the next 24 hours. To prepare the root cohort for decompensation detection, we dene a binary label that indicates whether the patient's DOD falls within the next 24 hours of the current time point. We then assign these labels to each hour, starting at four hours after admission to the ICU and ending when the patient dies or is discharged. This yields 2,908,414 and 523,208 instances (individual time points with a label) in the training and test sets, respectively. The decompensation rate is 2.06% (70,696 out of 3.431,622 instances). 10.4.3 Forecasting length of stay Our third benchmark task involves forecasting hospital LOS, one of the most important drivers of overall hospital cost [3][70]. Hospitals use patient LOS as both a measure of a patient's acuity and for scheduling and resource management [3]. Patients with extended LOS utilize more hospital resources and often have complex, persistent conditions that may not be immediately life threatening but are nonetheless dicult to treat. Reducing healthcare spending requires early identication and treatment of such patients. 222 0 100 200 300 400 500 Hours 0.000 0.005 0.010 0.015 0.020 Frequency LOS Remaining LOS (a) Distribution of raw LOS and remaining LOS. (0, 1) (1, 2) (2, 3) (3, 4) (4, 5) (5, 6) (6, 7) (7, 8) (8, 14) 14+ Buckets 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Frequency LOS Remaining LOS (b) Distribution of LOS by bucket. Figure 10.3: Distribution of LOS. Plot (a) shows the distribution of LOS for full ICU stays and remaining LOS per hour. The rightmost 5% of both distributions are not shown to keep the plot informative. Plot (b) shows a histogram of bucketed patient and hourly remaining LOS (less than one day, one each for 1-7 days, between 7 and 14 days, and over 14 days). Most LOS research has focused on identifying factors that in uence LOS [127] rather than predicting it. Both severity of illness scores [215] and early warning scores [218] have been used to predict LOS but with mixed success. In a comparison, recent revisions of MPM and SAPS were found to be only weakly correlated with LOS [297], while the best performing model was a variant of APACHE IV that was retrained to predict LOS directly [339]. There has been limited machine learning research concerned with LOS, most of it focused on specic conditions [226] and cohorts [112, 334]. None of this work has addressed continuous prediction of LOS over time. There is a great deal of early research that uses neural networks to predict LOS in hospitalized patients [112, 199]. However, rather than regression, much of this work for- mulates the task as binary classication aimed at identifying patients at risk for long stays [37]. Recently, novel deep learning architectures have been proposed for survival analysis [329, 330, 242, 148, 149], a similar time-to-event regression task with right censoring. 223 LOS is naturally formulated as a regression task. Traditional research focuses on accurate prediction of LOS early in admission, but in our benchmark we predict the remaining LOS once per hour for every hour after admission, similar to decompensation. Such a model can be used to help hospitals and care units make decisions about stang and resources on a regular basis, e.g., at the beginning of each day or at shift changes. We prepare the root cohort for LOS forecasting in a manner similar to decompensa- tion: for each time point, we assign a LOS target to each time point in sliding window fashion, beginning four hours after admission to the ICU and ending when the patient dies or is discharged. We compute remaining LOS by subtracting total time elapsed from the existing LOS eld in MIMIC-III. After ltering, there remain 2,925,434 and 525,912 instances (individual time points) in the training and test sets, respectively. Figure 10.3a shows the distributions of patient LOS and hourly remaining LOS in our nal cohort. In practice, hospitals round to the nearest day when billing, and stays over 1-2 weeks are considered extreme outliers which, if predicted, would trigger special interven- tions [70]. Thus, we also design a second evaluation that captures how LOS is measured and studied in practice. First, we divide the range of values into ten buckets, one bucket for extremely short visits (less than one day), seven day-long buckets for each day of the rst week, and two \outlier" buckets { one for stays of over one week but less than two, and one for stays of over two weeks. This converts length-of-stay prediction into an ordinal multiclass classication problem. Figure 10.3b shows the distribution of bucketed LOS and hourly remaining LOS. 224 10.4.4 Acute care phenotype classication Our nal benchmark task is phenotyping, i.e., classifying which acute care conditions are present in a given patient record. 2 Phenotyping has applications in cohort construction for clinical studies, comorbidity detection and risk adjustment, quality improvement and surveillance, and diagnosis [209]. Traditional research phenotypes are identied via chart review based on criteria predened by experts, while surveillance phenotypes use simple denitions based primarily on billing, e.g., ICD-9, codes. The adoption of EHRs has led to increased interest in machine learning approaches to phenotyping that treat it as classication [2, 116] or clustering [187, 129]. Early work focused on using mature techniques, such as regularized regression [40] and active learning [54], to train statistical classiers to recognize clinical phenotypes like rheumatoid arthritis. Recent work has explored ways to make development of phenotype classiers more scalable by reducing the need for large numbers of labeled training records [2, 116]. Phenotyping has been a popular application for deep learning researchers in recent years, though model architecture and problem denition vary widely. [160] and [50] each applied feedfoward architectures to clinical time series in sliding window fashion, while [243] applied a temporal convolutional network. citepmiotto2016deep applied a applied a simpler feedforward architecture to a similar problem. [55] used a long short-term memory network to model sequences of diagnostic codes, a proxy task for disease progression. Our 2 Note that we perform \retrospective" phenotype classication, in which we observe a full ICU stay before predicting which diseases are present. This is due in part to a limitation of MIMIC-III: the source of our disease labels, ICD-9 codes, do not have timestamps, so we do not know with certainty when the patient was diagnosed or rst became symptomatic. Rather than attempt to assign timestamps using a heuristic, we decided instead to embrace this limitation. 225 work is closely related to [178], who rst showed that recurrent neural networks could classify dozens of acute care diagnoses in variable length clinical time series. In this task we classify 25 conditions that are common in adult ICUs, including 12 critical (and sometimes life-threatening) conditions, such as respiratory failure and sepsis; eight chronic conditions that are common comorbidities and risk factors in critical care, such as diabetes and metabolic disorders; and ve conditions considered \mixed" because they are recurring or chronic with periodic acute episodes. To identify these conditions, we use the single-level denitions from the Health Cost and Utilization (HCUP) Clinical Classication Software (CCS) [4]. These denitions group ICD-9 billing and diagnostic codes into mutually exclusive, largely homogeneous disease categories, reducing some of the noise, redundancy, and ambiguity in the original ICD-9 codes. HCUP CCS code groups are used for reporting to state and national agencies, so they constitute sensible phenotype labels. We determined phenotype labels based on the MIMIC-III ICD-9 diagnosis table. First, we mapped each code to its HCUP CCS category, retaining only the 25 categories from Table 10.1. We then matched diagnoses to ICU stays using the hospital admission iden- tier, since ICD-9 codes in MIMIC-III are associated with hospital visits, not ICU stays. By excluding hospital admissions with multiple ICU stays, we reduced some of the am- biguity in these labels: there is only one ICU stay per hospital admission with which the diagnosis can be associated. We apply no additional ltering to the phenotyping cohort, so there are 35,621 and 6,281 ICU stays in the training and test sets, respectively. The full list of phenotypes is shown in Table 10.1, along with prevalence within the benchmark data set. 226 Table 10.1: Prevalence of ICU phenotypes in the benchmark data set. Phenotype Type Prevalence Train Test Acute and unspecied renal failure acute 0.214 0.212 Acute cerebrovascular disease acute 0.075 0.066 Acute myocardial infarction acute 0.103 0.108 Cardiac dysrhythmias mixed 0.321 0.323 Chronic kidney disease chronic 0.134 0.132 Chronic obstructive pulmonary disease chronic 0.131 0.126 Complications of surgical/medical care acute 0.207 0.213 Conduction disorders mixed 0.072 0.071 Congestive heart failure; nonhypertensive mixed 0.268 0.268 Coronary atherosclerosis and related chronic 0.322 0.331 Diabetes mellitus with complications mixed 0.095 0.094 Diabetes mellitus without complication chronic 0.193 0.192 Disorders of lipid metabolism chronic 0.291 0.289 Essential hypertension chronic 0.419 0.423 Fluid and electrolyte disorders acute 0.269 0.265 Gastrointestinal hemorrhage acute 0.072 0.079 Hypertension with complications chronic 0.133 0.130 Other liver diseases mixed 0.089 0.089 Other lower respiratory disease acute 0.051 0.057 Other upper respiratory disease acute 0.040 0.043 Pleurisy; pneumothorax; pulmonary collapse acute 0.087 0.091 Pneumonia acute 0.139 0.135 Respiratory failure; insuciency; arrest acute 0.181 0.177 Septicemia (except in labor) acute 0.143 0.139 Shock acute 0.078 0.082 Because diseases can co-occur (in fact, 99% of patients in our benchmark data set have more than one diagnosis), we formulate phenotyping as a multi-label classication problem. 10.5 Evaluation Here we detail the performance metrics that we support for each task and provide some notes about evaluating model accuracy. The most commonly reported metric in mortality 227 prediction research is area under the receiver operator characteristic curve (AUC-ROC). We also report area under the precision-recall curve (AUC-PR) metric since it can be more informative when dealing with highly skewed data sets [77]. We use the same metrics for decompensation as for mortality, i.e., AUC-ROC and AUC-PR. However, because we care about per-instance (vs. per-patient) accuracy, we compute the micro-average AUC over all predictions, regardless of patient. There is no widely accepted evaluation metric for LOS predictions so we use a stan- dard regression metric, namely mean absolute deviation (MAD). To evaluate prediction accuracy for the bucketed version of LOS, we use Cohen's linear weighted kappa [60, 36], which measures correlation between ordered items. For phenotype classication, we re- port macro- and micro-averaged AUC-ROC, similar to [178]. We also add a weighted average AUC-ROC metric that takes disease prevalence into account. Although we have not ocially added measures of calibration to our benchmark, recent research suggests that calibration is critical for clinical impact [202, 39]. Thus, we encourage researchers using our benchmark to assess the calibration of their binary classication models, either visually (see Figure 10.4 for an example) or formally with, for example, the Hosmer-Lemeshow test. 10.5.1 Adding error bars to performance metrics One of the disadvantages of using a xed test set is that it complicates the estimation of uncertainty or condence around performance. This could be critical assessing whether small improvements are in fact signicant. There are three potential sources of uncer- tainty that might aect performance: 228 1. Random initialization (when using, e.g., neural networks) 2. Training distribution 3. Test distribution The rst two are straightforward but time consuming to account for using multiple runs with dierent random seeds and cross-validation, respectively. However, with a xed test set, the only option we have for measuring the impact of test distribution is bootstrapping. To estimate a 95% condence interval we resample the test setM times with replacement. We then measure performance on each test set, and then compute, e.g., the mean and standard deviation of performance and derive condence intervals. This procedure has been used throughout the literature. [56] report the estimates of standard deviations of their evaluation measures. [279] apply bootstrap resampling on the test set to compute statistically signicant dierences between dierent models. [239] and [238] report 95% bootstrap condence intervals for their prediction models. 10.6 Discussion 10.6.1 In-hospital mortality prediction In our own experiments with the benchmark, our range of AUC-ROC scores for in- hospital mortality (show in Figure 7.3) is comparable to those for other published work using similar subsets of MIMIC-III [234, 141]. This inspires some condence that perhaps published work using MIMIC-III may be less sensitive to data preprocessing decisions that we had initially believed. 229 For risk-related tasks like mortality and decompensation, we are also interested how reliable the probabilities estimated by our predictive models are. This is known as cali- bration and is a common method for evaluating predictive models in the clinical research literature. In a well calibrated model, 10% all patients who receive a predicted 0.1 prob- ability of decompensation do in fact decompensate. Following our own recommendation, we informally visualize calibration for mortality and decompensation predictions using reliability plots. These are scatter plots of pre- dicted probability (computed by creating decile bins of predictions and then taking the mean value within each bin) vs. actual probability (the rate of, e.g., mortality, within each bin). Better calibrated predictions will fall closer to the diagonal. Figure 10.4a shows calibration of several in-hospital mortality prediction models. We see that LSTM-based models look reasonably calibrated, while logistic regression consistently overestimates the actual probability of morality. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Predicted probability 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Observed frequency LR C MC (a) In-hospital mortality 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Predicted probability 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Observed frequency LR C + DS MC (b) Decompensation Figure 10.4: Calibration of in-hospital mortality and decompensation prediction by the best linear, non-multitask and multitask LSTM-based models. 230 One interesting nding is that in-hospital mortality becomes increasingly dicult to predict as LOS increases. As shown in Figure 10.5, for LOS greater than two weeks, our best mortality model is only slightly better than random. The long-tailed distribution of LOS show in Figure 10.3a suggests a likely explanation: during training, we observe very few examples of patients with long LOS who also die in-hospital. What is more, forecasting events far into the future is an intrinsically dicult task. (2, 3) (3, 4) (4, 5) (5, 6) (6, 7) (7, 8) (8, 14) 14+ Buckets 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AUC-ROC Figure 10.5: In-hospital mortality prediction performance vs. length-of-stay. The con- dence intervals and standard deviations are estimated with bootstrapping on the data of each bucket. 231 0 1 2 3 4 Remaining length-of-stay (in days) Deceased patients 0 1 2 3 4 Remaining length-of-stay (in days) Living patients Figure 10.6: Prediction of C + DS baseline over the time. Each row shows the last days of a single ICU stay. Darker colors mean high probability predicted by the model. Red and blue colors indicate the ground-truth label is negative and positive, respectively. Ideally, the right image should be all white, and the left image should be all white except the right-most 24 hours, which should be all dark blue. 232 10.6.2 Decompensation For decompensation, our AUC-ROC scores (Figure 7.4) look extremely high (0.89 or higher), but our much more modest AUC-PRC scores suggest that this may be misleading. Decompensation has severe class imbalance (2% of examples have positive labels), which can in ate absolute AUC-ROC. Similarly, we see in Figure 10.4b that the decompensation models are in general more poorly calibrated. To understand better decompensation, we visualize the predictions of the best de- compensation model over the time in Figure 10.6. The gure on the left shows a random subset 100 patients from the test set who died in ICU. The right gure shows another random subset of 100 test set patients who survived until discharge.Every row shows the predictions for the last 100 hours of a single ICU stay. Darker colors indicate higher pre- dicted probability of mortality in the upcoming 24 hours. Red and blue colors indicate ground truth labels (blue is positive mortality). A visual inspection indicates that the model frequently fails to predict decompensation within 24 hours of death (false nega- tives) and predicts a reasonably large number of false alarms (false positives). This is consistent with the lower AUC-PRC score. 10.6.3 Length-of-stay prediction In general, our results for LOS forecasting are the worst among the four tasks. Our intu- ition is that this due in part to the intrinsic diculty of the task, especially distinguishing between stays of, e.g., 3 and 4 days. Supporting this conclusion, we found that all of our models were far more accurate when predicting short (1-2 days) or long (two weeks or longer) stays than predicting moderately long stays. 233 To investigate this intuition further, we considered a task formulation similar to [238], where the goal was to predict whether a patient would have an extended LOS (longer than seven days) from only the rst 24 hours of data. In order to evaluate our models in a similar manner, we summed the predicted probabilities from our multiclass LOS model for all buckets corresponding to seven days or longer LOS. For our best LOS model, this yielded an AUC-ROC of 0.84 for predicting extended LOS at 24 hours after admission. This is comparable to the results from [238], who reported AUC-ROCs of 0.86 and 0.85 on two larger private data sets using an ensemble of several neural architectures. This is especially noteworthy since our models were not trained to solve this particular problem and suggested that the extended LOS problem is more tractable than the regression or multiclass versions. Nonetheless, solving the more dicult ne-grained LOS problem remains an important goal for clinical machine learning researchers. To determine whether our multiclass formulation of the problem might be harming performance, we also trained regression models that directly predict the number of days. These models consistently performed worse than classication models in terms of kappa score, but achieved a higher MAD in general, as shown in Table 10.2. Table 10.2: Results for length of stay prediction task (regression) Model Kappa MAD LinR 0.3361 (0.3345, 0.3376) 116.45 (115.82, 117.03) S 0.4332 (0.4316, 0.4346) 94.66 (94.24, 95.10) S + DS 0.4125 (0.4110, 0.4140) 94.54 (94.09, 94.98) C 0.4235 (0.4220, 0.4251) 94.34 (93.87, 94.79) C + DS 0.4260 (0.4244, 0.4276) 94.00 (93.56, 94.45) 234 10.6.4 Phenotype classication Our modest phenotype classication results (best score 0.7741, shown in Figure 7.6) are almost identical to those reported in [234] (0.7772), even though they use a dierent set of aggregated ICD-9 code groups. This raises the question of whether it will be possible to improve upon these results. Table 10.3 shows the individual phenotype AUC-ROCs for the best phenotype classier. We observe that performance on the individual diseases varies widely from 0.6834 (es- sential hypertension) and 0.9089 (acute cerebrovascular disease). Unsurprisingly, chronic diseases are harder to predict than the acute ones (0.7964 vs 0.7475). There is no pos- itive correlation between disease prevalence and AUC-ROC score. Moreover, the worst performance is observed for the most common phenotype (essential hypertension). 10.6.5 Published work using our benchmark Since the release of the preliminary version of the benchmark codebase, several researchers have published notable work utilizing our benchmark. [280] applied a transfer network architecture, which uses attention instead of recurrent layers, to our full benchmark suite and even tested a multitask version of their model. They achieved results comparable to early results we published with the preliminary release of our benchmark. We have since outperformed them, so it remains to be seen whether the transformer network really is competitive with LSTMs. [17] applied a novel variation of capsule networks to just the phenotyping task of the benchmark. [57] proposed a hybrid RNN-Gaussian Process architecture and evaluated it 235 on the LOS task in our benchmark. Finally, [236] derived a brand new prediction task (readmission risk) using our benchmark data. 10.7 Conclusion In this chapter, we proposed four rst-of-their-kind standardized benchmarks for machine learning researchers interested in clinical data problems, including in-hospital mortality, decompensation, length of stay, and phenotyping. Our benchmark data set is similar to other MIMIC-III patient cohorts described in by machine learning publications but makes use of a larger number of patients and is immediately accessible to other researchers who wish to replicate our experiments or build upon our work. We plan to continue improving and expanding our benchmark data set by adding fur- ther observational variables, inputs and outputs, and medications and treatments. This should not only enable improved performance on the existing benchmarks but also facil- itate interesting new research problems, such as treatment recommendation and causal inference. We will track published results using our benchmark at our project repository on GitHub, where code to construct the benchmark and train and run our models can be found. We also plan to rene our evaluations based on community feedback. For exam- ple, we would like to include formal measures calibration (visualized informally in Figure 10.4) for mortality and decompensation based on, e.g., the Hosmer-Lemeshow test. 236 Table 10.3: Per-phenotype classication performance for best overall multitask LSTM. Phenotype AUC-ROC Acute and unspecied renal failure 0.8057 Acute cerebrovascular disease 0.9089 Acute myocardial infarction 0.7760 Cardiac dysrhythmias 0.6870 Chronic kidney disease 0.7706 Chronic obstructive pulmonary disease 0.6951 Complications of surgical/medical care 0.7239 Conduction disorders 0.7371 Congestive heart failure; nonhypertensive 0.7632 Coronary atherosclerosis and related 0.7967 Diabetes mellitus with complications 0.8719 Diabetes mellitus without complication 0.7966 Disorders of lipid metabolism 0.7281 Essential hypertension 0.6834 Fluid and electrolyte disorders 0.7390 Gastrointestinal hemorrhage 0.7507 Hypertension with complications 0.7497 Other liver diseases 0.7781 Other lower respiratory disease 0.6941 Other upper respiratory disease 0.7853 Pleurisy; pneumothorax; pulmonary collapse 0.7085 Pneumonia 0.8088 Respiratory failure; insuciency; arrest 0.9064 Septicemia (except in labor) 0.8535 Shock 0.8921 All acute diseases (macro-averaged) 0.7964 All mixed (macro-averaged) 0.7675 All chronic diseases (macro-averaged) 0.7457 All diseases (macro-averaged) 0.7764 237 Chapter 11 Concluding Remarks Dr. Jim Fackler of Johns Hopkins University has been a mainstay of the annual Ma- chine Learning for Healthcare Conference (MLHC) and, before that, the Meaningful Use of Complex Medical Data (MUCMD) Symposium. At an early MUCMD meeting, Dr. Fackler participated in a panel-led discussion about problems that clinicians want com- puter scientists to solve. Dr. Fackler made the following observation, which has stuck with me throughout my PhD: doctors serve four primary roles in the practice of medicine, namely, 1. Memorizers of facts 2. Recorders and interpreters of patient data 3. Clinical decision-makers 4. Patient advocates and communicators Dr. Fackler went on to proclaim his condence that computers should be able to do (1)-(3) better than he can, and once they do, he will be able to focus on the most important and rewarding part of the job, communicating with patients. In his own no nonsense way, Dr. 238 Fackler expressed a hope shared by many clinicians and caregivers: that the true potential of information technology and articial intelligence will be to enable them to spend more time practicing the human side of medicine. This intuition, while perhaps wishful, is not unrealistic. Certainly, for the storage and retrieval of information, even modest computer systems surpass the abilities of most human beings. For all their aws, it is undeniable that EHRs have improved our ability to capture detailed records of healthcare delivery and to make those data available for analysis and mining (this thesis wouldn't have been possible otherwise!). Widespread adoption of EHRs is a step in the right direction, but Dr. Fackler's vision is far from realized. In fact, paradoxically, EHRs have by some measures made things worse: along with the increased capacity to capture digital health data came a slew of nancial and regulatory incentives to do so [10]. The result is that many healthcare workers feel their jobs have been reduced to data entry, and patients complain that their physicians spend more time looking at computer screens than at the patients themselves [62, 285, 188]. And what do clinicians get in return for collecting all of these data? A bevy of EHR windows to click through, long lists of variables to track and interpret, and a growing cognitive burden to manage with even less time to do so [62, 285]. It is no wonder, then, that so many stakeholders, from doctors to patients to ad- ministrators, rest their hopes on machine learning. Decades of research and successful commercial applications strongly suggest that machine learning should be able to deliver what Dr. Fackler is looking for: an automated computer system that can ingest digital patient data and then provide actionable insights and predictions. Then, rather than pore over charted data presented in Excel-like tables, Dr. Fackler will instead look at predicted 239 outcomes, likely diagnoses, signicant physiologic patterns, and common treatments for past similar patients [182]. The research described in this thesis suggests that Dr. Fackler's vision may closer than he thinks. For example, our best performing models in Chapters 5 and 6 consistently ranked the correct diagnosis in the top ten most likely diagnoses out of over a hundred possibilities. This is certainly a long way o from an automated diagnostician, but it might be feasible to build a Net ix-style recommender system around such a model, with the goal of accelerating dierential diagnosis. Likewise, in Chapter 7 we predicted in- hospital mortality far more accurately than SOI scores that are in common use today. Fast and accurate patient similarity searches like those described in Chapter 3 could provide bedside clinicians with historical context in the absence of guidelines or evidence from clinical research [182]. 11.1 Open Challenges While this thesis describes real advances in the application of machine learning to health- care data, but we do not wish to overstate our progress. Learning to diagnose systems are not quite ready for the bedside, and there remain open problems for clinical machine learning researchers to tackle. As many of these challenges have been discussed at length elsewhere [102], so we will restrict our attention to challenges that are most relevant for learning to diagnose. 240 11.1.1 Representing heterogeneous clinical data Our research demonstrated the advantages of learning, rather than engineering, represen- tations of clinical data. Our most eective diagnosis models used RNNs applied to clinical time series with preprocessing limited to normalizing numerical values, resampling sparse time series, and ecoding missing values. Recent work suggests that resampling can be avoided entirely by simply encoding time as another input [130, 55] or adding an input layer that approximates a Gaussian process [94]. In addition, our results suggest that it may not always be necessary to encode missing values if missingness is obvious from the values themselves. As promising as these ndings are, they are nonetheless restricted to relatively ho- mogeneous clinical time series. The picture becomes more complicated when we consider other data types: for example, how would we jointly model time series and clinical notes? One possibility is to preprocess notes into a continuous representation using, e.g., topic proportions from a topic model [101]. However, this adds an additional preprocessing step that we would rather avoid. An intriguing alternative is to encode all data as semi- structured text (using, for example, the HL-7 FHIR standard) and then apply well-known methods from natural language processing [238]. 11.1.2 Learning from small data with less compute Learning eective representations often involves training complicated models with large amounts of labeled training data, which may not be feasible if we have train a new model from scratch for each task. In an extreme example, [238] spent over 200,000 GPU hours training on hundreds of thousands of records to create a handful of models for only four 241 clinical prediction tasks. What is more, large labeled datasets are not available for many clinical tasks, making it dicult to learn eective representations. What we would like is an ability to learn a single representation (perhaps from a large amount of unlabeled data) that can then be used across a variety of tasks for which very little labeled data may be available [275]. Our research suggests this should be possible: the hidden layers in our multitask RNN learn intermediate representations that are useful for predicting four dierent clinical outcomes. Nonetheless, it remains an open question whether these representations might be useful for a new task, e.g., readmission. In work under preparation [82], we oer an armative answer for clinical text: we eectively train simple linear models for half a dozen prediction tasks using clinical term embeddings as features. The embeddings themselves were trained in a fully unsupervised manner and used as-is in the prediction models, making them task agnostic. What is more, the prediction models were trained on representatively small data sets, with as few as a hundred labeled examples. It remains to be seen whether this nding generalizes to other data types and tasks. 11.1.3 Incorporating domain knowledge into data-driven models Another solution to the small data problem involves incorporating the wealth of existing medical knowledge into otherwise data-driven models. Recall our motivating example of bradycardia from Chapter 1: we know that bradycardia is characterized by an abnormally low heart rate. There is no reason to re-discover this fact when training a machine learning model. Rather we should constrain our search to only those models that are consistent in some sense with this knowledge. By restricting our hypothesis space in 242 this way, we should be able to speed up learning and reduce the amount of labeled training data needed. On the other hand, what happens when domain knowledge is wrong or incomplete? The Pathnder expert system provides a cautionary tale: a major source of errors in Pathnder II was that it assigned zero probability to events that clinical experts deemed impossible [157]. Returning to our bradycardia example, doctors generally consider a heart rate below 60 beats-per-minute (BPM) to be bradycardia but not in endurance athletes or someone sleeping. How do we incorporate these sorts of \rules of thumb" into machine learning so that the model learns when to use them and when to disregard them? Our research on weakly supervised topic modeling indicates one path forward: multi- variate mutual information provides a principled and exible framework to indicate that two or more phenomena are associated without having to specify the precise relation- ship [312]. Our results and subsequent work [95] have demonstrated that this approach can be used to place weak constraints on the latent factors in an otherwise unsupervised model. The exibility of mutual information-based approaches comes with a weakness, however: it can be dicult to enforce specic relationships between variables, for exam- ple, a lower heart rate during hospitalization should on average increase the likelihood of a bradycardia. These sorts of relationships can be encoded using monotonicity con- straints [272] on input-output relationships. Resarch on learning with hints [1] goes back several decades but has been underutilized in healthcare applications. 243 Bibliography [1] Yaser S Abu-Mostafa. Hints. Neural Computation, 7(4):639{671, 1995. [2] Vibhu Agarwal, Tanya Podchiyska, Juan M Banda, Veena Goel, Tiany I Leung, Evan P Minty, Timothy E Sweeney, Elsie Gyang, and Nigam H Shah. Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, page ocw028, 2016. [3] Agency for Healthcare Research and Quality. Selecting Quality and Re- source Use Measures: A Decision Guide for Community Quality Collaboratives, 2014. URL https://www.ahrq.gov/professionals/quality-patient-safety/ quality-resources/tools/perfmeasguide/perfmeaspt3.html. Last accessed February 9, 2017. [4] Agency for Healthcare Research and Quality: Healthcare Cost and Utilization Project. Clinical Classications Software for ICD-9-CM Fact Sheet, 2012. URL https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp. Last accessed February 9, 2017. [5] Agency for Healthcare Research and Quality: Healthcare Cost and Uti- lization Project. Introduction to the HCUP National Inpatient Sam- ple, 2014, 2014. URL https://www.hcup-us.ahrq.gov/db/nation/nis/NIS_ Introduction_2014.jsp. Last accessed August 3, 2018. [6] Agency for Healthcare Research and Quality Patient Safety Network. Diagnostic errors, June 2017. Last accesseed 2018-08-03. [7] Norm Aleks, Stuart J Russell, Michael G Madden, Diane Morabito, Kristan Stau- denmayer, Mitchell Cohen, and Georey T Manley. Probabilistic detection of short events, with application to critical care monitoring. In Advances in Neural Infor- mation Processing Systems 21, 2009. [8] Paul D Allison. Missing data, volume 136. Sage publications, 2001. [9] Shun-ichi Amari and Andrzej Cichocki. Adaptive blind signal processing-neural network approaches. Proceedings of the IEEE, 1998. [10] American Medical Assocation. 2017 AMA Prior Authorization Physician Sur- vey, March 2018. URL https://www.ama-assn.org/sites/default/files/ media-browser/public/arc/prior-auth-2017.pdf. Last accessed August 3, 2018. 244 [11] David Andrzejewski and Xiaojin Zhu. Latent dirichlet allocation with topic-in-set knowledge. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, pages 43{48, 2009. [12] David Andrzejewski, Xiaojin Zhu, and Mark Craven. Incorporating domain knowl- edge into topic modeling via dirichlet forest priors. In Proceedings of the 26th International Conference on Machine Learning, pages 25{32, 2009. [13] Virgina Apgar. A proposal for a new method of evaluation of the newborn. Curr. Res. Anesth. Analg., 32(449):260{267, 1952. [14] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models { going beyond SVD. In Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, FOCS '12, pages 1{10, 2012. [15] Michael Auli, Michel Galley, Chris Quirk, and Georey Zweig. Joint language and translation modeling with recurrent neural networks. In EMNLP, 2013. [16] Kevin Bache and Moshe Lichman. UCI Machine Learning Repository, 2013. Last accessed August 3, 2018. [17] Mohammad Taha Bahadori. Spectral capsule networks. In Proceedings of the Sixth International Conference on Learning Representations Workshop Track, 2018. [18] Mohammad Taha Bahadori, Yan Liu, and Dan Zhang. Learning with Minimum Supervision: A General Framework for Transductive Transfer Learning. In Pro- ceedings of the 11th IEEE International Conference on Data Mining, pages 61{70, 2011. [19] Jon Barker, Phil Green, and Martin Cooke. Linking auditory scene analysis and robust ASR by missing data techniques. 2001. [20] David W Bates, Suchi Saria, Lucila Ohno-Machado, Anand Shah, and Gabriel Escobar. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aairs, 33(7):1123{1131, 2014. [21] W.G Baxt. Application of articial neural networks to clinical medicine. The Lancet, 346(8983):1135{1138, 1995. [22] Shai Ben-David, John Blitzer, Koby Crammer, and O Pereira. Analysis of Repre- sentations for Domain Adaptation. In Advances in Neural Information Processing Systems 19, pages 137{144. 2007. [23] Y Bengio, A Courville, and P Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798{1828, 2013. [24] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous data. In Advances in neural information processing systems 11, pages 395{401, 1996. 245 [25] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependen- cies with gradient descent is dicult. IEEE transactions on neural networks, 5(2): 157{166, 1994. [26] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer- wise training of deep networks. In B Sch olkopf, J C Platt, and T Homan, editors, Advances in Neural Information Processing Systems 19, pages 153{160. MIT Press, 2007. [27] Yoshua Bengio, Gr egoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. In Proceedings of the 30th International Conference on Machine Learning, pages 552{560, 2013. [28] James Bergstra, Olivier Breuleux, Fr ed eric Bastien, Pascal Lamblin, Razvan Pas- canu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientic Computing Conference (SciPy), June 2010. URL http://deeplearning.net/software/theano/. [29] Alina Beygelzimer, John Langford, Daniel Hsu, and Zhang Tong. Agnostic Ac- tive Learning Without Constraints. In Advances in Neural Information Processing Systems 23, pages 199{207. 2011. [30] Biomarkers Denitions Working Group. Biomarkers and surrogate endpoints: Pre- ferred denitions and conceptual framework. Clinical Pharmacology & Therapeutics, 69(3):89{95, 2001. [31] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, Boom- boxes and Blenders: Domain Adaptation for Sentiment Classication. In Proceed- ings of the 45th Annual Meeting of the Association for Computational Linguistics, 2007. [32] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wort- man. Learning Bounds for Domain Adaptation. In Advances in Neural Information Processing Systems 21, pages 1{12. 2008. [33] Marsden S Blois, Mark S Tuttle, and David D Sherertz. Reconsider: A program for generating dierential diagnoses. In Proceedings of the Symposium on Computer Applications in Medical Care, page 263|268, November 1981. [34] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. In Proceedings of the 29th International Con- ference on Machine Learning, pages 1881{1888, 2012. [35] Michael Bowie, Edmon Begoli, Byung Park, and Jeevith Bopaiah. Towards an LSTM-based approach for detection of temporally anomalous data in medical datasets. In MIT International Conference on Information Quality, 2017. 246 [36] Robert L Brennan and Dale J Prediger. Coecient kappa: Some uses, misuses, and alternatives. Educational and psychological measurement, 41(3):687{699, 1981. [37] Timothy G Buchman, Kenneth L Kubos, Alexander J Seidler, and Michael J Sieg- forth. A comparison of statistical and connectionist models for the prediction of chronicity in a surgical intensive care unit. Critical care medicine, 22(5):750{762, 1994. [38] Karla L Caballero Barajas and Ram Akella. Dynamically modeling patient's health state from electronic medical records: A time series approach. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 69{78. ACM, 2015. [39] Ben Van Calster and Andrew J Vickers. Calibration of risk prediction models: Impact on decision-analytic performance. Medical Decision Making, 35(2):162{169, 2015. PMID: 25155798. [40] Robert J Carroll, Will K Thompson, Anne E Eyler, Arthur M Mandelin, Tianxi Cai, Raquel M Zink, Jennifer A Pacheco, Chad S Boomershine, Thomas A Lasko, Hua Xu, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19(e1): e162{e169, 2012. [41] Rich Caruana, Shumeet Baluja, Tom Mitchell, et al. Using the future to \sort out" the present: Rankprop and multitask learning for medical risk evaluation. In Advances in Neural Information Processing Systems 8, pages 959{965, 1996. [42] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hos- pital 30-day readmission. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1721{1730. ACM, 2015. [43] Joan A Casey, Brian S Schwartz, Walter F Stewart, and Nancy E Adler. Using electronic health records for population health research: A review of methods and applications. Annu Rev Public Health, 37:61{81, 2016. [44] Leo Anthony Celi, Sean Galvin, Guido Davidzon, Joon Lee, Daniel Scott, and Roger Mark. A database-driven decision support system: customized mortality prediction. Journal of personalized medicine, 2(4):138{148, 2012. [45] Krzysztof Chalupka, Pietro Perona, and Frederick Eberhardt. Visual causal feature learning. pages 181{190, 2015. [46] Yee Seng Chan and Hwee Tou Ng. Domain Adaptation with Active Learning for Word Sense Disambiguation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 2007. 247 [47] Wendy W Chapman, Will Bridewell, Paul Hanbury, Gregory F Cooper, and Bruce G Buchanan. A simple algorithm for identifying negated ndings and dis- eases in discharge summaries. Journal of Biomedical Informatics, 34(5):301 { 310, 2001. [48] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pages 380{388, 2002. [49] Rita Chattopadhyay, S Mill Ave, Wei Fan, and Ian Davidson. Joint Transfer and Batch-mode Active Learning. In Proceedings of the 30th Annual International Con- ference on Machine Learning, 2013. [50] Zhengping Che, David C Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu. Deep computational phenotyping. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015. [51] Zhengping Che, David C Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu. Deep computational phenotyping. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 507{516. ACM, 2015. [52] Huanhuan Chen, Fengzhen Tang, Peter Tino, and Xin Yao. Model-based kernel for ecient time series analysis. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 392{400, 2013. [53] Peixian Chen, Nevin L Zhang, Leonard KM Poon, and Zhourong Chen. Progressive EM for latent tree models and hierarchical topic detection, 2016. [54] Yukun Chen, Robert J Carroll, Eugenia R McPeek Hinz, Anushi Shah, Anne E Eyler, Joshua C Denny, and Hua Xu. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. Journal of the American Medical Informatics Association, 20(e2):e253{e259, 2013. [55] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Doctor AI: Predicting clinical events via recurrent neural networks. In Proceedings of the 1st Machine Learning for Healthcare Conference, 2016. [56] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for health- care using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pages 3504{3512, 2016. [57] Ingyo Chung, Saehoon Kim, Juho Lee, Sung Ju Hwang, and Eunho Yang. Mixed eect composite RNN-GP: A personalized and reliable prediction model for health- care. arXiv:1806.01551, 2018. [58] Gilles Clermont, Derek C Angus, Stephen M DiRusso, Martin Grin, and Walter T Linde-Zwirble. Predicting hospital mortality for patients in the intensive care unit: 248 a comparison of articial neural networks with logistic regression models. Critical care medicine, 29(2):291{296, 2001. [59] Lei Clifton, David A Clifton, Marco AF Pimentel, Peter J Watkinson, and Lionel Tarassenko. Gaussian process regression in vital-sign early warning systems. In Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE, pages 6161{6164. IEEE, 2012. [60] Jacob Cohen. A coecient of agreement for nominal scales. Educational and psy- chological measurement, 20(1):37{46, 1960. [61] Patricia Cohen, Stephen G West, and Leona S Aiken. Applied multiple regres- sion/correlation analysis for the behavioral sciences. Psychology Press, 3 edition, 2003. [62] Roger Collier. Electronic health records contributing to physician burnout. CMAJ, 189(45):E1405{E1406, 2017. [63] Ronan Collobert and Jason Weston. A unied architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160{167. ACM, 2008. [64] Ronan Collobert, Koray Kavukcuoglu, and Cl ement Farabet. Torch7: A matlab-like environment for machine learning. In NIPS BigLearn Workshop, 2011. [65] Mark Craven and Johan Kumlien. Constructing biological knowledge bases by ex- tracting information from text sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 77{86, 1999. [66] Dana C Crawford, David R Crosslin, Gerard Tromp, Iftikhar J Kullo, Helena Kuiv- aniemi, M Georey Hayes, Joshua C Denny, William S Bush, Jonathan L Haines, Dan M Roden, et al. eMERGEing progress in genomics|the rst seven years. Frontiers in genetics, 5:184, 2014. [67] Marco Cuturi. Fast global alignment kernels. In Proceedings of the 28th Interna- tional Conference on Machine Learning, pages 929{936, 2011. [68] Marco Cuturi and Arnaud Doucet. Autoregressive kernels for time series. arXiv:1101.0673, 2011. [69] Filip Dabek and Jesus J Caban. A neural network based model for predicting psychological conditions. In International Conference on Brain Informatics and Health, pages 252{261. 2015. [70] Deborah Dahl, Greg G Wojtal, Michael J Breslow, Debra Huguez, David Stone, and Gloria Korpi. The high cost of low-acuity ICU outliers. Journal of Healthcare Management, 57(6):421{433, 2012. [71] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1):30{42, 2012. 249 [72] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems 30, pages 3079{3087, 2015. [73] Manhong Dai, Nigam H Shah, Wei Xuan, Mark A Musen, Stanley J Watson, Brian D Athey, Fan Meng, et al. An ecient solution for mapping free text to ontology terms. volume 21, 2008. [74] Sanjoy Dasgupta. Two Faces of Active Learning. Theoretical Computer Science, 412(19):1767{1781, 2011. [75] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality- sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, pages 253{262, 2004. [76] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In Proceedings of the 24th International Con- ference on Machine Learning, pages 209{216, 2007. [77] Jesse Davis and Mark Goadrich. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, pages 233{240, 2006. [78] N M De Mey, Joost Wauters, Alexander Wilmer, and Philippe Meersseman. Autopsy-detected diagnostic errors in critically ill patients with cirrhosis. Criti- cal Care, 18(1):P37, Mar 2014. [79] Wim De Mulder, Steven Bethard, and Marie-Francine Moens. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language, 30(1):61{98, 2015. [80] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248{255, 2009. [81] Pedro Domingos. A few useful things to know about machine learning. Commun. ACM, 55(10):78{87, Oct 2012. [82] Sebastien Dubois, Nathanael Romano, David C Kale, Nigam Shah, and Kenneth Jung. Eective representations of clinical notes, 2017. [83] Richard Dybowski, Vanya Gant, P Weller, and R Chang. Prediction of outcome in critically ill patients using articial neural network synthesised by genetic algorithm. The Lancet, 347(9009):1146{1150, 1996. [84] Mohamed Elhoseiny, Tarek El-Gaaly, Amr Bakry, and Ahmed Elgammal. A com- parative analysis and study of multiview cnn models for joint object categorization and pose estimation. In Proceedings of The 33rd International Conference on Ma- chine Learning, pages 888{897, 2016. 250 [85] Anne Elixhauser, Claudia Steiner, D Robert Harris, and Rosanna M Coey. Co- morbidity measures for use with administrative data. Medical care, 36(1):8{27, 1998. [86] Jerey L Elman. Finding structure in time. Cognitive science, 14(2):179{211, 1990. [87] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Dept. IRO, Universit e de Montr eal, Tech. Rep, 2009. [88] Dumitru Erhan, Aaron Courville, and Yoshua Bengio. Understanding represen- tations learned in deep architectures. Dept. Inf. Res. Oper., Univ. Montr eal, Montr eal, QC, Canada, Tech. Rep, 1355, 2010. [89] David Ferrucci, Anthony Levas, Sugato Bagchi, David Gondek, and Erik T Mueller. Watson: beyond jeopardy! Articial Intelligence, 199:93{105, 2013. [90] Debra Henry Fiser. Assessing the outcome of pediatric intensive care. Pediatrics, 121(1):68{74, 1992. [91] Susannah Fleming, Matthew Thompson, Richard Stevens, Carl Heneghan, Annette Pl uddemann, Ian Maconochie, Lionel Tarassenko, and David Mant. Normal ranges of heart rate and respiratory rate in children from birth to 18 years of age: a systematic review of observational studies. The Lancet, 377(9770):1011{1018, 2011. [92] Elizabeth Ford, John A Carroll, Helen E Smith, Donia Scott, and Jackie A Cas- sell. Extracting information from the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23(5):1007, 2016. [93] Joseph Futoma, Sanjay Hariharan, and Katherine Heller. Learning to detect sepsis with a multitask gaussian process rnn classier. In Proceedings of the 34th Inter- national Conference on Machine Learning, pages 1174{1182, 2017. [94] Joseph Futoma, Sanjay Hariharan, Katherine Heller, Mark Sendak, Nathan Brajer, Meredith Clement, Armando Bedoya, and Cara O'Brien. An improved multi-output gaussian process rnn with real-time validation for early sepsis detection. In Pro- ceedings of the 2nd Machine Learning for Healthcare Conference, pages 243{254, 2017. [95] Ryan J Gallagher, Kyle Reing, David C Kale, and Greg Ver Steeg. Anchored corre- lation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics, 5:529{542, 2017. [96] Pedro J Garc a-Laencina, Jos e-Luis Sancho-G omez, and An bal R Figueiras-Vidal. Pattern classication with missing data: a review. Neural Computing and Applica- tions, 2010. 251 [97] Felix A Gers and J urgen Schmidhuber. Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS Inter- national Joint Conference on, volume 3, pages 189{194. IEEE, 2000. [98] Felix A Gers, J urgen Schmidhuber, and Fred Cummins. Learning to forget: Con- tinual prediction with LSTM. Neural Computation, 2000. [99] Felix A Gers, Nicol N Schraudolph, and J urgen Schmidhuber. Learning precise timing with lstm recurrent networks. Journal of machine learning research, 3(Aug): 115{143, 2002. [100] Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, Nicole Brimmer, Rohit Joshi, Anna Rumshisky, and Peter Szolovits. Unfolding physiological state: Mor- tality modelling in intensive care units. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 75{84. ACM, 2014. [101] Marzyeh Ghassemi, Marco AF Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton, Peter Szolovits, and Mengling Feng. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in icu with sparse, heterogeneous clinical data. In Proceedings of the Twenty-Ninth AAAI Con- ference on Articial Intelligence, volume 2015, page 446, 2015. [102] Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L Beam, and Rajesh Ranganath. Opportunities in machine learning for healthcare, 2018. [103] Mic ol E Gianinazzi, Corina S Rueegg, Karin Zimmerman, Claudia E Kuehni, Gisela Michel, and the Swiss Paediatric Oncology Group (SPOG). Intra-rater and inter- rater reliability of a medical record abstraction study on transition of care after childhood cancer. PLoS ONE, 10(5):1{13, 05 2015. [104] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high di- mensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, pages 518{529, 1999. [105] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learn- ing, volume 1. MIT press Cambridge, 2016. [106] Ian J Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Fr ed eric Bastien, and Yoshua Bengio. Pylearn2: a machine learning research library. arXiv:1308.4214, 2013. [107] G Anthony Gorry, Jerome P Kassirer, Alvin Essig, and William B Schwartz. De- cision analysis as the basis for computer-aided management of acute renal failure. The American Journal of Medicine, 55(4):473 { 484, 1973. [108] Kristen Grauman and Rob Fergus. Learning binary hash codes for large-scale image search. In Machine Learning for Computer Vision, volume 411, pages 49{87. Springer Berlin Heidelberg, 2013. 252 [109] Alex Graves. Supervised sequence labelling with recurrent neural networks. volume 385 of Studies in Computational Intelligence. Springer-Verlag Berlin Heidelberg, 2012. [110] Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013. [111] Alex Graves, Marcus Liwicki, Santiago Fern andez, Roman Bertolami, Horst Bunke, and J urgen Schmidhuber. A novel connectionist system for unconstrained handwrit- ing recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009. [112] Jim Grigsby, Robert Kooken, and John Hershberger. Simulated neural networks to predict outcomes, costs, and length of stay among orthopedic rehabilitation patients. Archives of physical medicine and rehabilitation, 75(10):1077{1081, 1994. [113] Jerome Groopman. How Doctors Think. Houghton Miin, Boston, MA, 2007. [114] Yoni Halpern, Youngduck Choi, Steven Horng, and David Sontag. Using anchors to estimate clinical state without labeled data. In AMIA Annual Symposium Pro- ceedings, pages 606{615, 2014. [115] Yoni Halpern, Steven Horng, and David Sontag. Anchored discrete factor analysis. arXiv:1511.03299, 2015. [116] Yoni Halpern, Steven Horng, Youngduck Choi, and David Sontag. Electronic med- ical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 23(4):731, 2016. [117] Yoni Halpern, Steven Horng, and David Sontag. Clinical tagging with joint proba- bilistic models. In Proceedings of the 1st Machine Learning for Healthcare Confer- ence, 2016. [118] Nils Y Hammerla, James M Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Pl otz. PD disease state assessment in naturalistic environments using deep learning. In Proceedings of the Twenty-Ninth AAAI Conference on Articial Intelligence, 2015. [119] Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine learning, pages 353{ 360, New York, New York, USA, 2007. ACM Press. [120] Rave Harpaz, Alison Callahan, Suzanne Tamang, Yen Low, David Odgers, Sam Finlayson, Kenneth Jung, Paea LePendu, and Nigam H Shah. Text mining for adverse drug events: the promise, challenges, and state of the art. Drug safety, 37 (10):777{790, 2014. [121] David E Heckerman. Probabilistic similarity networks. Networks, August 1990. 253 [122] David E Heckerman and Bharat N Nathwani. An evaluation of the diagnostic accuracy of pathnder. Computers and Biomedical Research, 25(1):56 { 74, 1992. [123] David E Heckerman and Bharat N Nathwani. Toward normative expert systems part II. Methods of Information in Medicine, 31:106{116, August 1992. [124] David E Heckerman, Eric J Horvitz, and Bharat N Nathwani. Toward normative expert systems part I. Methods of Information in Medicine, 31:90{105, June 1992. [125] JaWanna Henry, Yuriy Pylypchuk, MPA Talisha Searcy, and Vaishali Patel. Adop- tion of Electronic Health Record Systems Among US Non-Federal Acute Care Hos- pitals: 2008-2015. ONC Data Brief, 35, 2015. [126] Katharine E Henry, David N Hager, Peter J Pronovost, and Suchi Saria. A targeted real-time early warning score (TREWScore) for septic shock. Science Translational Medicine, 7(299 299ra122):1{9, 2015. [127] Thomas L Higgins, William T McGee, Jay S Steingrub, John Rapoport, Stanley Lemeshow, and Daniel Teres. Early indicators of prolonged intensive care unit stay: Impact of illness severity, physician stang, and pre{intensive care unit length of stay. Critical care medicine, 31(1):45{51, 2003. [128] Georey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527{1554, 2006. [129] Joyce C Ho, Joydeep Ghosh, and Jimeng Sun. Marble: High-throughput pheno- typing from electronic health records via sparse nonnegative tensor factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 115{124. [130] Long V Ho, David Ledbetter, Melissa Aczon, and Randall Wetzel. The depen- dence of machine learning on electronic medical record quality. In AMIA Annual Symposium Proceedings, page 883, 2017. [131] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural com- putation, 9(8):1735{1780, 1997. [132] Nathan Hodas, Greg Ver Steeg, Joshua Harrison, Satish Chikkagoudar, Eric Bell, and Courtney Corley. Disentangling the lexicons of disaster response in twit- ter. In The 3rd International Workshop on Social Web for Disaster Management (SWDM'15), 2015. [133] Susan Dadakis Horn, Phoebe D Sharkey, June M Buckle, Joanne E Backofen, Richard F Averill, and Roger A Horn. The relationship between severity of ill- ness and hospital length of stay and mortality. Medical care, 29(4):305{317, 1991. [134] George Hripcsak, Jon D Duke, Nigam H Shah, Christian G Reich, Vojtech Huser, Martijn J Schuemie, Marc A Suchard, Rae Woong Park, Ian Chi Kei Wong, Pe- ter R Rijnbeek, et al. Observational health data sciences and informatics (OHDSI): 254 opportunities for observational researchers. Studies in health technology and infor- matics, 216:574, 2015. [135] Joy Hsu, Jennifer A Pacheco, Whitney W Stevens, Maureen E Smith, and Pe- dro C Avila. Accuracy of phenotyping chronic rhinosinusitis in the electronic health record. American journal of rhinology & allergy, 28(2):140{144, 2014. [136] Aapo Hyv arinen and Stephen M Smith. Pairwise likelihood ratios for estimation of non-gaussian structural equation models. Journal of Machine Learning Research, 14(Jan):111{152, 2013. [137] ImageNet. Imagenet large scale visual recognition challenge website, 2016. URL http://imagenet.org. Last accessed August 3, 2018. [138] Kenneth V Iserson and John C Moskop. Triage in medicine, part I: concept, history, and types. Annals of emergency medicine, 49(3):275{281, 2007. [139] Jagadeesh Jagarlamudi, Hal Daum e III, and Raghavendra Udupa. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics, pages 204{213, 2012. [140] Yangqing Jia, Evan Shelhamer, Je Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Cae: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014. [141] Alistair Johnson, Tom Pollard, and Roger Mark. Reproducibility in critical care: a mortality prediction case study. In Proceedings of the 2nd Machine Learning for Healthcare Conference, pages 361{376, 2017. [142] Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care database. Scientic Data, 3:160035 EP {, 05 2016. [143] Kenneth Jung, Paea LePendu, and Nigam Shah. Automated detection of systematic o-label drug use in free text of electronic medical records. In AMIA Summits on Translational Science Proceedings, page 94, 2013. [144] David C Kale and Yan Liu. Accelerating active learning with transfer learning. In 2013 IEEE 13th International Conference on Data Mining, pages 1085{1090. IEEE, 2013. [145] David C Kale, Zhengping Che, Yan Liu, and Randall Wetzel. Computational dis- covery of physiomes in critically ill children using deep learning. In AMIA Data Mining and Medical Informatics Workshop, 2014. [146] David C Kale, Dian Gong, Zhengping Che, Yan Liu, Gerard Medioni, Randall Wetzel, and Patrick Ross. An examination of multivariate time series hashing with 255 applications to health care. In Data Mining (ICDM), 2014 IEEE international conference on, pages 260{269. IEEE, 2014. [147] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128{3137, June 2015. [148] Jared Katzman, Uri Shaham, Jonathan Bates, Alexander Cloninger, Tingting Jiang, and Yuval Kluger. Deep survival: A deep cox proportional hazards net- work. arXiv:1606.00931, 2016. [149] Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1):24, Feb 2018. [150] Eamonn Keogh and Shruti Kasetty. On the need for time series data mining bench- marks: A survey and empirical demonstration. Data Min. Knowl. Discov., 7(4): 349{371, October 2003. [151] Eamonn Keogh and Chotirat Ann Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and information systems, 7(3):358{386, 2005. [152] Shojania KG, Burton EC, McDonald KM, and Goldman L. Changes in rates of autopsy-detected diagnostic errors over time: A systematic review. JAMA, 289 (21):2849{2856, 2003. [153] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting Change in Data Streams. In Proceedings of the 30th international Conference on Very Large Databases, pages 180{191, 2004. [154] Diederik P Kingma and Jimmy Lei Ba. Adam: Amethod for stochastic optimiza- tion. 2014. [155] Sue E Kirby, Sarah M Dennis, Upali W Jayasinghe, and Mark F Harris. Patient related factors in frequent readmissions: the in uence of condition, access to services and patient choice. BMC Health Services Research, 10(1):216, Jul 2010. [156] William A Knaus, Elizabeth A Draper, Douglas P Wagner, and Jack E Zimmerman. APACHE II: a severity of disease classication system. Critical care medicine, 13 (10):818{829, 1985. [157] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. [158] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classication with deep convolutional neural networks. In Advances in neural information pro- cessing systems, pages 1097{1105, 2012. 256 [159] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6):1092{1104, June 2012. [160] Thomas A Lasko, Joshua C Denny, and Mia A Levy. Computational phenotype dis- covery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS ONE, 8(6):e66341, 06 2013. [161] Archana Laxmisan, Forogh Hakimzada, Osman R Sayan, Robert A Green, Jiajie Zhang, and Vimla L Patel. The multitasking clinician: Decision-making and cog- nitive demand during and after team handos in emergency care. International Journal of Medical Informatics, 76(11{12):801 { 811, 2007. [162] Quoc V Le. Building high-level features using large scale unsupervised learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8595{8598. IEEE, 2013. [163] Jean-Roger Le Gall, Philippe Loirat, Annick Alperovitch, Paul Glaser, Claude Granthil, Daniel Mathieu, Philippe Mercier, Remi Thomas, and Daniel Villers. A simplied acute physiology score for ICU patients. Critical care medicine, 12 (11):975{977, 1984. [164] Yann LeCun, L eon Bottou, Yoshua Bengio, and Patrick Haner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278{ 2324, 1998. [165] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Proceedings of the Eighteenth International Conference on Articial Intelligence and Statistics, 2015. [166] Joon Lee and David M Maslove. Customization of a severity of illness score using local electronic medical record data. Journal of intensive care medicine, 32(1): 38{47, 2017. [167] Moontae Lee and David Mimno. Low-dimensional embeddings for interpretable anchor-based topic inference. In Proceedings of Empirical Methods in Natural Lan- guage Processing, 2014. [168] Li-wei H Lehman, Mohammed Saeed, George B Moody, and Roger G Mark. Similarity-based searching in multi-parameter time series databases. Comput Car- diol, 35(4749126):653{656, 2008. [169] Paea LePendu, Srinivasan V Iyer, Anna Bauer-Mehren, Rave Harpaz, Jonathan M Mortensen, Tanya Podchiyska, Todd A Ferris, and Nigam H Shah. Pharmacovig- ilance using clinical notes. Clinical pharmacology & therapeutics, 93(6):547{555, 2013. [170] Sijin Li, Zhi-Qiang Liu, and Antoni B Chan. Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In Proceedings of 257 the IEEE Conference on Computer Vision and Pattern Recognition, pages 482{489, 2014. [171] Jessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. Experiencing sax: a novel symbolic representation of time series. Data Mining and Knowledge Discovery, 15 (2):107{144, 2007. [172] Jessica Lin, Rohan Khade, and Yuan Li. Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst., 39(2):287{315, October 2012. [173] Jessica Lin, Sheri Williamson, Kirke D Borne, and David De Barr. Pattern recog- nition in time series. In Advances in Machine Learning and Data Mining for As- tronomy, Data Mining and Knowledge Discovery, chapter 28. CRC Press, 2012. [174] Donald A Lindbeg, L R Rowland, C R Buchs, W F Morse, and S S Morse. Consider: A computer program for medical instruction. In Proceedings of the Ninth IBM Medical Symposium, volume 69, page 54, 1968. [175] Zachary C Lipton. The mythos of model interpretability. ICML Workshop on Human Interpretability in Machine Learning, 2016. [176] Zachary C Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. Optimal thresholding of classiers to maximize F1 measure. In Machine Learning and Knowledge Discovery in Databases. 2014. [177] Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019, 2015. [178] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzell. Learning to diagnose with LSTM recurrent neural networks. In Proceedings of the Fourth International Conference on Learning Representations, 2016. [179] Zachary C Lipton, David C Kale, and Randall Wetzel. Modeling missing data in clinical time series with rnns. In Proceedings of the 1st Machine Learning for Healthcare Conference, 2016. [180] I-Ting Liu and Bhiksha Ramakrishnan. Bach in 2014: Music composition with recurrent neural network. arXiv:1412.3191, 2014. [181] Marcus Liwicki, Alex Graves, Horst Bunke, and J urgen Schmidhuber. A novel ap- proach to on-line handwriting recognition based on bidirectional long short-term memory networks. In Proceedings of the Ninth International Conference on Docu- ment Analysis and Recognition, 2007. [182] Christopher A Longhurst, Robert A Harrington, and Nigam H Shah. A `green button' for using aggregate patient data at the point of care. Health Aairs, 33(7): 1229{1235, 2014. 258 [183] Yuan Luo, Yu Xin, Rohit Joshi, Leo Celi, and Peter Szolovits. Predicting icu mortality risk by grouping temporal trends from a multivariate panel of physio- logic measurements. In Proceedings of the Thirtieth AAAI Conference on Articial Intelligence, 2016. [184] Helmut L utkepohl. New Introduction to Multiple Time Series Analysis. Springer, 2005. [185] Yi Mao, Wenlin Chen, Yixin Chen, Chenyang Lu, Marin Kollef, and Thomas Bailey. An integrated data mining approach to real-time clinical monitoring and deterio- ration warning. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1140{1148. ACM, 2012. [186] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313{330, 1993. [187] Benjamin M Marlin, David C Kale, Robinder G Khemani, and Randall C Wetzel. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pages 389{398. ACM, 2012. [188] Rebecca A Marmor, Brian Clay, Marlene Millen, Thomas J Savides, and Christo- pher A Longhurst. The impact of physician EHR usage on patient satisfaction. Applied clinical informatics, 9(01):011{014, 2018. [189] Robert J Mason, V Courtney Broaddus, Thomas Martin, Talmadge E King, Dean Schraufnagel, John F Murray, and Jay A Nadel. Murray and Nadel's textbook of respiratory medicine: 2-volume set. 2010. [190] Vassar Matt and Holzmann Matthew. The retrospective chart review: important methodological considerations. Journal of educational evaluation for health profes- sions, 10, 2013. [191] Alf Meberg. Transient bradycardia in early infancy. Archives of Pediatrics and Adolescent Medicine, 148(11):1231{1232, 1994. [192] Merriam-Webster. Denition of \diagnosis", 2018. https://www. merriam-webster.com/dictionary/diagnosis. Last accessed August 3, 2018. [193] Ashley ND Meyer, Velma L Payne, Derek W Meeks, Radha Rao, and Hardeep Singh. Physicians' diagnostic accuracy, condence, and resource requests: A vi- gnette study. JAMA Internal Medicine, 173(21):1952{1958, 2013. [194] B Middleton, DF Sittig, and A Wright. Clinical decision support: a 25 year retro- spective and a 25 year vision. Yearbook of medical informatics, 25(S 01):S103{S116, 2016. 259 [195] Blackford Middleton, Michael A Shwe, David E Heckerman, Max Henrion, Eric J Horvitz, Harold P Lehmann, and Gregory F Cooper. Probabilistic diagnosis using a reformulation of the internist-1/qmr knowledge base part II. Methods of Information in Medicine, 30:256{267, August 1991. [196] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and J. Cernock y. Empirical eval- uation and combination of advanced language modeling techniques. In INTER- SPEECH, 2011. [197] Randolph A Miller. Computer-assisted diagnostic decision support: history, chal- lenges, and possible paths forward. Advances in Health Sciences Education, 14(1): 89{106, Sep 2009. [198] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, volume 2, pages 1003{1011, 2009. [199] Bert A Mobley, Renee Leasure, and Lynda Davidson. Articial neural network predictions of lengths of stay on a post-coronary care unit. Heart & Lung: The Journal of Acute and Critical Care, 24(3):251{256, 1995. [200] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in Neural Information Processing Systems, pages 1196{1204, 2013. [201] National High Blood Pressure Education Program Working Group on Children and Adolescents. The fourth report on the diagnosis, evaluation, and treatment of high blood pressure in children and adolescents. Pediatrics, 2004. [202] Shah ND, Steyerberg EW, and Kent DM. Big data and predictive analytics: Re- calibrating expectations. JAMA, 320(1):27{28, 2018. [203] Katherine M Newton, Peggy L Peissig, Abel Ngo Kho, Suzette J Bielinski, Richard L Berg, Vidhu Choudhary, Melissa Basford, Christopher G Chute, Iftikhar J Kullo, Rongling Li, Jennifer A Pacheco, Luke V Rasmussen, Leslie Span- gler, and Joshua C Denny. Validation of electronic medical record-based phenotyp- ing algorithms: results and lessons learned from the emerge network. Journal of the American Medical Informatics Association, 20(e1):e147{e154, 2013. [204] Che Ngufor, Sudhindra Upadhyaya, Dennis Murphree, Daryl Kor, and Jyotishman Pathak. Multi-task learning with selective cross-task transfer for predicting bleeding and other important patient outcomes. In International Conference on Data Science and Advanced Analytics, pages 1{8. IEEE, 2015. [205] Anh Nguyen, Jason Yosinski, and Je Clune. Deep neural networks are easily fooled: High condence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427{436, 2015. 260 [206] Thang Nguyen, Yuening Hu, and Jordan L Boyd-Graber. Anchors regularized: Adding robustness and extensibility to scalable topic-modeling algorithms. In As- socation for Computational Linguistics, pages 359{369, 2014. [207] Nozomi Nori, Hisashi Kashima, Kazuto Yamashita, Hiroshi Ikai, and Yuichi Imanaka. Simultaneous modeling of multiple diseases for mortality prediction in acute hospital care. In Proceedings of the 21st ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pages 855{864. ACM, 2015. [208] Mohammad Norouzi, David Fleet, and Ruslan Salakhutdinov. Hamming distance metric learning. In Advances in Neural Information Processing Systems 25, pages 1070{1078. 2012. [209] Anika Oellrich, Nigel Collier, Tudor Groza, Dietrich Rebholz-Schuhmann, Nigam Shah, Olivier Bodenreider, Mary Regina Boland, Ivo Georgiev, Hongfang Liu, Kevin Livingston, Augustin Luna, Ann-Marie Mallon, Prashanti Manda, Peter N Robinson, Gabriella Rustici, Michelle Simon, Liqin Wang, Rainer Winnenburg, and Michel Dumontier. The digital revolution in phenotyping. Briengs in Bioinfor- matics, 17(5):819{830, 2016. [210] Institute of Medicine, National Academies of Sciences Engineering, and Medicine. Improving Diagnosis in Health Care. The National Academies Press, Washington, DC, 2015. [211] Cody S Olsen, Nathan Kuppermann, David M Jae, Kathleen Brown, Lynn Bab- cock, Prashant V Mahajan, Julie C Leonard, and The Pediatric Emergency Care Applied Research Network (PECARN) Cervical Spine Injury Study Group. Interob- server agreement in retrospective chart reviews for factors associated with cervical spine injuries in children. Academic Emergency Medicine, 22(4):487{491, 2015. [212] Kimberly J O'Malley, Karon F Cook, Matt D Price, Kimberly Raiford Wildes, John F Hurdle, and Carol M Ashton. Measuring diagnoses: ICD code accuracy. Health services research, 40(5p2):1620{1639, 2005. [213] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and trans- ferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014. [214] World Health Organization. International statistical classication of diseases and related health problems, 2004. URL https://www.cdc.gov/nchs/icd/icd9cm. htm. Last accessed August 3, 2018. [215] Turner M Osler, Frederick B Rogers, Laurent G Glance, Myra Cohen, Robert Rut- ledge, and Steven R Shackford. Predicting survival, length of stay, and cost in the surgical intensive care unit: APACHE II versus ICISS. Journal of Trauma and Acute Care Surgery, 45(2):234{238, 1998. 261 [216] Shahla Parveen and P Green. Speech recognition with missing data using recurrent neural nets. In Advances in Neural Information Processing Systems 13, 2001. [217] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations, 2014. [218] R Paterson, DC MacLeod, D Thetford, A Beattie, C Graham, S Lam, and D Bell. Prediction of in-hospital mortality and length of stay using an early warning scoring system: clinical audit. Clinical Medicine, 6(3):281{284, 2006. [219] Stephen G Pauker, G Anthony Gorry, Jerome P Kassirer, and William B. Schwartz. Towards the Simulation of Clinical Cognition: Taking a Present Illness by Com- puter, pages 108{138. Springer New York, New York, NY, 1985. [220] Chris Paxton, Alexandru Niculescu-Mizil, and Suchi Saria. Developing predictive models using electronic medical records: challenges and pitfalls. In AMIA Annual Symposium Proceedings, page 1109, 2013. [221] Judea Pearl. Causality: models, reasoning and inference. Cambridge Univ Press, 2009. [222] F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blon- del, P Prettenhofer, R Weiss, V Dubourg, J Vanderplas, A Passos, D Courna- peau, M Brucher, M Perrot, and E Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825{2830, 2011. URL http://scikit-learn.org/. [223] Adam Perer and Jimeng Sun. Matrix ow: Temporal network visual analytics to track symptom evolution during disease progression. AMIA Annual Symposium Proceedings, pages 716{25, 2012. [224] Vu Pham, Th eodore Bluche, Christopher Kermorvant, and J er^ ome Louradour. Dropout improves recurrent neural networks for handwriting recognition. In 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 285{290. IEEE, 2014. [225] Therese D Pigott. A review of methods for missing data. Educational research and evaluation, 2001. [226] Walter E Pofahl, Steven M Walczak, Ethan Rhone, and Seth D Izenberg. Use of an articial neural network to predict length of stay in acute pancreatitis. The American Surgeon, 64(9):868, 1998. [227] Murray M Pollack, Urs E Ruttimann, and Pamela R Getson. Pediatric risk of mortality (PRISM) score. Critical care medicine, 16(11):1110{1116, 1988. [228] Murray M Pollack, Kantilal M Patel, and Urs E Ruttimann. PRISM III: an updated pediatric risk of mortality score. Critical Care Medicine, 24(5):743{752, 1996. 262 [229] Gianluca Pollastri, Darisz Przybylski, Burkhard Rost, and Pierre Baldi. Improv- ing the prediction of protein secondary structure in three and eight classes using recurrent neural networks and proles. Proteins: Structure, Function, and Bioin- formatics, 47(2):228{235, 2002. [230] Harry E Pople. Heuristic methods for imposing structure on ill-structured problems: The structuring of medical diagnostics. Articial intelligence in medicine, 51:119{ 190, 1982. [231] Malcolm Pradhan, Gregory Provan, Blackford Middleton, and Max Henrion. Knowledge engineering for large belief networks. In Proceedings of the Tenth Inter- national Conference on Uncertainty in Articial Intelligence, pages 484{490, 1994. [232] David R Prytherch, Gary B Smith, Paul E Schmidt, and Peter I Featherstone. ViEWS: towards a national early warning score for detecting adult inpatient dete- rioration. Resuscitation, 81(8):932{937, 2010. [233] Sanjay Purushotham, Wilka Carvalho, Tanachat Nilanon, and Yan Liu. Vartional recurrent adversarial deep domain adaptation. In Proceedings of the Fifth Interna- tional Conference on Learning Representations, 2017. [234] Sanjay Purushotham, Chuizheng Meng, Zhengping Che, and Yan Liu. Bench- marking deep learning models on large healthcare datasets. Journal of Biomedical Informatics, 83:112 { 134, 2018. [235] John Quinn, Christopher KI Williams, Neil McIntosh, et al. Factorial switch- ing linear dynamical systems applied to physiological condition monitoring. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009. [236] Parvez Ra, Arash Pakbin, and Shiva Kumar Pentyala. Interpretable deep learn- ing framework for predicting all-cause 30-day ICU readmissions. Technical report, Texas A&M University, 2018. [237] Maxim Raginsky and Svetlana Lazebnik. Locality-sensitive binary codes from shift- invariant kernels. In Advances in Neural Information Processing Systems 22, pages 1509{1517. 2009. [238] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1(18), May 2018. [239] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew P Lungren, and Andrew Y Ng. CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv:1711.05225, 2017. [240] Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh. Searching and 263 mining trillions of time series subsequences under dynamic time warping. In Proceed- ings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 262{270, 2012. [241] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Kon- erding, and Vijay Pande. Massively multitask networks for drug discovery. arXiv:1502.02072, 2015. [242] Rajesh Ranganath, Adler Perotte, No emie Elhadad, and David Blei. Deep survival analysis. In Proceedings of the 1st Machine Learning for Healthcare Conference, 2016. [243] Narges Razavian, Jake Marcus, and David Sontag. Multi-task prediction of disease onsets from longitudinal lab tests. In Proceedings of the 1st Machine Learning for Healthcare Conference, 2016. [244] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disen- tangle factors of variation with manifold interaction. In Proceedings of the 31st International Conference on Machine Learning, 2014. [245] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Er- han, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. Proceedings of the Third International Conference on Learning Rep- resentations Workshop Track, 2015. [246] Rachel L Richesson, Jimeng Sun, Jyotishman Pathak, Abel N Kho, and Joshua C Denny. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Articial Intelligence in Medicine, 71:57 { 61, 2016. [247] Anand I Rughani, Travis M Dumont, Zhenyu Lu, Josh Bongard, Michael A Horgan, Paul L Penar, and Bruce I Tranmer. Use of an articial neural network to predict head injury outcome: clinical article. Journal of Neurosurgery, 113(3):585{590, 2010. [248] David E Rumelhart, Georey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, DTIC Document, 1985. [249] Avishek Saha, Piyush Rai, Hal Daum e, Suresh Venkatasubramanian, and Scott L DuVall. Active Supervised Domain Adaptation. In Proceedings of the 2011 Eu- ropean Conference on Machine Learning and Knowledge Discovery in Databases, pages 97{112, 2011. [250] Hiroaki Sakoe and Seibi Chiba. Readings in speech recognition. chapter Dynamic Programming Algorithm Optimization for Spoken Word Recognition, pages 159{ 165. 1990. [251] Ruslan Salakhutdinov and Georey Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969{978, July 2009. 264 [252] Stan Salvador and Philip Chan. Toward accurate dynamic time warping in linear time and space. Intell. Data Anal., 11(5):561{580, October 2007. [253] Suchi Saria and Anna Goldenberg. Subtyping: What it is and its role in precision medicine. IEEE Intelligent Systems, 30(4):70{75, 2015. [254] Suchi Saria, Daphne Koller, and Anna Penn. Learning individual and population level traits from clinical temporal data. In Advances in Neural Information Pro- cessing Systems 22, 2010. [255] Suchi Saria, Anand K Rajani, Jerey Gould, Daphne Koller, and Anna A Penn. Integration of early physiological responses predicts later illness severity in preterm infants. Science translational medicine, 2(48):48ra65{48ra65, 2010. [256] Florian Schro, Dmitry Kalenichenko, and James Philbin. Facenet: A unied em- bedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815{823, 2015. [257] Peter Schulam, Fredrick Wigley, and Suchi Saria. Clustering longitudinal clinical marker trajectories from electronic health data: Applications to phenotyping and endotype discovery. 2015. [258] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997. [259] Burr Settles. Active learning. Synthesis Lectures on Articial Intelligence and Machine Learning, 6(1):1{114, 2012. [260] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004. [261] Xiaoxiao Shi, Wei Fan, and Jiangtao Ren. Actively Transfer Domain Knowledge. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases, pages 342{357, 2008. [262] Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics, 22(5): 1589{1604, Sept 2018. [263] Jin Shieh and Eamonn Keogh. iSAX: Indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 623{631, 2008. [264] Shohei Shimizu, Patrik O Hoyer, Aapo Hyv arinen, and Antti Kerminen. A lin- ear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 2006. 265 [265] Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv arinen, Yoshinobu Kawahara, Takashi Washio, Patrik O Hoyer, and Kenneth Bollen. DirectLiNGAM: A direct method for learning a linear non-gaussian structural equation model. Jour- nal of Machine Learning Research, 2011. [266] Hiroshi Shimodaira, Kenichi Noma, Mitsuru Nakai, and Shigeki Sagayama. Dy- namic time-alignment kernel in support vector machine. In Advances in Neural Information Processing Systems 14, pages 921{928. 2002. [267] H Shin, H R Roth, M Gao, L Lu, Z Xu, I Nogues, J Yao, D Mollura, and R M Summers. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging, 35(5):1285{1298, May 2016. [268] Kilho Shin and Tetsuji Kuboyama. A generalization of haussler's convolution kernel | mapping kernel and its application to tree kernels. Journal of Computer Science and Technology, 25(5):1040{1054, 2010. [269] Edward Shortlie. Computer-based medical consultations: MYCIN, volume 2. El- sevier, 2012. [270] Michael A Shwe, Blackford Middleton, David E Heckerman, Max Henrion, Eric J Horvitz, Harold P Lehmann, and Gregory F Cooper. Probabilistic diagnosis using a reformulation of the internist-1/qmr knowledge base. Methods of information in Medicine, 30(4):241{255, 1991. [271] Rosaria Silipo and Carlo Marchesi. Articial neural networks for automatic ecg analysis. IEEE Transactions on Signal Processing, 1998. [272] Joseph Sill and Yaser S Abu-Mostafa. Monotonicity hints. In Advances in Neural Information Processing Systems. MIT Press, 1997. [273] I Silva, G Moody, D J Scott, L A Celi, and R G Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in Cardiology, pages 245{248, Sept 2012. [274] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484{489, 2016. [275] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27, pages 568{576, 2014. [276] Hardeep Singh, Traber Davis Giardina, Ashley N D Meyer, Samuel N Forjuoh, Michael D Reis, and Eric J Thomas. Types and origins of diagnostic errors in primary care settings. JAMA Internal Medicine, 173(6):418{425, 2013. 266 [277] Hardeep Singh, Ashley N D Meyer, and Eric J Thomas. The frequency of diag- nostic errors in outpatient care: estimations from three large observational studies involving us adult populations. BMJ Quality & Safety, 2014. [278] Noam Slonim, Nir Friedman, and Naftali Tishby. Multivariate information bottle- neck. Neural Computation, 18(8):1739{1789, 2006. [279] Larry Smith, Lorraine K. Tanabe, Rie Johnson nee Ando, Cheng-Ju Kuo, I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M. Friedrich, Kuz- man Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A. Struble, Richard J. Povinelli, Andreas Vlachos, William A. Baumgartner, Lawrence Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong-Jie Dai, Feng Liu, Yifei Chen, Chengjie Sun, Sophia Katrenko, Pieter Adriaans, Christian Blaschke, Rafael Tor- res, Mariana Neves, Preslav Nakov, Anna Divoli, Manuel Ma~ na-L opez, Jacinto Mata, and W John Wilbur. Overview of biocreative II gene mention recognition. Genome Biology, 9(2):S2, Sep 2008. [280] Huan Song, Deepta Rajan, Jayaraman J Thiagarajan, and Andreas Spanias. Attend and diagnose: Clinical time series analysis using attention models. In Proceedings of the Thirty-Second AAAI Conference on Articial Intelligence, 2018. [281] David J Spiegelhalter, Rodney C G Franklin, and Kate Bull. Assessment, criti- cism and improvement of imprecise subjective probabilities for a medical expert system. In Proceedings of the Fifth Annual Conference on Uncertainty in Articial Intelligence, pages 285{294, 1990. [282] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overtting. Journal of Machine Learning Research, 15(1):1929{1958, 2014. [283] Ioan Stanculescu, Christopher K Williams, Yvonne Freer, et al. Autoregressive hidden markov models for the early detection of neonatal sepsis. IEEE Journal of Biomedical and Health Informatics, 2014. [284] Ioan Stanculescu, Christopher KI Williams, and Yvonne Freer. A hierarchical switching linear dynamical system applied to the detection of sepsis in neonatal condition monitoring. In Proceedings of the 30th Conference on Uncertainty in Articial Intelligence, 2014. [285] Richard L. Street, Lin Liu, Neil J. Farber, Yunan Chen, Alan Calvitti, Nadir Weibel, Mark T. Gabuzda, Kristin Bell, Barbara Gray, Steven Rick, Shazia Ashfaq, and Zia Agha. Keystrokes, mouse clicks, and gazing at the computer: How physician interaction with the ehr aects patient participation. Journal of General Internal Medicine, 33(4):423{428, Apr 2018. [286] Christian P Subbe, M Kruger, Peter Rutherford, and L Gemmel. Validation of a modied early warning score in medical admissions. Qjm, 94(10):521{526, 2001. 267 [287] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 29, pages 3104{3112, 2014. [288] William R Swartout. Rule-based expert systems: The MYCIN experiments of the stanford heuristic programming project. Articial Intelligence, 26(3):364{366, 1985. [289] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Proceedings of the Second International Conference on Learning Representations, 2014. [290] Sebastian Thrun. Is Learning the Nth Thing Any Easier Than Learning The First? In Advances in Neural Information Processing Systems 8, pages 640{646. 1996. [291] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bot- tleneck method. In The Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368|-377, 2000. [292] Teresa To, Eileen Estrabillo, Chengning Wang, and Lisa Cicutto. Examining intra- rater and inter-rater response agreement: A medical chart abstraction study of a community-based asthma care program. BMC medical research methodology, 8(1): 1, 2008. [293] Volker Tresp and Thomas Briegel. A solution for missing data in recurrent neural networks with an application to blood glucose prediction. In Advances in Neural Information Processing Systems 10. 1998. [294] Elif Derya Ubeyli. Combining recurrent neural networks with eigenvector methods for classication of ecg beats. Digital Signal Processing, 2009. [295] Ozlem Uzuner. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16(4):561{570, 2009. [296] Gulshan V, Peng L, Coram M, and et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus pho- tographs. JAMA, 316(22):2402{2410, 2016. [297] Eduard E Vasilevskis, Michael W Kuzniewicz, Brian A Cason, Rondall K Lane, Mitzi L Dean, Ted Clay, Deborah J Rennie, Eric Vittingho, and R Adams Dudley. Mortality probability model III and simplied acute physiology score II: assessing their value in predicting length of stay and comparison to APACHE IV. CHEST Journal, 136(1):89{101, 2009. [298] Matt Vassar and Matthew Holzmann. The retrospective chart review: important methodological considerations. JEEHP, 10:12, 2013. [299] Greg Ver Steeg. Open source project implementing hierarchical topic models on sparse data, 2016. URL https://github.com/gregversteeg/corex_topic. Last accessed August 3, 2018. 268 [300] Greg Ver Steeg and Aram Galstyan. Discovering structure in high-dimensional data through correlation explanation. Advances in Neural Information Processing Systems, 2014. [301] Greg Ver Steeg and Aram Galstyan. Maximally informative hierarchical represen- tations of high-dimensional data. In Proceedings of the Eighteenth International Conference on Articial Intelligence and Statistics, 2015. [302] J-L Vincent, Rui Moreno, Jukka Takala, Sheila Willatts, Arnaldo De Mendon ca, Hajo Bruining, CK Reinhart, PeterM Suter, and LG Thijs. The SOFA (sepsis- related organ failure assessment) score to describe organ dysfunction/failure. In- tensive care medicine, 22(7):707{710, 1996. [303] Jean-Louis Vincent and Rui Moreno. Clinical review: scoring systems in the criti- cally ill. Critical care, 14(2):207, 2010. [304] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Ex- tracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096{1103. ACM, 2008. [305] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre- Antoine Manzagol. Stacked denoising autoencoders: Learning useful representa- tions in a deep network with a local denoising criterion. Journal of Machine Learn- ing Research, 11:3371{3408, 2010. [306] T K Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1): 52{57, 1968. [307] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. [308] Ji r Vohradsk y. Neural network model of gene expression. The FASEB Journal, 15 (3):846{854, 2001. [309] Douglas P Wagner, William A Knaus, and Elizabeth A Draper. Statistical vali- dation of a severity of illness measure. American journal of public health, 73(8): 878{884, 1983. [310] Kiri Wagsta, Claire Cardie, Seth Rogers, Stefan Schr odl, et al. Constrained k- means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, volume 1, pages 577{584, 2001. [311] Yuan Wang, Wenlin Chen, Kevin Heard, Marin H Kollef, Thomas C Bailey, Zhicheng Cui, Yujie He, Chenyang Lu, and Yixin Chen. Mortality prediction in icus using a novel time-slicing cox regression method. In AMIA Annual Symposium Proceedings, page 1289, 2015. 269 [312] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66{82, 1960. [313] Mark G Weiner and Peter J Embi. Toward reuse of clinical data for research and quality improvement: the end of the beginning? Annals of internal medicine, 151 (5):359{360, 2009. [314] Sholom M Weiss, Casimir A Kulikowski, Saul Amarel, and Aran Sar. A model- based method for computer-aided medical decision-making. Articial intelligence, 11(1-2):145{172, 1978. [315] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in Neural Information Processing Systems 21, pages 1753{1760. Curran Associates, Inc., 2009. [316] Paul J Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339 { 356, 1988. [317] Jenna Wiens, Eric Horvitz, and John V Guttag. Patient risk stratication for hospital-associated c. di as a time-series classication task. In Advances in Neural Information Processing Systems 24, 2012. [318] B Williams, G Alberti, C Ball, D Bell, R Binks, L Durham, et al. National early warning score (NEWS): Standardising the assessment of acute-illness severity in the NHS. The Royal College of Physicians, London, 2012. [319] Derek J Williams, Samir S Shah, Angela Myers, Matthew Hall, Katherine Auger, Mary Ann Queen, Karen E Jerardi, Lauren McClain, Catherine Wiggleton, and Joel S Tieder. Identifying pediatric community-acquired pneumonia hospitaliza- tions: Accuracy of administrative billing codes. JAMA Pediatrics, 167(9):851{858, 2013. [320] Bradford Winters, Jason Custer, Samuel M Galvagno, Elizabeth Colantuoni, Shruti G Kapoor, HeeWon Lee, Victoria Goode, Karen Robinson, Atul Nakhasi, Peter Pronovost, and David Newman-Toker. Diagnostic errors in the intensive care unit: a systematic review of autopsy studies. BMJ Quality & Safety, 21(11): 894{902, 2012. [321] Ting Xiang, Debajyoti Ray, Terry Lohrenz, Peter Dayan, and P Read Montague. Computational phenotyping of two-person interactions reveals dierential neural response to depth-of-thought. PLoS computational biology, 2012. [322] Rui Xu, Donald Wunsch II, and Ronald Frank. Inference of genetic regulatory networks with recurrent neural network models using particle swarm optimization. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(4):681{ 692, 2007. [323] Liu Yang, Steve Hanneke, and Jaime Carbonell. A Theory of Transfer Learning with Applications to Active Learning. Machine Learning, 90(2):1{28, 2012. 270 [324] Xiaolin Yang, Seyoung Kim, and Eric P Xing. Heterogeneous multitask learning with joint sparsity constraints. In Advances in Neural Information Processing Sys- tems 22, pages 2151{2159. 2009. [325] Yiming Yang and Jan O Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 412{420, 1997. [326] Barbara P Yawn and Peter Wollan. Interrater reliability: Completing the methods description in medical records review studies. American Journal of Epidemiology, 161(10):974{977, 2005. [327] Robert W Yeh, Stephen Sidney, Malini Chandra, Michael Sorel, Joseph V Selby, and Alan S Go. Population trends in the incidence and outcomes of acute myocardial infarction. New England Journal of Medicine, 362(23):2155{2165, 2010. [328] Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vision, 126(2-4):375{389, April 2018. [329] Safoora Youse, Congzheng Song, Nelson Nauata, and Lee Cooper. Learning ge- nomic representations to predict clinical outcomes in cancer. arXiv:1609.08663, 2016. [330] Safoora Youse, Fatemeh Amrollahi, Mohamed Amgad, Chengliang Dong, Joshua E Lewis, Congzheng Song, David A Gutman, Sameer H Halani, Jose Enrique Ve- lazquez Vega, Daniel J Brat, et al. Predicting clinical outcomes from large scale cancer genomic proles with deep survival models. Scientic Reports, 7(1):11707, 2017. [331] Sheng Yu, Katherine P Liao, Stanley Y Shaw, Vivian S Gainer, Susanne E Churchill, Peter Szolovits, Shawn N Murphy, Isaac S Kohane, and Tianxi Cai. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association, 22(5):993, 2015. [332] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep net- works for video classication. pages 4694{4702, 2015. [333] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv:1409.2329, 2014. [334] B Zernikow, K Holtmannsp otter, E Michel, F Hornschuh, K Groote, and K-H Hennecke. Predicting length-of-stay in preterm neonates. European journal of pediatrics, 158(1):59{62, 1999. [335] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the Fifth International Conference on Learning Representations, 2017. 271 [336] Jiayu Zhou, Fei Wang, Jianying Hu, and Jieping Ye. From micro to macro: Data driven phenotyping by densication of longitudinal electronic medical records. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. [337] Xiaojin Zhu, John Laerty, and Zoubin Ghahramani. Combining Active Learning and Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In ICML 2003 Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 58{65, 2003. [338] Jack E Zimmerman, Andrew A Kramer, Douglas S McNair, and Fern M Malila. Acute physiology and chronic health evaluation (APACHE) IV: hospital mortality assessment for today's critically ill patients. Critical care medicine, 34(5):1297{ 1310, 2006. [339] Jack E Zimmerman, Andrew A Kramer, Douglas S McNair, Fern M Malila, and Violet L Shaer. Intensive care unit length of stay: Benchmarking based on acute physiology and chronic health evaluation (APACHE) IV. Critical care medicine, 34 (10):2517{2529, 2006. 272
Abstract (if available)
Abstract
With the widespread adoption of electronic health records (EHRs), US hospitals now digitally record millions of patient encounters each year. At the same time, we have seen high-profile successes by machine learning, including superhuman performance in complex games. These factors have driven speculation that similar breakthroughs in healthcare are just around the corner, but there are major obstacles to replicating these successes. In this thesis, we study these challenges in the context of learning to diagnose, which involves building software to recognize diseases based on the analysis of historical data rather than expert knowledge. Our central hypothesis is that we can build such systems while minimizing the burden of effort on clinical experts. We demonstrate one of the first successful applications of recurrent neural networks to the classification of multivariate clinical time series. We then extend this framework to model non-random missing values and heterogeneous prediction tasks. We next examine the scenario in which labeled data are scarce, proposing practical solutions based on active learning and weak supervision. Finally, we describe a public benchmark for clinical prediction and multitask learning that promotes reproducibility and lowers the barrier to entry for new researchers. We conclude by considering the broader impact of information technology on healthcare and how machine learning can help fulfill the vision of a learning healthcare system.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Fast and label-efficient graph representation learning
PDF
Invariant representation learning for robust and fair predictions
PDF
Mutual information estimation and its applications to machine learning
PDF
Learning affordances through interactive perception and manipulation
PDF
Learning at the local level
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Learning from planners to enable new robot capabilities
PDF
Deep learning models for temporal data in health care
PDF
Robust causal inference with machine learning on observational data
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Customized data mining objective functions
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Hashcode representations of natural language for relation extraction
PDF
Deep learning for subsurface characterization and forecasting
PDF
Learning distributed representations from network data and human navigation
PDF
Brain tumor segmentation
Asset Metadata
Creator
Kale, David Charles
(author)
Core Title
Learning to diagnose from electronic health records data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/11/2018
Defense Date
08/14/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
active learning,benchmark,deep learning,Diagnosis,digital health,EHR,electronic health records,electronic medical records,EMR,health care,healthcare,machine learning,missing values, recurrent neural network,multilabel classification,neural network,OAI-PMH Harvest,time series,transfer learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ver Steeg, Greg (
committee chair
), Galstyan, Aram (
committee member
), Raghavendra, Cauligi (
committee member
), Sukhatme, Gaurav (
committee member
)
Creator Email
davekale@gmail.com,dkale@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-87112
Unique identifier
UC11675165
Identifier
etd-KaleDavidC-6814.pdf (filename),usctheses-c89-87112 (legacy record id)
Legacy Identifier
etd-KaleDavidC-6814.pdf
Dmrecord
87112
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kale, David Charles
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
active learning
benchmark
deep learning
digital health
EHR
electronic health records
electronic medical records
EMR
healthcare
machine learning
missing values, recurrent neural network
multilabel classification
neural network
time series
transfer learning