Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deep learning models for temporal data in health care
(USC Thesis Other)
Deep learning models for temporal data in health care
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEEP LEARNING MODELS FOR TEMPORAL DATA IN HEALTH CARE by Zhengping Che A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2018 Copyright 2018 Zhengping Che Acknowledgments During my five-year Ph.D. study at the University of Southern California, I feel deeply fortunate and grateful to be with many people who have helped and supported me and my research. First and foremost, I would like to thank my advisor Prof. Yan Liu for her great guidance and support throughout my Ph.D. studies. She always gave me timely encouragement and advice and has set an excellent example as a researcher with strong passions and keen insights. I would also like to thank other members of my dissertation committee, Prof. Kevin Knight and Prof. Shinyi Wu, and other members of my qualifying exam committee, Prof. Ram Nevatia and Prof. Minlan Yu for their constructive suggestions and valuable help on my research and thesis. I feel very thankful to have great research mentors: Yu Cheng and Zhaonan Sun during and after my internship at IBM Research, and Hongfang Liu and Jennifer St. Sauver during and after my visit to Mayo Clinic. I would also like to thank my collaborators without whom it would be impossible to complete the work: Mohammad Taha Bahadori, Kyunghyun Cho, Bo Jiang, David C. Kale, Robinder Khemani, Guangyu Li, Wenzhe Li, Sanjay Purushotham, David Sontag, and Shuangfei Zhai. I am also thankful to other members and former members of i USC Melady Lab for our shared thoughts and memories: Wilka Carvalho, Dehua Cheng, Marjan Ghazvininejad, Umang Gupta, Michael Hankin, Xinran He, He Jiang, Nitin Kamra, Yaguang Li, Hanpeng Liu, Chuizheng Meng, Danial Moyer, Tanachat Nilanon, Natali Ruchansky, Sungyong Seo, Michael Tsang, and Rose Yu. Last but not the least, I would like to thank my family for their constant love, support, and encouragement. ii Contents Acknowledgments i Contents iii List of Tables viii List of Figures xi Abstract xv 1 Introduction 1 1.1 Summary of Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Preliminary 14 2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Basic Temporal Data Analytic Models . . . . . . . . . . . . . . . . . 16 2.3 Basic Deep Learning Models for Temporal Data . . . . . . . . . . . . 19 3 Temporal Datasets in Health Care 25 3.1 Medical Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Patient History Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iii 4 Utilizing Data Heterogeneity 29 4.1 Exploiting Missingness in Multivariate Time Series . . . . . . . . . . 30 4.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.1.1 Notations for Time Series with Missing Values . . . . 34 4.1.1.2 GRU-RNN for Time Series with Missing Values . . . 35 4.1.1.3 GRU-D: Model with Trainable Decays . . . . . . . . 37 4.1.1.4 Baseline Imputation Methods . . . . . . . . . . . . . 40 4.1.1.5 Baseline Prediction Methods . . . . . . . . . . . . . 42 4.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.2.1 Dataset and Task Descriptions . . . . . . . . . . . . 43 4.1.2.2 Implementation Details . . . . . . . . . . . . . . . . 44 4.1.2.3 Evaluations on Synthetic Datasets . . . . . . . . . . 46 4.1.2.4 Mortality Prediction Evaluations . . . . . . . . . . . 50 4.1.2.5 Multi-Task Prediction Evaluations . . . . . . . . . . 51 4.1.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.3.1 Investigations of Relation between Missingness and Labels . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.3.2 Validating and Interpreting the Learned Decays . . . 53 4.1.3.3 Early Prediction Capacity . . . . . . . . . . . . . . . 55 4.1.3.4 Model Scalability with Growing Data Size . . . . . . 56 4.1.3.5 Comparison to Existing Studies on Mortality Prediction 56 4.1.3.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Modeling Multi-Rate Multivariate Time Series . . . . . . . . . . . . . 59 4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.2.1 Notations for MRMTS and MR-HDMM . . . . . . . 62 4.2.2.2 Sketchy Illustrations of MR-HDMM . . . . . . . . . 63 4.2.2.3 Generation Model . . . . . . . . . . . . . . . . . . . 65 4.2.2.4 Inference Network . . . . . . . . . . . . . . . . . . . 68 4.2.2.5 Derivations for Model Parameterization . . . . . . . 71 iv 4.2.2.6 Learning the Parameters . . . . . . . . . . . . . . . . 74 4.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.3.1 Experiment Design . . . . . . . . . . . . . . . . . . . 76 4.2.3.2 Baseline Descriptions . . . . . . . . . . . . . . . . . . 78 4.2.3.3 Evaluation and Implementation Details . . . . . . . . 79 4.2.3.4 Quantitative Results . . . . . . . . . . . . . . . . . . 81 4.2.3.5 Discussions on Learnt Switches . . . . . . . . . . . . 84 5 Handling Data Scarcity 85 5.1 Boosting Performance with a Small Set of Labeled Data . . . . . . . 86 5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1.2.1 Basic Deep Prediction Model . . . . . . . . . . . . . 90 5.1.2.2 ehrGAN: Modified GAN Model for EHR Data . . . 91 5.1.2.3 Semi-Supervised Learning with GANs . . . . . . . . 94 5.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.3.1 Problem Settings . . . . . . . . . . . . . . . . . . . . 96 5.1.3.2 Risk Prediction Comparison on Basic Models . . . . 97 5.1.3.3 Analysis of the Generated Data . . . . . . . . . . . . 98 5.1.3.4 Evaluation of the Boosted Model . . . . . . . . . . . 101 5.1.3.5 Selections of Parameters . . . . . . . . . . . . . . . . 104 5.2 Incorporating Prior-Knowledge and Incremental Training . . . . . . . 106 5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2.2.1 Prior-Based Regularization . . . . . . . . . . . . . . 110 5.2.2.2 Incremental Training . . . . . . . . . . . . . . . . . . 113 5.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2.3.1 Benefits of Prior-Based Regularization . . . . . . . . 119 5.2.3.2 Efficacy of Incremental Training . . . . . . . . . . . . 123 5.2.3.3 Qualitative Analysis of Features . . . . . . . . . . . . 125 v 6 Improving Model Interpretability and Usability 127 6.1 Interpretable Mimic Learning Framework . . . . . . . . . . . . . . . . 128 6.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.1.1.1 Knowledge Distillation . . . . . . . . . . . . . . . . . 130 6.1.1.2 Gradient Boosting Trees . . . . . . . . . . . . . . . . 132 6.1.1.3 Interpretable Mimic Learning Framework . . . . . . 132 6.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.1.2.1 Experimental Design . . . . . . . . . . . . . . . . . . 135 6.1.2.2 Methods and Implementation Details . . . . . . . . . 136 6.1.2.3 Overall Classification Performance . . . . . . . . . . 137 6.1.3 Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.1.3.1 Feature Influence . . . . . . . . . . . . . . . . . . . . 138 6.1.3.2 Partial Dependence Plots . . . . . . . . . . . . . . . 140 6.1.3.3 Top Decision Rules . . . . . . . . . . . . . . . . . . . 143 6.2 Showcase on Deep Learning for Opioid Usage Study . . . . . . . . . . 144 6.2.1 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.2.1.1 Cohort Selection . . . . . . . . . . . . . . . . . . . . 147 6.2.1.2 Group Identification . . . . . . . . . . . . . . . . . . 149 6.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.2.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . 150 6.2.2.2 Temporal Data Processing . . . . . . . . . . . . . . . 150 6.2.2.3 Implementation and Training Details of Deep Learn- ing Models . . . . . . . . . . . . . . . . . . . . . . . 152 6.2.2.4 Comparison to Other Machine Learning Baselines . . 154 6.2.2.5 Investigating Important Features . . . . . . . . . . . 155 6.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.2.3.1 Classification Result Comparison . . . . . . . . . . . 156 6.2.3.2 Feature Analysis . . . . . . . . . . . . . . . . . . . . 159 7 Conclusion and Future Work 161 vi 7.1 Contributions and Limitations . . . . . . . . . . . . . . . . . . . . . . 162 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Index 167 Reference List 169 vii List of Tables 4.1 Size comparison of GRU models used in the GRU-D experiments. . 46 4.2 Model performances measured by AUROC score (mean±std) for the mortality prediction task on multivariate time series with missing values. 49 4.3 Model performances measured by average AUROC score (mean±std) for multi-task predictions on real datasets with missing values. . . . . 52 4.4 Comparison of structured inference networks used in MR-HDMM. . . 70 4.5 Descriptions of the MIMIC-III dataset used for MR-HDMM evalua- tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.6 Forecasting results (MSE) on the MIMIC-III dataset with MRMTS data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7 Interpolation results (MSE) on the MIMIC-III dataset with MRMTS data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.8 Lower bound of log-likelihood of generative models on theMIMIC-III dataset with MRMTS. . . . . . . . . . . . . . . . . . . . . . . . . . . 83 viii 5.1 Prediction performance comparison of the basic predictive models on the Claim dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Top 10 most frequent ICD-9 diagnosis codes of heart failure cohort group in the generated data. . . . . . . . . . . . . . . . . . . . . . . . 101 5.3 Top 10 most frequent ICD-9 diagnosis codes of diabetes cohort group in the generated data. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4 Performance compassion of different CNN and SSL prediction models on four sub-datasets from the Claim dataset. . . . . . . . . . . . . . 103 5.5 AUROC scores with different values of ρ in the SSL-GAN method. . 104 5.6 AUROC scores for classification on the PICU dataset. . . . . . . . . 122 5.7 AUROC scores for incremental training on the PICU dataset. . . . . 124 6.1 Interpretable mimic learning classification results (mean± 95% confi- dence interval) for two tasks on the Vent dataset. . . . . . . . . . . 137 6.2 Top features and their corresponding importance scores from the GBT and GBTmimic models on the Vent dataset. . . . . . . . . . . . . . 139 6.3 Data characteristics of different patient groups identified from the REP dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Record table descriptions and statistics of the selected data from the REP dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.5 Model size comparison in terms of the saved binary files in disk for opioid study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 ix 6.6 Long-term opioid patient prediction (ST-LT) results (mean± 95% confidence interval) on the REP dataset. . . . . . . . . . . . . . . . . 156 6.7 Opioid-dependent patient prediction (LT-OD) results (mean± 95% confidence interval) on the REP dataset. . . . . . . . . . . . . . . . . 157 6.8 Most important features for long-term opioid patient (ST-LT, left) and opioid-dependent patient (LT-OD, right) identified from DNN-3hl model for opioid study. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 x List of Figures 1.1 Summary of thesis work. . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Examples of missing data and irregularly sampled multivariate time series (left, orange parts indicate missing observations, and pH values are irregularly sampled) and multi-rate multivariate time series (right) in the MIMIC-III dataset. . . . . . . . . . . . . . . . . . . . . . . . 7 4.1 Demonstration of informative missingness on the MIMIC-III dataset: the absolute values of Pearson correlations between variable missing rates (bottom) and ICD-9 diagnosis categories (top) and mortality (middle). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 An example of measurement vectorsx t , time stamps s t , maskingm t , and time interval t . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 GraphicalillustrationsoftheoriginalGRU (top-left),GRU-D (bottom- left), and the proposed network architecture (right). . . . . . . . . . . 37 4.4 Classification performance on the Gesture synthetic datasets with different correlation values and random missing values. . . . . . . . . 48 xi 4.5 DemonstrationsofinformativemissingnessonthePhysioNetdataset: the absolute values of Pearson correlations between variable missing rates (bottom) and 4 labels. . . . . . . . . . . . . . . . . . . . . . . . 53 4.6 Plots of input decay x t for all variables (top) and histrograms of hidden state decay weights W γ h for 10 variables (bottom) in the GRU-D model for the mortality prediction task on the PhysioNet dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.7 Early prediction capacity (left) and model scalability comparisons (right) of GRU-D and other RNN baselines on the MIMIC-III dataset. 56 4.8 Generation model (left) and structured inference network (right, with the filtering setting) of our proposed MR-HDMM for MRMTS. . . . . 64 4.9 The switch mechanism (left) for updating the latent statesz l t in MR- HDMM and the illustrations of switch on (middle, s l t = 1) and switch off (right, s l t = 0) conditions. . . . . . . . . . . . . . . . . . . . . . . . 65 4.10 Generation model of MR-DMM (left) and that of HDMM (right) simplified from the proposed MR-HDMM. . . . . . . . . . . . . . . . 78 4.11 Interpretable latent structure learned by MR-HDMM model in the first 48 hours of an admission from the MIMIC-III dataset by switch states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1 Illustration of the structure of proposed ehrGAN model (left) and its generator (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 The length distributions of original and generated datasets from the Claim dataset in heart failure cohort (left) and diabetes cohort (right). 99 xii 5.3 The frequency of top 100 features in original and generated datasets from the Claim dataset in heart failure cohort (left) and diabetes cohort (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.4 The co-occurrence frequency of top 20 diagnosis features from the Claim dataset in heart failure cohort (left two) and diabetes cohort (right two). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5 AUROC scores with different values of μ in SSL-GAN for heart failure (left) and diabetes (right). . . . . . . . . . . . . . . . . . . . . . . . . 105 5.6 An illustration of the deep network (left) with the regularization on categorical structures (middle) applied to the output layer of the network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.7 The co-occurrence matrix of the PICU dataset. . . . . . . . . . . . . 112 5.8 Illustrations on how adding various units by the incremental training changes the weightsW. . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.9 Weight distributions for three layers of a neural network after pretrain- ing (left three) and finetuning (right three) the model on the PICU dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.10 Example priors for the PICU dataset (leftmost: ICD-9 tree prior; middle-left: ICD-9 shared category prior; middle-right: co-occurrence prior from left to right) and Physionet dataset (rightmost: co- occurrence prior). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.11 Classification performance comparison for prior-based regularizer on the PhysioNet dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 121 xiii 5.12 Trainingtimefordifferentneuralnetworksforfull/incrementaltraining strategies on the PICU dataset. . . . . . . . . . . . . . . . . . . . . . 124 5.13 Example features learned from the PICU dataset (top two rows) for the ICD-9 circulatory disease category (ICD-9 codes 390-459) and (bottom two rows) for conditions related to septic shock (ICD-9 codes 990-995). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.1 Illustration of mimic method training Pipeline 1. . . . . . . . . . . . . 133 6.2 Illustration of mimic method training Pipeline 2. . . . . . . . . . . . . 133 6.3 Individual (with left y-axis) and cumulative (with right y-axis) feature importance for MOR (top) and VFD (bottom) tasks on the Vent dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.4 Feature importance for static features and temporal features on each day for two tasks on the Vent dataset. . . . . . . . . . . . . . . . . . 140 6.5 One-way partial dependence plots of the top features from GBTmimic for MOR (top) and VFD (bottom) tasks on the Vent dataset. . . . . 141 6.6 Pairwise partial dependence plots of the top features from GBTmimic for MOR (top) and VFD (bottom) tasks on the Vent dataset. . . . . 142 6.7 Sample decision trees from best GBTmimic models for MOR (top) and VFD (bottom) tasks on the Vent dataset. . . . . . . . . . . . . . 143 6.8 Illustrations of the proposed pipelines from raw cohort data to final prediction for opioid study with DNN prediction model for data with temporal sum-pooling (left) and RNN prediction model for data with temporal segmentation (right). . . . . . . . . . . . . . . . . . . . . . . 152 xiv Abstract The worldwide push for electronic health records has resulted in an exponential surge in volume, detail, and availability of digital health data which offers an unprecedented opportunity to solve many difficult and important problems in health care. Clinicians are collaborating with computer scientists by using this opportunity to improve the state of data-driven and personalized health care services. Meanwhile, recent success and development of deep learning is revolutionizing many domains and provides promising solutions to the problems of prediction and feature discovery on health care data, which have made us closest ever towards the goals of improving health quality, reducing cost, and most importantly saving lives. However, the recent rise of this research field with more data and new applications has also introduced several challenges which have not been answered well. Mythesisworkfocusesonprovidingdeeplearning-basedsolutionstothreemajor challenges in health care tasks on temporal data, ranging from data heterogeneity, data scarcity, and model interpretability and usability for health care applications in practice. In terms of utilizing data heterogeneity, we develop recurrent neural network models which exploit missingness in time series to improve the prediction performance. We then introduce hierarchical deep generative models for multi-rate multivariate xv time series which capture multi-scale temporal dependencies and complex underlying generation mechanism of temporal data in health care. To handle data scarcity, we introduce a semi-supervised learning framework with modified generative adversarial networks to boost prediction performance with a limited amount of labeled data. We also make some attempts to incorporate prior- knowledge and incremental learning to train deep neural networks more efficiently. To improve the interpretability and usability of deep learning models for clin- icians and doctors in practical applications, we propose an interpretable mimic learning framework to get models with both powerful performance and good inter- pretability, and describe our deep learning solutions to opioid usage and addiction study on a large-scale dataset to demonstrate how deep learning can be applied to important and urgent health care tasks in the real world. All of the proposed methods are evaluated on real-world health care datasets from different applications. Some important findings are also examined and validated by our collaborators from hospitals. xvi Chapter 1 Introduction 1 The worldwide push [4, 221] for electronic health records (EHRs) has resulted in an exponential surge in volume, detail, and availability of digital health data. A variety of valuable health data coming from different sources, along with the need for better health care quality, is beginning to overwhelm health care providers and researchers. The increasing deployment of EHR system in modern hospitals and the popularity and impact of using personal smart devices and medical apps has led to collection and aggregation of massive unstructured health care data which offers an unprecedented opportunity to solve difficult and important problems and enhance each stage of the health care chain. This field of research and applications, commonly referred as data-driven healthcare (DDH) [144], is under rapid development and attracting many researchers and institutions. Researchers have attempted to utilize state-of-the-art machine learning and statistical models on a broad set of clinical tasks which are difficult or even impossible to solve by traditional methods and/or without large-scale data [186, 242]. Deep learning is one of the most effective choices among all proposed methods mainly because of its capacity for feature learning and other downstream tasks. By learning multi-level representations in a hierarchical way, it has brought lots of significant successes in many areas with either supervised [80], unsupervised [66], or semi- supervised [116] strategies, especially after the very first breakthrough of training deep learning models in more than a decade ago [94]. Nowadays, deep learning models are able to recognize and distinguish thousands of human faces at a time [172, 218], understanding, translating, and generating human languages almost flawlessly [15, 232], and mastering games and beating top human professional players [208]. The more fascinating part of deep learning is that new ideas and models are proposed every year and pushing the state of the art in this area. People have used generative 2 adversariallearningtointroducediscriminatorandgeneratorforlearningtherealdata distribution[83], examined thepossibleuse ofexternal memorycoupled with standard neural networks by neural Turing machines [85], and applied deep reinforcement learning to learning by trial-and-error from the interaction between an agent and an environment [157]. In addition, a plenty of flexible and fast-developing open-source software libraries, such as TensorFlow [1], PyTorch [175], Theano [5], Caffe [104], MXNet [37], and Keras [44], make it easier for everyone to experience the strength of deep learning in both research and production, which leads to a broader impacts and more success of deep learning. Not only revolutionizing the traditional domains and applications, recent success and development in deep learning is also demonstrating its great potential in health care. A series of excellent work has been conducted in seek of novel deep learning solutions in different health care applications, which are also reviewed and summarized from different perspectives [105, 154, 206, 245]. Unsupervised deep learning methods are often used for learning computational phenotypes and concept embeddings from various EHR data [128, 149]. Disease classification, risk prediction, and outcome prediction studies are some other major health care analytic tasks [38, 153], which are usually handled by supervised deep learning models on disease-specific datasets [20, 36] or general EHR datasets [106, 141]. Following its success in computer vision and natural language processing, deep learning is also widely applied in medical imaging analysis including classification, feature learning, and segmentation for both 2D/3D magnetic resonance imaging scans and other types of photographs [11, 67, 152, 247], and clinical natural language processing [166, 223, 236]. Other promising application directions of deep learning 3 for health care includes genomic medicine [130], mobile data modeling [188], and patient de-identification [56]. People share different explanations on why deep learning works, either from theoretical analysis, empirical validation, or even intuitive feeling, and attempt to figure out in which cases it works [79, 134, 252]. Actually, the practical reality of deep learning models is not always smooth sailing, and we face many open questions when applying them to real-world problems in health care. This problem is especially serious when dealing with temporal data in health care, which is widely available and where complex temporal dependency is the key to accurate predictive analysis but also difficult to capture, and several unique and critical challenges lie in its practical aspects. The first challenge of modeling temporal data in health care comes from the inherent heterogeneity. Data in health care are not as clean and well-organized as those in other research areas [143] where deep learning models stand out. Temporal data in health care usually come with unique properties, such as longitudinal irregu- larity, inherent noises, missing values, and multi-rate and multi-source nature. For example, for patients at intensive care unit (ICU), their vital signs such as heart rate are sampled frequently, while lab results such as pH are measured infrequently, and sometimes the nurse may even not take the measurements of some variables for some reasons. Meanwhile, previous studies have also revealed the statistical difference in manual and electronic recordings of the same variables [35]. The data hetero- geneity make it extremely difficult to apply most existing mature models in many domains [75, 111] and especially in health care [124]. At the same time, such data heterogeneity may also provide additional useful information from the data-driven perspective [88, 181], though it is difficult to verify or exploit the intuitions and 4 assumptions. In one word, the heterogenous temporal data which are commonly observed in health care is a double-edged sword for predictive analysis. Another major challenge lies in the scarcity of the health care data, especially for rare but critical diseases and conditions, which prevents powerful deep learning models from being well-trained. Data scarcity might not appear to be recognised as an issue at the first glance, as they expect health care data are collected from all around the world. However, data collections are scarce for rare diseases and rare cases [97]. The diagnoses of these diseases can also be difficult, costly, delayed or even inaccurate. Many people doubt how deep learning and machine learning models perform under this circumstance with data scarcity [180]. Meanwhile, we own rich structured domain knowledge from various medical ontologies such as Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) [49], Unified Medical Language System (UMLS) [26], International Classification of Disease - 9th revision (ICD-9) [168], or just knowledge about the medical feature similarity. It is promising but nontrivial to utilize such information which is particularly available in health care to help deep learning model training. Furthermore, it is quite challenging to make deep learning methods easier to use, understand, and interpret for doctors and clinicians in real-world health care applications. While computer scientists have provided more sophisticated deep learning models and achieved better scores in terms of almost every evaluation metric, doctors are becoming interested in but still cautious about the usage of these novel methods [137, 138]. There are several possible explanations for the existence of the gap between these two communities. Firstly, this is simply because most deep learning methods are unable to express why they achieved the result that they did [203]. Not being able to explain oneself might be only minor issues in some 5 cases but is definitely a big problem in many applications [230] including health care, where interpretability is not only important but actually necessary. Doctors will not let a “black box” determine their actions and decisions just because the “black box” says so. Therefore, a good interpretability is the prerequisite to adopt novel deep learning methods and should be the merit of all proposed models to health care problems. Secondly, to design or borrow existing deep learning solutions to new clinical studies requires the contributions from both machine learning researchers and clinicians, and neither can be dispensed with [9]. Clinicians provide necessary domain knowledge of the tasks and ask problems that really matter, while machine learning researchers should be capable of thinking beyond syntax and deliver suitable models swiftly. Examples of such effective collaborations can be quite helpful for both communities. 1.1 Summary of Thesis Work Deep learning models for temporal data in health care Utilizing data heterogeneity • [Nature Scientific Reports] GRU-D: Exploiting missingness in multivariate time series • [ICML 2018] MR-HDMM: Modeling multi- rate multivariate time series Handling data scarcity • [ICDM 2017] ehrGAN: Boosting performance with a small set of labeled data • [KDD 2015] Incorporating prior-knowledge and incremental training Improving model interpretability and usability • [AMIA 2016] Interpretable mimic learning framework • [AMIA 2017] Deep learning solutions to opioid usage study Figure 1.1: Summary of thesis work. The primary work of this thesis is three-fold by addressing the above challenges and concerns as shown in Figure 1.1: first, to utilize two types of heterogeneous 6 temporal data (i.e., missing data and multi-rate data, as illustrated in Figure 1.2) to improve deep learning model performance; second, to handle deep learning model training issues which come from data scarcity; and third, to make deep learning models easier to understand and use by clinicians and doctors. Figure 1.2: Examples of missing data and irregularly sampled multivariate time series (left, orange parts indicate missing observations, and pH values are irregularly sampled) and multi-rate multivariate time series (right) in the MIMIC-III dataset. To exploit the missing patterns for achieving improved prediction performance in multivariate time series data in health care, we develop novel deep learning models, namely GRU-D, as one of the early attempts to utilize the correlations between the missing values and missing patterns in time series and the target labels, a.k.a., informative missingness. GRU-D is based on the gated recurrent unit (GRU) network and takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates the decay mechanism in the deep model architecture. GRU-D not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. To address the challenge of modeling multi-rate multivariate time series data in health care which come with various sampling rates and encode multiple temporal dependencies, we propose the multi-rate hierarchical deep Markov model (MR- HDMM), a novel deep generative model which learns the latent hierarchical structure of the underlying data generation process. MR-HDMM is equipped with the learnable 7 switches and the auxiliary connections. The learnable switches capture multiple temporal dependencies in multiple latent states by controlling the update-and-reuse mechanism of the states. The auxiliary connections help the model effectively capture multi-scale dependencies from time series with different sampling rates and bring more flexibility into the model. With a systematically designed inference network, MR-HDMM can be effectively learned by stochastic backpropagation and ancestralsampling. MR-HDMM isabletoachievebetterforecastingandinterpolation performance compared with single-rate or basic imputation methods. To address the challenge of not having massive temporal EHR data with labels to train powerful deep learning risk prediction model, we proposed a general frameworktoboosttheriskpredictionperformancewithalimitedamountoftemporal EHR data with labels. The proposed method is a general semi-supervised learning framework equipped with a modified generative adversarial network, namely ehrGAN, and a convolutional neural network prediction model. The ehrGAN model is able to provide plausible labeled EHR data by mimicking real patient records and taking the labels of the real records. Then the generated data are used to augment the training dataset in a semi-supervised learning manner and enhance the predictive capacity of the convolutional neural network prediction model. To address the challenge of utilizing domain knowledge and data characteristics in health care time series data, we propose a general framework with two novel modifications to standard deep learning models to speedup neural network training and improve the prediction performance. First, we use Laplacian regularizer in the topmost layer to leverage priors of any form, ranging from medical ontologies to data-derived similarity. Second, we describe a scalable procedure for training neural networks incrementally with partially shared architectures. Both of these innovations 8 are well-suited to health care applications, where data for rare diagnoses may not be large-scale, but exploitable structures, such as temporal order and relationships between labels and features, exist in the temporal data. We propose an interpretable mimic learning framework to address the challenge that deep learning models lack interpretability, which is crucial for their wide adoption in medical research and clinical decision-making. The interpretable mimic learning framework takes a simple yet powerful knowledge-distillation approach which uses gradient boosting trees to learn interpretable models and at the same time achieves strong prediction performance as deep learning models. This is done by first training a state-of-the-art deep learning model as a teacher model, and then training gradient boosting trees as the student model by taking the output of the teacher model as the target labels. The student model is able to mimic the teacher model and achieve similar or even better performance. We also utilize different methods to interpret the proposed models, including taking feature influence scores, visualizing partial dependency plots, and investigating the top decision trees. The identified important features and rules are also validated by the clinicians. To demonstrate how to adopt novel deep learning methods for urgent and important real-world health care problems with large-scale data, we propose our deep learning solutions for classifying patients on opioid use. We work closely with doctors to build powerful deep learning predictive models and discover the factors leading to opioid long-term use, dependence, and abuse. We design and build the entire pipeline from cohort selection and group identification to feature extraction and temporal data processing, compare several commonly used baselines with deep learning methods, and investigate important features from deep learning models. Our study shows the strengths and limitations of different deep learning methods for 9 its application to opioid study and also acts as reference for other predictive analysis in health care. Though our models are mainly designed for temporal data in health care, the application domain of the proposed models goes far beyond, as we believe the key ideas behind the models for handling complex structured and unstructured data could be generalized to real-world problems with other types of data and/or in other domains. The studies in this thesis work showcase how we collaborate with domain experts to express domain-specific problems with the machine learning language and attempt to solve them together. 1.2 Thesis Statement In this thesis, we provide effective deep learning solutions to many unique and critical challenges in temporal data from real-world health care applications, in terms of data heterogeneity, data scarcity, and model interpretability. To be more specific, based on the thesis work, we conclude the following statements hold true in general. • Data heterogeneity in health care domain is a fruitful resource for development and improvement of deep learning solutions. By modeling the decay mecha- nism based on the missing patterns, recurrent neural network models exploit the missingness to achieve better prediction performance. Hierarchical deep generative models work better on multi-rate time series by capturing intra- and inter-rate multi-scale temporal dependencies. 10 • The data scarcity issue may cause many training issues especially for rare diseases and because of lacking high quality labels and annotations. Semi- supervised learning frameworks with generative adversarial networks and other prior-knowledge-based training strategies mitigate such problems and provide more efficient solutions. • The interpretability is at least as important as predictive performance in health care research. The interpretable mimic learning framework is an effective way to provide interpretable and powerful models. • Deep learning models can be used as efficient and effective solutions to solve many urgent and important real-world health care problems on large-scale datasets. This relies on the collaborations between computer scientists with expert knowledge in deep learning and clinicians and doctors with domain knowledge in health care. 1.3 Thesis Outline WefirstintroducepreliminaryconceptsforthisthesisinChapter2. InChapter3 we introduce several temporal datasets in health care on which all experiments in this thesis work are conducted. In Chapter 4, we present our in utilizing two types of data heterogeneity, time series with missing values and of multiple sampling rates. We present two works in handling data scarcity in Chapter 5. In Chapter 6, we present the interpretable mimic learning framework and the deep learning solutions to opioid usage study. We conclude the thesis and summarize the future work in Chapter 7. 11 1.4 Related Publications Parts of this thesis have been published in machine learning, data mining and health care conferences and journals. The list includes: • Related to Chapter 4: – Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Nature Scientific Reports (SREP), 8(1):6085, 2018. – Zhengping Che*, Sanjay Purushotham*, Guangyu Li*, Bo Jiang, and Yan Liu. Hierarchical Deep Generative Models for Multi-Rate Multivariate Time Series. Proceedings of the 35th International Conference on Machine Learning (ICML), 2018. (*equal contributions) • Related to Chapter 5: – Zhengping Che*, Yu Cheng*, Shuangfei Zhai, Zhaonan Sun, and Yan Liu. Boosting Deep Learning Risk Prediction with Generative Adversar- ial Networks for Electronic Health Records. Proceedings of the IEEE 17th International Conference on Data Mining (ICDM), 2017. (*equal contributions) – Zhengping Che*, David C. Kale*, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu. Deep Computational Phenotyping. Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2015. (*equal contributions) • Related to Chapter 6: 12 – Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable Deep Models for ICU Outcome Prediction. Proceedings of the American Medical Informatics Association Annual Symposium (AMIA), 2016. – Zhengping Che, Jennifer St. Sauver, Hongfang Liu, and Yan Liu. Deep Learning Solutions for Classifying Patients on Opioid Use. Proceedings of the American Medical Informatics Association Annual Symposium (AMIA), 2017. 13 Chapter 2 Preliminary 14 In this chapter, we provide necessary preliminary concepts and descriptions used throughout the rest of this thesis. We first introduce the basic notations in Section 2.1 and review some basic machine learning methods for temporal data analysis in Section 2.2. Then we review several standard and commonly used deep learning models in Section 2.3, which is also the core technical background of this thesis. In addition, application-specific background and notations are respectively presented within each chapter. 2.1 Notations We list the notations which are used to describe the temporal data and the models in this thesis. We use bold capital letter (e.g.,W andX) to refer to matrix variable, bold lowercase letter (e.g.,b andy) for vector variable, and unbold letter (e.g., l, t, D, and T) for scalar, unless otherwise specified. The element (i,j) of a matrixA is denoted byA i,j orA[i,j]. The i-th entry of vectorb is denoted by b i . We useX > andy > to denote the transpose of a matrixX and a vectory, respectively. We denote a multivariate time series (MTS, e.g., measurements of a patient at ICU) with D variables of length T as X = (x 1 ,x 2 ,...,x T ) > ∈ R T×D , where x t ∈ R D represents the t-th observations (a.k.a., measurements) of all variables for each t∈{1, 2,...,T}, and x d t denotes the measurement of d-th variable ofx t . Similarly, a temporal embedding matrix can be represented byX∈R T×M , where each rowx t ∈R M of that matrix is the t-th embedding vector in the order of time of occurrence. For classification or regression tasks, we usually use y to denote the labels or the target values. We use letter in script typeface such asD to denote 15 dataset. For the n-th sample in the dataset, we use superscript with parenthesis (e.g.,X (n) or y (n) ) to indicate variables of that sample. The weights of deep learning models are represented by matrices and vectors as well. To represent the weight and bias in lth layer of the network, we use superscript likeW [l] andb [l] , or simplyW l andb l if there is no contextual ambiguity. We use to represent element-wise multiplication operator. We usek·k to represent the matrix/vector norm, which is usually the L 2 norm if not specified. 2.2 Basic Temporal Data Analytic Models Temporal data analysis is the process to model and explain data points which have time-dependency. In this section, we review several common machine learning and statistical models for temporal data related to this thesis work, which are time series classification, prediction and forecasting. Autoregressive integrated moving average (ARIMA) model The autore- gressive integrated moving average (ARIMA) model [30], or often denoted as ARIMA(p,d,q), is a popular model fitting to time series data to better under- stand the data or to predict future points in the series. It can be seen as the combination of the auto-regression (AR) part, the moving average (MA) part, and the integrated (I) part, and is designed for non-stationary data. The p, d, and q in the notation are the parameters for the order of the AR model, the number of differencing, and the order of the MA model, respectively. ARIMA attempts to forecast x t , the value of the time series at time t, from its past value(s). Firstly, the 16 AR part indicates that the evolving variable of interest is regressed on its own lagged (i.e., prior) values, and the AR model of order p (i.e., AR(p)) can be written as x t = p X i=1 φ i x t−i +ε t +c where φ i are the parameters, ε t are the error terms, and c is a constant. Secondly, the MA part indicates that the regression error is actually a linear combination of error terms whose values occurred contemporaneously and at various times in the past. The MA model of order q (i.e., MA(q)) can be written as x t = q X i=1 θ i ε t−i +ε t +c where θ i are the parameters of the MA model. Thirdly, the I part indicates that the data values have been replaced with the difference between their values and the previous values. By combining all three parts together, the full model ARIMA(p,d,q) can be written as 1− p X i=1 φ i L i ! (1−L) d x t =c + 1 + q X i=1 θ i L i ! ε t where L is the lag operator defined as L k x t = x t−k . One way to choose a proper order is to minimize the Akaike information criterion (AIC) or Bayesian Information Criterion (BIC) [48]. These models can be straightforwardly generalized from univariate to multivariate and lead to the vector autoregression (VAR) models. Kalman filters (KF) Kalman filters (KF) [110] is one kind of state space models for time series data estimation which assumes the linear dynamic properties of the system and Gaussian (white) noise in the temporal data observations. To be more 17 specific, the Kalman filter model for statesx, control vectorsu and observationsz can be represented by the following two equations x t =Ax t−1 +Bu t +w k z t =Hx t +v k where A,B,H are the state transition model, the control input model, and the emission model, respectively. w∼N (0,Q t ) andv∼N (0,R t ) are the process noise and the observation noise, respectively, which both from a zero mean multivariate normal distribution. Kalman filters can be fitted by taking five update equations. The first two equations are for the time update (prediction) b x − t =A b x − t−1 +Bu t−1 P − t =AP t−1 A > +Q The following three equations are for the measurement correction K t =P − t H > HP − t H +R −1 b x t = b x − t +K t z t −H b x − t P t = (I−K t H)P − t HereP is the error covariance,K is the Kalman gain,? − and b ? denote the prior and estimate of the variable ?, respectively. By taking the update equations iteratively, Kalman filters converge to the true value after enough steps. 18 Dynamic time warping (DTW) Measuring similarity between two temporal sequences, which may have different lengths, is a fundamental problem in time series analysis and serves as the basis for many other tasks. Dynamic time warping (DTW) [25] is one of the most popular similarity measures, which aims to take the speed difference between the two time series into consideration. DTW attempts to find the optimal alignment between two temporal sequencesX = (x 1 ,··· ,x T ) andY = (y 1 ,··· ,y T 0) with certain restrictions and rules. First, x 1 and y 1 must be aligned, so as x T and y T 0. Second, each x t and y t 0 must be include in the alignment, and the order of indices should be non-decreasing. If we use d t,t 0 to indicate the (partial) dynamic time warping distance between (x 1 ,··· ,x t ) and (y 1 ,··· ,y t 0), then DTW can be calculated by dynamic programming as d t,t 0 = min{d t−1,t 0,d t,t 0 −1 ,d t−1,t 0 −1 } +kx t −y t 0k DTW can also be generalized to multivariate time series in practice [78] by taking the Euclidean distance as the local distance for all (x t ,y t 0). 2.3 Basic Deep Learning Models for Temporal Data Deep learning models, or deep neural networks, have become a successful approach for automated extraction of complex data representations for end-to-end training. Deep learning models consist of a layered and hierarchical architectures of neurons for learning and representing data. The hierarchical learning architecture is motivated by artificial intelligence emulating the deep and layered learning process of 19 the primary sensorial areas of the neocortex in the human brain, which automatically extracts features and abstractions from the underlying data [22, 127]. In deep learning models, each neuron receives one or more inputs and sums them to produce an output (or activation). Each neuron in the hidden layers is assigned a weight that is considered for the outcome classification, but this weight is itself learned from its previous layers. The hidden layers thus can use multidimensional input data and introduce progressively non-linear weight combinations to the learning algorithm. Deep feed-forward neural network (DNN) model Deep feed-forward neural network (DNN) [99] is one of the most basic deep neural networks. It is composed with multiple non-linear transformation layers to extract features from the data and possibly one prediction layer on the top to solve classification/regression task. The output of each layer is fed to the next layer as input. For a DNN model with L layers (i.e., L− 1 hidden layers and one final output layer), the input vector for the l-th layer is denoted asx [l] ∈R D [l] , and the transformation of each layer l can be written as x [l+1] =h [l] =f [l] (x [l] ) =f [l] W [l] x [l] +b [l] where W [l] and b [l] are respectively the weight matrix and bias vector of layer l, and f [l] is a nonlinear activation function, which usually takes one of sigmoid (logistic sigmoid, σ(x) = 1 1+exp(−x) ), tanh (hyperbolic tangent, tanh(x) = 2σ(2x)− 1), and ReLU (rectified linear unit, ReLU(x) = max(0,x)) [160]. To conduct binary classification tasks, we apply sigmoid function in the last layer and set the output dimension to be 1. People also usually treat the output of other layersx [l] as the features extracted by the neural network. The weights of the deep learning models 20 can be learned via backpropagation during training. Taking the binary classification task as an example, we can learn the weights by optimizing binary cross-entropy loss function during training, which is ` loss =− N X n=1 y (n) logh [L],(n) + (1−y (n) ) log(1−h [L],(n) ) where y (n) is the binary label for n-th sample in the training dataset, and h [L],(n) is the output of the DNN model. It is worth noting that the ideas of using non-linear activation functions, training by backpropagation, and learning layer-wise features (representations) are not only used for simple DNN models but in other deep learning models. Vanilla recurrent neural network (RNN) model In order to handle sequen- tial or temporal data of arbitrary length and capture temporal information from the data, recurrent neural network (RNN) [82] models are widely used. Unlike feed-forward neural network, RNN performs the same operation at each time step of the sequence input, and feed the output to the next time step as part of the input. Thus, the RNN models are able to memorize what they have seen before and benefit from shared model weights (parameters) for all time steps. For example, at time step t, the activation of a vanilla RNN can be calculated as h t =f(W x x t +W h h t−1 +b) whereW x ,W h andb are the network parameters, f(·) is a non-linear function, and the initial value of the hidden stateh 0 is usually set to 0. Similar to theDNN models, RNN can also have multiple stacked hidden layers at each time step. Additional 21 feed-forward layers can also be applied on top of the last outputh T or each outputh t . backpropagation through time algorithm [193] is usually used to train RNN models. Long short-term memory (LSTM) In order to capture complex long temporal dependency and avoid vanishing gradient problems, some modified RNN models have been proposed with state-of-the-art performance. Long Short-Term Memory (LSTM) [96] is one of the earliest and most commonly used RNN models. At each time step t, LSTM takes the input at that time stepx t and output at previous time steph t−1 to update its inner cell statec t and produce the current outputh t as f t =σ (W f x t +U f h t−1 +b f ) i t =σ (W i x t +U i h t−1 +b i ) o t =σ (W o x t +U o h t−1 +b o ) c t =f t c t−1 +i t tanh (W c x t +U c h t−1 +b c ) h t =o t tanh (c t ) whereW ∗ ,U ∗ ,b ∗ are all model parameters. The initial values ofh 0 andc 0 are both set to be 0. Gated recurrent unit (GRU) Gated recurrent unit (GRU) [39] is anotherRNN model which has a simpler architecture compared to LSTM and has been shown to achieve the state-of-the-art performance among all RNN models for modeling sequential data [47]. Letx t ∈R P denotes the variables at time t, where 1≤t≤T. 22 At each timet, GRU has a reset gater j t and an update gatez j t for each of the hidden state h j t . The update function of GRU is shown as z t =σ (W z x t +U z h t−1 +b z ) r t =σ (W r x t +U r h t−1 +b r ) e h t =tanh (Wx t +U(r t h t−1 ) +b) h t = (1−z t )h t−1 +z t e h t where matricesW ∗ ,U ∗ ,b ∗ are model parameters. Convolutional neural network (CNN) model Another basic neural network is the convolutional neural network (CNN) [129], which is good at capturing local structure and especially commonly applied on image or video data. In order to handle temporal data, we can use CNN models with one-dimensional convolutional operation over the temporal dimension. CNN models are usually a combination of three types of layers: convolutional layer, pooling layer, and fully connected layer. The convolutional layer is the core building block of CNN models, which attempts to learn spatially activation map via lots of filters over the inputs with parameter sharing. The pooling layer is used to mainly reduce the size of the representations and the number of parameters. For example, a one-dimensional max-pooling layer outputs the maximum value of its input along one dimension. The fully connected layer is just as regular layers in DNN which connects to all outputs from its previous layer. A variety of CNN architectures made up of the above layers are proposed to solve different problems. 23 Deep Markov model (DMM) Unlike other deep learning models introduced above, Deep Markov model (DMM) [122, 123] is a class of generative models which empower Gaussian state space models to leverage the representational capacity of deep learning models. Deep Markom models keep the Markov property as hidden Markov models and replace the classic linear emission and transition distributions by multiple neural network layers. Deep Markom models can be learnt by stochastic gradient ascent on a variational lower bound of the likelihood. Combination of deep learning models One significant advantage of deep learn- ing models is its flexibility in structure with end-to-end training, which allow us to combine different types of layers and networks into one model. For example, we can use an RNN model to take the temporal inputs and a DNN model to take the non-temporal inputs, and then add additional layers which takes the output of the outputs from both networks. Then all components of the entire model can be trained jointly by backprorogation. 24 Chapter 3 Temporal Datasets in Health Care 25 Havingevaluationsonreal-worldhealthcaredatasetsisanessentialsteptowards building and validating effective deep learning models for health care applications. In every work in this thesis, we conduct experiments and provide discussions on one or more private or public health care datasets with temporal records. Based on the settings in our experiments, these datasets can be categorized into two groups by their sources and characteristics. 3.1 Medical Time Series Data The first group of data is the time series data in health care, which are mainly from modern hospitals. For example, the monitoring data of patient at ICU or measurements after hospital admission fall into this group. Most available features are vital signs, lab tests, and other similar variables, which are recorded as multivariate time series with continuous or discrete values. Each time series in these datasets usually lasts from hours to days. MIMIC-III dataset The MIMIC-III dataset is a public dataset [106] which has deidentified clinical care data with over 58,000 hospital admission records collected at Beth Israel Deaconess Medical Center from 2001 to 2012. In our work, we mainly use admission records collected during 2008-2012 by the MetaVision data management system which is still employed at the hospital. The major tables used in our work are input events (fluids into patient, e.g. insulin), output events (fluids out of the patient, e.g. urine), lab events (lab test results, e,g. pH and platelet count) and prescription events (drugs prescribed by doctors, e.g. aspirin and potassium chloride). We use this dataset in the studies in Section 4.1 and Section 4.2. 26 PhysioNet dataset The PhysioNet dataset, from PhysioNet Challenge 2012 [207], is a publicly available collection of multivariate clinical time series from 8,000 ICU records. Each record is a multivariate time series of roughly 48 hours and contains 33 variables such as Albumin, heart-rate, glucose, etc. We use the Training Set A from this dataset in our experiments since outcomes (such as in-hospital mortality labels) are publicly available only for this subset. We use this dataset in the studies in Section 4.1 and Section 5.2. PICU dataset The PICU dataset consists of ICU clinical time series extracted from the EHR system of a major hospital. The original dataset includes roughly ten thousand episodes of varying lengths, but we exclude episodes shorter than 12 hours or longer than 128 hours, yielding a dataset of 8,500 multivariate time series of a dozen physiologic variables with hourly sampling. Each episode also has zero or more associated diagnostic codes from ICD-9. We use this dataset in the work in Section 5.2. Vent dataset The Vent dataset [114] is a pediatric ICU dataset collected at the Children’s Hospital Los Angeles. This dataset consists of health records from 398 patients with acute lung injury in the hospital. It contains a set of 27 static features, such as demographic information and admission diagnoses, and another set of 21 temporal features (recorded daily), such as monitoring features and discretized scores made by experts, during the initial 4 days of mechanical ventilation. We use this dataset in the study in Section 6.1. 27 3.2 Patient History Data The second type of temporal data in health care are the electronic health records or claims data from hospitals or companies with sequential codings of patients’ medical history. This type of data usually includes medications, diagnoses, procedures, and other type of records in common coding systems. The time range of this type of data can be as long as years, while infrequent or sparse records are common. Claim dataset TheClaim dataset is from a real-world longitudinal EHR database of 218,680 patients and 14,969,489 observations of 14, 690 unique medical events, between the year 2011 to 2015 from an anonymous health insurance company. In this dataset, a set of diseases related ICD-9 codes are recorded to indicate medical conditions as well as drug prescriptions. Part of this dataset is used in other related works [38, 257]. We use this dataset in the work in Section 5.1. REP dataset The REP dataset is extracted from the Rochester Epidemiology Project (REP) [216] with a cohort of 142,377 patients. The total number of people identified by REP [215] covers about 98.7% of the population that reside in Olmsted CountybytheUSCensus. Thislarge-scaledataset, withmorethan50yearsofhistory, is well-representative for this population-based study and suitable for powerful and complex deep learning models. Records of diagnoses, procedures, prescriptions and surgeries, as well as patient demographics are mainly used in the study on this dataset. We use this dataset in the work in Section 6.2. In addition, the details about data processing and task descriptions for each dataset are discussed within each chapter as they are specific to each study. 28 Chapter 4 Utilizing Data Heterogeneity 29 In this chapter, we present two works on effectively utilizing data heterogeneity in deep learning models. We mainly consider two forms of data heterogeneity in temporal data in health care. First, data may have different numbers of missing values and various missing patterns. We introduce a modified recurrent neural network namely GRU-D, which is designed for multivariate time series with missing values, in Section 4.1. Second, data may come with multiple sampling rates, and we then present our hierarchical deep generative model, MR-HDMM, for multi-rate multivariate time series in Section 4.2. 4.1 Exploiting Missingness in Multivariate Time Series Multivariate time series, which are ubiquitous in practical applications in health care, often inevitably carry missing observations due to various reasons, such as medical events, saving costs, anomalies, inconvenience and so on. It has been noted that these missing values are usually informative missingness [195], i.e., the missing values and patterns provide rich information about target labels in supervised learning tasks (e.g, time series classification). To illustrate this idea, we show some examples from the MIMIC-III dataset [106] in Figure 4.1. We plot the Pearson correlation coefficient between variable missing rates, which indicates how often the variable is missing in the time series, and the labels of our interests, which are mortality and ICD-9 diagnosis categories. We observe that the value of missing rate is correlated with the labels, and the missing rate of variables with low missing rate are usually highly (either positive or negative) correlated with the labels. In other words, the missing rate of variables for each patient is useful, and this information is more 30 useful for the variables which are observed more often in the dataset. These findings demonstrate the usefulness of missingness patterns in solving a prediction task. In the past decades, various approaches have been developed to address missing values in time series [202]. A simple solution is to omit the missing data and to perform analysis only on the observed data, but it does not provide good performance when the missing rate is high and inadequate samples are kept. Another solution is to fill in the missing values with substituted values, which is known as data imputation. Smoothing, interpolation [121], and spline [51] methods are simple and efficient, thus widely applied in practice. However, these methods do not capture variable correlations and may not capture complex pattern to perform imputation. A variety of imputation methods have been developed to better estimate missing data. These include spectral analysis [158], kernel methods [190], expectation–maximization algorithm [76], matrix completion [148] and matrix factorization [119]. Additionally, multiple imputation [12, 240] can be further applied with these imputation methods to reduce the uncertainty, by repeating the imputation procedure multiple times and averaging the results. Combining the imputation methods with prediction models often results in a two-step process where imputation and prediction models are separated. By doing this, the missing patterns are not effectively explored in the prediction model, thus leading to suboptimal analyses results [239]. In addition, most imputation methods also have other requirements which may not be satisfied in real applications, for example, many of them work on data with small missing rates only, assume the data is missing at random or missing completely at random, or can not handle time series data with varying lengths. Moreover, training and applying these imputation methods are usually computationally expensive. 31 Recently, recurrent neural networks (RNN), such as long short-term memory (LSTM) [96] and gated recurrent unit (GRU) [40], have shown to achieve the state- of-the-art results in many applications with time series or sequential data, including machine translation [15, 219] and speech recognition [92]. RNN models enjoy several nice properties such as strong prediction performance as well as the ability to capture long-term temporal dependencies and variable-length observations. RNN models for missing data have been studied in earlier works [23, 173, 228] and applied for speech recognition and blood-glucose prediction. Recent works [41, 136] tried to handle missingness in RNN models by concatenating missing entries or timestamps with the input or performing simple imputations. However, there have not been works which design RNN structures incorporating the patterns of missingness for time series classification problems. Exploiting the power of customized RNN models along with capturing the informativeness of missing patterns is a new promising venue to effectively model multivariate time series and is the main motivation behind our work. In this chapter, we develop a novel deep learning model based on GRU, namely GRU-D, to effectively exploit two representations of informative missingness patterns, i.e., masking and time interval. Masking informs the model that which inputs are observed (or missing), while time interval encapsulates the input observation patterns. Our model captures the observations and their dependencies by applying masking and time interval (using a decay term) to the inputs and network states of GRU, and jointly train all model components using backpropagation. Thus, our model not only captures the long-term temporal dependencies of time series observations but also utilizes the missing patterns to improve the prediction results. Empirical experiments on real-world clinical datasets as well as synthetic datasets demonstrate 32 1 10 20 ICD-9 Diagonsis Category Index Corr. with Mortality 1 20 40 60 80 99 Input Variable Index 0.8 1.0 Missing Rate 0.0 0.1 0.2 0.3 Absolute Values of Pearson Correlations between Variable Missing Rates and Labels (Mortality and ICD-9 Diagonsis Categories on MIMIC-III Dataset) Figure 4.1: Demonstration of informative missingness on the MIMIC-III dataset: the absolute values of Pearson correlations between variable missing rates (bottom) and ICD-9 diagnosis categories (top) and mortality (middle). that our proposed model outperforms strong deep learning models built on GRU with imputation as well as other strong baselines. These experiments show that our proposed method is suitable for many time series classification problems with missing data, and in particular is readily applicable to the predictive tasks in emerging health care applications. Moreover, our method provides useful insights into more general research challenges of time series analysis with missing data beyond classification tasks, including 1) a general deep learning framework to handle time series with missing data, 2) an effective solution to characterize the missing patterns of not missing-completely-at-random time series data with masking and time interval, and 3) an insightful approach to study the impact of variable missingness on the prediction labels by decay analysis. 33 ࢄ : Input time series (2 variables); ࢙ : Timestamps for ࢄ ; ࢄ ൌ Ͷ Ͷ ͻ ܰ ܣ Ͷ Ͳ ܰ ܣ Ͷ ͵ ͷ ͷ ܰ ܣ ͳ ͷ ͳ Ͷ ܰ ܣ ܰ ܣ ܰ ܣ ͳ ͷ ࢙ ൌ Ͳ Ͳ Ǥͳ Ͳ Ǥ ͳ Ǥ ʹ Ǥʹ ʹ Ǥͷ ͵ Ǥͳ ࡹ : Masking for ࢄ ; ઢ : Time interval for ࢄ . ࡹ ൌ ͳ ͳ Ͳ ͳ Ͳ ͳ ͳ Ͳ ͳ ͳ Ͳ Ͳ Ͳ ͳ ઢ ൌ Ͳ Ǥ Ͳ Ͳ Ǥ ͳ Ͳ Ǥ ͷ ͳ Ǥ ͷ Ͳ Ǥ Ͳ Ǥ ͻ Ͳ Ǥ Ͳ ǤͲ Ͳ Ǥͳ Ͳ Ǥͷ ͳ ǤͲ ͳ Ǥ ͳ Ǥͻ ʹ Ǥͷ Figure 4.2: An example of measurement vectorsx t , time stamps s t , maskingm t , and time interval t . 4.1.1 Methodology 4.1.1.1 Notations for Time Series with Missing Values Given the multivariate time series (MTS) X ∈ R T×D which possibly have missing values, we use s t ∈R denote the time-stamp when the t-th observation is obtained and we assume that the first observation is made at time-stamp 0 (i.e., s 1 = 0), and introduce a masking vector m t ∈{0, 1} D to denote which variables are missing at time step t, and also maintain the time interval δ d t ∈R for each variable d since its last observation. To be more specific, we have m d t = 1, if x d t is observed 0, otherwise (4.1) and δ d t = s t −s t−1 +δ d t−1 , t> 1,m d t−1 = 0 s t −s t−1 , t> 1,m d t−1 = 1 0, t = 1 (4.2) An example of the notations is illustrated in Figure 4.2. In this chapter, we are interested in the time series classification problem, where we predict the labels 34 l n ∈{1,...,L} given the time series dataD, whereD ={(X n ,s n ,M n )} N n=1 , and X n = h x (n) 1 ,...,x (n) Tn i ,s n = h s (n) 1 ,...,s (n) Tn i ,M n = h m (n) 1 ,...,m (n) Tn i . 4.1.1.2 GRU-RNN for Time Series with Missing Values We investigate the common use of RNN for time-series classification with missing values. Among different variants of the RNN, we specifically consider GRU- RNN, a recurrent neural network with gated recurrent units, but similar discussions and modifications are also valid for other RNN models such as LSTM. We describe three commonly used straightforward ways to handle missing values without applying any imputation approaches or making any modifications to GRU network architecture. The first approach is simply to replace each missing observation with the mean of the variable across the training examples. In the context of GRU, we have x d t ←m d t x d t + (1−m d t )e x d (4.3) where e x d = P N n=1 P Tn t=1 m d t,n x d t,n . P N n=1 P Tn t=1 m d t,n . e x d is calculated on the training dataset and used for both training and testing datasets. We refer to this approach as GRU-Mean. A second approach is to exploit the temporal structure. For example, we may assume any missing value is the same as its last measurement and use forward imputation (GRU-Forward), i.e., x d t ←m d t x d t + (1−m d t )x d t 0 (4.4) 35 where t 0 <t is the last time the d-th variable was observed. Instead of explicitly imputing missing values, the third approach simply indi- cates which variables are missing and how long they have been missing as a part of input by concatenating the measurement, masking and time interval vectors as x (n) t ← h x (n) t ;m (n) t ; (n) t i wherex (n) t can be either from Equation (4.3) or (4.4). We later refer to this approach as GRU-Simple. Besides the three common strategies, several recent works [41, 136, 179] also use different RNN models on EHR data to model diseases and to predict patient diagnosis from health care time series data with irregular time stamps or missing values, but none of them have explicitly attempted to capture and utilize the missing patterns into their RNN units via systematically modified network architectures. For example, one model [41] feeds medical codes along with its time stamps into GRU model to predict the next medical event. This idea of feeding time stamps along with the input values is equivalent to the baseline GRU-Simple with only the interval but not the masking (m), which we denote as GRU-Simple w/o m. Another model [179] uses LSTM model and extends the forget gate in LSTM to a logarithmic or cubic decay function of time intervals between two time stamps. Their model is essentially similar to GRU-Simple w/om. Neither of them consider missing values in time series medical records. In addition, they use the same time stamps for all variables, while our model keeps track of the time stamps at which measurements were made for each variable separately and thus can be more precise. In another work [136], the authors achieve their best performance on diagnosis prediction by filling missing values with zeros and feeding masking vectors in the recurrent neural network. Their model is equivalent to the GRU-Simple model without feeding the time interval (δ) given that the input features are normalized to have mean value 0 before fed into the RNN 36 ෩ IN OUT ෩ IN OUT ෝ MASK Hidden state decay term Input decay term GRU-D Target Predictions: E.g., Mortality or ICD-9 GRU-D ⋯ ⋯ GRU-D Prediction Layer Inputs: Variable ( ) Masking ( ) Time Interval ( ) − − − GRU-D (Parts in cyan refer to the modifications.) Figure 4.3: Graphical illustrations of the original GRU (top-left), GRU-D (bottom- left), and the proposed network architecture (right). model. We denote it as GRU-Simple w/o δ. Conclusively, none of the related works mentioned above modify the RNN model structure to further capture and utilize missingness, and our GRU-Simple baseline can be considered as a generalization of all these related RNN models. These approaches solve the missing value issue to a certain extent, however, imputing the missing value with mean or forward imputation cannot distinguish whether missing values are imputed or truly observed. Simply concatenating masking and time interval vectors fails to exploit the temporal structure of missing values, and none of them fully utilize missingness in data to achieve desirable performance. 4.1.1.3 GRU-D: Model with Trainable Decays To fundamentally address the issue of missing values in time series, we notice two important properties of the missing values in time series, especially in the health care domain: First, the value of the missing variable tend to be close to some default value if its last observation happens a long time ago. This property usually exists 37 in health care data for human body as homeostasis mechanisms and is considered to be critical for disease diagnosis and treatment [233]. Second, the influence of the input variables will fade away over time if the variable has been missing for a while. For example, one medical feature in EHRs is only significant in a certain temporal context [258]. Therefore, based on the originalGRU shown in the top-left of Figure 4.3, we propose a GRU-based model calledGRU-D shown in the bottom-left of Figure 4.3, in which a decay mechanism is designed for the input variables and the hidden states to capture the aforementioned properties. We introduce decay rates in the model to control the decay mechanism by considering the following important factors. First, each input variable in health care time series has its own meaning and importance in medical applications. The decay rates should differ from variable to variable based on the underlying properties associated with the variables. Second, as we see lots of missing patterns are informative and potentially useful in prediction tasks but unknown and possibly complex, we aim at learning decay rates from the training data rather than fixed a priori. That is, we model a vector of decay rates as t = exp{− max (0,W γ t +b γ )} (4.5) where W γ and b γ are model parameters that we train jointly with all the other parameters of the GRU. We chose the exponentiated negative rectifier in order to keep each decay rate monotonically decreasing in a reasonable range between 0 and 1. Note that other formulations such as a sigmoid function can be used instead, as long as the resulting decay is monotonic and is in the same range. Our proposed GRU-D model incorporates two different trainable decay mecha- nisms to utilize the missingness directly with the input feature values and implicitly 38 in the network hidden states. First, for a missing variable, we use an input decay x to decay it over time toward the empirical mean (which we take as a default configuration), instead of using the last observation as it is. Under this assumption, the trainable decay scheme can be readily applied to the measurement vector by b x d t =m d t x d t + 1−m d t γ x d t x d t 0 + (1−γ x d t )e x d (4.6) where x d t 0 is the last observation of the d-th variable (t 0 <t) and e x d is the empirical mean of the d-th variable. When decaying the input variable directly, we con- strainW γx to be diagonal, which effectively makes the decay rate of each variable independent from the others. Sometimes the input decay may not fully capture the missing patterns since not all missingness information can be represented in decayed input values. In order to capture richer knowledge from missingness, we also have a hidden state decay h in GRU-D. Intuitively, this has an effect of decaying the extracted features (recurrent hidden states) rather than raw input variables directly. This is implemented by decaying the previous hidden stateh t−1 before computing the new hidden stateh t as b h t−1 = h t h t−1 , (4.7) in which case we do not constrain W γ h to be diagonal. In addition, we feed the masking vectors (m t ) directly into the model. The update functions of GRU-D are r t =σ W r b x t +U r b h t−1 +V r m t +b r (4.8) z t =σ W z b x t +U z b h t−1 +V z m t +b z (4.9) 39 e h t = tanh W b x t +U(r t b h t−1 ) +Vm t +b (4.10) h t = (1−z t ) b h t−1 +z t e h t (4.11) It is worth noting that the main differences between formulas of GRU-D and those of the standard GRU. First,x t andh t−1 are respectively replaced by b x t and b h t−1 from Equation (4.6) and Equation (4.7). Second, the masking vectorm t are fed into the model, andV z ,V r ,V are new parameters for it. In our final prediction model, we use the proposed GRU-D component at each time step, and apply a fully connected prediction layer, which has sigmoid activation for binary classification task or soft-max activation for multi-class classification tasks , on top of the last GRU component. The network architecture is shown in the right of Figure 4.3. For all datasets in our experiments, the same network structure is used with different settings on network size including the input, hidden state and output dimensions and the temporal lengths. The idea of decay term can be generalized to LSTM straightforwardly, and it can be generalized to other domains where time series data come with missing patterns which contain useful information in a variety of ways in practical applications. 4.1.1.4 Baseline Imputation Methods A common way to solve classification task with missing values is first filling the missing values and then applying predictive models on the imputed data. This usually requires to train additional models with extra running cost, and the imputed data quality can not be guaranteed. Our model avoids to rely on external imputation methods, buttohaveafairandcompletecomparison, wetestseveralinterpolationand imputation methods and apply them to other prediction baselines in our experiments. 40 We include the following interpolation and imputation methods: • Mean, Forward, Simple: We take the mean-imputation (Mean), forward- imputation (Forward), and concatenating the measurement with masking and time interval (Simple) as three imputation baselines. These strategies are described in Section 4.1.1.2 and can be performed directly without training any imputation models on all predictive models. • SoftImpute [148]: This method uses matrix completion via iterative soft- thresholded singular value decomposition (SVD) to impute missing values. • KNN [19]: This method uses k-nearest neighbor to find similar samples and imputed unobserved data by weighted average of similar observations. • CubicSpline [51]: In this method, we use cubic spline to interpolate each feature at different time steps. • MICE [12]: The multiple imputation by chained equations (MICE) method is widely used in practice, which uses chain equations to create multiple imputations for variables of different types. • MF [119]: We use matrix factorization (MF) to fill the missing items in the incomplete matrix by factorizing the matrix into two low-rank matrices. • PCA [52]: We impute the missing values with the principal component analysis (PCA) model. • MissForest [217]: This is a non-parametric imputation method which uses random forests trained on the observed values to predict the missing values. 41 For the MICE, MF, and PCA methods, we treat a multivariate time series X∈R T×D asT data samples and impute them independently, so that these methods can be applied to time series with different lengths. However, for the SoftImpute and KNN methods, taking each time step as one sample is unaffordable in terms of running time and space. We then treat each time seriesX as one data point in these two imputation methods. Therefore, we can not use them on time series with different lengths. We implemented these models in python based on the fancyimpute [196], predictive_imputer [64], and SciPy [108] libraries. We followed their original code and paper for hyperparameter setting and tuning strategies. 4.1.1.5 Baseline Prediction Methods We categorize all evaluated prediction models used in our experiments into three groups: • Non-RNN Baselines (Non-RNN): We evaluate logistic regression (LR), support vector machine (SVM) and random forest (RF), which are widely used in health care applications. We used all imputation methods described in Section 4.1.1.4 to fill in the missing values before using these prediction methods. • RNN Baselines (RNN): We take the three models described before (GRU- Mean, GRU-Forward, GRU-Simple), and LSTM-Mean (LSTM model with mean-imputation on the missing measurements) as the RNN baselines. As mentioned before, similar models are widely used in existing work [41, 136, 179] on applying RNN on health care time series data with missing values or irregular time stamps. We include two variations, GRU-Simple w/o δ [136] 42 and GRU-Simple w/o m [41, 179]. We also test GRU prediction models on the imputed data as well. • Proposed Methods (Proposed): This is our proposed GRU-D model. 4.1.2 Experiments 4.1.2.1 Dataset and Task Descriptions We demonstrate the performance of our proposed models on the MIMIC-III dataset and the PhysioNet dataset by comparing them to several strong machine learning and deep learning approaches in classification tasks. We evaluate our models for different settings such as early prediction and different training sizes and investigate the impact of missing values. Tasks on the PhysioNet dataset We take the 4,000 records from Training Set A, which are multivariate time series with 48 hours and 33 variables. We conduct the following two prediction tasks on this dataset: • Mortality prediction task: Predict whether the patient dies in the hospital. There are 554 patients with positive mortality label. We treat this as a binary classification problem. • All-4 task: Predict the following four tasks altogether: in-hospital mortality, length-of-stay less than 3 days, whether the patient had a cardiac condition, and whether the patient was recovering from surgery. We treat this as a multi-task classification problem. 43 Tasks on the MIMIC-III dataset We extract 99 time series features from 19,714 admission records. We only include patients who are alive in the first 48 hours after admission in our dataset, and only use the first 48 hours data after admission of them. We perform following two predictive tasks on this dataset: • Mortality task: Predict whether the patient dies in the hospital after 48 hours. There are 1,716 patients with positive mortality label and we perform binary classification. • ICD-9 task: Predict 20 ICD-9 diagnosis categories (e.g., respiratory system diagnosis) for each admission as a multi-task classification problem. We also have some evaluations on synthetic datasets (Gesture datasets) from a gesture phase segmentation dataset. 4.1.2.2 Implementation Details The non-RNN models cannot directly handle time series of different lengths. We carefully design experiments for them to capture the informative missingness as much as possible to have fair comparison with the RNN methods. We regularly sample the time-series data to get a fixed length input and perform all baseline imputation methods to fill in the missing values. For concatenation method (Simple) of the non-RNN methods, we concatenate the masking vector along with the measurements of the regularly sampled time series. For the PhysioNet dataset we sample the time series on an hourly basis and propagate measurements forward (or backward) in time to fill gaps, and for the MIMIC-III dataset we consider two hourly samples (in the first 48 hours) and do forward (or backward) imputation. Our preliminary experiments showed 2-hourly samples obtains better performance than one-hourly 44 samples for MIMIC-III. We choose Gaussian radial basis function (RBF) kernel for SVM since it performs better than other kernels. We use scikit-learn [177] for the non-RNN model implementation and tune the parameters by cross-validation. For the RNN models, we use one recurrent layer to model the sequence unless otherwise stated, and then apply a soft-max regressor on top of the last hidden state h T to do classification as shown in the right of Figure 4.3. We use 100 and 64 hidden units in GRU-Mean for the MIMIC-III and PhysioNet datasets, respectively. In order to fairly compare the capacity of all GRU-RNN models, we build each model in proper size so they share similar number of parameters. We show the statistics of our GRU-based models for three datasets in Table 4.1. For the two real datasets, we show the numbers for mortality prediction, and the numbers for multi-task classifications are also close for all the compared models. Vars., Size, Pars. represent the number of all input features/variables in that dataset, the number of hidden states (h) in one recurrent unit, and the number of all parameters in the prediction model, respectively. In addition, having comparable number of parameters also makes all the prediction models have number of iterations and training time close in the same scale in all the experiments. Batch normalization [103] and dropout [212] of rate 0.5 are applied to the top regressor layer. We train all the RNN models with the Adam optimization method [115] and use early stopping to find the best weights on the validation dataset. All RNN models are implemented with Keras [44] and Theano [5] in Python. All the input variables are normalized to be of 0 mean and 1 standard deviation. We report the results from 5-fold cross validation in terms of the area under the receiver operating characteristic curve (AUROC) score. To further evaluate the 45 Table 4.1: Size comparison of GRU models used in the GRU-D experiments. Other GRU Models GRU-Simple GRU-D Gesture # of Vars. 18 18 18 Size 64 50 55 # of Pars. 16,281 16,025 16,561 MIMIC-III # of Vars. 99 99 99 Size 100 56 67 # of Pars. 60,105 59,533 60,436 PhysioNet # of Vars. 33 33 33 Size 64 43 49 # of Pars. 18,885 18,495 18,838 proposed models, we also provide more detailed comparisons and evaluations on multilayer RNN models and with different model sizes. 4.1.2.3 Evaluations on Synthetic Datasets As illustrated in Figure 4.1, missing patterns can be useful in solving prediction tasks. A robust model should exploit informative missingness properly and avoid introducing nonexistent relations between missingness and predictions. To evaluate the impact of modeling missingness we conduct experiments on synthetic datasets from the gesture phase segmentation dataset (Gesture) [142]. This dataset, which is from the UCI machine learning repository [57], has multivariate time series features, regularlysampledandwithnomissingvalues, for5differentgesticulations. Weextract 378 time series and generate 4 synthetic datasets for the purpose of understanding 46 model behaviors with different missing patterns, and conduct multi-class classification tasks. Generating synthetic datasets on the Gesture dataset The Gesture dataset is composed by features extracted from 7 videos with people gesticulat- ing. It contains a time series with 18 numerical attributes and timestamps, and a set of 32 processed features with no timestamps. No missing values exist in this original dataset. Since the processed features are not exactly mapped to the timestamps, we use only the 18 raw features and their corresponding timestamps. A phase label from 5 possible phases (rest, preparation, hold, stroke, and retraction) is assigned at each time step. Noticing that the labels for neighbouring time steps are usually the same, we first generate non-overlapped time series with same label from the original data, which has 9,900 time stamps in total. From the beginning of each new phase, we truncate the time series by 30 time steps until the end of the phase segment. We ignore the last extracted time series if it is shorter than 7 time steps. After this step, we get 378 time series with different lengths. The numbers of time series of the 5 labels are 65, 115, 76, 49, and 73, respectively. We then generate several synthetic datasets by manually introducing missing values. Our goal is that all the synthetic datasets have the same overall average missing rates (50%), while the correlations between the missing rates and the labels are different for all datasets. We generate datasets with the desired properties in the following ways. First, for each featured∈{1,··· , 18}, we randomly sample a numbers d ∈{−1, 1} with equal probabilities to indicate whether the missing rate of that feature has positive or 47 0.6 0.7 0.8 0.9 1 0 0.2 0.5 0.8 AUROC Score Average Absolute Value of Pearson Correlation between Variable Missing Rates and Labels GRU-Mean GRU-Forward GRU-Simple GRU-D Figure 4.4: Classification performance on the Gesture synthetic datasets with different correlation values and random missing values. negative correlations with the labels. Then for each sample i, we randomly choose a missing rate r i,d from a uniform distribution r i,d ∼U[0.3 +C·s d ·y i , 0.7 +C·s d ·y i ] wherey i ={1,··· , 5} is the label for thei-th sample, andC is a constant parameter for that synthetic dataset. We select a proper value of C to control the average absolute values of the Pearson correlation between missing rate for each featurer d and the label y. We then randomly introduce missing values based on the corresponding missing rate. We repeat the above steps to generate 4 synthetic datasets, with the average absolute correlation value to be 0, 0.2, 0.5, and 0.8. The setting with higher correlation implies more informative missingness. We evaluate the model performance for the phase classification task. Figure 4.4 shows the AUROC score comparison of three GRU baseline models (GRU-Mean,GRU-Forward,GRU-Simple)andtheproposedGRU-D inthe4different settings with similar missing rate but different correlations between missing rate and the label. First, GRU-Mean and GRU-Forward do not utilize any missingness (i.e., 48 masking or time interval) and perform similarly across all 4 settings. GRU-Simple and GRU-D benefit from utilizing the missingness, so they get better performance when the correlation increases. They achieve similar and best performance on the dataset with highest correlation. However, when the correlation is low or non-existent, simply feeding the missingness representations may introduce irrelevant information. As shown in Figure 4.4, GRU-Simple fails when the correlation is low. On the other hand, GRU-D has a stable performance and achieves best AUROC scores in all the settings. These empirical findings validate our assumption that GRU-D utilizes the missing patterns only when the correlations are high and relies on the observed values when the correlations between labels and missing rates are low. Further, these results on synthetic datasets demonstrate that GRU-D can model the missing patterns properly and does not introduce any non-existent relations. Table 4.2: Model performances measured by AUROC score (mean±std) for the mortality prediction task on multivariate time series with missing values. Non-RNN Models RNN Models Mortality Prediction on the MIMIC-III Dataset LSTM-Mean 0.8142±0.014 LR-Mean 0.7589±0.015 SVM-Mean 0.7908±0.006 RF-Mean 0.8293±0.004 GRU-Mean 0.8252±0.011 LR-Forward 0.7792±0.018 SVM-Forward 0.8010±0.004 RF-Forward 0.8303±0.003 GRU-Forward 0.8192±0.013 LR-Simple 0.7715±0.015 SVM-Simple 0.8146±0.008 RF-Simple 0.8294±0.007 GRU-Simple w/o δ 0.8367±0.009 LR-SoftImpute 0.7598±0.017 SVM-SoftImpute 0.7540±0.012 RF-SoftImpute 0.7855±0.011 GRU-Simple w/o m 0.8266±0.009 LR-KNN 0.6877±0.011 SVM-KNN 0.7200±0.004 RF-KNN 0.7135±0.015 GRU-Simple 0.8380±0.008 LR-CubicSpline 0.7270±0.005 SVM-CubicSpline 0.6376±0.018 RF-CubicSpline 0.8339±0.007 GRU-CubicSpline 0.8180±0.011 LR-MICE 0.6965±0.019 SVM-MICE 0.7169±0.012 RF-MICE 0.7159±0.005 GRU-MICE 0.7527±0.015 LR-MF 0.7158±0.018 SVM-MF 0.7266±0.017 RF-MF 0.7234±0.011 GRU-MF 0.7843±0.012 LR-PCA 0.7246±0.014 SVM-PCA 0.7235±0.012 RF-PCA 0.7747±0.009 GRU-PCA 0.8236±0.007 LR-MissForest 0.7279±0.016 SVM-MissForest 0.7482±0.016 RF-MissForest 0.7858±0.010 GRU-MissForest 0.8239±0.006 Proposed GRU-D 0.8527±0.003 Mortality Prediction on the PhysioNet Dataset LSTM-Mean 0.8025±0.013 LR-Mean 0.7423±0.011 SVM-Mean 0.8131±0.018 RF-Mean 0.8183±0.015 GRU-Mean 0.8162±0.014 LR-Forward 0.7479±0.012 SVM-Forward 0.8140±0.018 RF-Forward 0.8219±0.017 GRU-Forward 0.8195±0.004 LR-Simple 0.7625±0.004 SVM-Simple 0.8277±0.012 RF-Simple 0.8157±0.014 GRU-Simple 0.8226±0.010 LR-SoftImpute 0.7386±0.007 SVM-SoftImpute 0.8057±0.019 RF-SoftImpute 0.8100±0.016 GRU-SoftImpute 0.8125±0.005 LR-KNN 0.7146±0.011 SVM-KNN 0.7644±0.018 RF-KNN 0.7567±0.012 GRU-KNN 0.8155±0.004 LR-CubicSpline 0.6913±0.022 SVM-CubicSpline 0.6364±0.015 RF-CubicSpline 0.8151±0.015 GRU-CubicSpline 0.7596±0.020 LR-MICE 0.6828±0.015 SVM-MICE 0.7690±0.016 RF-MICE 0.7618±0.007 GRU-MICE 0.8153±0.013 LR-MF 0.6513±0.014 SVM-MF 0.7515±0.022 RF-MF 0.7355±0.022 GRU-MF 0.7904±0.012 LR-PCA 0.6890±0.019 SVM-PCA 0.7741±0.014 RF-PCA 0.7561±0.025 GRU-PCA 0.8116±0.007 LR-MissForest 0.7010±0.018 SVM-MissForest 0.7779±0.008 RF-MissForest 0.7890±0.016 GRU-MissForest 0.8244±0.012 Proposed GRU-D 0.8424±0.012 49 4.1.2.4 Mortality Prediction Evaluations We evaluate all methods on the MIMIC-III and PhysioNet datasets. We notice that dropout in the recurrent layer helps a lot for all RNN models on both of the datasets. We apply recurrent dropout [73] with rate of 0.3 with same dropout samples at each time step on weights W,U,V . Table 4.2 shows the prediction performance of all the models on the mortality prediction task. While using simple imputation methods (Mean, Forward, Simple), all the prediction models except RF show improved performance when they concatenate missingness indicators along with inputs. The two-step imputation-prediction methods did not improve the prediction performance on these two datasets, and in many cases these methods have worse predictions. This is probably due to the high missing rates in both datasets (> 80%) and those imputation methods are not designed for such high missing rates. For example, datasets with a missing rate of 10% to 30% are reported in the related works [217]. Among all these imputation methods, withLR andSVM, theSoftImpute performs the best. CubicSpline, which captures the temporal structure of the data performs the best with RF, but fails with SVM and GRU. MissForest provides slightly better performance with GRU models than other additional imputation baselines. It is worth noting that all these imputation baselines, especially MICE, MF, PCA, and MissForest, generally require a substantial amount of time to train and tune the hyperparameters, thus makes the two-step procedure quite inefficient. Our proposed GRU-D achieves the best AUROC score on both datasets compared with all the other baselines. 50 4.1.2.5 Multi-Task Prediction Evaluations In the reminder of the experiments, we use GRU-Simple as a representative for all GRU-Simple variant models [41, 136, 179] since it obtains the best or comparable performance among them. The RNN models for multi-task learning with m tasks is almost the same as that for binary classification, except that 1) the soft-max prediction layer is replaced by a fully connected layer withn sigmoid logistic functions, and 2) a data-driven prior regularizer parameterized by comorbidity (co-occurrence) counts in training data, is applied to the prediction layer to improve the classification performance. We conduct multi-task classification experiments for the All-4 task on PhysioNet and the ICD-9 task onMIMIC-III using all theGRU models. As shown in Table 4.3, the comparison of all methods are quite similar to that for mortality prediction task. GRU-D performs best in terms of average AUROC score across all tasks and in most of the single tasks. On the MIMIC-III dataset, GRU-MissForest and GRU-Simple provides the best performance among all baselines, while all simple imputations perform better than additional imputation baselines on the PhysioNet dataset. 4.1.3 Discussions 4.1.3.1 Investigations of Relation between Missingness and Labels We provide more details about Figure 4.1 and empirically confirm this claim on real health care datasets by investigating the correlation between the missingness and prediction labels (mortality and ICD-9 diagnosis categories). For each patient and its corresponding time seriesX, we denote the missing rate for a variable d as p d X and calculate it byp d X = 1− 1 T P T t=1 m d t . Note thatp d X is dependent on the mask 51 Table 4.3: Model performances measured by average AUROC score (mean±std) for multi-task predictions on real datasets with missing values. Models ICD-9 Task on MIMIC-III All-4 Task on PhysioNet GRU-Mean 0.7070±0.001 0.8099±0.011 GRU-Forward 0.7077±0.001 0.8091±0.008 GRU-Simple 0.7105±0.001 0.8249±0.010 GRU-CubicSpline 0.6372±0.005 0.7451±0.011 GRU-MICE 0.6717±0.005 0.7955±0.003 GRU-MF 0.6805±0.004 0.7727±0.003 GRU-PCA 0.7040±0.002 0.8042±0.006 GRU-MissForest 0.7115±0.003 0.8076±0.009 Proposed GRU-D 0.7123±0.003 0.8370±0.012 vector (m d t ) of that patient and the number of time steps T. Then for each patient, we will have a vector p X = (p 1 X ,...,p D X ) > to represent the missing rates of all D time series features for that patient. For each prediction task denoted by label ` and each d-th feature, we compute the Pearson correlation coefficient between variable p d and ` given the entire dataset. Figure 4.1 and Figure 4.5 respectively show such analysis on the MIMIC-III and PhysioNet datasets. In these two figures, the bottom part shows the missing rate of each input variable, and the top parts show the absolute values of Pearson correlation coefficients between missing rate of each variable and selected labels to predict. Different colors are used to show the absolute values of Pearson correlation coefficients. As shown in Figure 4.1, we observe that in the MIMIC-III dataset the missing rates with low rate values are usually highly (either positive or negative) correlated with the labels. The distinct correlation between missingness and labels 52 Corr. with Cardiac Corr. with Surgery Corr. with Los<3 Corr. with Mortality 1 11 22 33 Input Variable Index 0.5 1.0 Missing Rate 0.0 0.1 0.2 Absolute Values of Pearson Correlations between Variable Missing Rates and Labels (PhysioNet Dataset) Figure 4.5: Demonstrations of informative missingness on the PhysioNet dataset: the absolute values of Pearson correlations between variable missing rates (bottom) and 4 labels. demonstrates usefulness of missingness patterns in solving prediction tasks. Similar meaningful correlations in the PhysioNet dataset are also validated in Figure 4.5 4.1.3.2 Validating and Interpreting the Learned Decays To validate the GRU-D model and demonstrate how it utilizes informative missing patterns, we take the PhysioNet mortality prediction task as a study case, and show the input decay ( x ) plots and hidden decay weight (W γh ) histograms for each input variable. From the top of Figure 4.6 which plots the learned input decay rate, we notice that the decay rate is almost constant for the majority of variables. However, a few variables have large decay which means that the value of the observation at the current time step is very important for prediction, and the model relies less on the previous observations. For example, the changes in the variable values of patient’s weight (missing rate 0.5452), arterial pH (missing rate 0.9118), temperature (missing 53 (x-axis: value of decay parameters ℎ ; y-axis: frequency; MR: missing rate) (x-axis: time interval between 0 and 24 hours; y-axis: value of decay rate ) Figure 4.6: Plots of input decay x t for all variables (top) and histrograms of hidden state decay weights W γ h for 10 variables (bottom) in the GRU-D model for the mortality prediction task on the PhysioNet dataset. rate 0.6915), and respiration rate (missing rate 0.8053) are known to impact the ICU patients health condition. In the bottom of Figure 4.6, we plot the histograms of the hidden state decay parameters (weightsW γh ) corresponding to the input variables which has the highest (the 5 left subfigures) and the lowest (the 5 left subfigures) missing rates among all features. We find that the absolute parameter values are larger for variables with lower missing rate. For example, heart rate and cholesterol have the lowest and highest missing rates among all variables in this dataset. Our plot shows that hidden decay weights corresponding to heart rate have much larger scale than those of cholesterol. This indicates the time intervals of the variables with less missing rate have more impact on hidden state decay in our model. Notice that this is consistent with our preliminary investigation (in Figure 4.1) that the mortality and the missing rate have larger correlations for variables with lower missing rate. These findings show that our model successfully recognise useful missing patterns from the data directly. 54 4.1.3.3 Early Prediction Capacity Although our model is trained on the first 48 hours data and makes prediction at the last time step, it can make predictions on the fly with partial observations. This is useful in applications such as health care, where early decision making is beneficial and critical for patient care. The left of Figure 4.7 shows the online prediction results for the MIMIC-III mortality prediction task. We compare the RNN models with three widely used non-RNN models in practice, which are LR-Simple, SVM-Simple, andRF-Simple. Since theseRNN models only take statistical mean from the training examples or use forward imputation on the time series, no future information of the time series is used when we make predictions at each time step for time series in the test dataset. As we can see, AUROC score is around 0.7 at first 12 hours for all the GRU models and it keeps increasing with further observations. GRU-D and GRU-Simple, which explicitly handle missingness, perform consistently better than the other two RNN methods. In addition, GRU-D outperforms GRU-Simple when making predictions given time series of more than 24 hours, and has at least 2.5% higher AUROC score after 30 hours. This indicates that GRU-D is able to capture and utilize long-range temporal missing patterns. Furthermore, GRU-D achieves similar prediction performance (i.e., same AUROC score) as best non-RNN baseline model with less time series data. As shown in the figure, GRU-D has the same AUROC performance at 36 hours as the best non-RNN baseline model (RF-Simple) at 48 hours. This 12 hour improvement of GRU-D over the two commonly used non-RNN baselines is highly significant in hospital settings such as ICU where accurate early prediction is necessary for making time-saving critical decisions. 55 4.1.3.4 Model Scalability with Growing Data Size In many practical applications with large datasets, model scalability is very important. To evaluate the model performance with different training dataset sizes, we subsample three smaller datasets of 2,000 and 10,000 admissions from the entire MIMIC-III dataset while keeping the same mortality rate. We compare our proposed models with all GRU baselines and two most competitive non-RNN baselines (SVM-Simple, RF-Simple) and present the prediction results in the right of Figure 4.7. We observe that all models can achieve improved performance given more training samples. However, the improvements of non-RNN baselines are quite limited compared to GRU models, and our GRU-D model achieves the best results on the two larger datasets. These results indicate the performance gap between GRU-D and non-RNN baselines continues to grow as more data become available. 0.69 0.75 0.81 0.87 12 18 24 30 36 42 48 AUROC Score In-Hospital Hours After Admission GRU-Mean GRU-Forward GRU-Simple GRU-D LR-Simple SVM-Simple RF-Simple 0.73 0.78 0.83 0.88 2k 10k 19.7k AUROC Score Subsampled Dataset Size (Number of Patients) SVM-Simple RF-Simple GRU-Mean GRU-Forward GRU-Simple GRU-D Figure 4.7: Early prediction capacity (left) and model scalability comparisons (right) of GRU-D and other RNN baselines on the MIMIC-III dataset. 4.1.3.5 Comparison to Existing Studies on Mortality Prediction A series of work along the line of comparing and benchmarking the prediction performance of existing machine learning and deep learning models on the MIMIC- III dataset have been conducted recently [107, 182]. In the recent reproducibility summary [107], the authors summarized the results of recently published methods for 56 theMIMIC-III mortality prediction task, and the results of our method is among the best as shown in Table 4 of their paper. It is worth noting that there are no standard cohorts (i.e. no standard patient and variable inclusion criteria) in the MIMIC-III dataset for prediction analysis. The sample size and mortality rate are quite different among studies, and therefore the quantitative results are difficult to compare directly among all studies mentioned in the reproducibility summary. Our model outperforms the model with similar data settings [140] by 1.27% AUROC score. To make fair and accurate comparison in our experiments, we choose the most competitive and relevant prediction baselines which are the RNN methods [41, 136, 179]. Similar to existing work [107] which compared results across different cohorts using logistic regression and gradient boosting trees, we use logistic regression, support vector machine, and random forest as baseline prediction models and show relative improvement of 2.2% AUROC score on the MIMIC-III dataset from our proposed models over the best of these baselines. In addition, to demonstrate the usefulness of modeling missing patterns, we show the results of all predictive methods which use the imputed data from various imputation approaches. 4.1.3.6 Limitations Our proposed model focuses on the goal of making accurate and robust predic- tions on multivariate time series data with missing values. This model relies on the information related to the prediction tasks, which is represented in the missing pat- terns, to improve the prediction performance over the original GRU-RNN baselines. If the missingness is not informative at all, or the inherent correlation between the missing patterns and the prediction tasks are not clear, our model may gain limited improvements or even fail. This requires a good understanding of the applied domains. 57 Though our proposed model can be used in many health care applications and other application domains such as traffic and climate informatics, where the informative missingness presents, the decay mechanism needs to be explicitly redesigned. The proposed method is not explicitly designed for filling in the missing values in the data, and can not be directly used in unsupervised settings without prediction labels. Though the proposed model’s structure can be modified for data imputation tasks, it requires additional evaluation and study, which is beyond the scope of this work. Our proposed models are only evaluated in retrospective observational study settings, which is due to the inherent limitation of the publicly available datasets used in our study. However, in general, the proposed model can be used in other applications in practice. First, we can deploy the proposed method in prospective observational study and validate the findings in the retrospective study. Second, by investigating the decay terms learnt from our model, doctors can assess the impact of missing data for each variable, and improve data collection strategies to acquire more important variables. In addition, the prediction from our model can be used as a surrogate to the scoring systems used in ICU, and thus it can be used to ensure similar baseline risks between comparative groups in clinical trials or to decide what kind of intervention needs to be given. Finally, the proposed model can also be used to study the real-time mortality risk assessment for ICU patients and can indicate how the health status of the patient evolves over time. 58 4.2 Modeling Multi-Rate Multivariate Time Series Multivariate time series (MTS) analysis [87, 191] has attracted a lot of attention health care applications. State-space models such as Kalman filters [110] and hidden Markov models [184] have been developed to model MTS and have shown promising results on prediction tasks such as forecasting and interpolation. However, in many applications, the MTS observations usually come from multiple sources and are often characterized by various sampling rates. For example, in data from ICU, vital signs such as heart rate are sampled frequently, while lab results such as pH are measured infrequently. Such time series observations with either regular or irregular sampling rates are termed as multi-rate multivariate time series (MRMTS). Modeling the MRMTS using state-space models is challenging since MRMTS naturally comes with multiple temporal dependencies and these dependencies may not have direct relationship to the sampling rates. That is, the long and short-term temporal dependencies may be associated with a few or all the time series data with different sampling rates. Capturing these temporal dependencies is important as they model the underlying data generation mechanism, and they impact the interpolation and forecasting tasks. Upsampling or downsampling MRMTS to a single rate time series cannot address this challenge, since these simple techniques may artifically introduce or remove some naturally occurring dependencies present in MRMTS. For example, forward/backward imputation will introduce long-term dependencies. Therefore, building models which can capture multiple temporal dependencies directly from the MRMTS data is still an open problem in the time series analysis field. 59 Deep learning models such as recurrent neural networks [96] have emerged as successful models for time series analysis [84, 150] and sequence modeling applica- tions [210, 246]. While deep discriminative models [46, 90, 146, 174] have been shown to model complex non-linear temporal dependencies present in MTS, deep generative models [74, 192] have become more popular since they are intuitive, interpretable and are more powerful than their discriminative counterparts [63] and they capture the data generation process. Despite their success with single-rate time series data, the existing deep generative models are not suitable for modeling MRMTS as they are not designed to capture multiple temporal dependencies from different sampling rates. Recently, latent hierarchical structure learning based on deep learning models have led to remarkable advances in capturing temporal dependencies from sequential data [46, 91, 120]. Motivated by these models, we propose a novel deep generative model termed as Multi-Rate Hierarchical Deep Markov Model (MR-HDMM), which learns multiple temporal dependencies directly from MRMTS by jointly modeling time series with different sampling rates. MR-HDMM learns the latent hierarchical structures along with learnable switches and captures the data generation process of MRMTS. It simultaneously learns a inference network and a generative model by leveraging a structured variational approximation parameterized by recur- rent neural networks to mimic the posterior distribution. The data generation process of MR-HDMM can automatically infer the hierarchical structures directly from data, which is extremely helpful for downstream tasks such as interpolation and forecasting. In this chapter, we introduce a first-of-a-kind novel deep generative model called MR-HDMM to systematically capture the multiple temporal dependencies present in MRMTS by using hierarchical latent structures and learnable switches. 60 In addition, we also propose a new structured inference network for MR-HDMM. A comprehensive and systematic evaluation of the MR-HDMM model is conducted on the MIMIC-III dataset to demonstrate the state-of-the-art performance in forecasting and interpolation tasks. Finally, we interpret the learnt latent hierarchies from MR-HDMM to study the captured temporal dependencies. 4.2.1 Related Work State-space models such as Kalman filters [110] and hidden Markov models (HMMs) [184] have been widely used in various time series applications such as speech recognition [184], atmospheric monitoring [100], and robotic control [162]. These approaches successfully model regularly sampled (i.e. sampled at the same frequency/rate) time series data, however, they cannot be directly used for MRMTS as they cannot simultaneously capture the multiple temporal dependencies present in MRMTS. To handle MRMTS with state-space models, researchers have extended KF models and proposed multi-rate Kalman filters (MR-KF) [8, 199]. MR-KF approaches either fuse the data with different sampling rates or fuse the estimates for Kalman filters trained on each sampling rate. Many of these MR-KF approaches aim to improve the estimates for the highest sampled rate data and do not focus on capturing the multiple temporal dependencies present in MRMTS. Moreover, the linear transition and emission functionality of the MR-KF models limits their usability on complex real-world data. Recently, researchers have resorted to deep learning models [46, 74, 122] to model the non-linear temporal dynamics of real-world and sequential data. Dis- criminative models such as hierarchical recurrent neural network [91], hierarchical multiscale recurrent neural network (HM-RNN) [46], and phased long short-term 61 memory (PLSTM) [163] have been proposed to capture temporal dependencies of sequential data. However, these discriminative models do not capture the underlying data generation process and therefore are not suited for forecasting and interpolation tasks. Deep generative models [74, 122, 192] have been developed to model the data generation process of the complex time series data. Deep Kalman filter [122] is a nonlinear state-space model by marrying the ideas of deep neural networks with Kalman filters. The stochastic recurrent neural network (SRNN) [69] glues an recurrent neural network with a state space model together to form a stochastic and sequential neural generative model. Even though these deep generative models are the state-of-the-art approaches to obtain the underlying data generation process, they are not designed to capture all the temporal dependencies of MRMTS. None of the existing deep learning models or state-space models can be directly used for modeling MRMTS. Thus, in this work, we develop a deep generative model which leverages the properties of the above discriminative and generative models, to model the data generation process of MRMTS while also capturing the multiple temporal dependencies using a latent hierarchical structure. 4.2.2 Methodology 4.2.2.1 Notations for MRMTS and MR-HDMM Given a MRMTS of L different sampling rates and length T, we use a vector x l t ∈ R D l to represent the time series observations of l-th rate at time t. Here l = 1,...,L, t = 1,...,T, and D l is the dimension of time series with lth rate. The L sampling rates are in descending order, i.e., l = 1 and l =L refer to the highest and lowest sampling rates. To make the notations succinct, we usex l:l 0 t:t 0 to denote 62 all observed time series of l-th to l 0 -th rates and from time t to t 0 . We use θ (.) and φ (.) to denote the parameter sets for generation model p θ and inference network q φ respectively. We use L layers of recurrent neural networks in the inference network to model MRMTS of L different sampling rates. We use L HS , the number of hidden layers in both generation model and inference network, to control the depth of the learnt hierarchical structures. In the rest of this chapter we take L HS = L for model simplicity, but in practice they are not tied. The latent states or variables are denoted byz,s andh. Their superscript and subscript respectively indicate the corresponding layer(s) and the time step(s) (e.g.,z 1:L 1:T ,s 2:L t ,h l t ). 4.2.2.2 Sketchy Illustrations of MR-HDMM Figure 4.8 illustrates our proposed MR-HDMM which consists of the generation model and inference network. The switches on incoming edges to a node (z l t ) are the same, which is shown as s l t in Figure 4.9. MR-HDMM captures the underlying data generation process by using the variational inference methods [117, 192] and learns the latent hierarchical structures using learnable switches and auxiliary connections to adaptively encode the dependencies across the hierarchies and the timestamps. In particular, the switches use an update-and-reuse mechanism to control the updates of the latent states of a layer based on their previous states (i.e., utilizing temporal information) and the lower latent layers (i.e., utilizing the hierarchy). The switch triggers an update of the current states if it gets enough information from lower-level states, otherwise it reuses the previous states. Thus, the higher-level states act as summarized representations over the lower-level states and the switches help to propagate the temporal dependencies. 63 z 3 5 <latexit sha1_base64="xo/E3ptk+QaeVilWfelxQwJBNeI=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSx+RL0RvXjExBUMrKRbutDQdjdt1wQ3+yu8eFDj1b/jzX9jgT0o+JJJXt6bycy8IOZMG9f9dgoLi0vLK8XV0tr6xuZWeXvnTkeJItQjEY9UK8CaciapZ5jhtBUrikXAaTMYXo395iNVmkXy1oxi6gvclyxkBBsr3T9109PsIT3OuuWKW3UnQPOklpMK5Gh0y1+dXkQSQaUhHGvdrrmx8VOsDCOcZqVOommMyRD3adtSiQXVfjo5OEMHVumhMFK2pEET9fdEioXWIxHYToHNQM96Y/E/r52Y8NxPmYwTQyWZLgoTjkyExt+jHlOUGD6yBBPF7K2IDLDCxNiMSjaE2uzL88Q7ql5U3ZuTSv0yT6MIe7APh1CDM6jDNTTAAwICnuEV3hzlvDjvzse0teDkM7vwB87nDzZHkDY=</latexit> <latexit sha1_base64="xo/E3ptk+QaeVilWfelxQwJBNeI=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSx+RL0RvXjExBUMrKRbutDQdjdt1wQ3+yu8eFDj1b/jzX9jgT0o+JJJXt6bycy8IOZMG9f9dgoLi0vLK8XV0tr6xuZWeXvnTkeJItQjEY9UK8CaciapZ5jhtBUrikXAaTMYXo395iNVmkXy1oxi6gvclyxkBBsr3T9109PsIT3OuuWKW3UnQPOklpMK5Gh0y1+dXkQSQaUhHGvdrrmx8VOsDCOcZqVOommMyRD3adtSiQXVfjo5OEMHVumhMFK2pEET9fdEioXWIxHYToHNQM96Y/E/r52Y8NxPmYwTQyWZLgoTjkyExt+jHlOUGD6yBBPF7K2IDLDCxNiMSjaE2uzL88Q7ql5U3ZuTSv0yT6MIe7APh1CDM6jDNTTAAwICnuEV3hzlvDjvzse0teDkM7vwB87nDzZHkDY=</latexit> <latexit sha1_base64="xo/E3ptk+QaeVilWfelxQwJBNeI=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSx+RL0RvXjExBUMrKRbutDQdjdt1wQ3+yu8eFDj1b/jzX9jgT0o+JJJXt6bycy8IOZMG9f9dgoLi0vLK8XV0tr6xuZWeXvnTkeJItQjEY9UK8CaciapZ5jhtBUrikXAaTMYXo395iNVmkXy1oxi6gvclyxkBBsr3T9109PsIT3OuuWKW3UnQPOklpMK5Gh0y1+dXkQSQaUhHGvdrrmx8VOsDCOcZqVOommMyRD3adtSiQXVfjo5OEMHVumhMFK2pEET9fdEioXWIxHYToHNQM96Y/E/r52Y8NxPmYwTQyWZLgoTjkyExt+jHlOUGD6yBBPF7K2IDLDCxNiMSjaE2uzL88Q7ql5U3ZuTSv0yT6MIe7APh1CDM6jDNTTAAwICnuEV3hzlvDjvzse0teDkM7vwB87nDzZHkDY=</latexit> z 3 4 z 3 3 <latexit sha1_base64="i1uYpfDc4ed45wYwxlYYKFN7NCc=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewqiXojevGIiSsYWEm3dKGh7W7arglu+BVePKjx6t/x5r+xwB4UfMkkL+/NZGZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOryZ+85EqzWJ5a0YJDQTuSxYxgo2V7p+62en4wVa3XHGr7hRokXg5qUCORrf81enFJBVUGsKx1m3PTUyQYWUY4XRc6qSaJpgMcZ+2LZVYUB1k04PH6MgqPRTFypY0aKr+nsiw0HokQtspsBnoeW8i/ue1UxOdBxmTSWqoJLNFUcqRidHke9RjihLDR5Zgopi9FZEBVpgYm1HJhuDNv7xI/JPqRdW9qVXql3kaRTiAQzgGD86gDtfQAB8ICHiGV3hzlPPivDsfs9aCk8/swx84nz8zN5A0</latexit> <latexit sha1_base64="i1uYpfDc4ed45wYwxlYYKFN7NCc=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewqiXojevGIiSsYWEm3dKGh7W7arglu+BVePKjx6t/x5r+xwB4UfMkkL+/NZGZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOryZ+85EqzWJ5a0YJDQTuSxYxgo2V7p+62en4wVa3XHGr7hRokXg5qUCORrf81enFJBVUGsKx1m3PTUyQYWUY4XRc6qSaJpgMcZ+2LZVYUB1k04PH6MgqPRTFypY0aKr+nsiw0HokQtspsBnoeW8i/ue1UxOdBxmTSWqoJLNFUcqRidHke9RjihLDR5Zgopi9FZEBVpgYm1HJhuDNv7xI/JPqRdW9qVXql3kaRTiAQzgGD86gDtfQAB8ICHiGV3hzlPPivDsfs9aCk8/swx84nz8zN5A0</latexit> <latexit sha1_base64="i1uYpfDc4ed45wYwxlYYKFN7NCc=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewqiXojevGIiSsYWEm3dKGh7W7arglu+BVePKjx6t/x5r+xwB4UfMkkL+/NZGZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOryZ+85EqzWJ5a0YJDQTuSxYxgo2V7p+62en4wVa3XHGr7hRokXg5qUCORrf81enFJBVUGsKx1m3PTUyQYWUY4XRc6qSaJpgMcZ+2LZVYUB1k04PH6MgqPRTFypY0aKr+nsiw0HokQtspsBnoeW8i/ue1UxOdBxmTSWqoJLNFUcqRidHke9RjihLDR5Zgopi9FZEBVpgYm1HJhuDNv7xI/JPqRdW9qVXql3kaRTiAQzgGD86gDtfQAB8ICHiGV3hzlPPivDsfs9aCk8/swx84nz8zN5A0</latexit> z 3 2 z 3 1 z 2 1 <latexit sha1_base64="7dezeGnBPLFz/n9GvV1gREx7vPs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqOQMQ==</latexit> <latexit sha1_base64="7dezeGnBPLFz/n9GvV1gREx7vPs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqOQMQ==</latexit> <latexit sha1_base64="7dezeGnBPLFz/n9GvV1gREx7vPs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqOQMQ==</latexit> z 2 2 z 2 3 <latexit sha1_base64="9XdrESdfp8cYMi1TvEaWZTfR8F0=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPU M8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQlho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGzkDM=</latexit> <latexit sha1_base64="9XdrESdfp8cYMi1TvEaWZTfR8F0=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPU M8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQlho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGzkDM=</latexit> <latexit sha1_base64="9XdrESdfp8cYMi1TvEaWZTfR8F0=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPU M8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQlho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGzkDM=</latexit> z 2 4 z 2 5 z 1 5 z 1 4 z 1 3 z 1 2 z 1 1 x 1 1 x 1 2 x 1 3 <latexit sha1_base64="wgr+t9mZrUhwY7xByoNJ4NNScJg=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KokK6q3oxWMFYyttLJvtpl26uwm7G7GE/AovHlS8+ne8+W/ctjlo64OBx3szzMwLE860cd1vZ2FxaXlltbRWXt/Y3Nqu7Oze6ThVhPok5rFqhVhTziT1DTOcthJFsQg5bYbDq7HffKRKs1jemlFCA4H7kkWMYGOl+6dudpI/ZF7erVTdmjsBmideQapQoNGtfHV6MUkFlYZwrHXbcxMTZFgZRjjNy51U0wSTIe7TtqUSC6qDbHJwjg6t0kNRrGxJgybq74kMC61HIrSdApuBnvXG4n9eOzXReZAxmaSGSjJdFKUcmRiNv0c9pigx fGQJJorZWxEZYIWJsRmVbQje7MvzxD+uXdTcm9Nq/bJIowT7cABH4MEZ1OEaGuADAQHP8ApvjnJenHfnY9q64BQze/AHzucPLRmQMA==</latexit> <latexit sha1_base64="wgr+t9mZrUhwY7xByoNJ4NNScJg=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KokK6q3oxWMFYyttLJvtpl26uwm7G7GE/AovHlS8+ne8+W/ctjlo64OBx3szzMwLE860cd1vZ2FxaXlltbRWXt/Y3Nqu7Oze6ThVhPok5rFqhVhTziT1DTOcthJFsQg5bYbDq7HffKRKs1jemlFCA4H7kkWMYGOl+6dudpI/ZF7erVTdmjsBmideQapQoNGtfHV6MUkFlYZwrHXbcxMTZFgZRjjNy51U0wSTIe7TtqUSC6qDbHJwjg6t0kNRrGxJgybq74kMC61HIrSdApuBnvXG4n9eOzXReZAxmaSGSjJdFKUcmRiNv0c9pigx fGQJJorZWxEZYIWJsRmVbQje7MvzxD+uXdTcm9Nq/bJIowT7cABH4MEZ1OEaGuADAQHP8ApvjnJenHfnY9q64BQze/AHzucPLRmQMA==</latexit> <latexit sha1_base64="wgr+t9mZrUhwY7xByoNJ4NNScJg=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KokK6q3oxWMFYyttLJvtpl26uwm7G7GE/AovHlS8+ne8+W/ctjlo64OBx3szzMwLE860cd1vZ2FxaXlltbRWXt/Y3Nqu7Oze6ThVhPok5rFqhVhTziT1DTOcthJFsQg5bYbDq7HffKRKs1jemlFCA4H7kkWMYGOl+6dudpI/ZF7erVTdmjsBmideQapQoNGtfHV6MUkFlYZwrHXbcxMTZFgZRjjNy51U0wSTIe7TtqUSC6qDbHJwjg6t0kNRrGxJgybq74kMC61HIrSdApuBnvXG4n9eOzXReZAxmaSGSjJdFKUcmRiNv0c9pigx fGQJJorZWxEZYIWJsRmVbQje7MvzxD+uXdTcm9Nq/bJIowT7cABH4MEZ1OEaGuADAQHP8ApvjnJenHfnY9q64BQze/AHzucPLRmQMA==</latexit> x 1 4 x 1 5 x 2 5 x 3 5 x 3 1 x 2 1 x 2 3 <latexit sha1_base64="mLnVqBjNXd0UGID/U0T8/vXabZE=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2aySb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8AS6dkDE=</latexit> <latexit sha1_base64="mLnVqBjNXd0UGID/U0T8/vXabZE=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2aySb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8AS6dkDE=</latexit> <latexit sha1_base64="mLnVqBjNXd0UGID/U0T8/vXabZE=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2aySb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8AS6dkDE=</latexit> z 3 5 z 3 4 <latexit sha1_base64="Ddk1wsUUpdGdorNP0fQBk4UuLfw=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9nVgnorevFYwbWVdi3ZNNuGJtklyQp12V/hxYOKV/+ON/+N6cdBWx8MPN6bYWZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOr8Z+85EqzWJ5a0YJDQTuSxYxgo2V7p+6WS1/yE7zbrniVt0J0CLxZqQCMzS65a9OLyapoNIQjrVue25iggwrwwineamTappgMsR92rZUYkF1kE0OztGRVXooipUtadBE/T2RYaH1SIS2U2Az0PPeWPzPa6cmOg8yJpPUUEmmi6KUIxOj8feoxxQlho8swUQxeysiA6wwMTajkg3Bm395kfgn1Yuqe1Or1C9naRThAA7hGDw4gzpcQwN8ICDgGV7hzVHOi/PufExbC85sZh/+wPn8ATS/kDU=</latexit> <latexit sha1_base64="Ddk1wsUUpdGdorNP0fQBk4UuLfw=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9nVgnorevFYwbWVdi3ZNNuGJtklyQp12V/hxYOKV/+ON/+N6cdBWx8MPN6bYWZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOr8Z+85EqzWJ5a0YJDQTuSxYxgo2V7p+6WS1/yE7zbrniVt0J0CLxZqQCMzS65a9OLyapoNIQjrVue25iggwrwwineamTappgMsR92rZUYkF1kE0OztGRVXooipUtadBE/T2RYaH1SIS2U2Az0PPeWPzPa6cmOg8yJpPUUEmmi6KUIxOj8feoxxQlho8swUQxeysiA6wwMTajkg3Bm395kfgn1Yuqe1Or1C9naRThAA7hGDw4gzpcQwN8ICDgGV7hzVHOi/PufExbC85sZh/+wPn8ATS/kDU=</latexit> <latexit sha1_base64="Ddk1wsUUpdGdorNP0fQBk4UuLfw=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9nVgnorevFYwbWVdi3ZNNuGJtklyQp12V/hxYOKV/+ON/+N6cdBWx8MPN6bYWZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOr8Z+85EqzWJ5a0YJDQTuSxYxgo2V7p+6WS1/yE7zbrniVt0J0CLxZqQCMzS65a9OLyapoNIQjrVue25iggwrwwineamTappgMsR92rZUYkF1kE0OztGRVXooipUtadBE/T2RYaH1SIS2U2Az0PPeWPzPa6cmOg8yJpPUUEmmi6KUIxOj8feoxxQlho8swUQxeysiA6wwMTajkg3Bm395kfgn1Yuqe1Or1C9naRThAA7hGDw4gzpcQwN8ICDgGV7hzVHOi/PufExbC85sZh/+wPn8ATS/kDU=</latexit> z 3 3 z 3 2 z 3 1 <latexit sha1_base64="7Aw8cYKSlJp93OrDHKWtDVCUlt8=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KokK6q3oxWMFYyttLJvtpl26uwm7G6GG/AovHlS8+ne8+W/ctjlo64OBx3szzMwLE860cd1vZ2FxaXlltbRWXt/Y3Nqu7Oze6ThVhPok5rFqhVhTziT1DTOcthJFsQg5bYbDq7HffKRKs1jemlFCA4H7kkWMYGOl+6du5uUP2UnerVTdmjsBmideQapQoNGtfHV6MUkFlYZwrHXbcxMTZFgZRjjNy51U0wSTIe7TtqUSC6qDbHJwjg6t0kNRrGxJgybq74kMC61HIrSdApuBnvXG4n9eOzXReZAxmaSGSjJdFKUcmRiNv0c9pigx fGQJJorZWxEZYIWJsRmVbQje7MvzxD+uXdTcm9Nq/bJIowT7cABH4MEZ1OEaGuADAQHP8ApvjnJenHfnY9q64BQze/AHzucPMCeQMg==</latexit> <latexit sha1_base64="7Aw8cYKSlJp93OrDHKWtDVCUlt8=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KokK6q3oxWMFYyttLJvtpl26uwm7G6GG/AovHlS8+ne8+W/ctjlo64OBx3szzMwLE860cd1vZ2FxaXlltbRWXt/Y3Nqu7Oze6ThVhPok5rFqhVhTziT1DTOcthJFsQg5bYbDq7HffKRKs1jemlFCA4H7kkWMYGOl+6du5uUP2UnerVTdmjsBmideQapQoNGtfHV6MUkFlYZwrHXbcxMTZFgZRjjNy51U0wSTIe7TtqUSC6qDbHJwjg6t0kNRrGxJgybq74kMC61HIrSdApuBnvXG4n9eOzXReZAxmaSGSjJdFKUcmRiNv0c9pigx fGQJJorZWxEZYIWJsRmVbQje7MvzxD+uXdTcm9Nq/bJIowT7cABH4MEZ1OEaGuADAQHP8ApvjnJenHfnY9q64BQze/AHzucPMCeQMg==</latexit> <latexit sha1_base64="7Aw8cYKSlJp93OrDHKWtDVCUlt8=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KokK6q3oxWMFYyttLJvtpl26uwm7G6GG/AovHlS8+ne8+W/ctjlo64OBx3szzMwLE860cd1vZ2FxaXlltbRWXt/Y3Nqu7Oze6ThVhPok5rFqhVhTziT1DTOcthJFsQg5bYbDq7HffKRKs1jemlFCA4H7kkWMYGOl+6du5uUP2UnerVTdmjsBmideQapQoNGtfHV6MUkFlYZwrHXbcxMTZFgZRjjNy51U0wSTIe7TtqUSC6qDbHJwjg6t0kNRrGxJgybq74kMC61HIrSdApuBnvXG4n9eOzXReZAxmaSGSjJdFKUcmRiNv0c9pigx fGQJJorZWxEZYIWJsRmVbQje7MvzxD+uXdTcm9Nq/bJIowT7cABH4MEZ1OEaGuADAQHP8ApvjnJenHfnY9q64BQze/AHzucPMCeQMg==</latexit> z 2 1 z 2 2 <latexit sha1_base64="cvGdtb9JaDbWIrCRUjrDkfBuInM=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewSE/RG9OIRE1cwsJJu6UJD2920XRPc8Cu8eFDj1b/jzX9jgT0o+JJJXt6bycy8MOFMG9f9dgorq2vrG8XN0tb2zu5eef/gTsepItQnMY9VO8Saciapb5jhtJ0oikXIaSscXU391iNVmsXy1owTGgg8kCxiBBsr3T/1strkwVavXHGr7gxomXg5qUCOZq/81e3HJBVUGsKx1h3PTUyQYWUY4XRS6qaaJpiM8IB2LJVYUB1ks4Mn6MQqfRTFypY0aKb+nsiw0HosQtspsBnqRW8q/ud1UhOdBxmTSWqoJPNFUcqRidH0e9RnihLD x5Zgopi9FZEhVpgYm1HJhuAtvrxM/Fr1ourenFUal3kaRTiCYzgFD+rQgGtogg8EBDzDK7w5ynlx3p2PeWvByWcO4Q+czx8wK5Ay</latexit> <latexit sha1_base64="cvGdtb9JaDbWIrCRUjrDkfBuInM=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewSE/RG9OIRE1cwsJJu6UJD2920XRPc8Cu8eFDj1b/jzX9jgT0o+JJJXt6bycy8MOFMG9f9dgorq2vrG8XN0tb2zu5eef/gTsepItQnMY9VO8Saciapb5jhtJ0oikXIaSscXU391iNVmsXy1owTGgg8kCxiBBsr3T/1strkwVavXHGr7gxomXg5qUCOZq/81e3HJBVUGsKx1h3PTUyQYWUY4XRS6qaaJpiM8IB2LJVYUB1ks4Mn6MQqfRTFypY0aKb+nsiw0HosQtspsBnqRW8q/ud1UhOdBxmTSWqoJPNFUcqRidH0e9RnihLD x5Zgopi9FZEhVpgYm1HJhuAtvrxM/Fr1ourenFUal3kaRTiCYzgFD+rQgGtogg8EBDzDK7w5ynlx3p2PeWvByWcO4Q+czx8wK5Ay</latexit> <latexit sha1_base64="cvGdtb9JaDbWIrCRUjrDkfBuInM=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewSE/RG9OIRE1cwsJJu6UJD2920XRPc8Cu8eFDj1b/jzX9jgT0o+JJJXt6bycy8MOFMG9f9dgorq2vrG8XN0tb2zu5eef/gTsepItQnMY9VO8Saciapb5jhtJ0oikXIaSscXU391iNVmsXy1owTGgg8kCxiBBsr3T/1strkwVavXHGr7gxomXg5qUCOZq/81e3HJBVUGsKx1h3PTUyQYWUY4XRS6qaaJpiM8IB2LJVYUB1ks4Mn6MQqfRTFypY0aKb+nsiw0HosQtspsBnqRW8q/ud1UhOdBxmTSWqoJPNFUcqRidH0e9RnihLD x5Zgopi9FZEhVpgYm1HJhuAtvrxM/Fr1ourenFUal3kaRTiCYzgFD+rQgGtogg8EBDzDK7w5ynlx3p2PeWvByWcO4Q+czx8wK5Ay</latexit> z 2 3 z 2 4 z 2 5 z 1 5 z 1 4 z 1 3 z 1 2 z 1 1 h 1 1 h 1 2 <latexit sha1_base64="wJ8P3xPwWyO745UfpNLw8GkBqRQ=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN6KXjxWMLbSxrLZbtqlu5uwuxFKyK/w4kHFq3/Hm//GbZuDtj4YeLw3w8y8MOFMG9f9dkorq2vrG+XNytb2zu5edf/gXsepItQnMY9VJ8Saciap b5jhtJMoikXIaTscX0/99hNVmsXyzkwSGgg8lCxiBBsrPYz6WSN/zLy8X625dXcGtEy8gtSgQKtf/eoNYpIKKg3hWOuu5yYmyLAyjHCaV3qppgkmYzykXUslFlQH2ezgHJ1YZYCiWNmSBs3U3xMZFlpPRGg7BTYjvehNxf+8bmqiiyBjMkkNlWS+KEo5MjGafo8GTFFi+MQSTBSztyIywgoTYzOq2BC8xZeXid+oX9bd27Na86pIowxHcAyn4ME5NOEGWuADAQHP8ApvjnJenHfnY95acoqZQ/gD5/MHEuGQHw==</latexit> <latexit sha1_base64="wJ8P3xPwWyO745UfpNLw8GkBqRQ=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN6KXjxWMLbSxrLZbtqlu5uwuxFKyK/w4kHFq3/Hm//GbZuDtj4YeLw3w8y8MOFMG9f9dkorq2vrG+XNytb2zu5edf/gXsepItQnMY9VJ8Saciap b5jhtJMoikXIaTscX0/99hNVmsXyzkwSGgg8lCxiBBsrPYz6WSN/zLy8X625dXcGtEy8gtSgQKtf/eoNYpIKKg3hWOuu5yYmyLAyjHCaV3qppgkmYzykXUslFlQH2ezgHJ1YZYCiWNmSBs3U3xMZFlpPRGg7BTYjvehNxf+8bmqiiyBjMkkNlWS+KEo5MjGafo8GTFFi+MQSTBSztyIywgoTYzOq2BC8xZeXid+oX9bd27Na86pIowxHcAyn4ME5NOEGWuADAQHP8ApvjnJenHfnY95acoqZQ/gD5/MHEuGQHw==</latexit> <latexit sha1_base64="wJ8P3xPwWyO745UfpNLw8GkBqRQ=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN6KXjxWMLbSxrLZbtqlu5uwuxFKyK/w4kHFq3/Hm//GbZuDtj4YeLw3w8y8MOFMG9f9dkorq2vrG+XNytb2zu5edf/gXsepItQnMY9VJ8Saciap b5jhtJMoikXIaTscX0/99hNVmsXyzkwSGgg8lCxiBBsrPYz6WSN/zLy8X625dXcGtEy8gtSgQKtf/eoNYpIKKg3hWOuu5yYmyLAyjHCaV3qppgkmYzykXUslFlQH2ezgHJ1YZYCiWNmSBs3U3xMZFlpPRGg7BTYjvehNxf+8bmqiiyBjMkkNlWS+KEo5MjGafo8GTFFi+MQSTBSztyIywgoTYzOq2BC8xZeXid+oX9bd27Na86pIowxHcAyn4ME5NOEGWuADAQHP8ApvjnJenHfnY95acoqZQ/gD5/MHEuGQHw==</latexit> h 1 3 <latexit sha1_base64="IR79hVIp/X2tq9yv0nZSPPOlc1g=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lUUG9FLx4rGFtpY9lsN+3S3U3Y3Qgl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3r+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOrid+64kqzWJ5Z8YJDQQeSBYxgo2VHoa97DR/zLy8V625dXcKtEi8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHCaV7qppgkmIzygHUslFlQH2fTgHB1ZpY+iWNmSBk3V3xMZFlqPRWg7BTZDPe9NxP+8TmqiiyBjMkkNlWS2KEo5MjGafI/6TFFi+NgSTBSztyIyxAoTYzOq2BC8+ZcXiX9Sv6y7t2e1xlWRRhkO4BCOwYNzaMANNMEHAgKe4RXeHOW8OO/Ox6y15BQz+/AHzucPFGmQIA==</latexit> <latexit sha1_base64="IR79hVIp/X2tq9yv0nZSPPOlc1g=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lUUG9FLx4rGFtpY9lsN+3S3U3Y3Qgl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3r+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOrid+64kqzWJ5Z8YJDQQeSBYxgo2VHoa97DR/zLy8V625dXcKtEi8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHCaV7qppgkmIzygHUslFlQH2fTgHB1ZpY+iWNmSBk3V3xMZFlqPRWg7BTZDPe9NxP+8TmqiiyBjMkkNlWS2KEo5MjGafI/6TFFi+NgSTBSztyIyxAoTYzOq2BC8+ZcXiX9Sv6y7t2e1xlWRRhkO4BCOwYNzaMANNMEHAgKe4RXeHOW8OO/Ox6y15BQz+/AHzucPFGmQIA==</latexit> <latexit sha1_base64="IR79hVIp/X2tq9yv0nZSPPOlc1g=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lUUG9FLx4rGFtpY9lsN+3S3U3Y3Qgl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3r+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOrid+64kqzWJ5Z8YJDQQeSBYxgo2VHoa97DR/zLy8V625dXcKtEi8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHCaV7qppgkmIzygHUslFlQH2fTgHB1ZpY+iWNmSBk3V3xMZFlqPRWg7BTZDPe9NxP+8TmqiiyBjMkkNlWS2KEo5MjGafI/6TFFi+NgSTBSztyIyxAoTYzOq2BC8+ZcXiX9Sv6y7t2e1xlWRRhkO4BCOwYNzaMANNMEHAgKe4RXeHOW8OO/Ox6y15BQz+/AHzucPFGmQIA==</latexit> h 1 4 h 1 5 <latexit sha1_base64="bcn5ArE3vHbK1EJAOOANxhcLvGI=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUW9FLx4rGFtpY9lsN+3S3U3Y3Qgl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3r+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOrid+64kqzWJ5Z8YJDQQeSBYxgo2VHoa97Cx/zLy8V625dXcKtEi8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHCaV7qppgkmIzygHUslFlQH2fTgHB1ZpY+iWNmSBk3V3xMZFlqPRWg7BTZDPe9NxP+8TmqiiyBjMkkNlWS2KEo5MjGafI/6TFFi+NgSTBSztyIyxAoTYzOq2BC8+ZcXiX9Sv6y7t6e1xlWRRhkO4BCOwYNzaMANNMEHAgKe4RXeHOW8OO/Ox6y15BQz+/AHzucPF3mQIg==</latexit><latexit sha1_base64="bcn5ArE3vHbK1EJAOOANxhcLvGI=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUW9FLx4rGFtpY9lsN+3S3U3Y3Qgl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3r+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOrid+64kqzWJ5Z8YJDQQeSBYxgo2VHoa97Cx/zLy8V625dXcKtEi8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHCaV7qppgkmIzygHUslFlQH2fTgHB1ZpY+iWNmSBk3V3xMZFlqPRWg7BTZDPe9NxP+8TmqiiyBjMkkNlWS2KEo5MjGafI/6TFFi+NgSTBSztyIyxAoTYzOq2BC8+ZcXiX9Sv6y7t6e1xlWRRhkO4BCOwYNzaMANNMEHAgKe4RXeHOW8OO/Ox6y15BQz+/AHzucPF3mQIg==</latexit><latexit sha1_base64="bcn5ArE3vHbK1EJAOOANxhcLvGI=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUW9FLx4rGFtpY9lsN+3S3U3Y3Qgl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3r+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOrid+64kqzWJ5Z8YJDQQeSBYxgo2VHoa97Cx/zLy8V625dXcKtEi8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHCaV7qppgkmIzygHUslFlQH2fTgHB1ZpY+iWNmSBk3V3xMZFlqPRWg7BTZDPe9NxP+8TmqiiyBjMkkNlWS2KEo5MjGafI/6TFFi+NgSTBSztyIyxAoTYzOq2BC8+ZcXiX9Sv6y7t6e1xlWRRhkO4BCOwYNzaMANNMEHAgKe4RXeHOW8OO/Ox6y15BQz+/AHzucPF3mQIg==</latexit> h 2 1 h 2 3 h 2 5 h 3 1 h 3 5 x 1 1 x 1 2 <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> x 1 3 x 1 4 x 1 5 x 2 5 x 3 5 x 3 1 x 2 1 x 2 3 Auxiliary connections Latent variable z <latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittb CRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit> <latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittb CRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit> <latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittb CRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit> Observation x Inference RNN h <latexit sha1_base64="lt6MdrejMYgtNBMJ4G/0FERlp4M=">AAAB53icbVBNS8NAEJ34WetX1aOXxSJ4KokI6q3oxWMLxhbaUDbbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szzMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+ 4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWaw16l6tbcGcgy8QpShQKNXuWr209YFqM0TFCtO56bmiCnynAmcFLuZhpTykZ0gB1LJY1RB/ns0Ak5tUqfRImyJQ2Zqb8nchprPY5D2xlTM9SL3lT8z+tkJroKci7TzKBk80VRJohJyPRr0ucKmRFjSyhT3N5K2JAqyozNpmxD8BZfXib+ee265jYvqvWbIo0SHMMJnIEHl1CHO2iADwwQnuEV3pxH58V5dz7mrStOMXMEf+B8/gA8HYy/</latexit> <latexit sha1_base64="lt6MdrejMYgtNBMJ4G/0FERlp4M=">AAAB53icbVBNS8NAEJ34WetX1aOXxSJ4KokI6q3oxWMLxhbaUDbbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szzMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+ 4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWaw16l6tbcGcgy8QpShQKNXuWr209YFqM0TFCtO56bmiCnynAmcFLuZhpTykZ0gB1LJY1RB/ns0Ak5tUqfRImyJQ2Zqb8nchprPY5D2xlTM9SL3lT8z+tkJroKci7TzKBk80VRJohJyPRr0ucKmRFjSyhT3N5K2JAqyozNpmxD8BZfXib+ee265jYvqvWbIo0SHMMJnIEHl1CHO2iADwwQnuEV3pxH58V5dz7mrStOMXMEf+B8/gA8HYy/</latexit> <latexit sha1_base64="lt6MdrejMYgtNBMJ4G/0FERlp4M=">AAAB53icbVBNS8NAEJ34WetX1aOXxSJ4KokI6q3oxWMLxhbaUDbbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szzMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+ 4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWaw16l6tbcGcgy8QpShQKNXuWr209YFqM0TFCtO56bmiCnynAmcFLuZhpTykZ0gB1LJY1RB/ns0Ak5tUqfRImyJQ2Zqb8nchprPY5D2xlTM9SL3lT8z+tkJroKci7TzKBk80VRJohJyPRr0ucKmRFjSyhT3N5K2JAqyozNpmxD8BZfXib+ee265jYvqvWbIo0SHMMJnIEHl1CHO2iADwwQnuEV3pxH58V5dz7mrStOMXMEf+B8/gA8HYy/</latexit> Switches s Figure 4.8: Generation model (left) and structured inference network (right, with the filtering setting) of our proposed MR-HDMM for MRMTS. The auxiliary connections (dashed lines in the left of Figure 4.8) between MRMTS of different sampling rates and different latent layers help the model effec- tively capture the short-term and long-term temporal dependencies. Without the auxiliary connections, the higher-rate time series may mask the multi-scale depen- dencies present in the lower-rate time series data while propagating dependencies through bottom-up connections. Note that, the auxiliary connections are not related to the sampling rate of MRMTS, and the sampling rate of higher-rate variable need not be a multiple of sampling rate of the lower-rate variable. Due to the flexibility of auxiliary connections, MR-HDMM can also handle irregularly sampled time series data or missing data. We can a) zero-out the missing data points in the inference network and remove the corresponding auxiliary connections in the generation model during training, and b) interpolate missing values by adding auxiliary connections in the well-trained model. 64 4.2.2.3 Generation Model The left of Figure 4.8 shows the generation model of our MR-HDMM. The generation process of our MR-HDMM follows the transition and emission framework, which is obtained by applying deep recurrent neural networks to non-linear continuous state space models. The generation model is carefully designed to incorporate the switching mechanism and auxiliary connections in order to capture the multiple temporal dependencies present in MRMTS. 0 1 Reuse Update 0 1 Update Update Update = 1 Reuse = 0 Figure 4.9: The switch mechanism (left) for updating the latent statesz l t in MR- HDMM and the illustrations of switch on (middle, s l t = 1) and switch off (right, s l t = 0) conditions. Transition We design the transition process of the latent state z to capture the hierarchical structure for multiple temporal dependencies with learnable binary switches s. For each non-bottom layer l > 1 and time step t ≥ 1, we use a binary switch state s l t to control the updates of the corresponding latent states z l t , as shown in Figure 4.9. s l t is obtained based on the values of the previous latent statesz l t−1 and the lower layer latent statesz l−1 t by a deterministic mapping s l t =I g θs (z l t−1 ,z l−1 t )≥ 0 . It is worth noting that during training we use a sharp sigmoid function s l t 'σ c·g θs (z l t−1 ,z l−1 t ) with a large constant c instead. When the switch is on (i.e., update operation,s l t = 1),z l t is updated based onz l t−1 andz l−1 t 65 through a learnt transition distribution. We use a multivariate Gaussian distribution N l t , l t |z l t−1 ,z l−1 t ;θ z with mean and covariance given by ( l t , l t ) =g θz (z l t−1 ,z l−1 t ) as the transition distribution. When the switch is off (i.e., reuse operation, s l t = 0), z l t will be drawn from the same distribution as its previous states z l t−1 , which is N l t−1 , l t−1 . Unlike HM-RNN [46], the proposed MR-HDMM does not copy the previous state since the latent states are stochastic. The latent states of the first layer (z 1 1:T ) are always updated at each time step. In MR-HDMM, g θs is parameterized by a multilayer perceptron (MLP), and g θz is parameterized by GRU [47] to capture the temporal dependencies. With this update-or-reuse transition mechanism, higher latent layers tend to capture longer-term temporal dependencies through the bottom- up connections in the latent layers. Emission MRMTS observationx needs to be generated fromz in the emission process. In order to embed the multiple temporal dependencies in the generated MRMTS, we introduce auxiliary connections (denoted by the red dashed lines in Figure 4.8) from the higher latent layers to the lower rate time series. That is, time series of l-th rate at time t (i.e.,x l t ) is generated from all latent states up to the l-th layerz 1:l t through emission distribution Π x l t |z 1:l t ;θ x . The choice of the emission distribution Π is flexible and depends on the data type. Multinomial distribution is used for categorical data, and Gaussian distribution is used for continuous data. Since all the data in our tasks are continuous, we use Gaussian distribution where the meanμ (x) l t and covariance Σ (x) l t are determined by g θx (z 1:l t ), which is parameterized by an MLP. 66 Algorithm 4.1: Generation model of MR-HDMM Input: The generation model with parameters θ z ,θ s ,θ x Output: Generated MRMTSx 1:L 1:T 1: Initializez 1:L 0 ∼N (0,I) 2: for t = 1,...,T do 3: ( 1 t , 1 t ) =g θz (z 1 t−1 ) 4: z 1 t ∼N ( 1 t , 1 t ) {Transition of the first layer.} 5: for l = 2,··· ,L do 6: s l t =I g θs (z l t−1 ,z l−1 t )≥ 0 7: l t , l t = g θz (z l t−1 ,z l−1 t ) if s l t = 1 l t−1 , l t−1 otherwise. 8: z l t ∼N l t , l t {Transition of other layers.} 9: end for 10: for l = 1,··· ,L do 11: μ (x) l t , Σ (x) l t =g θx (z 1:l t ) 12: x l t ∼N μ (x) l t , Σ (x) l t {Emission.} 13: end for 14: end for To summarize, the overall generation process is described in Algorithm 4.1. The parameter set of generation model is θ ={θ x ,θ z ,θ s }. Given this, the joint probability of MRMTS and the latent states/switches can be factorized by p θ x 1:L 1:T ,z 1:L 1:T ,s 2:L 1:T |z 1:L 0 =p θ x 1:L 1:T |z 1:L 1:T p θ z 1:L 1:T ,s 2:L 1:T |z 1:L 0 = T Y t=1 p θ x 1:L t |z 1:L t · T Y t=1 p θ z 1:L t ,s 2:L t |z 1:L t−1 = T Y t=1 L Y l=1 p θx x l t |z 1:l t · T Y t=1 p θz z 1 t |z 1 t−1 · T Y t=1 L Y l=2 p θs s l t |z l t−1 ,z l−1 t p θz z l t |z l t−1 ,z l−1 t ,s l t (4.12) In order to obtain the parameters of MR-HDMM, we need to maximize the log marginal likelihood of all MRMTS data points, which is the summation of the 67 log marginal likelihoodL(θ) = logp θ x 1:L 1:T |z 1:L 0 of each MRMTS data pointx 1:L 1:T . The log marginal likelihood of one data point can be achieved by integrating out all possiblez ands in Equation (4.12). Sinces are deterministic binary variables, integrating them out can be done straightforwardly by taking their values in the likelihood. However, stochastic variable z cannot be analytically integrated out. Thus, we resort to the well-known variational principle [109] and introduce our inference network below. 4.2.2.4 Inference Network We design our inference network to mimic the structure of the generative model. The goal is to obtain an objective which can be optimized easily and which can make the model parameter learning amenable. Instead of directly maximizingL(θ) w.r.t θ, we build an inference network with a tractable distribution q φ and maximize the variational evidence lower bound (ELBO)F(θ,φ)≤L(θ) with respect to both θ and φ. Note, φ is the parameter set of the inference network which will is formally defined at the end of this section. The lower bound can be written as F(θ,φ) =E q φ h logp θ x 1:L 1:T |z 1:L 0:T i −D KL q φ z 1:L 1:T ,s 2:L 1:T |x 1:L 1:T ,z 1:L 0 p θ z 1:L 1:T ,s 2:L 1:T |z 1:L 0 (4.13) where the expectation of the first term is under q φ z 1:L 1:T |x 1:L 1:T ,z 1:L 0 . To get a tight bound and an accurate estimate from our MR-HDMM, we need to properly design a new inference network as using the existing inference networks from SRNN [69] or DMM [122] is not applicable for MRMTS. In the following, we show how we 68 design the inference network (as shown on the right of Figure 4.8) to obtain a good structured approximation to the posterior. First, we maintain the Markov properties ofz in the inference network, which leads to the factorization q φ z 1:L 1:T ,s 2:L 1:T |x 1:L 1:T ,z 1:L 0 = T Y t=1 q φ z 1:L t ,s 2:L t |z 1:L t−1 ,x 1:L 1:T (4.14) We then leverage the hierarchical structure and inherit the switches from the generation model into the inference network. That is, the same g θs from the generation model is used in the inference network, i.e., q φ s l t |z l t−1 ,z l−1 t ,x 1:L 1:T =q φs s l t |z l t−1 ,z l−1 t =p θs s l t |z l t−1 ,z l−1 t Then, for each term in the righthand side of Equation (4.14) and for all t = 1,··· ,T, we have q φ z 1:L t ,s 2:L t |z 1:L t−1 ,x 1:L 1:T =q φ z 1 t |z 1 t−1 ,x 1:L 1:T · L Y l=2 q φ s l t |z l t−1 ,z l−1 t ,x 1:L 1:T q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:T =q φ z 1 t |z 1 t−1 ,x 1:L 1:T · L Y l=2 p θs s l t |z l t−1 ,z l−1 t q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:T (4.15) Thus, the inference network can be factorized by Equation (4.14) and Equa- tion (4.15). Note, we also can factorize generative model based on Equation (4.12). Given these, we further factorize the ELBO in Equation (4.13) as a summation 69 of expectations of conditional log likelihood and Kullback–Leibler divergence (KL divergence) terms over time steps and hierarchical layers F(θ,φ) = T X t=1 L X l=1 E Q ∗ (z 1:l t ) logp θx x l t |z 1:l t + T X t=1 E Q ∗ (z 1 t−1 ) D KL q φ z 1 t |x 1:L 1:T ,z 1 t−1 p θ z 1 t |z 1 t−1 + T X t=1 L X l=2 E Q ∗ (z 1 t−1 ,z l−1 t ) D KL q φ z l t |x 1:L 1:T ,z l t−1 ,z l−1 t p θ z 1 t |z 1 t−1 ,z l−1 t (4.16) whereQ ∗ (·) denotes the marginal distribution of (·) from q φ . Table 4.4: Comparison of structured inference networks used in MR-HDMM. Inference network RNN output Captured inh l t Variational approx. forz l t filtering (forward RNN) h forward x l 1:t q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:t smoothing (backward RNN) h backward x l t:T q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L t:T bi-direction (bi-directional RNN) h h forward ,h backward i x l 1:T q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:T Parameterization of inference network We parameterize the inference net- work and construct the variational approximation q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:T used in Equation 4.16 by deep learning models. First, we use L recurrent neural networks to capture MRMTS with L differ- ent sampling rates such that each rate is modeled by one RNN model separately. Second, to obtain the l-th latent statesz l t of the inference network at time step t, we not only use the previous latent states z l t−1 and the lower layer latent states z l−1 t but also take the l-th RNN output denoted by h l t as an input. Third, we reuse the same latent state distribution and switch mechanism from the generation 70 model to generate z of the inference network. To be more specific, z l t is drawn from a multivariate normal distribution, where the mean and covariance are reused from those of z l t−1 if s l t = 1 and l > 1, otherwise the mean and covariance are modeled by GRU with input h h l t ,z l t−1 ,z l−1 t i . The choice of the RNN models for h l t affects what and how the information at other time steps is considered in the approximation at time t, i.e. the form of q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:T . Inspired by SRNN [123], we construct the variational approximation in three settings (filtering, smoothing, bi-direction) for forecasting and interpolation tasks. In filtering setting, we only consider the information up to time t (i.e.,x 1:L 1:t ) using forward recurrent neural networks. By doing this, we haveh l t =h l t forward =RNN forward h l t−1 forward ,x l t , and thus q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:T = q φ z l t |z l t−1 ,z l−1 t ,s l t ,x 1:L 1:t . The filtering set- ting does not use future information, so it is suitable for forecasting task at future time step t 0 > T. For interpolation tasks, we can use backward recur- rent neural networks to utilize the information after time t (i.e., x 1:L t:T ) with h l t = h l t backward = RNN backward h l t+1 backward ,x l t , or bi-directional recurrent neural networks to utilize information across all time steps, which isx 1:L 1:T , at any time t withh l t = h h l t forward ,h l t backward i . These two models lead to smoothing and bidirection settings, respectively. We summarize the three inference networks in Table 4.4. We use φ h and φ z to denote the parameter sets related toh andz respectively and use φ ={φ h ,φ z ,φ s =θ s } to represent the parameter set of the inference network. 4.2.2.5 Derivations for Model Parameterization Getting the ELBO in Equation (4.13) To utilize the variational principle and get the ELBO, we introduce another distributionQ = q φ z 1:L 1:T ,s 2:L 1:T |x 1:L 1:T ,z 1:L 0 to 71 approximateP =p θ z 1:L 1:T ,s 2:L 1:T |x 1:L 1:T ,z 1:L 0 . Starting with the KL divergence fromQ toP, we have 0≤D KL Q P =E Q log Q·p θ x 1:L 1:T |z 1:L 0 p θ (x 1:L 1:T |z 1:L 1:T ,s 2:L 1:T ,z 1:L 0 )·p θ (z 1:L 1:T ,s 2:L 1:T |z 1:L 0 ) =E Q " log Q p θ (z 1:L 1:T ,s 2:L 1:T |z 1:L 0 ) − logp θ x 1:L 1:T |z 1:L 0:T ,s 2:L 1:T # + logp θ x 1:L 1:T |z 1:L 0 =D KL Q p θ z 1:L 1:T ,s 2:L 1:T |z 1:L 0 −E Q h logp θ x 1:L 1:T |z 1:L 0:T i + logp θ x 1:L 1:T |z 1:L 0 There are two things to be noticed about this equation. First, the expectations are underQ = q φ z 1:L 1:T ,s 2:L 1:T |x 1:L 1:T ,z 1:L 0 , and p θ x 1:L 1:T |z 1:L 0 does not depend on it. Second, in our generation model, x is independent from s given z by its design. Then we have L(θ) = logp θ x 1:L 1:T |z 1:L 0 ≥E Q h logp θ x 1:L 1:T |z 1:L 0:T i −D KL Q p θ z 1:L 1:T ,s 2:L 1:T |z 1:L 0 =F(θ,φ) We find that the tightness of this bound depends on how wellQ approximates the true posterior of latent variables given the inputs and initial parametersP, because the equality holds if and only if D KL Q P = 0. This requirement is carefully taken into consideration in our inference model design. 72 Factorizing the ELBO in Equation (4.16) The first part of theF(θ,φ) can be factorized similarly as Equation (4.12) by E Q logp θ x 1:L 1:T |z 1:L 0:T =E Q T X t=1 L X l=1 logp θx x l t |z 1:l t =E Q T X t=1 L X l=1 logp θx x l t |z 1:l t = T X t=1 L X l=1 E Q ∗ (z 1:l t ) logp θx x l t |z 1:l t (4.17) whereQ ∗ z 1:l t is the marginal distribution ofz 1:l t in the variational approximation to the posterior q φ z 1:l 1:t |x 1:L 1:T ,z 1:L 0 and is defined as Q ∗ z 1:l t = Z q φ z 1:L 1:t |x 1:L 1:T ,z 1:L 0 dz 1:L 1:t−1 = Z q φ z 1:L t |x 1:L 1:t ,z 1:L t−1 q φ z 1:L 1:t−1 |x 1:L 1:T ,z 1:L 0 dz 1:L 1:t−1 =E Q ∗ (z 1:l t−1 ) q φ z 1:L t |x 1:L 1:t ,z 1:L t−1 whereQ ∗ z 1:l t−1 denotes the marginal distribution ofz 1:l t−1 in the same way. The second part of theF(θ,φ) can be factorized based on the principle of minimum discrimination information. First, we have D KL q φ z 1:L 1:T ,s 2:L 1:T |x 1:L 1:T ,z 1:L 0 p θ z 1:L 1:T ,s 2:L 1:T |z 1:L 0 =E q φ D KL q φ z 1:L T ,s 2:L T |x 1:L 1:T ,z 1:L T−1 p θ z 1:L T ,s 2:L T |z 1:L T−1 +D KL q φ z 1:L 1:T−1 ,s 2:L 1:T−1 |x 1:L 1:T ,z 1:L 0 p θ z 1:L 1:T−1 ,s 2:L 1:T−1 |z 1:L 0 (4.18) where q φ is a shorthand for q φ z 1:L 1:T−1 ,s 2:L 1:T−1 |x 1:L 1:T ,z 1:L 0 . Notice that q φs (.) and p θs (.) are the same and s is deterministic given other variables and thus can be 73 integrated out (by taking its actual value). Then, the first KL divergence term can be further factorized as D KL q φ z 1:L T ,s 2:L T |x 1:L 1:T ,z 1:L T−1 p θ z 1:L T ,s 2:L T |z 1:L T−1 =D KL q φ z 1 T |x 1 T ,z 1 T−1 p θ z 1 T |z 1 T−1 + L X l=2 E q φ D KL q φ z l T |x 1:L 1:T ,z 1 T−1 ,z l−1 T ,s l T p θ z 1 T |z 1 T−1 ,z l−1 T ,s l T + L X l=2 E q φ D KL q φ s l T |x 1:L 1:T ,z 1 T−1 ,z l−1 T p θ s l T |z 1 T−1 ,z l−1 T =D KL q φ z 1 T |x 1:L 1:T ,z 1 T−1 p θ z 1 T |z 1 T−1 + L X l=2 E q φ D KL q φ z l T |x 1:L 1:T ,z 1 T−1 ,z l−1 T p θ z 1 T |z 1 T−1 ,z l−1 T By recursively factorizing the last term of Equation (4.18), we have D KL q φ z 1:L 1:T ,s 2:L 1:T |x 1:L 1:T ,z 1:L 0 p θ z 1:L 1:T ,s 2:L 1:T |z 1:L 0 = T X t=1 E Q ∗ (z 1 t−1 ) D KL q φ z 1 t |x 1:L 1:T ,z 1 t−1 p θ z 1 t |z 1 t−1 + T X t=1 L X l=2 E Q ∗ (z 1 t−1 ,z l−1 t ) D KL q φ z l t |x 1:L 1:T ,z l t−1 ,z l−1 t p θ z 1 t |z 1 t−1 ,z l−1 t (4.19) whereQ ∗ is the marginalized distributions defined as previous. Finally, taking both Equations (4.17) and (4.19) leads to the factorized ELBO in Equation (4.16). 4.2.2.6 Learning the Parameters We jointly learn the parameters (θ, φ) of the generative model p θ and the inference network q φ by maximizing the ELBO in Equation (4.16). The main challenge in the optimization is obtaining the gradients of all the terms under the 74 Algorithm 4.2: Learning MR-HDMM with stochastic backpropagation and stochastic gradient descent method Input:X: a set of MRMTS of L sampling rates; Initial (θ,φ) Output: Updated (θ,φ) 1: while not converged do 2: Choose a random minibatch of MRMTSX 0 ⊂X 3: for each samplex 1:L 1:T ∈X 0 do 4: Computeh 1:L 1:T by inference network φ h on inputx 1:L 1:T 5: Sample d z 1:L 0 ∼N (0,I) 6: for t = 1,··· ,T do 7: Estimate 1 t (φ) , 1 t (φ) byφ z , and 1 t , 1 t byθ z , given samples d z 1 t−1 andh 1 t 8: Based on 1 t (φ) , 1 t (φ) , 1 t , 1 t , compute the gradient of D KL q φ (z 1 t |·) p θ (z 1 t |·) 9: Sample c z 1 t ∼N 1 t (φ) , 1 t (φ) 10: for l = 2,··· ,L do 11: Compute s l t by θ s from samples d z l t−1 and d z l−1 t 12: Estimate l t (φ) , l t (φ) by φ z , and l t , l t by θ z , given samples d z l t−1 , d z l−1 t , s l t , andh l t 13: Based on l t (φ) , l t (φ) , l t , l t , compute the gradient of D KL q φ z l t |· p θ z l t |· 14: Sample c z l t ∼N l t (φ) , l t (φ) 15: end for 16: Compute the gradient of logp θx x l t | d z 1:l t 17: end for 18: end for 19: Update (θ,φ) using all gradients 20: end while correct expectation i.e,E Q ∗. We use stochastic backpropagation [117] for estimating all these gradients and train the model by the stochastic gradient descent (SGD) approaches. We employ ancestral sampling techniques to obtain the samplesz. That is, we draw all samplesz in a sequential way from time 1 toT and from layer 1 toL. Given the samples from previous layer l− 1 or previous time t− 1, the new samples at time t and layer l will be distributed according to the marginal distribution 75 Q ∗ . Notice that all terms of D KL q φ z l t |· p θ z l t |· in Equation (4.16) are the KL divergences between two multivariate Gaussian distributions, and p θx x l t |z 1:l t is also a multivariate Gaussian distribution. Thus, all the required gradients can be estimated analytically from the samples drawn in our proposed way. Algorithm 4.2 shows the overall learning procedure. 4.2.3 Experiments We conduct experiments on the MIMIC-III dataset and attempt to answer the following questions: (a) How does our proposed model perform when compared to the existing state-of-the-art approaches? (b) To what extent, are the proposed learnable hierarchical latent structure and auxiliary connections useful to model the data generation process? (c) How do we interpret the hierarchy learned by the proposed model? 4.2.3.1 Experiment Design For our experiments, we choose 10,709 adult admission records and extract 62 temporal features from the first 72 hours from the MIMIC-III dataset. These features have one of the three sampling rates of 1 hour, 4 hours and 12 hours. We use HSR/MSR/LSR to denote the high/medium/low sampling rate respectively. Table 4.5 shows the statistics of the processed dataset. To fill-in any missing entries in the dataset and get completed time series in each sampling rate, we use forward or linear imputation. To ensure fair comparison, we only evaluate and compare all the models on the original time-series (i.e. non-imputed data). Our choice of imputation may not be the optimal one and finding better imputation methods 76 is another important research direction beyond the scope of this work. For fair comparison, we use the same imputed data for evaluation of all the methods. Table 4.5: Descriptions of the MIMIC-III dataset used for MR-HDMM evaluations. Number of samples High/Midium/Low sampling rates Number of variables of each rate Time series length 10709 1/4/12 hours 7/11/44 72 hours Our two main tasks on the MIMIC-III dataset are forecasting on time series with all rates and interpolation of the low-rate time series values, descripted as follows: • Forecasting: Predict the future multivariate time series based on its history. We predict the last 24 hours time series based on the first (previous) 48 hours time series data. • Interpolation: Fill-in the low rate time series based on co-evolving higher rate time series data. We down-sample 8 features from MSR to LSR and then perform interpolation task by up-sampling these 8 features back to MSR. We demonstrate both the in-sample interpolation (i.e., interpolation within training dataset) and the out-sample interpolation (i.e., interpolation in the testing dataset). In addition, we also compare all generative baseline methods in terms of their statistical inference results. 77 4.2.3.2 Baseline Descriptions We compare MR-HDMM with several strong baselines in these two tasks. Additionally, to show the advantage of learnable hierarchical latent structure and auxiliary connections, we simplify MR-HDMM into two other models for comparison, (a) multi-rate deep Markov models (MR-DMM) which removes the hierarchical structure in latent space, and (b) hierarchical deep Markov models (HDMM) which drops the auxiliary connections between the lower-rate time series and higher level latent layers. The other parts of the two models remain the same as MR-HDMM. The generation models of MR-DMM and HDMM are shown on the left and right of Figure 4.10, respectively. z 1 5 z 1 4 <latexit sha1_base64="VlqRj/VmgQsEenEgUFbg5hw/LEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Qg15Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fMbeQMw==</latexit> <latexit sha1_base64="VlqRj/VmgQsEenEgUFbg5hw/LEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Qg15Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fMbeQMw==</latexit> <latexit sha1_base64="VlqRj/VmgQsEenEgUFbg5hw/LEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Qg15Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fMbeQMw==</latexit> z 1 3 z 1 2 <latexit sha1_base64="O/VLvuvmxm+Xdo1dLHt2aRfZCqs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnq G2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqeQMQ==</latexit> <latexit sha1_base64="O/VLvuvmxm+Xdo1dLHt2aRfZCqs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnq G2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqeQMQ==</latexit> <latexit sha1_base64="O/VLvuvmxm+Xdo1dLHt2aRfZCqs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnq G2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqeQMQ==</latexit> z 1 1 x 1 1 <latexit sha1_base64="EStZdaUJfryJ9i2RU7Qci1owMMw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p96mZc/2OpVa27dnQItEq8gNSjQ7FW/uv2YpIJKQzjWuuO5iQkyrAwjnOaVbqppgskID2jHUokF1UE2PThHR1bpoyhWtqRBU/X3RIaF1mMR2k6BzVDPexPxP6+Tmug8yJhMUkMlmS2KUo5MjCbfoz5TlBg+tgQTxeytiAyxwsTYjCo2BG/+5UXin9Qv6u7Naa1xWaRRhgM4hGPw4AwacA1N8IGAgGd4hTdHOS/Ou/Mxay05xcw+/IHz+QMqCZAu</latexit> <latexit sha1_base64="EStZdaUJfryJ9i2RU7Qci1owMMw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p96mZc/2OpVa27dnQItEq8gNSjQ7FW/uv2YpIJKQzjWuuO5iQkyrAwjnOaVbqppgskID2jHUokF1UE2PThHR1bpoyhWtqRBU/X3RIaF1mMR2k6BzVDPexPxP6+Tmug8yJhMUkMlmS2KUo5MjCbfoz5TlBg+tgQTxeytiAyxwsTYjCo2BG/+5UXin9Qv6u7Naa1xWaRRhgM4hGPw4AwacA1N8IGAgGd4hTdHOS/Ou/Mxay05xcw+/IHz+QMqCZAu</latexit> <latexit sha1_base64="EStZdaUJfryJ9i2RU7Qci1owMMw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p96mZc/2OpVa27dnQItEq8gNSjQ7FW/uv2YpIJKQzjWuuO5iQkyrAwjnOaVbqppgskID2jHUokF1UE2PThHR1bpoyhWtqRBU/X3RIaF1mMR2k6BzVDPexPxP6+Tmug8yJhMUkMlmS2KUo5MjCbfoz5TlBg+tgQTxeytiAyxwsTYjCo2BG/+5UXin9Qv6u7Naa1xWaRRhgM4hGPw4AwacA1N8IGAgGd4hTdHOS/Ou/Mxay05xcw+/IHz+QMqCZAu</latexit> x 1 2 <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> x 1 3 x 1 4 <latexit sha1_base64="k1kLVZRqhvvH9GxRlKQiu39y/xo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fLqGQMQ==</latexit> <latexit sha1_base64="k1kLVZRqhvvH9GxRlKQiu39y/xo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fLqGQMQ==</latexit> <latexit sha1_base64="k1kLVZRqhvvH9GxRlKQiu39y/xo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fLqGQMQ==</latexit> x 1 5 x 2 5 x 3 5 x 3 1 x 2 1 <latexit sha1_base64="G3m/UR1RUtAa1iIxEKqXv36PKjM=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK42QLw==</latexit> <latexit sha1_base64="G3m/UR1RUtAa1iIxEKqXv36PKjM=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK42QLw==</latexit> <latexit sha1_base64="G3m/UR1RUtAa1iIxEKqXv36PKjM=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigxfGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK42QLw==</latexit> x 2 3 z 3 5 <latexit sha1_base64="xo/E3ptk+QaeVilWfelxQwJBNeI=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSx+RL0RvXjExBUMrKRbutDQdjdt1wQ3+yu8eFDj1b/jzX9jgT0o+JJJXt6bycy8IOZMG9f9dgoLi0vLK8XV0tr6xuZWeXvnTkeJItQjEY9UK8Caciap Z5jhtBUrikXAaTMYXo395iNVmkXy1oxi6gvclyxkBBsr3T9109PsIT3OuuWKW3UnQPOklpMK5Gh0y1+dXkQSQaUhHGvdrrmx8VOsDCOcZqVOommMyRD3adtSiQXVfjo5OEMHVumhMFK2pEET9fdEioXWIxHYToHNQM96Y/E/r52Y8NxPmYwTQyWZLgoTjkyExt+jHlOU GD6yBBPF7K2IDLDCxNiMSjaE2uzL88Q7ql5U3ZuTSv0yT6MIe7APh1CDM6jDNTTAAwICnuEV3hzlvDjvzse0teDkM7vwB87nDzZHkDY=</latexit> <latexit sha1_base64="xo/E3ptk+QaeVilWfelxQwJBNeI=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSx+RL0RvXjExBUMrKRbutDQdjdt1wQ3+yu8eFDj1b/jzX9jgT0o+JJJXt6bycy8IOZMG9f9dgoLi0vLK8XV0tr6xuZWeXvnTkeJItQjEY9UK8Caciap Z5jhtBUrikXAaTMYXo395iNVmkXy1oxi6gvclyxkBBsr3T9109PsIT3OuuWKW3UnQPOklpMK5Gh0y1+dXkQSQaUhHGvdrrmx8VOsDCOcZqVOommMyRD3adtSiQXVfjo5OEMHVumhMFK2pEET9fdEioXWIxHYToHNQM96Y/E/r52Y8NxPmYwTQyWZLgoTjkyExt+jHlOU GD6yBBPF7K2IDLDCxNiMSjaE2uzL88Q7ql5U3ZuTSv0yT6MIe7APh1CDM6jDNTTAAwICnuEV3hzlvDjvzse0teDkM7vwB87nDzZHkDY=</latexit> <latexit sha1_base64="xo/E3ptk+QaeVilWfelxQwJBNeI=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSx+RL0RvXjExBUMrKRbutDQdjdt1wQ3+yu8eFDj1b/jzX9jgT0o+JJJXt6bycy8IOZMG9f9dgoLi0vLK8XV0tr6xuZWeXvnTkeJItQjEY9UK8Caciap Z5jhtBUrikXAaTMYXo395iNVmkXy1oxi6gvclyxkBBsr3T9109PsIT3OuuWKW3UnQPOklpMK5Gh0y1+dXkQSQaUhHGvdrrmx8VOsDCOcZqVOommMyRD3adtSiQXVfjo5OEMHVumhMFK2pEET9fdEioXWIxHYToHNQM96Y/E/r52Y8NxPmYwTQyWZLgoTjkyExt+jHlOU GD6yBBPF7K2IDLDCxNiMSjaE2uzL88Q7ql5U3ZuTSv0yT6MIe7APh1CDM6jDNTTAAwICnuEV3hzlvDjvzse0teDkM7vwB87nDzZHkDY=</latexit> z 3 4 <latexit sha1_base64="Ddk1wsUUpdGdorNP0fQBk4UuLfw=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9nVgnorevFYwbWVdi3ZNNuGJtklyQp12V/hxYOKV/+ON/+N6cdBWx8MPN6bYWZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOr8Z+85EqzWJ5a0YJDQTuSxYxgo2V7p+6WS1/yE7zbrniVt0J0CLxZqQCMzS65a9OLyapoNIQjrVue25iggwrwwineamTappgMsR92rZUYkF1kE0OztGRVXooipUtadBE/T2RYaH1SIS2U2Az0PPeWPzPa6cmOg8yJpPUUEmmi6KUIxOj8feoxxQl ho8swUQxeysiA6wwMTajkg3Bm395kfgn1Yuqe1Or1C9naRThAA7hGDw4gzpcQwN8ICDgGV7hzVHOi/PufExbC85sZh/+wPn8ATS/kDU=</latexit> <latexit sha1_base64="Ddk1wsUUpdGdorNP0fQBk4UuLfw=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9nVgnorevFYwbWVdi3ZNNuGJtklyQp12V/hxYOKV/+ON/+N6cdBWx8MPN6bYWZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOr8Z+85EqzWJ5a0YJDQTuSxYxgo2V7p+6WS1/yE7zbrniVt0J0CLxZqQCMzS65a9OLyapoNIQjrVue25iggwrwwineamTappgMsR92rZUYkF1kE0OztGRVXooipUtadBE/T2RYaH1SIS2U2Az0PPeWPzPa6cmOg8yJpPUUEmmi6KUIxOj8feoxxQl ho8swUQxeysiA6wwMTajkg3Bm395kfgn1Yuqe1Or1C9naRThAA7hGDw4gzpcQwN8ICDgGV7hzVHOi/PufExbC85sZh/+wPn8ATS/kDU=</latexit> <latexit sha1_base64="Ddk1wsUUpdGdorNP0fQBk4UuLfw=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9nVgnorevFYwbWVdi3ZNNuGJtklyQp12V/hxYOKV/+ON/+N6cdBWx8MPN6bYWZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPU N8xw2koUxSLktBkOr8Z+85EqzWJ5a0YJDQTuSxYxgo2V7p+6WS1/yE7zbrniVt0J0CLxZqQCMzS65a9OLyapoNIQjrVue25iggwrwwineamTappgMsR92rZUYkF1kE0OztGRVXooipUtadBE/T2RYaH1SIS2U2Az0PPeWPzPa6cmOg8yJpPUUEmmi6KUIxOj8feoxxQl ho8swUQxeysiA6wwMTajkg3Bm395kfgn1Yuqe1Or1C9naRThAA7hGDw4gzpcQwN8ICDgGV7hzVHOi/PufExbC85sZh/+wPn8ATS/kDU=</latexit> z 3 3 <latexit sha1_base64="i1uYpfDc4ed45wYwxlYYKFN7NCc=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewqiXojevGIiSsYWEm3dKGh7W7arglu+BVePKjx6t/x5r+xwB4UfMkkL+/NZGZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPUN8xw2koUxSLktBkOryZ+85EqzWJ5a0YJDQTuSxYxgo2V7p+62en4wVa3XHGr7hRokXg5qUCORrf81enFJBVUGsKx1m3PTUyQYWUY4XRc6qSaJpgMcZ+2LZVYUB1k04PH6MgqPRTFypY0aKr+nsiw0HokQtspsBnoeW8i/ue1UxOdBxmTSWqoJLNFUcqRidHke9RjihLD R5Zgopi9FZEBVpgYm1HJhuDNv7xI/JPqRdW9qVXql3kaRTiAQzgGD86gDtfQAB8ICHiGV3hzlPPivDsfs9aCk8/swx84nz8zN5A0</latexit> <latexit sha1_base64="i1uYpfDc4ed45wYwxlYYKFN7NCc=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewqiXojevGIiSsYWEm3dKGh7W7arglu+BVePKjx6t/x5r+xwB4UfMkkL+/NZGZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPUN8xw2koUxSLktBkOryZ+85EqzWJ5a0YJDQTuSxYxgo2V7p+62en4wVa3XHGr7hRokXg5qUCORrf81enFJBVUGsKx1m3PTUyQYWUY4XRc6qSaJpgMcZ+2LZVYUB1k04PH6MgqPRTFypY0aKr+nsiw0HokQtspsBnoeW8i/ue1UxOdBxmTSWqoJLNFUcqRidHke9RjihLD R5Zgopi9FZEBVpgYm1HJhuDNv7xI/JPqRdW9qVXql3kaRTiAQzgGD86gDtfQAB8ICHiGV3hzlPPivDsfs9aCk8/swx84nz8zN5A0</latexit> <latexit sha1_base64="i1uYpfDc4ed45wYwxlYYKFN7NCc=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiewqiXojevGIiSsYWEm3dKGh7W7arglu+BVePKjx6t/x5r+xwB4UfMkkL+/NZGZemHCmjet+O4Wl5ZXVteJ6aWNza3unvLt3p+NUEeqTmMeqFWJNOZPUN8xw2koUxSLktBkOryZ+85EqzWJ5a0YJDQTuSxYxgo2V7p+62en4wVa3XHGr7hRokXg5qUCORrf81enFJBVUGsKx1m3PTUyQYWUY4XRc6qSaJpgMcZ+2LZVYUB1k04PH6MgqPRTFypY0aKr+nsiw0HokQtspsBnoeW8i/ue1UxOdBxmTSWqoJLNFUcqRidHke9RjihLD R5Zgopi9FZEBVpgYm1HJhuDNv7xI/JPqRdW9qVXql3kaRTiAQzgGD86gDtfQAB8ICHiGV3hzlPPivDsfs9aCk8/swx84nz8zN5A0</latexit> z 3 2 <latexit sha1_base64="m3Jvsvgon0SW2+ZapQttsZmXaeE=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p96aT17SE+yXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGvkDM=</latexit> <latexit sha1_base64="m3Jvsvgon0SW2+ZapQttsZmXaeE=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p96aT17SE+yXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGvkDM=</latexit> <latexit sha1_base64="m3Jvsvgon0SW2+ZapQttsZmXaeE=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p96aT17SE+yXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGvkDM=</latexit> z 3 1 z 2 1 z 2 2 z 2 3 <latexit sha1_base64="9XdrESdfp8cYMi1TvEaWZTfR8F0=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGzkDM=</latexit> <latexit sha1_base64="9XdrESdfp8cYMi1TvEaWZTfR8F0=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGzkDM=</latexit> <latexit sha1_base64="9XdrESdfp8cYMi1TvEaWZTfR8F0=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmLiiSxoot6IXjxi4goGVtItXWhou5u2a4Kb/RVePKjx6t/x5r+xwB4UfMkkL+/NZGZeEHOmjet+O4Wl5ZXVteJ6aWNza3unvLt3p6NEEeqRiEeqHWBNOZPUM8xw2o4VxSLgtBWMriZ+65EqzSJ5a8Yx9QUeSBYygo2V7p966Un2kNazXrniVt0p0CKp5aQCOZq98le3H5FEUGkIx1p3am5s/BQrwwinWambaBpjMsID2rFUYkG1n04PztCRVfoojJQtadBU/T2RYqH1WAS2U2Az1PPeRPzP6yQmPPdTJuPEUElmi8KEIxOhyfeozxQl ho8twUQxeysiQ6wwMTajkg2hNv/yIvHq1Yuqe3NaaVzmaRThAA7hGGpwBg24hiZ4QEDAM7zCm6OcF+fd+Zi1Fpx8Zh/+wPn8ATGzkDM=</latexit> z 2 4 z 2 5 z 1 5 z 1 4 <latexit sha1_base64="VlqRj/VmgQsEenEgUFbg5hw/LEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Qg15Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fMbeQMw==</latexit> <latexit sha1_base64="VlqRj/VmgQsEenEgUFbg5hw/LEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Qg15Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fMbeQMw==</latexit> <latexit sha1_base64="VlqRj/VmgQsEenEgUFbg5hw/LEw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Qg15Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fMbeQMw==</latexit> z 1 3 z 1 2 <latexit sha1_base64="O/VLvuvmxm+Xdo1dLHt2aRfZCqs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqeQMQ==</latexit> <latexit sha1_base64="O/VLvuvmxm+Xdo1dLHt2aRfZCqs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqeQMQ==</latexit> <latexit sha1_base64="O/VLvuvmxm+Xdo1dLHt2aRfZCqs=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N0IN+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHLqeQMQ==</latexit> z 1 1 x 1 1 <latexit sha1_base64="EStZdaUJfryJ9i2RU7Qci1owMMw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p96mZc/2OpVa27dnQItEq8gNSjQ7FW/uv2YpIJKQzjWuuO5iQkyrAwjnOaVbqppgskID2jHUokF1UE2PThHR1bpoyhWtqRBU/X3RIaF1mMR2k6BzVDPexPxP6+Tmug8yJhMUkMlmS2KUo5MjCbfoz5TlBg+tgQTxeytiAyxwsTYjCo2BG/+5UXin9Qv6u7Naa1xWaRRhgM4hGPw4AwacA1N8IGAgGd4hTdHOS/Ou/Mxay05xcw+/IHz+QMqCZAu</latexit> <latexit sha1_base64="EStZdaUJfryJ9i2RU7Qci1owMMw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p96mZc/2OpVa27dnQItEq8gNSjQ7FW/uv2YpIJKQzjWuuO5iQkyrAwjnOaVbqppgskID2jHUokF1UE2PThHR1bpoyhWtqRBU/X3RIaF1mMR2k6BzVDPexPxP6+Tmug8yJhMUkMlmS2KUo5MjCbfoz5TlBg+tgQTxeytiAyxwsTYjCo2BG/+5UXin9Qv6u7Naa1xWaRRhgM4hGPw4AwacA1N8IGAgGd4hTdHOS/Ou/Mxay05xcw+/IHz+QMqCZAu</latexit> <latexit sha1_base64="EStZdaUJfryJ9i2RU7Qci1owMMw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPUN8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p96mZc/2OpVa27dnQItEq8gNSjQ7FW/uv2YpIJKQzjWuuO5iQkyrAwjnOaVbqppgskID2jHUokF1UE2PThHR1bpoyhWtqRBU/X3RIaF1mMR2k6BzVDPexPxP6+Tmug8yJhMUkMlmS2KUo5MjCbfoz5TlBg+tgQTxeytiAyxwsTYjCo2BG/+5UXin9Qv6u7Naa1xWaRRhgM4hGPw4AwacA1N8IGAgGd4hTdHOS/Ou/Mxay05xcw+/IHz+QMqCZAu</latexit> x 1 2 <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> <latexit sha1_base64="VrLPiSrAzYSUT45RPsMOAO+Li3w=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9rJ4/ZF7eq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK5GQLw==</latexit> x 1 3 x 1 4 <latexit sha1_base64="k1kLVZRqhvvH9GxRlKQiu39y/xo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fLqGQMQ==</latexit> <latexit sha1_base64="k1kLVZRqhvvH9GxRlKQiu39y/xo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fLqGQMQ==</latexit> <latexit sha1_base64="k1kLVZRqhvvH9GxRlKQiu39y/xo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtpY9lsN+3S3U3Y3Ygl5Fd48aDi1b/jzX/jts1BWx8MPN6bYWZemHCmjet+O6Wl5ZXVtfJ6ZWNza3unurt3p+NUEeqTmMeqHWJNOZPU N8xw2k4UxSLktBWOriZ+65EqzWJ5a8YJDQQeSBYxgo2V7p962Wn+kHl5r1pz6+4UaJF4BalBgWav+tXtxyQVVBrCsdYdz01MkGFlGOE0r3RTTRNMRnhAO5ZKLKgOsunBOTqySh9FsbIlDZqqvycyLLQei9B2CmyGet6biP95ndRE50HGZJIaKslsUZRyZGI0+R71maLE 8LElmChmb0VkiBUmxmZUsSF48y8vEv+kflF3b05rjcsijTIcwCEcgwdn0IBraIIPBAQ8wyu8Ocp5cd6dj1lrySlm9uEPnM8fLqGQMQ==</latexit> x 1 5 x 2 5 x 3 5 x 3 1 x 2 1 <latexit sha1_base64="G3m/UR1RUtAa1iIxEKqXv36PKjM=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK42QLw==</latexit> <latexit sha1_base64="G3m/UR1RUtAa1iIxEKqXv36PKjM=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK42QLw==</latexit> <latexit sha1_base64="G3m/UR1RUtAa1iIxEKqXv36PKjM=">AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxlbaWDbbTbt0dxN2N2IJ+RVePKh49e9489+4bXPQ1gcDj/dmmJkXJpxp47rfztLyyuraemmjvLm1vbNb2du/03GqCPVJzGPVDrGmnEnqG2Y4bSeKYhFy2gpHVxO/9UiVZrG8NeOEBgIPJIsYwcZK90+9zMsfsnreq1TdmjsFWiReQapQoNmrfHX7MUkFlYZwrHXHcxMTZFgZRjjNy91U0wSTER7QjqUSC6qDbHpwjo6t0kdRrGxJg6bq74kMC63HIrSdApuhnvcm4n9eJzXReZAxmaSGSjJbFKUcmRhNvkd9pigx fGwJJorZWxEZYoWJsRmVbQje/MuLxK/XLmruzWm1cVmkUYJDOIIT8OAMGnANTfCBgIBneIU3RzkvzrvzMWtdcoqZA/gD5/MHK42QLw==</latexit> x 2 3 Latent variable z <latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6 hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittbCRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit> <latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6 hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittbCRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit> <latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6 hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittbCRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit> Observation x Switches s Figure 4.10: Generation model of MR-DMM (left) and that of HDMM (right) simplified from the proposed MR-HDMM. For forecasting task, we compare MR-HDMM with the following baselines: • Single-rate models: Kalman filters (KF), vector auto-regression (VAR), long- short term memory (LSTM) [96], phased LSTM (PLSTM) [163], deep Markov 78 models (DMM) [122], and hierarchical multiscale recurrent neural networks (HM-RNN) [46]. • Multi-rate models: Multiple Kalman filters (MKF) [60], multi-rate Kalman filters (MR-KF) [199], multi-rate deep Markov models (MR-DMM), and hier- archical deep Markov models (HDMM). For interpolation task, we compare MR-HDMM with the following baselines: • Imputation methods: Mean imputation (Simple-Mean), cubic spline (Cubic- Spline) [51], multiple imputations by chained equations (MICE) [240], Miss- Forest [217], and SoftImpute [148]. • Deep learning models: Deep Markov models (DMM), multi-rate deep Markov models (MR-DMM), and hierarchical deep Markov models (HDMM). 4.2.3.3 Evaluation and Implementation Details The single-rate baseline models cannot handle multi-rate data directly, and we up-sample all the lower rate data into higher rate data using linear interpolation. We use the stats-toolbox [204] in python for the VAR model implementation. We use pykalman [62] to implement all the KF-based models. Implementation comparison of KF-based models • Kalman filters (KF): We first up-sample all the MSR and LSR features to make their sampling rate the same as the HSR features by forward imputation to the single-rate multivariate time series. Then we train KF on the single-rate 79 MTS data using the expectation–maximization algorithm to get the forecasting results. • Multiple Kalman filters (MKF): We train three different KF models on the HSR/MSR/LSR time series separately, and then we concatenate the outputs (i.e., forecasting results) of these three KF models to obtain the final results. • Multi-rate Kalman filters (MR-KF): Three different KF models are trained on the HSR/MSR/LSR time series separately to get their state estimations. Then a neural network (e.g., MLP) is employed to fuse the estimated state vectors from each Kalman filter to obtain the final prediction results. Similar to the other deep learning methods, MR-KF is trained on the training set, the best weights are chosen based on the performance on the validation set, and the results are reported on the held-out test set. Implementation details of deep learning models For generation model in MR-HDMM, we use multivariate Gaussian distributions with diagonal covariance for both emission distribution and transition distribution. We parameterize the emission mapping g θx by a 3-layer MLP with ReLU activations, the transition mapping g θz by GRU, and mapping g θs by a 3-layer MLP with ReLU activations on the hidden layers and linear activations on the output layer. For inference networks, we adopt filtering setting for forecasting and bidirection setting for interpolation from Table 4.4 with 3-layer GRUs. To update θ s , we replace the sign function with a sharp sigmoid function during training, and use the indicator function during validation. For LSTM and PLSTM model, we use one layer with 100 neurons to model the time-series, and then apply a soft-max regressor on top of the last hidden state to do regression. 80 We show the evaluation results of our MR-HDMM on the following: (a) Forecasting: we generate the next latent state using the learned transition distribution and then generate observations from these new latent states; (b) Interpolation: we use the mode of the approximated posterior in the generation model to generate the unseen data in low-rate time series. (c) Inference: we take multi-rate time series as the input to obtain the approximate posterior of latent states. To ensure a fair comparison, we use roughly the same amount of parameters for all models. We use 5-fold cross validation (train on 3 folds, validate on another fold and test on the remaining fold) and report the average mean squared error (MSE) of 5 runs for both forecasting and interpolation tasks. Note that, we train all the deep learning models with the Adam optimization method [115] and use validation set to find the best weights, and report the results on the held-out test set. All the input variables are normalized to be of 0 mean and 1 standard deviation. 4.2.3.4 Quantitative Results Forecasting Table 4.6 shows the forecasting results on the MIMIC-III dataset in terms of MSE. Our proposed MR-HDMM outperforms all the competing multi-rate latent space models by at least 5% and beats the single-rate models by at least 15% with all features. Specifically, our model performs the best on single-rate HSR and MSR forecasting tasks and performs well on the LSR forecasting task. LSR variables are lower weighted in the objective function of MR-HDMM due to their fewer occurrences than MSR and HSR, and this might be the reason thatMR-HDMM does not beat some baselines on LSR forecasting which are either trained within each single-rate (VAR, HM-RNN) or first trained within each single-rate then trained jointly (MR-KF). However, HSR and MSR are sampled more frequently and are 81 usually more important variables in health care applications, and the improvement on forecasting such variables by MR-HDMM can be quite useful in practice. Among all three KF-based methods, only MR-KF which trains Kalman filter for each sampling rate separately with a prediction layer provides reasonable results, while KF and MKF can not converge properly. These two models are probably mislead by some outlier points with extreme values in the dataset. Table 4.6: Forecasting results (MSE) on the MIMIC-III dataset with MRMTS data. All HSR MSR LSR KF 1.91×10 18 3.34×10 18 8.38×10 9 1.22×10 6 VAR 1.233 1.735 0.779 0.802 DMM 1.530 1.875 1.064 1.070 HM-RNN 1.388 1.846 0.904 0.713 LSTM 1.512 1.876 1.006 1.036 PLSTM 1.244 1.392 1.030 1.056 MKF 2.05×10 18 3.58×10 18 3.63×10 4 9.54×10 2 MR-KF 1.691 2.289 0.944 0.860 MR-DMM 1.061 1.192 0.723 1.065 HDMM 1.047 1.168 0.702 1.076 MR-HDMM 0.996 1.148 0.678 0.911 Interpolation Table 4.7 shows the interpolation results on theMIMIC-III dataset. Since VAR and LSTM cannot be directly used for the interpolation task, we focus on evaluating generative models and imputation methods. From that table, we observe that MR-HDMM outperforms the baselines and the competing multi-rate latent space models by a large margin on all the interpolation tasks in both in-sample and out-sample settings. CubicSpline and MICE performs quite worse than other 82 methods because they are mislead by outliers, which is similar to that for KF-based methods in forecasting tasks. Table 4.7: Interpolation results (MSE) on the MIMIC-III dataset with MRMTS data. In-sample Out-sample Simple-Mean 3.812 3.123 CubicSpline 3.713 3.212×10 4 MICE 3.747 7.580×10 2 MissForest 3.863 3.027 SoftImpute 3.715 3.086 DMM 3.714 3.027 MR-DMM 3.710 3.021 HDMM 3.790 3.100 MR-HDMM 3.582 2.921 Inference We also compare the lower bound of log-likelihood of all generative mod- els in Table 4.8. The higher lower bound value indicates a better fitted model given the training data. Our MR-HDMM model achieves the best performance compared with other deep generative models including DMM and the two simplifications of the proposed model. Table 4.8: Lower bound of log-likelihood of generative models on the MIMIC-III dataset with MRMTS. DMM MR-DMM HDMM MR-HDMM −1.54 2.62 10.54 15.27 83 4.2.3.5 Discussions on Learnt Switches In all our experiments, MR-HDMM outperforms other generative models by a significant margin. Considering that all the deep generative models have the same amount of parameters, this improvement empirically demonstrates the effectiveness of our proposed learnable latent hierarchical structure and auxiliary connections. In Figure 4.11, we visualize the latent hierarchical structure of MR-HDMM learned from the first 48 hours of an admission in the MIMIC-III dataset. A colored block (blue or pink) indicates that the latent statez l t is updated fromz l t−1 andz l−1 t (update), while a white block indicates z l t is generated from the same distribution of z l t−1 (reuse). As expected, the higher latent layers tend to update less frequently and capture the long-term and more important temporal dependencies. We also noticed that the higher latent layers tend to update more within the first a few hours after the admission, which is also reasonable since the patient might be in danger right after the admission and tends to become fine with proper actions taken afterwards. These observations on the learnt switches indicate MR-HDMM captures the latent structure from data better and can provide insightful information in practice. 0 10 20 30 40 48 Hours 3 2 Figure 4.11: Interpretable latent structure learned by MR-HDMM model in the first 48 hours of an admission from the MIMIC-III dataset by switch states. 84 Chapter 5 Handling Data Scarcity 85 In this chapter, we present two works in handling data scarcity. Firstly, we provide a semi-supervised learning framework with generative adversarial networks to boost prediction performance with limited labeled data in Section 5.1. Secondly, we propose two training strategies to improve and speedup deep learning model training by incorporating prior knowledge and incremental training in Section 5.2. 5.1 Boosting Performance with a Small Set of Labeled Data Properly-designed deep neural networks have the prospect of handling these issues if equipped with massive data, but the amount of clinical data, especially with accurate labels and for rare diseases and conditions, is somewhat limited and far from most models’ requirements [118]. This comes from the following reasons: The number of cases for rare diseases and conditions is quite limited; The diagnosis and patient labeling process highly relies on experienced human experts and is usually very time-consuming; Getting detailed results of lab tests and other medical features, though has become more feasible with modern facilities than ever, are still quite costly; Not to mention the privacy issues and regulations which makes it even harder to collect and keep enough medical data with desired details. These unique challenges lying in health care domain prevent existing deep learning models from exerting their strength with enough available and high-quality labeled data. One way to overcome the challenges arising from limited data in machine learning field is semi-supervised learning (SSL) [224]. Semi-supervised learning is a class of techniques that makes use of unlabeled or augmented data together with a relatively small set of labeled data to get better performance. Though some previous 86 work utilized semi-supervised learning methods on EHR data [253], most of them focus on clinical text data [77, 241], and only limited work attempt to perform semi-supervised learning method on structured quantitative EHR data [21]. Generative model is also considered as a promising solution. As one type of probabilistic models which is widely used in semi-supervised learning [260], it aims at learning the joint probability distribution over observations and labels from the training data, and can be further used for downstream algorithms and applications such as data modeling [222], classifier and predictor training [165], and data augmentations [259]. Though generative model approaches have been well explored for years, deep generative models haven’t caught enough attentions due to its complexity and computation issues until the recent development of generative adversarial network (GAN) [83]. GAN simultaneously trains a deep generative model and a deep discriminative model, which captures the data distribution and distinguishes generated data from original data respectively, as a mini-max game. GAN models have been mainly used on image, video and text data to learn useful featureswithbetterunderstandings[147,185]orsampledataforspecificdemand[102, 235]. In health care domain large amount of reliable data, either from real datasets or by data augmentation, are in great demand for training powerful predictive models. Applying GAN models to generate sequential or time series EHR data may bring significant benefits while only few works have been proposed. In this chapter, we investigate and propose general deep learning solutions to the challenges on high dimensional temporal EHR data with limited labels. We propose a generative model, ehrGAN, for EHR data via adversarial training, which is able to generate convincing and useful samples similar to realistic patient records in existing data. We further propose a semi-supervised learning framework which 87 achieves boosted risk prediction performance by utilizing the augmented data and representations from the proposed generative models. We conduct experiments on two real clinical tasks and demonstrate the efficacy of both the generative model and prediction framework. 5.1.1 Related Work In this section we briefly review existing works which are closely related to the our proposed method of this chapter, including some related semi-supervised and unsupervised learning methods on electronic health records, the recent research of adversarial training and generative adversarial networks, and a combination of them. Supervised information is limited for many EHR applications because it requires expense human efforts in labeling or scoring the patient records so as to analyze them in a model. Thus unsupervised and semi-supervised learning schema is in great demand to be designed. There are some works [42, 43] trying to learn the medical feature and concept embedding representations with unsupervised method and the learned representations incorporate both co-occurrence information and visit sequence information of the EHR data. To our best of knowledge, none of existing work attempts to build a semi-supervised deep learning model for applications with time series EHR data, and our work is the first of its kind. In health informatics domain, there are some other works targeting semi-supervised problems in both deep and non-deep settings, but most of them focus on clinical natural language processing problems, including learning from structured quantitative EHR data [21], building graph-based model for clinical text classification [77], and handling question- answering task in health care forum data [241]. They are not directly related to our work since we consider different data types or tasks. 88 The idea of adversarial learning is to include a set of machines which learn together by pursuing competing goals. In a generative adversarial networks (GAN) [83], a generator function (usually formed as a deep neural network) learns to synthesize samples that best resemble a dataset, while a discriminator function (also usually a deep neural network) learns to distinguish between samples drawn from the dataset and samples synthesized by the generator. There are lots of deep learning works out on GAN models recently, and some of them have emerged as promising frameworks for unsupervised learning. For instance, the generators are able to produce images of unprecedented visual quality [131], while the discriminators learn features with rich semantics that can benefit other learning paradigms such as semi-supervised learning and transfer learning [139, 200]. The semi-supervised learning frameworks with GAN are used to solve clas- sification tasks and learn a generative model simultaneously. The representations learned by the discriminator, the classifications from the semi-supervised classifier, and the sampled data from the generator improve each other. There are several works to build theoretical semi-supervised learning frameworks with GAN and apply them to the classification task. Generally speaking, existing methods include feature augmentation [211], virtual adversarial training [156], and joint training [170, 200]. Our proposed semi-supervised paradigm belongs to data augmentation methods. It is noting that most of these related models are only applied to and designed for computer vision or natural language processing domains. To extend existing semi-supervised learning and generative adversarial network framework to EHR data is not straightforward. Moreover, to facilitate GAN models with semi-supervised learning for onset prediction is also difficult. These two unsolved challenges are well addressed in our proposed framework in the remainder of this chapter. 89 5.1.2 Methodology In this section, we first introduce the basic deep learning risk prediction model used as a strong baseline as well as a component in the proposed framework. Then we describe ehrGAN, a modified generative adversarial network which is specifically designed for EHR data. Finally, we present the data augmented semi-supervised learning schema with ehrGAN which is able to perform boosted onset predictions. 5.1.2.1 Basic Deep Prediction Model The basic model used in this work is a CNN model with 1-dimension convolu- tional layers over the temporal dimension and max over-time pooling layer, which was commonly used in previous work [38, 189]. The input to the model is the EHR records of patient p, which is represented as a temporal embedding matrix x p ∈ R Tp×M , where T p , which can be different among patients, is the number of medical events in patient p’s record, and M is the dimension of the learned embedding. The rows of x p are the embedding vectors for the medical events, arranged in the order of time of occurrence. The embedding for medical events is trained by the Word2vec model [151] on the same corpus of EHR data. We apply convolutional operation only over the temporal dimension but not over embedding dimension. We use a combination of filters with different lengths to capture temporal dependencies in multiple levels, and our preliminary experiments validated the performance improvement from such strategy. After the convolutional step, we apply a max-pooling operation along the temporal dimension to keep the most important features across the time. This temporal pooling converts the inputs with different temporal lengths into a fixed length output vector. Finally a fully connected soft-max layer is used to produce prediction probabilities. This CNN-based deep prediction model described above is 90 shown to be the most competitive baseline among all other tested baselines in our experiments and serves as the basic prediction component in our proposed work. 5.1.2.2 ehrGAN: Modified GAN Model for EHR Data The original GAN [83] is trained by solving the following mini-max game in the form of min G max D E x∼p data (x) [logD(x)] +E z∼pz (z) [log (1−D(G(z)))] where p data (x) is the true data distribution; D(x) is the discriminator that takes a sample as the input and outputs a scalar between [0, 1] as the probability of the sample drawing from real dataset; G(z) is the generator that maps a noise variable z∈R d drawn from a given distribution p z (z) back to the input space. The training procedure consists of two loops optimizing the generator G and the discriminator D iteratively. After the mini-max game reaches its Nash equilibrium [161],G defines an implicit distribution p g (x) that recovers the data distribution, i.e., p g (x) =p data (x). Generally, both D and G are parameterized as deep neural networks. In the context of EHR data, similar to the basic prediction models, our choice of G and D falls into the family of 1-dimensional convolutional neural networks and 1-dimensional deconvolutional neural networks. We then describe the proposedehrGAN model which shares similar structures as an original GAN with the ability to generate plausible labeled data. The overview of the proposed model is shown on the left of Figure 5.1. M and f M represent the models used in discriminator (D) and generator (G), respectively. s and e s represent real and synthetic samples, respectively. z andm are drawn randomly. e x andx 91 ෩ ǁ Real or Synthetic Decoder Decoder Encoder ҧ ෨ ℎ ℎ 1 − Figure 5.1: Illustration of the structure of proposed ehrGAN model (left) and its generator (right). are the reconstructed sample and the generated synthetic sample based on the real samplex, respectively. In the following parts, we will discuss model details in terms of the design of discriminator and generator, and some specific training techniques. Discriminator We adopt the structure of the basic prediction model to the dis- criminator, due to its simplicity and excellent classification performance. We replace the top prediction layer by a single sigmoid unit to output the probability of the input data being drawn from the real dataset. Generator The goal of the generator in GAN is to translate a latent vector z into the synthetic sample e x. The generator is encoded by a de-convolutional neural network with two consecutive fully connected layers, the latter of which is reshaped and followed by two de-convolutional layers to perform upsampling convolution. Empirically, this generator is able to generate good samples. However, this version of generator can not be directly used in semi-supervised learning setting as the model 92 is trained only to differentiate real or synthetic data instead of the classes. To solve this problem, we introduce a variational version of the generator, which also provides some new understandings of GAN models. Generator with variational contrastive divergence The design of the vari- ational generator is based on the recently proposed variational contrastive diver- gence [251]. Instead of directly learning a generator distribution defined by G(z), we learn a transition distribution of the form p( e x|x) for the generated sample e x, with x∼ p data (x). The marginal distribution of the generator is then given by p g ( e x) =E x∼p data (x) p( e x|x). Intuitively, the transition distribution p( e x|x) encodes a generating process. In this process, based on an example drawn from the training data distribution, a neighboring sample e x is generated. The generator is equipped with encoder-decoder CNN networks. For each real samplex, we get the representa- tionh from encoder and the reconstructionx by feedingh into decoder. h is mixed with a random noise vectorz of the same dimensionality by a random binary mask vector m to obtain e h =mz + (1−m)∗h, where represents element-wise multiplication. The synthetic sample e x is obtained by feeding e h to the same decoder. An illustration of this generator with variational contrastive divergence is shown on the right of Figure 5.1. The generator attempts to minimize the objective as E x∼p data (x) h ρ·E e x∼pg (e x|x) [− logD( e x)] + (1−ρ)·kx−xk 2 2 i (5.1) where D is the discriminator function and the hyperparameter ρ controls how close the synthetic sample should be to the corresponding real sample. The usage of the proposed ehrGAN brings two benefits. First, while original GAN models are known to have mode collapsing issues, i.e., G is encouraged to generate only a 93 few modes, ehrGAN eliminates the mode collapsing issue by its design, as the diversity of the generated samples inherently approximates that of the training data. Second and more importantly, the learned transition distribution p( e x|x) contains rich structures of the data manifold around training examplesx, which can be quite useful when incorporating with our semi-supervised learning framework to obtain effective classification models. Training techniques We train the proposed ehrGAN by optimizing the gener- ator and discriminator iteratively with SGD. The training procedure (shown in Algorithm 5.1) is similar to that of a standard GAN. We take several techniques to stabilize the training procedure which are similar to those in [147, 200], and we relieve the training instability and sensitivity to hyper-parameters. Firstly, we switch the order of discriminator and generator training and perform k = 5 optimization steps for the generator for every one steps for the discriminators. Secondly, we add an l 2 -norm regularizer in the cost function of discriminator. Finally, batch normalization and label smoothing techniques are used. 5.1.2.3 Semi-Supervised Learning with GANs We next introduce our method of conducting semi-supervised learning with a learnedehrGAN, ina waywhichis similar toour previous modelfor images [251]. The basic idea is to use the learned transition distribution to perform data augmentation. To be concrete, within the semi-supervised learning setting we minimize the follow loss function: 1 N N X i=1 L(x (i) ,y (i) ) +μ· 1 N N X i=1 E e x (i) ∼p(e x|x (i) ) L( e x (i) ,y (i) ) (5.2) 94 Algorithm 5.1: The optimization procedure of ehrGAN Input: Dataset with N samples{x (1) ,...,x (N) } Output: ehrGAN with generator G and discriminator D 1: for enough iterations until convergence do 2: for k inner steps do 3: Sample N noise variables{z (1) ,...,z (N) } and N binary mask vectors {m (1) ,...,m (N) }; 4: Update generator G by one step gradient ascent of 1 N P N i=1 logD(G(z (i) ,m (i) )) 5: end for 6: Sample N training data{x (1) ,...,x (N) }, N noise variables{z (1) ,...,z (N) }, and N binary mask vectors{m (1) ,...,m (N) }; 7: Update discriminator D with one step gradient descent of − 1 N P N i=1 logD(x (i) )− 1 N P N i=1 log 1−D(G(z (i) ,m (i) )) 8: end for whereL refers to the binary crossentropy loss on each data sample and label pair (x (i) ,y (i) ), and μ leverages the ratio of the numbers of training data and augmented data fromGAN. In other words, this model assumes that a well trained generator with distributionp( e x|x) should be able to generate samples that are likely to align within the same class ofx, which can in turn provide valuable information to the classifier as additional training data. We name this method as SSL-GAN (Semi-supervised learning with a learned ehrGAN) in this chapter. 5.1.3 Experiments In this section we apply our models to two real clinical case studies with data extracted from heart failure and diabetes cohorts. It is particularly interesting to investigate how well GAN can generate EHR samples as the real ones. Also, under- standing how the proposed method can boost the performance of onset prediction is crucial for many health care applications. We start this section by introducing the 95 datasets and experimental settings, and provide the evaluation analysis, followed by the discussions on the selections of parameters. 5.1.3.1 Problem Settings We identify two following cohorts from theClaim dataset, and predict whether a patient is from case or control group as a binary classification task. The labels of both case and control groups are identified by domain experts according to ICD-9 codes. • Congestive heart failure (Heart Failure), which contains 3,357 confirmed patients in case group and 6,714 patients in control group; • Diabetes (Diabetes), with 2,248 patients in case group and 4,496 patients in control group. We import ICD-9 diagnosis and medications as the input features, eliminate those which show less than 5 times in this dataset, and get 8,627 unique medical features. We segment the time dimension into disjoint 90-day windows and combine all the observations within each window. We split datasets into training, validation and test with ratio 7:1:2, limit the length of each record sequence between 50 and 250, and form it to the embedding matrix. All sequences are 0-padded to match the longest sequence. The embedding is trained by Word2vec [151] on the entire dataset with a dimension of 200. The ehrGAN is trained on only the training subset. For the CNN discriminator, we employ filters of sizes{3, 4, 5} with 100 feature maps. For the generator, the dimension of the latent variablez is 100. It is first projected and reshaped by the generator and up-sampled by two one-dimensional CNN layers with filers size 100 and 3. The output of the generator is an embedding matrix with size 96 200× 150. These hyperparameters are selected based on preliminary experimental results. To generate samples with different length, we paddle a special embedding mark at the end of each training record. The masks m in the generator with variational contrastive divergence are uniformly sampled with probability 0.5. The Adam algorithm [115] with learning rate 0.001 for both discriminator and generator is utilized for optimization. Gradients are clipped if the norm of the parameter vector exceeds 5 [219]. After we get the generated data, we can map it into EHR record by finding the nearest-neighbor with cosine distance for each feature. The selection of optimal values for hyper-parameters μ and ρ will be discussed later in Section 5.1.3.5. Table 5.1: Prediction performance comparison of the basic predictive models on the Claim dataset. Heart failure Diabetes Accuracy AUROC Accuracy AUROC CNN 0.8630 0.9329 0.9644 0.9789 GRU 0.8578 0.9129 0.9304 0.9659 LSTM 0.8511 0.9103 0.9448 0.9553 LR 0.8494 0.9052 0.9066 0.9681 SVM 0.8443 0.9017 0.8944 0.9462 RF 0.8571 0.9225 0.9476 0.9658 5.1.3.2 Risk Prediction Comparison on Basic Models First, we show the performance of our basic predict model (CNN), which explores the CNN model with pre-trained medical feature embedding and is a strong baseline even before boosted. We compare it with logistic regression (LR), linear 97 support vector machine (SVM), random forest (RF) and two other deep models, recurrent neural network models using gated recurrent units (GRU [39]) and long short-term memory (LSTM [96]). We applyL 2 regularizers in LR and SVM, and we use early stopping for RF with at most 50 trees. We follow the similar settings from existing work [135] for GRU and LSTM. Table 5.1 shows the classification accuracy and AUROC (area under receiver operating characteristic curve) score of all basic baseline models on the two prediction tasks. CNN is among the best methods in Heart Failure task and significantly outperforms baselines in Diabetes task. The performance improvement mainly comes from the learned embeddings in heart failure task, and from CNN model structures in diabetes task. The other two deep models GRU and LSTM work well but can not beat CNN. 5.1.3.3 Analysis of the Generated Data Before testing our semi-supervised prediction models with augmented data, we need to inspect whether the generated data from ehrGAN are able to simulate original data well enough, especially for the patient records in the case cohorts. Having the generated data similar to original one is an important precondition to improve our model performance instead of hurting it. First, we compare the record length of the original data (D o ) and generate data D g for the two case groups. The length is the number of medical events in one record. The results are shown in Figure 5.2, where x-axis indicates data length and y-axis represents the frequencies (probabilities). As shown in this figure, most heart failure patients have no more than 500 medical event records, while diabetes patients have 98 0 100 200 300 400 500 600 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Original Generated 0 100 200 300 400 500 600 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Original Generated Figure 5.2: The length distributions of original and generated datasets from the Claim dataset in heart failure cohort (left) and diabetes cohort (right). fewer medical event records (no more than 200 records). The generated datasets have similar length distributions to original datasets. 0 20 40 60 80 100 0.00 0.01 0.02 0.03 Original Generated 0 20 40 60 80 100 0.00 0.01 0.02 0.03 Original Generated Figure 5.3: The frequency of top 100 features in original and generated datasets from the Claim dataset in heart failure cohort (left) and diabetes cohort (right). Then we check the most frequency input features from the two datasets and show the results in Figure 5.3. Here, x-axis shows the list of features sorted by their frequencies in the original dataset, and y-axis shows the frequencies. As shown in this figure, The generated data keeps similar frequencies for the 100 most frequent features. Interestingly, we find that in the original heart failure dataset, the top 2 to 5 features have almost the same frequencies, and the generated heart failure dataset also keeps similar distribution patterns for these features. 99 1 6 11 16 20 1 6 11 16 20 1 6 11 16 20 1 6 11 16 20 1 6 11 16 20 1 6 11 16 20 1 6 11 16 20 1 6 11 16 20 Figure 5.4: The co-occurrence frequency of top 20 diagnosis features from the Claim dataset in heart failure cohort (left two) and diabetes cohort (right two). Comorbidities (cooccurrences) in patient records are quite useful in clinical prediction tasks. Therefore, the comorbidity properties should be kept as well in our generated data. We select 20 most frequent diagnosis features from these two case cohorts, and show the comorbidity heatmaps in Figure 5.4. Similarly, we use blue parts to represent the original datasets and green parts to represent the generated datasets. A darker/lighter (blue/green) square indicates a higher/lower co-occurrence frequencies (in the original/generated dataset). We sort the features by their frequencies in the original dataset, and list the top 20 features in x-axis and y-axis. We can find that both the feature frequencies and the comorbidity clusters are well simulated in our generated datasets. The list of top 10 diagnosis features for the two cohorts are listed in Table 5.2 and Table 5.3. Most of them are common diagnoses in patient records, but with slightly different occurrence frequencies for different cohorts. Our generated models are able to capture the occurrence patterns from different case cohorts and keep those patterns very similar to those in the corresponding original datasets. These analyses not only verify the quality of our generated data, but also help us get better understandings on patterns in cohorts for different tasks. 100 Table 5.2: Top 10 most frequent ICD-9 diagnosis codes of heart failure cohort group in the generated data. Rank inD g Rank inD o ICD-9 code Diagnosis descriptions 1 2 250.0 Diabetes mellitus, unspecified 2 1 401.1 Hypertension, benign 3 3 427.31 Atrial fibrillation 4 4 401.9 Hypertension, unspecified 5 5 272.4 Other and unspecified hyperlipidemia 6 6 496 Chronic airway obstruction 7 14 585.6 End stage renal disease 8 7 272.0 Pure hypercholesterolemia 9 9 285.9 Anemia, unspecified 10 8 244.9 Hypothyroidism, unspecified 5.1.3.4 Evaluation of the Boosted Model To evaluate the performance of the boosted model with semi-supervised learning setting, we conduct extensive experiments on the following six approaches. • CNN-Basic: The basic model described in Section 5.1.2.1, trained only on the training subset; • CNN-Full: The basic model trained with the same amount of labeled data as SSL-GAN; 101 Table 5.3: Top 10 most frequent ICD-9 diagnosis codes of diabetes cohort group in the generated data. Rank inD g Rank inD o ICD-9 code Diagnosis descriptions 1 1 401.1 Hypertension, benign 2 2 401.9 Hypertension, unspecified 3 3 272.4 Other and unspecified hyperlipidemia 4 4 427.31 Atrial fibrillation 5 6 244.9 Hypothyroidism, unspecified 6 5 272.0 Pure hypercholesterolemia 7 10 V57.1 Care involving other physical therapy 8 7 285.9 Anemia unspecified 9 8 599.0 Urinary tract infection, unspecified 10 11 496 Chronic airway obstruction • CNN-Random: The basic model trained with the same amount of data as SSL-GAN with random labels for additional data and true labels for training subset; • SSL-SMIR: Squared-loss mutual information regularization [167]; • SSL-LGC: Semi-supervised learning approach with local and global consis- tency [255]; • SSL-GAN: The proposed method with ehrGAN based data augmentation. It is notable that SSL-SMIR and SSL-LGC are strong and robust semi-supervised learning baselines. CNN-Full, SSL-SMIR, and SSL-LGC are trained with additional 102 Table 5.4: Performance compassion of different CNN and SSL prediction models on four sub-datasets from the Claim dataset. HF50 Dia50 HF67 Dia67 Accuracy AUROC Accuracy AUROC Accuracy AUROC Accuracy AUROC CNN-Basic 0.8096 0.8784 0.8990 0.9156 0.8347 0.8953 0.9129 0.9386 CNN-Random 0.7418 0.7856 0.7734 0.8011 0.7788 0.8117 0.7969 0.8486 CNN-Full 0.8631 0.9212 0.9335 0.9528 0.8749 0.9329 0.9486 0.9714 SSL-SMIR 0.8207 0.8842 0.9089 0.9197 0.8466 0.9102 0.9038 0.9277 SSL-LGC 0.8119 0.8767 0.8844 0.9102 0.8325 0.9011 0.8815 0.9128 SSL-GAN 0.8574 0.9075 0.9135 0.9354 0.8662 0.9246 0.9330 0.9563 samples from a held-off subset. The parameters setting of SSL-SMIR and SSL-LGC follows those in the original papers [167, 255] and bag of words feature are used. We choose the values ofρ andmu in SSL-GAN with best performance by Section 5.1.3.5. We summarize the classification performance in Table 5.4 in different settings with different amounts of labeled data. For example, HF50 means 50% of the training set of Heart Failure is used, and Dia67 means 2/3 of the training set of Diabetes is used. First of all, our model consistently beats CNN-Basic and CNN-Random. On HF50, SSL-GAN achieves 0.8574 on accuracy and 0.9075 on AUROC score, which are much better than 0.8096 and 0.8784 from CNN-Basic. On Dia50 and Dia67, SSL-GAN also improves 3%− 4% over the baseline in both measurements. Due to the messed up label information, the performance of CNN-Random is even worse than CNN-Basic. Second, compared with the CNN-Full method, our model can also achieve comparable results. Measured in AUROC score, SSL-GAN is about 2% lower than CNN-Full on both Dia50 and Dia67 sets. On HF50 and HF67, the margins are even smaller. The two standard semi-supervised learning methods SSL-SMIR and SSL-LGC do not perform very well, only achieving similar performances as 103 CNN-Basic, and our method easily outperforms them. Overall, these evaluations show the strong boosting power of the proposed SSL-GAN model. 5.1.3.5 Selections of Parameters We use the two balanced datasets, HF50 and Dia50, to show the effects of values of the two hyperparameters ρ and μ. Table 5.5: AUROC scores with different values of ρ in the SSL-GAN method. Model Heart failure Diabetes No augmentation (μ = 0) 0.8784 0.9156 SSL-GAN (ρ = 0) 0.8654 0.8754 SSL-GAN (ρ = 0.001) 0.8823 0.9188 SSL-GAN (ρ = 0.01) 0.8911 0.9237 SSL-GAN (ρ = 0.1) 0.9075 0.9354 SSL-GAN (ρ = 0.2) 0.8876 0.9025 SSL-GAN (ρ = 1) 0.7503 0.7603 The effectiveness of ρ In this part, we discuss how the selection of ρ in Equa- tion (5.1) affects the performance. We fix other parameters and vary ρ from 0 to 1, and report the AUROC score with different settings in Table 5.5. We see that on both datasets, with a properly chosen ρ the generator is able to provide good generations to improve learning. ρ = 0.1 is an optimal selection for the model (results of ρ> 0.2 are no better than ρ = 0.2 and thus omitted here). On the other hand, with ρ = 0, which corresponds to sample from an autoencoder, hurts performance. ρ = 1 completely messes up training as the generated samples are not guaranteed to 104 have the same label as the samples conditioned on. This shows that the transition distribution is able to generate samples that are sufficiently different from training samples to boost the performance. Figure 5.5: AUROC scores with different values of μ in SSL-GAN for heart failure (left) and diabetes (right). The effectiveness of μ How to optimally utilize the augmented data from GAN models to support the supervised learning is an important problem. In our task, this is controlled by the parameter μ in Equation (5.2), which leverages the ratio of labeled data and augmented data. Generally, including more augmented data will help while too many augmented data may even hurt the performance. With a fixed value of ρ, we vary μ from 0.2 to 1.4 and test the prediction performance on two datasets. The prediction AUROC scores of different methods are shown in Figure 5.5. We also include the setting with fully labeled data (referred to as FULL in the figure), and μ represents the number of real labeled data used instead of from GAN models in this setting. This can be seen as the reference of the upper bound of the prediction performance. It is obvious for the method with fully labeled data that the prediction performance continues improving with μ increased. For SSL-GAN, 105 when ρ = 0.1 (the optimal setting), we see it achieves the best performance when μ = 0.6. After that point, the performance decreases a little, which indicates that more augmented data can not help further. For the setting with ρ = 1 and ρ = 0, the prediction power continues falling as including more augmented data is harmful for both cases. Similar trend are also observed under the measure of accuracy. 5.2 Incorporating Prior-Knowledge and Incre- mental Training The success of deep learning is often associated with massive datasets of millions of examples [53, 227], but in medicine “big data” often means EHR databases [81, 106, 145] with only tens of thousands of cases, and in which only limited data are available for some rare cases. This is a far cry from millions of images in computer vision or Internet-scale corpus data in natural language processing. The model architectures and the training procedures need to be well designed to answer these questions and this can easily lead to time-consuming trial and mediocre performance. In addition, there is a wealth of existing medical knowledge that can inform ana- lytics. Medicine is replete with ontologies, including SNOMED-CT [49], UMLS [26], and ICD-9 [168], whose structured forms are readily exploited by computational methods. Such ontologies have proven very useful for search, data mining, and decision support [2, 58]. For machine learning, such resources represent a source of potentially useful biases that can be used to accelerate learning. Combining structured knowledge with data-driven methods like deep learning presents a major challenge but also a significant opportunity for medical data mining. 106 In this chapter we explore and propose solutions to some of the data scarcity challenges that researchers face when utilizing deep learning to discover and detect significant physiologic patterns in critically ill patients by exploiting unique properties of both our domain (e.g., ontologies) and our data (e.g., temporal order in time series). We show that our framework improves the performance of our neural networks and makes the training process more efficient. To be more specific, we first formulate a prior-based regularization framework for guiding the training of multi- label neural networks using medical ontologies and other structured knowledge. Our formulation is based on graph Laplacian priors [6, 14, 237, 254], which can represent any graph structure and incorporate arbitrary relational information or data-driven (e.g., comorbidity patterns) and hybrid priors. This helps the neural networks for diagnostic classification tasks with many labels and severe class imbalance. Second, we propose an efficient incremental training procedure for building a series of neural networks that detect physiologic patterns of increasing length and more input features. We use the parameters of an existing neural network to initialize the training of a new neural network to speedup the training procedure. This technique exploits both the well-known low rank structure of neural network weight matrices [54] and structure in our data domain, including temporal smoothness and feature similarity. We demonstrate the empirical efficacy of our deep learning training frame- work using the PICU and PhysioNet datasets. We show that our prior-based regularization framework improves performance on a very challenging multi-label classification task (predicting ICD-9 diagnostic codes) and that it is beneficial to incorporate both domain knowledge and data-driven similarity. We demonstrate that our incremental training procedure leads to faster convergence during training 107 and learns features that are useful for classification and competitive with classically trained neural networks. 5.2.1 Related Work Classic learning theory results show that the amount of training data required increases as the complexity of the learning algorithm increases [18, 229]. Thus, the high complexity and flexibility of neural networks poses challenges for domains without massive datasets. This is especially true for medical applications where many predictive tasks suffer from severe class imbalance since most conditions are inherently rare. One possible remedy is to use side information, such as class hierarchy, as a rich prior to prevent overfitting and improve predictive performance. However, there is still limited work in the deep learning community exploring the utility of such priors for training neural networks. For example, the first work [213] that combines a deep architecture with a tree-based prior to encode relations among different labels and label categories is limited to modeling a restricted class of side information. Our work also has clear and interesting connections to ongoing research into efficient methods for training deep architectures. The renaissance of neural networks was launched by unsupervised pretraining [24, 93, 187]. The classic pretraining procedure can be viewed as a simple greedy method for building a deep architecture vertically, one layer at a time. Our incremental training method can be viewed as a greedy method for building deep architectures horizontally by adding units to one or more existing layers. Our incremental training framework is also connected to two recent methods: an incremental approach to feature learning in an online setting [256], and an approach for predicting parameters of neural networks by exploiting the smoothness of input 108 data and the low rank structure of weight matrices [54]. In the first work, the authors use a two step process to train new features: first, they train only the weights of the new features using a subset of training samples; then they retrain all weights on the full data. This approach outperforms fixed neural networks in streaming settings where the data and label distributions drift. There is an obvious parallel with our work, but we focus on changing input size rather than data drift. In addition, they do not analyze the convergence properties of their training procedure. In the second work, each weight matrix is decomposed into a product of two low rank matrices. One represents the learned weights for a subset of parameters. The other is a kernel similarity matrix, either designed using domain knowledge or estimated from data (using, e.g., covariance). In this way, parameter learning becomes a kernel regression problem. We use a related idea in our parameter initialization scheme: we exploit similarity between new inputs and old inputs to estimate initial parameter values prior to training. 5.2.2 Methodology In this section, we describe our framework for performing effective deep learning on clinical time series data. We begin by discussing the Laplacian graph-based prior framework that we use to perform regularization when training multi-label neural networks. This allows us to effectively train neural networks, even with smaller datasets, and to exploit structured domain knowledge, such as ontologies. We then describe our incremental neural network procedure, which we developed in order to rapidly train a collection of neural networks to detect physiologic patterns of increasing length. 109 5.2.2.1 Prior-Based Regularization Deep neural networks are known to work best in big data scenarios with many training examples. When we have access to only a few examples of each class label, incorporating prior knowledge can improve learning. A tree-based prior, constructed from a hierarchy over image labels, was used in recent work [213] to improve classification performance for smaller datasets with rare labels. However, the tree-based prior can only model a very restricted class of side information. In practice, we might have other types of prior information as well, such as pairwise similarity or co-occurrence. Thus, it is useful to have a more general framework able to incorporate a wider range of prior information in a unified way. Graph Laplacian-based regularization [6, 14, 237, 254] provides one such framework and is able to incorporate any relational information that can be represented as a (weighted) graph, including the tree-based prior as a special case. Given a matrixA∈R K×K representing pairwise connections or similarities, the Laplacian matrix is defined asL =C−A, whereC is a diagonal matrix with k-th diagonal element C k,k = P K k 0 =1 (A k,k 0). L has the following property that makes it interesting for regularization. Given a set of K vectors of parameters k ∈R D (L) , we have tr( > L) = 1 2 X 1≤k,k 0 ≤K A k,k 0k k − k 0k 2 2 , (5.3) where tr(·) represents the trace operator. According to Equation (5.3), the graph Laplacian regularizer enforces the parameters k and k 0 to be similar, proportional to A k,k 0. We use θ to represent all parameters in the model, and the Laplacian regularizer can be combined with other regularizers R(θ) (e.g., the Frobenius norm 110 kWk 2 F to keep hidden layer weights small). Given inputs X and corresponding multi-label outputsy, this yields the regularized loss function L =− N X i=1 logp(y (i) |X (i) ,θ) +λR(θ) + ρ 2 tr( > L) where ρ,λ> 0 are the Laplacian and other regularization hyperparameters, respec- tively. Note that the graph Laplacian regularizer is a quadratic function in terms of parameters and so it does not add significantly to the computational cost. The network architecture with the regularization on the top prediction layer is shown on the left of Figure 5.6. Labels ¼ 7 ¼ 5 U 5 U 6 U 7 T 5 T 6 T 7 T 8 Inputs Respiratory Pneumonia Infections « Cardiovascular Rheumatic Fever Coronary artery Disease « Figure 5.6: An illustration of the deep network (left) with the regularization on categorical structures (middle) applied to the output layer of the network. The graph Laplacian regularizer can represent any pairwise relationship between parameters. Herewediscusshowtousedifferenttypesofpriorsandthecorresponding Laplacian regularizers to incorporate both structured domain knowledge (e.g., label hierarchies based on medical ontologies) and empirical similarities. 111 Structured domain knowledge as a tree-based prior The graph Laplacian regularizer can represent a tree-based prior based on hierarchical relationships found in medical ontologies. In our experiments, we use diagnostic codes from the ICD-9 system, which are widely used for classifying diseases and coding hospital data. The three digits (and two optional decimal digits) in each code form a natural hierarchy including broad body system categories (e.g., respiratory), individual diseases (e.g., pneumonia), and subtypes (e.g., viral vs. pneumococcal pneumonia). The right part of Figure 5.6 illustrates two levels of the hierarchical structure of the ICD-9 codes. When using ICD-9 codes as labels, we can treat their ontological structure as prior knowledge. If two diseases belong to the same category, then we add an edge between them in the adjacency graphA. 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 0 0.05 0.1 0.15 0.2 Figure 5.7: The co-occurrence matrix of the PICU dataset. Data-driven similarity as a prior Laplacian regularization is not limited to prior knowledge in the form of trees or ontologies. It can also incorporate empirical 112 priors, in the form of similarity matrices, estimated from data. For example, we can use the co-occurrence matrixA∈R K×K whose elements are defined as A k,k 0 = 1 N N X i=1 I(y (i) k y (i) k 0 = 1) where N is the total number of training data points, andI(·) is the indicator function. Given the fact that A k,k 0 is the maximum likelihood estimation of the joint probability p(y (i) k = 1,y (i) k 0 = 1), regularization with the Laplacian constructed from the co-occurrence similarity matrix encourages the learning algorithm to find a solution for the deep network that accurately predicts the pair-wise joint probability of the labels. The co-occurrence similarity matrix of 19 categories in the PICU dataset is shown in Figure 5.7. 5.2.2.2 Incremental Training Next we describe our algorithm for efficiently training a series of deep models to discover and detect physiologic patterns of varying lengths. This framework utilizes a simple and robust strategy for incrementally learning larger neural networks from smaller ones by iteratively adding new units to one or more layers. Our strategy is founded upon intelligent initialization of the larger network’s parameters using those of the smaller network. Given a multivariate time seriesX∈R P×T , there are two ways to use feature maps of varying or increasing sizes in deep learning structures with fixed-size inputs such as multilayer perceptrons and autoencoders. The first would be to perform time series classification in an online setting in which we want to regularly re-classify a time series based on all available data. For example, we might want to re-classify (or 113 diagnose) a patient after each new observation while also including all previous data. Second, we can apply a feature map g designed for a shorter time series of length T S to a longer time series of length T >T S using a sliding window approach. We apply g as a filter to subsequences of size T S with stride R S (a.k.a, there will be T−T S +1 R S filters). Proper choice of window size T S and stride R S is critical for producing effective features. However, there is often no way to choose the right T S and R S beforehand without a priori knowledge, which is often unavailable. What is more, in many applications, we are interested in multiple tasks (e.g., patient diagnosis and risk quantification), for which different values of T S and R S may work best. Thus, generating and testing features for many T S and R S is useful and often necessary, but it can also be computationally expensive and time-consuming for deep learning models. To address this problem, we propose an incremental training procedure that leverages a deep learning model trained on windows of size T S to initialize and accelerate the training of a new neural network that detects patterns of length T 0 =T S + ΔT S (i.e., ΔT S additional time steps). That is, the input size of the first layer changes from D =PT S to D 0 =D +d =PT S +PΔT S , i.e.,x∈R PT S +P ΔT S . Suppose that the existing and new networks have D [1] andD [1] +d [1] hidden units in their first hidden layers, respectively. Recall that we compute the activations in our first hidden layer according to the formulah [1] =σ(W [1] x +b [1] ). This makesW [1] a D [1] ×D matrix andb [1] a D [1] -vector. We have a row for each feature (hidden unit) inh [1] and a column for each input inx. From here on, we will treat the biasb [1] as a column inW [1] corresponding to a constant input and omit it from our notation. The larger neural network has a (D [1] +d [1] )× (D +d) weight matrixW 0 [1] . The first D columns ofW 0 [1] correspond exactly to the D columns ofW [1] because 114 they take the same D inputs. In time series data, these inputs are the observations in the same T S ×P matrix. We cannot guarantee the same identity for the first D [1] columns ofW 0 [1] , which are the first D [1] hidden units ofh 0 [1] ; nonetheless, we can make a reasonable assumption that these hidden units are highly similar toh [1] . Thus, we can think of constructing W 0 [1] by adding d new columns and d [1] new rows toW [1] . As illustrated in Figure 5.8, the new weights can be divided into three categories. 1 1 1 x Δx h 1 Δh 1 1 1 1 1 = + Figure 5.8: Illustrations on how adding various units by the incremental training changes the weightsW. • ΔW ne : D [1] ×d weights that transform new inputs to existing features. • ΔW en : d [1] ×D weights that transform existing inputs to new features. • ΔW nn : d [1] ×d weights that transform new inputs to new features. We now describe strategies for usingW [1] to choose initial parameter values in each category. 115 Algorithm 5.2: Similarity-based initialization for incremental training Input: Training dataX∈R N×(D+d) ; existing weightsW [1] ∈R D [1] ×D ; kernel function k(·,·) Output: Initialized weights ΔW ne ∈R D [1] ×d 1: for each new input dimension i∈ [1,d] do 2: for each existing input dimension i 0 ∈ [1,D] do 3: Let K[D +i,i 0 ] :=k(X[·,D +i],X[·,i 0 ]) 4: end for 5: NormalizeK (if necessary) 6: for each existing feature j∈ [1,D [1] ] do 7: Let ΔW ne [j,i] := P D k=1 K[D +i,k]W [1] [j,k] 8: end for 9: end for Similarity-basedinitializationfornewinputs Toinitialize ΔW ne , weleverage the fact that we can compute or estimate the similarity among inputs. LetK be a (D +d)× (D +d) kernel similarity matrix between the inputs to the larger neural network that we want to learn. We can estimate the weight between the i-th new input (i.e., input D +i) and the j-th hidden unit as a linear combination of the parameters for the existing inputs, weighted by each existing input’s similarity to the i-th new input. This is shown in Algorithm 5.2. Choice of K is a matter of preference and input type. A time series-specific similarity measure might assign a zero for each pair of inputs that represents different variables (i.e., different univariate time series). Alternatively, the similarity measure can emphasize temporal proximity using, e.g., a squared exponential kernel. A more general approach might estimate similarity empirically, using sample covariance or cosine similarity. We find that the latter works well, for both time series inputs and arbitrary hidden layers. Sampling-based initialization for new features When initializing the weights forW en , we do not have the similarity structure to guide us, but the weights inW [1] 116 Algorithm 5.3: Gaussian sampling-based initialization for incremental training Input: Existing weightsW [1] ∈R D [1] ×D Output: Initialized weights ΔW en ∈R d [1] ×D , ΔW nn ∈R d [1] ×d 1: Let w = 1 D·D [1] P i,j W [1] [i,j] 2: Let s = 1 D·D [1] −1 P i,j (W [1] [i,j]−w) 2 3: for each new feature j∈ [1,d [1] ] do 4: for each existing input dimension i∈ [1,D] do 5: Sample ΔW ne [j,i]∼N (w,s) 6: end for 7: for each new input dimension i∈ [1,d] do 8: Sample ΔW nn [j,i]∼N (w,s) 9: end for 10: end for -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 Figure 5.9: Weight distributions for three layers of a neural network after pretraining (left three) and finetuning (right three) the model on the PICU dataset. provide similarity information. A simple but reasonable strategy is to sample random weights from the empirical distribution of entries inW [1] . We have several choices here. Our first question is to decide whether to assume and estimate a parametric distribution (e.g., fit a Gaussian) or use a nonparametric approach, such as a kernel density estimator or histogram. The second question is to decide whether to consider a single distribution over all weights or a separate distribution for each input. In our experiments, we found that the existing weights often had recognizable distributions (e.g., Gaussian, see Figure 5.9) and that it was simplest to estimate and sample from 117 a parametric distribution. We also found that using a single distribution over all weights worked as well as, if not better than, a separate distribution for each input. For initializing weights inW nn , which connect new inputs to new features, we could apply either strategy, as long as we have already initializedW en andW ne . We found that estimating all new feature weights (for existing or new inputs) from the same simple distribution (based onW [1] ) worked best. Our full Gaussian sampling initialization strategy is shown in Algorithm 5.3. Initializing other layers This framework generalizes beyond the input and first layers. Adding d 0 new hidden units toh 0 [1] is equivalent to adding d 0 new inputs toh 0 [l] . If we compute the activations inh 0 [l−1] for a given dataset, these become the new inputs for h 0 [l] and we can apply both the similarity and sampling-based strategies to initialize new entries in the expanded weight matrixW 0 [l] . The same goes for all layers. While we can no longer design special similarity matrices to exploit known structure in the inputs, we can still estimate empirical similarity from training data activations in, e.g.,h 0 [l] . Intuition suggests that if our initializations from the previous pretrained values are sufficiently good, we may be able to forego pretraining and simply perform backpropagation. Thus, we choose to initialize with pretrained weights, then do the supervised finetuning on all weights. 5.2.3 Experiments To evaluate our framework, we ran a series of classification and feature-learning experiments using the PICU and PhysioNet datasets. In Section 5.2.3.1, we demonstrate the benefit of using knowledge- or data-driven priors to regularize the 118 training of multi-label neural networks. In Section 5.2.3.2, we show that incremental training both speeds up training of larger neural networks and maintains classifica- tion performance. Finally, in Section 5.2.3.3, we perform a qualitative analysis of the learned features, showing that neural networks can learn clinically significant physiologic patterns. We implement all neural networks in Theano [5] as variations of multilayer perceptrons with five hidden layers (of the same size) of sigmoid units. The input layer hasPT input units forP variables andT time steps, while the output layer has one sigmoid output unit per label. With the exception of our incremental training procedure, we always initialize each neural network by training it as an unsupervised stacked denoising autoencoder [231]. We find this helps significantly because our datasets are relatively small and our labels are quite sparse. We use minibatch stochastic gradient descent to minimize cross-entropy loss during unsupervised pretraining and logistic loss during supervised finetuning. For pretraining, we stop when the relative improvement in reconstruction error drops below 0.001 for a certain number of iterations in a row or we hit 25 epochs. For finetuning, we apply a similar strategy using validation set error, with a threshold of 0.005 and a maximum of 2,000 epochs. We use 5-fold cross validation for each of our methods, so the neural network feature maps are not trained on the test folds. 5.2.3.1 Benefits of Prior-Based Regularization Our first set of experiments demonstrate the utility of using priors to regularize the training of multi-label neural networks, especially when labels are sparse and highly correlated. From each time series, we extract all subsequences of length T = 12 in sliding window fashion, with an overlap of 50% (i.e., strideR = 0.5T), and 119 Figure 5.10: Example priors for the PICU dataset (leftmost: ICD-9 tree prior; middle-left: ICD-9 shared category prior; middle-right: co-occurrence prior from left to right) and Physionet dataset (rightmost: co-occurrence prior). each subsequence receives its episode’s labels (e.g., diagnostic code or outcome). We use these subsequences to train a single unsupervised stacked denoising autoencoder with five layers and increasing levels of corruption (from 0.1 to 0.3), which we then use to initialize the weights for all supervised neural networks. The sparse multi-label nature of the data makes stratified k-fold cross validation difficult, so we instead randomly generate a series of 80/20 random training/test splits of episodes and keep the first five that have at least one positive example for each label or category. At testing time, we measure classification performance for both frames and episodes. We make episode-level predictions by thresholding the mean score for all subsequences from that episode. We treat the simultaneous prediction of all 86 diagnostic labels and categories in the PICU dataset as a multi-label prediction problem. This lends itself naturally to a tree-based prior because of the hierarchical structure of the labels and categories as shown in the left two plots in Figure 5.10. We also test a data-based prior based on co-occurrence as shown in the middle-right part in Figure 5.10. The Physionet dataset has no such natural label structure to leverage, so we simply test whether a data-based prior can improve performance. We create a small multi-label classification problem consisting of four binary labels with strong correlations, so that 120 Mortality LOS<3 Surgery Cardiac 0.70 0.75 0.80 0.85 0.90 0.95 1.00 AUROC Physionet episode classification Indep. baseline ML baseline Co-Oc. Prior Figure 5.11: Classification performance comparison for prior-based regularizer on the PhysioNet dataset. similarity-based regularization should help: in-hospital mortality (mortality), length- of-stay less than 3 days (los<3), whether the patient had a cardiac condition (cardiac), and whether the patient was recovering from surgery (surgery). The mortality rate among patients with length-of-stay less than 3 days is nearly double the overall rate. The cardiac and surgery are created from a single original variable indicating which type of critical care unit the patient was admitted to; nearly 60% of cardiac patients had surgery. The rightmost part in Figure 5.10 shows the co-occurrence similarity between the labels. We impute missing time series (where a patient has no measurements of a variable) with the median value for patients in the same unit. This makes the cardiac and surgery prediction problems easier but serves to demonstrate the efficacy of our prior-based training framework. Each neural network has an input layer of 396 units and five hidden layers of 900 units each. The results for Physionet are shown in Figure 5.11. We observe two trends, which both suggest that multi-label neural networks work well and that priors help. 121 First, jointly learning features, even without regularization, can provide a significant benefit. Both multi-label neural networks dramatically improve performance for the surgery and cardiac tasks, which are strongly correlated and easy to detect because of our imputation procedure. In addition, including the co-occurrence prior yields clear improvements in the mortality and los<3 tasks while maintaining the high performance in the other two tasks. Note that this is without tuning the regularization parameters. Table 5.6: AUROC scores for classification on the PICU dataset. Level Task No prior Co-occurrence ICD-9 tree Subsequence All 0.7079± 0.0089 0:7169 0:0087 0.7143± 0.0066 Categories 0.6758± 0.0078 0:6804 0:0109 0.6710± 0.0070 Labels 0.7148± 0.0114 0:7241 0:0093 0.7237± 0.0081 Episode All 0.7245± 0.0077 0:7348 0:0064 0.7316± 0.0062 Categories 0.6952± 0.0106 0:7010 0:0136 0.6902± 0.0118 Labels 0.7308± 0.0099 0:7414 0:0064 0.7407± 0.0070 Table 5.6 shows the results on the PICU dataset. We report classification AUROC performance for both individual subsequences and episodes, computed across all outputs, as well as for just labels and just categories. The priors provide some benefit but the improvement is not nearly as dramatic as it is for Physionet. We face a rather extreme case of class imbalance (some labels have fewer than 0.1% positive examples) multiplied across dozens of labels. In such settings, predicting all negatives yields a very low loss. We believe that even the prior-based regularization suffers from the imbalanced classes: enforcing similar parameters for equally rare labels may cause the model to make few positive predictions. However, the co- occurrence prior does provide a clear benefit, even in comparison to the ICD-9 prior. 122 As the middle-right part in Figure 5.10 shows, this empirical prior captures not only the category/label relationship encoded by the ICD-9 tree prior but also includes valuable cross-category relationships that represent commonly co-morbid conditions. 5.2.3.2 Efficacy of Incremental Training In these experiments we show that our incremental training procedure not only produces more effective classifiers (by allowing us to combine features of different lengths) but also speeds up training. We train a series of neural networks designed to model and detect patterns of lengths T S = 12, 16, 20, 24. Each neural network has PT S inputs (for P variables) and five layers of 2PT S hidden units each. We use each neural network to make an episode-level prediction as before (i.e., the mean real-valued output for all frames) and then combine those predictions to make a single episode level prediction. We combine two training strategies as follows. • Full: Separately train each neural network with unsupervised pretraining followed by supervised finetuning. • Incremental: Fully train the smallest (T S = 12) neural network and then use its weights to initialize supervised training of the next network (T S = 16), and then repeat for subsequent networks. We run experiments on a subset of the PICU dataset, including only the 6,200 episodes with at least 24 hours and no more than 128 hours of measurements. This dataset yields 50,000, 40,000, 30,000, and 20,000 frames of lengths 12, 16, 20, and 24, respectively. We begin by comparing the training time (in minutes) saved by incremental learning in Figure 5.12. Incremental training provides an alternative way to initialize 123 0 2 4 6 8 10 12 14 16 Window Size 0 10 20 30 40 50 60 Training Time (min.) Full Inc. Full + Prior Inc. + Prior Figure 5.12: Training time for different neural networks for full/incremental training strategies on the PICU dataset. Table 5.7: AUROC scores for incremental training on the PICU dataset. Size Level Full Inc Prior+Full Prior+Inc 16 Subsequence 0.6928 0.6874 0.6556 0.6581 Episode 0.7148 0.7090 0.6668 0.6744 20 Subsequence 0.6853 0.6593 0.6674 0.6746 Episode 0.7022 0.6720 0.6794 0.6944 24 Subsequence 0.7002 0.6969 0.6946 0.7008 Episode 0.7185 0.7156 0.7136 0.7171 larger neural networks and allows us to forego unsupervised pretraining. What is more, supervised finetuning converges just as quickly for the incrementally intialized networks as it does for the fully trained network. As a result, it reduces training time for a single neural network by half. Table 5.7 shows the that the incremental training reaches comparable performance. Moreover, the combination of the incremental training and Laplacian prior leads to better performance than using the Laplacian prior alone. 124 5.2.3.3 Qualitative Analysis of Features Figure 5.13: Example features learned from the PICU dataset (top two rows) for the ICD-9 circulatory disease category (ICD-9 codes 390-459) and (bottom two rows) for conditions related to septic shock (ICD-9 codes 990-995). Figure 5.13 shows visualizations of two of the significant physiologic patterns learned by the deep learning model with the ICD-9 tree prior. In each case, we use a feature selection procedure to identify a subset of hidden units in the topmost hidden layer that are most strongly associated with a particular label or category. We then find 50 input subsequences with the highest activations in those units and plot the mean trajectories for 12 of the 13 physiologic variables. The top 12 plots visualize features that are found to be causally related to the circulatory disease category using the Pairwise LiNGAM algorithm [101] for causal inference. We see these features detect highly elevated blood pressure and heart rate as well as depressed pH. The 125 features also detect elevated end-tidal CO2 (ETCO2) and fraction-inspired oxygen (FIO2), which likely indicate ventilation and severe critical illness. Interesting, these features also detect elevated urine output, and thus it is not surprising that these features are also correlated with labels related to urinary disorders. The bottom 12 plots visualize the patterns detected by features that are highly correlated with septic shock. Unsurprisingly, they detect very irregular physiology, including extremely low Glascow Coma Scale (indicating the patient may be unconscious) as well as evidence of ventilation. These are all common symptoms of shock. 126 Chapter 6 Improving Model Interpretability and Usability 127 In this chapter, we first present the interpretable mimic learning framework in Section 6.1, which provides predictive models with both great performance and interpretability. We then demonstrate our deep learning solutions to the opioid usage study in Section 6.2. 6.1 Interpretable Mimic Learning Framework Even though powerful, deep learning models (usually with millions of model parameters) are difficult to interpret. In today’s hospitals, model interpretability is not only important but also necessary, since clinicians are increasingly relying on data-driven solutions for patient monitoring and decision-making. An interpretable predictive model is shown to result in faster adoptability among clinical staff and better quality of patient care [113, 178]. Decision trees [183], due to their ease of interpretation, havebeensuccessfullyemployedinthehealthcaredomain[27,68,248], and clinicians have embraced them for predictive tasks such as disease diagnosis. However, decision trees can easily overfit and perform poorly on large heterogeneous EHR datasets. Thus, an important question naturally arises: how can we develop novel data-driven solutions which can achieve state-of-the-art performance similar to that of deep learning models and at the same time can be easily interpreted by health care professionals and medical practitioners? Recently, machine learning researchers have conducted preliminary work aiming to interpret the learned features from deep models. An early work [65] investigated visualizing the hierarchical representations learned by deep networks, while a followup work [250] explored feature generalizibility in convolutional neural networks. More recent work [220] argued that interpreting individual units of deep models can be 128 misleading. This line of work has shown that interpreting deep learning features is possible but the behavior of deep models may be more complex than previously believed, which motivates us to find alternative strategies to interpret how deep model work. In the meanwhile, recent work [13] showed empirically that shallow neural networks are capable of achieving similar prediction performance as deep neural networks by first training a state-of-the-art deep model, and then training a shallow neural networks using predictions by the deep model as target labels. Similarly, an efficient knowledge distillation approach [95] was proposed to transfer (dark) knowledge from model ensembles into a single model following the idea of model compression [33]. Another work [16] takes a Bayesian approach to distill knowledge from a deep neural network to a shallow neural network. Furthermore, mimic learning has also been successfully applied to multitask learning, reinforcement learning and speech processing applications [133, 171, 198]. These works motivate us to explore the possibility of employing mimic learning to learn an interpretable model which achieves similar performance as a deep neural network. In this chapter, we introduce a simple yet effective knowledge-distillation approach called interpretable mimic learning, to learn interpretable models with robust prediction performance as deep learning models. Unlike standard mimic learn- ing [13], which uses shallow neural networks or kernel methods, our interpretable mimic learning framework uses gradient boosting trees (GBT) [71] to learn inter- pretable models from deep learning models. GBT, as an ensemble of decision trees, provides good interpretability along with strong learning capacity. We conduct extensive experiments on several deep learning architectures including feed-forward networks [99] and recurrent neural networks [47] for mortality and ventilator free 129 days prediction tasks on the Vent dataset. We demonstrate that deep learning approaches achieve state-of-the-art performance compared to several machine learn- ing methods. Moreover, we show that our interpretable mimic learning framework can maintain strong prediction performance of deep models and provide interpretable features and decision rules. 6.1.1 Methodology In this section, we first briefly describe the knowledge distillation concept and the interpretable model used in this work. Then we introduce the interpretable mimic learning method, which learns interpretable models and achieves similar performance as deep learning models. The proposed approach is motivated by recent development of deep learning in machine learning research and specifically designed for the health care domain. 6.1.1.1 Knowledge Distillation The main idea of knowledge distillation [95] is to first train a large, slow, but accurate model and transfer its knowledge to a much smaller, faster, yet still accurate model. It is also known as mimic learning [13], which uses a complex model (i.e., deep neural network, or an ensemble of network models) as a teacher/base model to train a student/mimic model (such as a shallow neural network or a single network model). The way of distilling knowledge, a.k.a. mimicking the complex models, is to utilize the soft labels learned from the teacher/base model as the target labels while training the student/mimic model. The soft label, in contrast to the hard label from the raw data, is the real value output of the teacher model, whose value usually 130 ranges in [0, 1]. It is worth noting that a shallow neural network model is usually not as accurate as a deep neural network model, if trained directly on the same training data. However, with the help of the soft labels from deep models, the shallow model is capable of learning the knowledge extracted by the deep model and can achieve similar or better performance. The reasons that the mimic learning approach works well can be explained as follows: Some potential noise and error in the training data (input features or labels) may affect the training efficacy of simple models. The teacher model may eliminate some of these errors, thus making learning easier for the student model. Soft labels from the teacher model are usually more informative than the original hard label (i.e. 0/1 in classification tasks), which further improves the student model. Moreover, the mimic approach can also be treated as an implicit way of regularization on the teacher model, which makes the student model robust and prevents it from overfitting. The parameters of the student model can be estimated by minimizing the squared loss between the soft labels from the teacher model and the predictions by the student model. That is, given a set of data{X (i) } where i = 1, 2,··· ,N as well as the soft label y s,i from the teacher model, we estimate the student model F (X) by minimizing P N i=1 ky (i) s −F (X (i) )k 2 . While existing works on mimic learning focus on model compression (via shallow neural networks or kernel methods), they cannot lead to more interpretable models, which is important and necessary in health care applications. To address this, we introduce a simple and effective knowledge-distillation approach called interpretable mimic learning, to learn interpretable models that mimic the performance of deep learning models. The main difference of our approach from existing mimic learning approaches is that we use gradient boosting trees (GBT) instead of another neural 131 network as the student model since GBT satisfies our requirements for both learning capacity and interpretability. In the following part, we describe GBT and our proposed interpretable mimic learning in more details. 6.1.1.2 Gradient Boosting Trees Gradient boosting machines [71, 72] are a family of methods which train an ensemble of weak learners to optimize a differentiable loss function in stages. The basic idea is that the prediction function F(X) can be approximated by a linear combination of several functions (under some assumptions), and these functions can be sought using gradient descent approaches. Gradient boosting trees (GBT) take simple classification or regression tree as weak learner, and add one weak learner to the entire model at every stage. At the m-th stage, assume the current model is F m (X), then the Gradient Boosting method tries to find a weak model h m (X) to fit the gradient of the loss function with respect toF (X) atF m (X). The coefficientγ m of the stage function is computed by the line search strategy to minimize the loss. To keep gradient boosting from overfitting, a regularization method called shrinkage is usually employed, which multiplies a small learning rate ν to the stage function in each stage. The final model with M stages can be written as: F M (X) = M X m=1 νγ m h m (X) +const 6.1.1.3 Interpretable Mimic Learning Framework We present two general training pipelines within our interpretable mimic learning framework, which utilize the learned feature representations or the soft labels from deep learning models to help the student model. The main difference 132 between these two pipelines is whether to take the soft labels directly from deep learning models or from a helper classifier trained on the features from deep networks. Deep Learning Model Mimic Model Input Target Output Figure 6.1: Illustration of mimic method training Pipeline 1. In Pipeline 1 (Figure 6.1), we directly use the predicted soft labels from deep learning models. In the first step, we train a deep learning model, which can be a simple feedforward network or GRU, given the inputX and the original target y (which is either 0 or 1 for binary classification). Then, for each input sampleX, we obtain the soft prediction score y nn ∈ [0, 1] from the prediction layer of the neural network. Usually, the learned soft score y nn is close but not exactly the same as the original binary label y. In the second step, we train a mimic Gradient boosting model, given the raw inputX and the soft label y nn as the model input and target, respectively. We train the mimic model to minimize the mean squared error of the output y m to the soft label y nn . In Pipeline 2 (Figure 6.2), we take the learned features from deep learning models instead of the prediction scores, input them to a helper classifier, and mimic the performance based on the prediction scores from the helper classifier. For each input sampleX, we obtain the activationsX nn of the highest hidden layer, which Deep Learning Model Helper Classifier Mimic Model Input Target Output Figure 6.2: Illustration of mimic method training Pipeline 2. 133 can beX [L−1]) from an L-layer feed forward network, or the flattened output at all time steps from GRU. These obtained activations can be considered as the extracted representations from the neural network, and we can change the its dimension by varying the size of the neural networks. We then feedX nn into a helper classifier (e.g., logistic regression or support vector machines), to predict the original task y, and take the soft prediction score y c from the classifier. Finally, we train a mimic Gradient boosting model givenX and y c . In both pipelines, we apply the mimic model trained in the last step to predict the labels of testing examples. Our interpretable mimic learning approach has several advantages. First, our proposed approach can provide models with state-of-art prediction performance. The teacher deep learning model outperforms the traditional methods, and student gradient boosting tree model is good at maintaining the performance of the teacher model by mimicking its predictions. Second, our proposed approach yields more interpretable model than the original deep learning model, which is complex to interpret due to its complex network structures and the large amount of parameters. Our student gradient boosting tree model has better interpretability than original deep model since we can study each feature’s impact on prediction and we can also obtain simple decision rules from the tree structures. Furthermore, our mimic learning approach uses the soft targets from the teacher deep learning model to avoid overfitting to the original data. Thus, our student model generalizes better than standard decision tree methods or other models which tend to overfit to original data. 134 6.1.2 Experiments We conduct experiments on the Vent dataset. In this section we describe the tasks, methods and the empirical results to demonstrate how our proposed mimic learning framework performs when compared to the state-of-the-art deep learning methods and other machine learning methods. 6.1.2.1 Experimental Design We apply simple imputation to fill in missing values in the dataset, where we take the majority value for binary variables, and empirical mean for other variables. Our choice of imputation may not be the optimal one and finding better imputation methods is another important research direction but is beyond the scope of this work. For fair comparison, we used the same imputed data for evaluation of all the methods. We perform two binary classification (prediction) tasks on this dataset: (1) Mortality (MOR): we aim to predict whether the patient dies within 60 days after admission and treat it as a binary classification task. 20.10% of all the patients are mortality positive (i.e., patients died). (2) Ventilator Free Days (VFD): we aim to evaluate a surrogate outcome of morbidity and mortality (i.e., ventilator free days, of which lower value is bad), by identifying patients who survive and are on a ventilator for longer than 14 days within 28 days after admission. As a lower number of ventilator free days is bad, it is a bad outcome if the value≤ 14, otherwise it is a good outcome. We treat this as a binary classification task. 59.05% of the patients have the number of ventilator free day > 14. 135 6.1.2.2 Methods and Implementation Details We categorize the methods in our experiments into the following groups: • Baseline machine learning methods which are popular in health care domains: Linear Support Vector Machine (SVM), Logistic Regression (LR), Decision Trees (DT) and Gradient Boosting Trees (GBT). • Deep network models: We use deep feed-forward neural network (DNN), GRU, and the combinations of them (DNN + GRU). • Proposed mimic learning models: For each of the deep models shown above, we test both the mimic learning pipelines, and evaluate our mimic model (GBTmimic). We train all the baseline methods with the same input, i.e., the concatenation of the static and flattened temporal features. The DNN implementations have two hidden layers and one prediction layer. We set the size of each hidden layer twice as large as input size. For GRU, we only use the temporal features as input. The size of other models are set to be on the same scale. We apply several strategies to avoid overfitting and train robust deep learning models: We train for 250 epochs with early stopping criterion based on the loss on validation dataset. We use SGD for DNN and Adam [115] with gradient clipping for other deep learning models. We also use weight regularizer and dropout for deep learning models. Similarly, for gradient boosting methods, we set the maximum number of boosting stages 100, with early stopping based on the AUROC score on validation dataset. We implement all baseline methods using scikit-learn [177] and all deep networks using Theano [5] and Keras [44]. 136 6.1.2.3 Overall Classification Performance Table 6.1 shows the prediction performance (area under receiver operating characteristic curve (AUROC) and area under precision-recall curve (AUPRC)) of all methods. The results are averaged over 5 random trials of 5-fold cross validation. We observe that for both tasks, all deep learning models perform better than baseline models. The best performance of deep learning models is achieved by the combination model, which use both DNN and GRU to handle static and temporal input variables, respectively. Our interpretable mimic approach achieves similar (or even slightly better performance) as deep models. We find that Pipeline 1 yields slightly better performance than Pipeline 2. For example, Pipeline 1 and Pipeline 2 obtain AUROC score of 0.7898 and 0.7670 for MOR task, and 0.7889 and 0.7799 for VFD task, respectively. Therefore, we use Pipeline 1 model in the discussions in Section 6.1.3. Table 6.1: Interpretable mimic learning classification results (mean± 95% confidence interval) for two tasks on the Vent dataset. Methods MOR (Mortality) VFD (Ventilator Free Days) AUROC AUPRC AUROC AUPRC Baselines SVM 0.6437± 0.024 0.3408± 0.034 0.7251± 0.023 0.7901± 0.019 LR 0.6915± 0.027 0.3736± 0.038 0.7592± 0.021 0.8142± 0.019 DT 0.6024± 0.013 0.4369± 0.016 0.5794± 0.022 0.7570± 0.012 GBT 0.7196± 0.023 0.4171± 0.040 0.7528± 0.017 0.8037± 0.018 Deep models DNN 0.7266± 0.089 0.4117± 0.122 0.7752± 0.054 0.8341± 0.042 GRU 0.7666± 0.063 0.4587± 0.104 0.7723± 0.053 0.8131± 0.058 DNN + GRU 0.7813± 0.028 0.4874± 0.051 0.7896± 0.019 0.8397± 0.018 Best mimic model 0.7898± 0.030 0.4766± 0.050 0.7889± 0.018 0.8324± 0.016 137 6.1.3 Interpretations In this section, we discuss a series of solutions to interpret gradient boosting trees in our mimic models, including feature importance measure, partial dependence plots and important decision rules, which are commonly used and effective in practice. We also provide several case studies and discussions using these interpretation tools. 6.1.3.1 Feature Influence Feature importance definition One of the most common interpretation tools for tree-based algorithms is feature importance (influence of variable) [71]. The influence of one variable j in a single tree T with L splits is based on the number of times when the variable is selected to split the data samples. Formally, the influence Inf is defined as Inf j (T) = L−1 X l=1 I 2 l I(S l =j), where I 2 l refers to the empirical squared improvement after split l, and I is the identity function. The importance score of GBT is defined as the average influence across all trees and normalized across all variables. Although importance score is not about how the feature is actually used in the model, it proves to be a useful metric for feature selection. Table 6.2 shows the most useful features for MOR and VFD tasks, respectively, from both GBT and the best GBTmimic models. We find that some important features are shared by several models, e.g., MAP (Mean Airway Pressure) at day 1, δPF (Change of PaO2/FIO2 Ratio) at day 1, etc. Besides, almost all the top features are temporal features. Among the static features, PRISM (Pediatric Risk 138 Table 6.2: Top features and their corresponding importance scores from the GBT and GBTmimic models on the Vent dataset. Task MOR (Mortality) VFD (Ventilator Free Days) Model GBT GBTmimic GBT GBTmimic Features PaO2-Day2 (0.0539) BE-Day0 (0.0433) MAP-Day1 (0.0423) MAP-Day1 (0.0384) MAP-Day1 (0.0510) δPF-Day1 (0.0431) PH-Day3 (0.0354) PIM2S (0.0322) BE-Day1 (0.0349) PH-Day1 (0.0386) MAP-Day2 (0.0297) VE-Day0 (0.0309) FiO2-Day3 (0.0341) PF-Day0 (0.0322) MAP-Day3 (0.0293) VI-Day0 (0.0288) PF-Day0 (0.0324) MAP-Day1 (0.0309) PRISM12 (0.0290) PaO2-Day0 (0.0275) of Mortality) score, which is developed and commonly used by doctors and medical experts, is the most useful static variable. As our mimic method outperforms original GBT significantly, it is worthwhile to investigate which features are considered as more important or less important by our method. 0 0.2 0.4 0.6 0.8 1 0 0.01 0.02 0.03 0.04 0.05 1 37 73 109 Individual importance Cumulative Importance 0 0.2 0.4 0.6 0.8 1 0 0.01 0.02 0.03 0.04 0.05 1 37 73 109 Individual importance Cumulative Importance Figure 6.3: Individual (with left y-axis) and cumulative (with right y-axis) feature importance for MOR (top) and VFD (bottom) tasks on the Vent dataset. Figure 6.3 shows the individual (i.e., feature importance of a single feature) and cumulative (i.e., aggregated importance of features sorted by importance score) feature importance of the two tasks. The list of features sorted by the individual importance score are shown in x-axis. From this figure, we observe that there is no dominant feature (i.e., feature with high importance score among all features) and the most dominant feature has a importance score less than 0.05, which implies that we need more features for obtaining better predictions. We also notice that 139 for MOR task, we need less number of features compared to the VFD task based on the cumulative feature importance scores. The number of features when cumulative score > 0.8 is 41 for MOR and 52 for VFD. 0.136 0.073 0.190 0.276 0.172 0.248 0.167 0.189 0.285 0.305 0.245 0.202 0.185 0.211 0.178 0.171 0.221 0.163 0.220 0.163 MOR-GBT MOR-GBTmimic VFD-GBT VFD-GBTmimic Static Day 0 Day 1 Day 2 Day 3 Figure 6.4: Feature importance for static features and temporal features on each day for two tasks on the Vent dataset. WeshowtheaggregatedfeatureimportancescoresondifferentdaysinFigure6.4. The trend of feature importance for GBTmimic methods is Day 1 > Day 0 > Day 2 > Day 3, which means early observations are more useful for both MOR and VFD prediction tasks. On the other hand, for GBT methods, the trend is Day 1 > Day 3 > Day 2 > Day 0 for both the tasks. Overall, Day-1 features are more useful across all the tasks and models. 6.1.3.2 Partial Dependence Plots Visualizations provide better interpretability of our mimic models. We visualize GBTmimic by plotting the partial dependence of a specific variable or a subset of variables. The partial dependence can be treated as the approximation of the prediction function given only a set of specific variable(s). It is obtained by calculating the prediction value by marginalizing over the values of all other variables. 140 20 0 20 BE-D0 0.02 0.00 0.02 0.04 0.06 500 200 100 DeltaPF-D1 0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 7.100 7.325 7.550 PH-D1 0.02 0.00 0.02 0.04 0.06 0.08 0 150 300 PF-D0 0.02 0.01 0.00 0.01 0.02 0.03 0.04 5.0 17.5 30.0 MAP-D1 0.02 0.00 0.02 0.04 0.06 5.0 17.5 30.0 MAP-D1 0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 7.0 3.5 0.0 PIM2S 0.04 0.02 0.00 0.02 0.04 0 400 800 VE 0.04 0.02 0.00 0.02 0.04 0 400 800 VI 0.03 0.02 0.01 0.00 0.01 0.02 0.03 20 110 200 PaO2-D0 0.04 0.03 0.02 0.01 0.00 0.01 0.02 0.03 0.04 Figure 6.5: One-way partial dependence plots of the top features from GBTmimic for MOR (top) and VFD (bottom) tasks on the Vent dataset. One-way partial dependence Table 6.2 shows the list of important features selected by our model (GBTmimic) and GBT. It is interesting to study how these features influence the model predictions. Furthermore, we can compare different mimic models by investigating the influence of the same variable in different models. Figure 6.5 shows one-way partial dependence scores from GBTmimic for the two tasks, where the variable value is shown in x-axis, and the dependence value is shown in y-axis. The results are easy to interpret and match existing findings. For instance, our mimic model predicts a higher chance of mortality when the patient has value of PH-Day0 below 7.325. This conforms to the existing knowledge that human blood (in healthy people) stays in a very narrow pH range around 7.35 - 7.45. Base blood pH can be low because of metabolic acidosis (more negative values for base excess), or from high carbon dioxide levels (ineffective ventilation). Our findings that pH and base excess are associated with higher mortality corroborate clinical knowledge. More useful rules from our mimic models can be found via the partial dependence plots, which provide deeper insights into the results of the deep models. Two-way partial dependence In practical applications, it would be more helpful to understand the interactions between most important features. One possible way is 141 400 300 200 100 0 DeltaPF-D1 15 10 5 0 5 10 15 BE-D0 -0.01 0.01 0.04 0.06 0.09 0.11 7.20 7.28 7.36 7.44 7.52 PH-D1 15 10 5 0 5 10 15 BE-D0 -0.02 0.01 0.03 0.06 0.09 0.12 7.20 7.28 7.36 7.44 7.52 PH-D1 400 300 200 100 0 DeltaPF-D1 -0.01 0.02 0.04 0.06 0.09 0.11 Figure 6.6: Pairwise partial dependence plots of the top features from GBTmimic for MOR (top) and VFD (bottom) tasks on the Vent dataset. to generate 2-dimensional partial dependence for important feature pairs. Figure 6.6 demonstrates the 2-way dependence scores of the top three features used in our GBTmimic model. The x-axis and y-axis show the values of two variables. The dependence value is shown by the color, where red refers to positive dependence value and blue refers to negative dependence value. From the left of Figure 6.6, we see that the combination of severe metabolic acidosis (low base excess) and big reduction in PF ratio may indicate that the patients are developing multiple organ failures, which leads to mortality (area in red). However, big drop in PF ratio alone, without metabolic acidosis, is not associated with mortality (light cyan). From the middle of Figure 6.6, we see that low PH value from metabolic acidosis (i.e., with low base excess) may lead to mortality. However, respiratory acidosis itself may not be bad, since if pH is low but not from metabolic, the outcome is milder (green and yellow). The rightmost figure shows that a low pH with falling PF ratio is a bad sign, which probably comes from a worsening disease on day 1. But a low pH without much change in oxygenation is not important in mortality prediction. These findings are clinically significant and have been corroborated by the doctors. 142 LIS-D0 <= 2.8333 S = 100.0% OI-D1 <= 10.5673 S = 67.6% True BE-D0 <= 0.95 S = 32.4% False PCO2-D1 <= 53.2545 S = 62.6% MAP-D1 <= 18.2083 S = 5.0% % = 0.083 S = 58.2% V = -0.1334 % = 0.294 S = 4.4% V = 0.0897 % = 0.333 S = 2.8% V = 0.0823 % = 0.625 S = 2.2% V = 0.4669 PIP-D3 <= 25.9108 S = 22.6% PH-D3 <= 7.3631 S = 9.7% % = 0.091 S = 2.5% V = -0.0133 % = 0.474 S = 20.1% V = 0.3223 % = 0.375 S = 1.9% V = 0.2112 % = 0.152 S = 7.9% V = -0.095 OI-D1 <= 10.927 S = 100.0% LIS-D0 <= 2.8333 S = 82.4% True DeltaPF-D2 <= -89.042 S = 17.6% False BE-D1 <= -5.9335 S = 64.8% MAP-D1 <= 13.6886 S = 17.6% % = 0.400 S = 6.0% V = -0.1921 % = 0.762 S = 58.8% V = 0.204 % = 0.846 S = 3.5% V = 0.2104 % = 0.393 S = 14.2% V = -0.3013 PaO2-D0 <= 50.5 S = 4.4% LeakPer <= 0.1669 S = 13.2% % = 0.125 S = 2.5% V = -0.3634 % = 0.583 S = 1.9% V = -0.0715 % = 0.200 S = 12.6% V = -0.4922 % = 0.000 S = 0.6% V = -0.1118 Figure 6.7: Sample decision trees from best GBTmimic models for MOR (top) and VFD (bottom) tasks on the Vent dataset. 6.1.3.3 Top Decision Rules Another way to evaluate our mimic methods is to compare and interpret the trees obtained from our models. Figure 6.7 shows two examples of the most important trees (i.e., the tree with the highest coefficient weight in the final prediction function) built by interpretable mimic learning methods for MOR and VFD tasks. The % notation and the color of the leaf node represents the class distribution for samples belong to that node. S and V shows the number of samples to that node and the prediction value of that node, respectively. Some observations from these trees are as follows: Markers of lung injury such as lung injury score (LIS), oxygenation index (OI), and ventilator markers such as mean airway pressure (MAP) and PIP are the most discriminative features for the mortality task prediction, which has 143 been reported in previous work [114]. However, our selected trees provide more fine- grained decision rules. For example, we can study how the feature values on different admission days can impact the mortality prediction outcome. Similar observations can be made for the VFD task. We notice that the most important tree includes features, such as OI, LIS, Delta-PF, in the top features for VFD task, which again agrees well with earlier findings [114]. 6.2 Showcase on Deep Learning for Opioid Usage Study Opioid analgesics are effective and commonly prescribed medications used for management of both acute and chronic pain in patients with different medical conditions and following different medical procedures [59, 225]. Prescription of opioids in the United States is high. Between 2011 and 2012, nearly 7% of the adult population was estimated to have taken an opioid in a span of thirty days [70, 176]. However, as reported in several previous studies, these medications do not effectively control pain in all patients [32, 45, 205], and many patients are at a high risk of adverse effects due to these medications [45, 112]. A meta-analysis of randomized trials found that 80% of patients treated with opioids for chronic, non-cancer pain experienced at least one adverse event, with symptoms ranging from mild nausea to life-threatening respiratory depression [112]. In addition, the US is experiencing an opioid epidemic. Specifically, opioids are increasingly misused and diverted from their intended recipients, and abuse and overdoses have risen alarmingly in the last ten years [34]. The rate for drug overdose deaths, driven largely by opioid overdose, increased by approximately 140% from 2000 to 2014 [197]. In 2017, one of the largest 144 pharmaceutical distributors in US was fined a record $150 million for failing to report suspicious orders linked to the opioid addiction epidemic [55]. Indeed, prompt and proper actions need to be taken to achieve balanced opioid usage strategies, stem the tide of this public health epidemic, and prevent further devastating consequences. The factors that contribute to opioid use – particularly the patient factors that contribute to long-term, chronic use of these medications and/or dependence or abuse of these drugs are poorly understood. Previous work found significant increases in incident opioid prescriptions for chronic, non-malignant pain between 1997 and 2005 in the Kaiser Permanente and Group Health populations [28]. Additionally, the proportion of the population receiving long-term therapy nearly doubled in the same time frame. The most common indications for long-term use in this study were chronic back pain, extremity pain, and osteoarthritis. Apart from these data, however, littleisknownaboutwhoreceivesopioidanalgesicprescriptionsinanaverage community. Additionally, with the exception of a few studies exploring the role of mental illness, depression, or previous patterns of substance abuse [31, 201, 238], patient characteristics that might contribute to these adverse outcomes have not been described. The rapid growth in EHR adoption provides a wealth of patient information that could help identify patients at high risk of long-term opioid use or dependence. If one predictive or classifying model can leverage such data for analysing opioid usage and/or dependence, that is, the model has the ability to identify patients likely to benefit from or get addicted to these medications and target therapy more appropriatelytothem, wecanexpectthosemodelstobeabletoextracttheknowledge of the clinical characteristics associated with the progression of a short-term to an episodic or long-term opioid prescribing pattern and aid in the identification of 145 at-risk patients and provide the basis for developing targeted clinical interventions. In the era of data explosion, however, more powerful data-driven learning models are in urgent demand in order to fully utilize the large amount of EHR data, identify meaningful features for opioid dependence or abuse, provide precise information for clinicians to make early decisions, and ultimately contribute to better personalized health care quality. In this work, we utilize state-of-the-art deep learning models on a much larger dataset for opioid usage prediction and factor investigation tasks. Deep learning models have brought significant successes to many applications and are also rev- olutionizing the health care domain with a focus on a variety of important and challenging tasks. It is well known that deep learning solutions equipped with ample computational resources and large-scale datasets are able to go far beyond traditional statistical methods and shed light on intriguing real-world applications in health care. However, it is rarely the case that you can just take the standard deep learning models from others and directly make it work on real-world health care applications, especially for clinicians who have little experience in training neural networks and dealing with critical tricks. It is probably difficult but necessary to have effective and easy to follow examples which deploy powerful deep neural networks and tailor them for those particular tasks. In this chapter, we demonstrate our proposed deep learning solutions for identifying opioid user groups and show that they provided superior classification results and outperform other widely used learning baseline methods. We validate important factors and risk factors identified by deep learning models with previous clinical studies. Our work also provides a practical example on properly adopting 146 novel deep learning methods for real-world health care problems leveraging large-scale EHR data. 6.2.1 Study Design In this work, we take a cohort and identify patient groups from the REP dataset. This section describes our cohort selection and group identification steps. 6.2.1.1 Cohort Selection First, all outpatient drug prescriptions are obtained from Mayo Clinic and the Olmsted Medical Center from January 1, 2003 through March 31, 2016 for patients who authorized the use of their medical records for research purposes. The drug prescriptions are standardized using the 2016 version of RxNorm [164]. We keep the records for all patients who received at least one opioid analgesic prescription between July 1, 2013 and March 31, 2016 and did not have any opioid prescriptions 6 months prior to their first prescriptions within the study period. The analgesicprescriptionsaredeterminedbytheRxNormCode, witheither national drug file reference terminology (NDF-RT) code C8834 (opioid analgesics) or ingredient code 10689 (tramadol) and 352362 (acetaminophen/tramadol). In order to remove incorrectly duplicated and modified prescription records, only the last prescription are kept if the same drug prescriptions are made for one patient within 30 minutes. A cohort of 102,166 patients is created after these data cleaning and selection steps. 147 Table 6.3: Data characteristics of different patient groups identified from the REP dataset. Short-Term Long-Term Opioid-Dependent All Count % Count % Count % Count % Total number of patients 80,596 78.89% 21,570 21.11% 749 – 102,166 – Sex Men 37,981 47.13% 8,447 39.16% 345 46.06% 46,428 45.44% Women 42,453 52.67% 13,075 60.62% 402 53.67% 55,528 54.35% Other/Unknown 162 0.20% 48 0.22% 2 0.27% 210 0.21% Age ≤ 18 5,900 7.32% 447 2.07% 0 0.00% 6,347 6.21% 19-29 13,701 17.00% 1,311 6.08% 55 7.34% 15,012 14.69% 30-49 27,696 34.36% 5,416 25.11% 354 47.26% 33,112 32.41% 50-64 18,027 22.37% 5,570 25.82% 245 32.71% 23,597 23.10% ≥ 65 15,272 18.95% 8,826 40.92% 95 12.68% 24,098 23.59% Race White 66,184 82.12% 19,297 89.46% 655 87.45% 85,481 83.67% Hispanic 4,151 5.15% 697 3.23% 34 4.54% 4,848 4.75% African American 4,131 5.13% 898 4.16% 49 6.54% 5,029 4.92% Asian 3,225 4.00% 361 1.67% 3 0.40% 3,586 3.51% Other/Unknown 2,905 3.60% 317 1.47% 8 1.07% 3,222 3.15% Mortality Dead 4,481 5.56% 17,075 79.16% 628 83.85% 21,556 21.10% Alive/Unknown 76,115 94.44% 4,495 20.84% 121 16.15% 80,610 78.90% Tobacco use Never/Unknown 46,264 57.40% 7,159 33.19% 76 10.15% 53,423 52.29% Secondhand only 746 0.93% 750 3.48% 25 3.34% 1,496 1.46% Past/Current 33,586 41.67% 13,661 63.33% 648 86.52% 47,247 46.25% First time of anxiety or depression Never 58,322 72.36% 11,002 51.01% 115 15.35% 69,324 67.85% Before FOT 10,431 12.94% 3,230 14.97% 207 27.64% 13,661 13.37% After FOT 11,843 14.69% 7,338 34.02% 427 57.01% 19,181 18.77% First time of substance abuse Never 70,039 86.90% 15,283 70.85% 45 6.01% 85,322 83.51% Before FOT 4,315 5.35% 1,730 8.02% 221 29.51% 6,045 5.92% After FOT 6,242 7.74% 4,557 21.13% 483 64.49% 10,799 10.57% First time of other psychological diagnosis Never 35,716 44.31% 3,482 16.14% 9 1.20% 39,198 38.37% Before FOT 27,627 34.28% 9,253 42.90% 412 55.01% 36,880 36.10% After FOT 17,253 21.41% 8,835 40.96% 328 43.79% 26,088 25.53% 148 6.2.1.2 Group Identification All patients are classified into three groups, namely short-term users (ST), long- term users (LT), and opioid-dependent users (OD). ST and LT groups are defined by the CONSORT study [234] and the same as in our previous work [98]. Episodes of opioid prescriptions lasting longer than 90 days and with 120 or more total days supply or 10 or more prescriptions are classified as long-term (N LT =21,570), while others are classified as short-term (N ST =80,596). N OD = 749 opioid-dependent patients are further identified by the diagnosis of “opioid dependence” from their problem lists. It is noting that the relatively low identification rate might be due to the fact that only part of dependent patients got explicit diagnosis in the problem lists by doctors. All identified dependent users are validated by clinicians. Table 6.3 shows the detailed data characteristics of each patient group, which also match the finding in the previous related study on a smaller dataset [98]. All patients included in OD group are also included in LT group. We use FOT to refer to the time of the first opioid prescription for each patient, as relevant medical records taken before that time are usually very important. Two classification tasks are considered in our experiments: 1) whether the patient will become a long-term opioid user or just a short-term opioid user (Task ST-LT), and 2) whether a long-term opioid user is an opioid-dependent patient or not (Task LT-OD). 6.2.2 Methodology In this section, we first describe our feature extraction and temporal data processing steps. Next, several ways used to improve the model performance and 149 obtain important features of the two deep learning models deployed in this study are discussed, followed by the descriptions of some machine learning baselines. 6.2.2.1 Feature Extraction AfterweretrievethestructureEHRdatafromtheREPdataset, weextractcode records with time stamps and other information from three chart tables: diagnoses (DX), procedures (PR), and prescriptions (RX). The details are shown in Table 6.4. Instead of taking raw records in these tables, we map all the codes to a higher level code space, for the following two reasons: First, coding systems used in Mayo are different and often change from time to time [214]. For example, three different coding systems, ICD-9, ICD-10, and HICDA (hospital international classification of diseases adapted) are used for disease records in DX table. HICDA codes were used only before 2011, ICD-10 codes have not been in use until the year of 2015. This prevented us from taking one single raw code system and thus a consistent mapping of these conceptually-overlaid codes was required. Second, since there are tens of thousands of different raw codes in each table, the raw data tables are quite sparse and difficult to be examined in the feature level. Therefore, we map all DX and PR codes into categories in clinical classifications software (CCS) [50] and all RX codes into NDF-RT class. In PR table, we also record the corresponding quantity together alone with the code. 6.2.2.2 Temporal Data Processing We apply 1-of-K (one-hot) encoding [159] on the extracted features and use either the temporal sum-pooling or segmentations of the encoded features to get numerical features from sparse categorical features simply yet effectively. The 1-of-K 150 Table 6.4: Record table descriptions and statistics of the selected data from theREP dataset. Table name Descriptions # of records Raw code Mapped code Coding systems Count Coding system Count DX Diagnosis records 56229157 ICD-9, ICD-10, HICDA 43438 CCS 284 PR Procedure, service, and surgical index records 46386740 ICD-9, ICD-10, CPT/HCPCS [209] 18984 CCS 245 RX Prescription records 8102477 Ingredient RxNorm code 2460 NDF-RT class RxNorm code 307 encoding converts each record line to a single binary vector of the same length, and temporal sum-pooling and segmentation step (i.e., sum-pooling in each temporal segment) further aggregates features along the temporal direction. For RX table, we use the length of days instead of 1 when applying 1-of-K encoding on prescription records to utilize the important quantitative effective length information for each prescription. For example, a prescription record with length of 4 days is converted into a vector like [0,..., 4,..., 0]. Our recurrent neural network models are able to handle time series data directly and capture temporal information. We compute the sum-pooling vector [29] in each year and stack them into a matrix, which we refer as yearly temporal segmentations. Since the medical records for patients might be of different length, the resulted matrices also have different length T seg , thus these matrices can not be directly used in other models including non-recurrent deep networks and other machine learning baselines. For those model we choose the sum-pooling along all time steps and thus obtain a vector with fixed length. The length is the sum of the numbers of all mapped features from three tables (D = D DX +D PR +D RX = 836) in our dataset. The prediction models take the obtained data as the input and provide predictive results. The entire pipeline is illustrated in Figure 6.8. 151 [ICD-9 786.59, May/27/2003] [ICD-10 K57.30, Aug/25/2006] …… [ICD-9 89.01, Nov/16/2004] [HCP 88173, Aug/25/2006] …… [Med 817579, May/27/2003, 4 days] [Med 6922, May/27/2003, 7 days] …… DX Table PR Table RX Table Raw Data Feature Mapping [CCS 102, May/27/2003] [CCS 146, Aug/25/2006] …… [CCS 227, Nov/16/2004] [CCS 234, Aug/25/2006] …… [C8834, May/27/2003, 4 days] [C8762, May/27/2003, 7 days] …… 1-of-K Coding + Temporal Sum-Pooling A vector of length ( + + ) [ 1 , 2 , 0 , … , 0 , 0 , 1 , … , 3 , 4 , 0 , … , 9 , 0 ] = 284 features = 245 features = 307 features A matrix of size T s e g × ( + + ) 1 , 1 , 0 , … , 0 , 0 , 0 , … , 1 , 4 , 0 , … , 7 , 0 0 , 1 , 0 , … , 0 , 0 , 1 , … , 2 , 0 , 0 , … , 2 , 0 1-of-K Coding + Temporal Segmentation DNN Prediction Model RNN Prediction Model Prediction ∈ [ 0 , 1 ] ∈ [ 0 , 1 ] Figure 6.8: Illustrations of the proposed pipelines from raw cohort data to final prediction for opioid study with DNN prediction model for data with temporal sum-pooling (left) and RNN prediction model for data with temporal segmentation (right). 6.2.2.3 Implementation and Training Details of Deep Learning Models We select two deep learning models in this study: A deep feed-forward neural network model with multiple hidden layers (referred to as DNN), and a recurrent neural network model with Long Short-Term Memory (referred to as RNN) which can better model time series data. Our deep learning models are implemented with Theano [5] and Keras [44], and all the models are reproducible. 152 For both DNN and RNN models, we set the dimension of each hidden layer to be 256, which is chosen to have proper size and good performance. We use DNN- khl to refer to a DNN model with k hidden layers and one output layer. Several training techniques are designed and used to better handle our data. First, we apply an L-1 regularizer with coefficient 0.0001 to make the model robust and able to select important features. Our preliminary experiments show that L 1 regularizer provides more compact models with better or similar performance as L 2 or no regularizers. Second, dropout technique [212] with ratep dr = 0.5 is used for all layers to reduce overfitting and avoid harmful weight co-adaptations. This is implemented by randomly dropping out units by probability of p dr in the neural networks at training time and re-scaling all the weights byW test =p dr W train at test time. Third, we apply novel batch normalization [103] on all non-recurrent layers. The basic idea is to normalize the activations of the previous layer such that the outputs keep mean of 0 and standard deviation of 1 in each mini-batch during training. The running averages computed on training dataset are used to normalize the outputs at test time. This strategy speeds up training process and improved overall performance. Additionally in our experiments we find that applying batch normalization before the input layer has relatively the same impact as taking z-normalization on the input directly. They both improve model performance but the former one has less data preprocessing cost and is more flexible. Fourth, we use RMSprop [226] as the gradient descent optimization algorithms to train these models. RMSprop utilizes an adaptive learning rate to normalize gradient values by their magnitudes. Finally, all our deep learning models are efficiently trained within several hours on a single desktop with i5-4590S CPU and 16 GB memory. 153 6.2.2.4 Comparison to Other Machine Learning Baselines In order to evaluate the proposed deep models and validate the findings, we also compare some commonly used machine learning baselines in clinical research, includ- ing Logistic regression (LR), linear support vector machine with hinge loss (SVM), and random forest (RF). All the baselines are implemented in scikit-learn [177] package. We keep most of the default settings and hyperparameters which are shown to be effective in practice, but make several specific changes to better fit our tasks. In order to distinguish important features an introduce sparsity into the model coefficients, we also useL 1 penalty in LR and SVM, tune the regularization strength C by searching from 0.0001 to 10 and finally choose C = 0.1 in our experiments since it usually provides best prediction results. In RF, using more trees usually leads to better results, but also possibly makes the model computationally inefficient and overfitted to training samples, and the model size also grows linearly to the number of trees. Since using more trees brings negligible performance improvement but drastically increases the model size in our preliminary experiments, we take the default setting (10 trees) so that the RF model has moderate size as others. All deep learning models are serialized and saved in HDF5 files, and other models are saved in cPickle files. As shown in Table 6.5, all the tested models have comparable sizes, and the performance comparison is fair. Table 6.5: Model size comparison in terms of the saved binary files in disk for opioid study. Model DNN RNN LR SVM RF File size (in KB) 1,878 9,320 21 23 2,282 154 6.2.2.5 Investigating Important Features Deep learning models are often argued to be difficult to interpret and investigate, especially because of their complex structures and thousands of or even millions of parameters. Furthermore, carelessly attempting to check and visualize individual units in neural networks might lead to misleading conclusions [220]. However, by checking the overall model weights and structures, it is still possible to identify important features extracted from the deep learning models and obtain rough quantitative evaluations. We design the feature importance scoreI for such purpose. We first take the weight matrix in the first layerW [1] ∈R D [2] ×D of a DNN model as an example, where each columnw [1] d ofW [1] corresponds to the d-th input feature. A simple way to quantify the feature importance is to take the summation of each column. The first importance score of d-th input feature is formally defined as I 1 (d) = P D [2] i=1 W [1] [i,d], where W[i,d] is the number in i-th row and d-th column ofW. However, we only consider the first layer in this score, which is definitely insufficient for a deep models. To overcome this issue, we need to take weights in higher layers into consideration. Since ReLU function is used as transformation function in all the hidden layers, we multiply weights in all layers and take the value at the corresponding index as the importance score. We also need to take care of the impact of batch normalization since it introduces different scales on parameters, so we apply batch normalization operation before we multiply the weight matrix for each layer. Thus, the second importance score can be defined formally as I 2 =W [L] f [L] BN ···W [2] f [2] BN W [1] f [1] BN (1) ∈R 1×D where f [l] BN denotes the batch normalization operation for layer l. This process can also be viewed as building a simplified version of the original deep neural networks 155 without non-linear transformations or bias vectors. In our experiments,I 2 is used for our DNN models. In order to validate the way of investigating important features and verify the selected features, we check previous clinical studies and compare the features in those studies with those from our baseline models. 6.2.3 Experiments Table 6.6: Long-term opioid patient prediction (ST-LT) results (mean± 95% confi- dence interval) on the REP dataset. Baseline Models Deep Models LR SVM RF DNN-1hl DNN-2hl DNN-3hl RNN Setting A Acc. 0.8946± 0.002 0.8938± 0.002 0.8666± 0.004 0.8960± 0.002 0.8954± 0.001 0:8975± 0.002 0.8961± 0.002 AUC 0.9074± 0.002 0.9038± 0.002 0.8747± 0.003 0.9086± 0.002 0.9082± 0.002 0.9091± 0.002 0:9094± 0.002 Prec. 0.8483± 0.007 0.8671± 0.006 0.8213± 0.009 0.8539± 0.013 0.8546± 0.009 0.8567± 0.009 0:8719± 0.008 Rec. 0.6099± 0.006 0.5868± 0.007 0.4702± 0.018 0.6122± 0.009 0.6082± 0.006 0:6178± 0.006 0.5957± 0.007 κ 0.6473± 0.007 0.6383± 0.007 0.5249± 0.018 0.6516± 0.007 0.6489± 0.004 0:6571± 0.006 0.6472± 0.006 Setting B Acc. 0:8385± 0.002 0.8372± 0.002 0.8162± 0.002 0.8371± 0.002 0.8340± 0.002 0.8352± 0.002 0.8371± 0.002 AUC 0.8369± 0.002 0.8366± 0.002 0.8044± 0.002 0.8412± 0.002 0.8362± 0.002 0.8362± 0.003 0:8466± 0.002 Prec. 0.7161± 0.010 0.7309± 0.011 0.6590± 0.011 0:7319± 0.013 0.6999± 0.010 0.7121± 0.022 0.6889± 0.012 Rec. 0.3892± 0.005 0.3623± 0.006 0.2683± 0.005 0.3612± 0.008 0.3749± 0.016 0.3712± 0.018 0:4207± 0.020 κ 0.4177± 0.007 0.4005± 0.009 0.2952± 0.006 0.3998± 0.008 0.3996± 0.013 0.4005± 0.009 0:4297± 0.010 Setting C Acc. 0.7917± 0.001 0.7908± 0.001 0.7890± 0.001 0.7919± 0.001 0.7920± 0.001 0.7915± 0.001 0:7989± 0.001 AUC 0.7323± 0.003 0.7327± 0.003 0.6936± 0.003 0.7220± 0.004 0.7340± 0.004 0.7218± 0.004 0:7536± 0.003 Prec. 0.5366± 0.019 0.5303± 0.021 0.5007± 0.010 0.5670± 0.031 0:5943± 0.012 0.5774± 0.027 0.5692± 0.028 Rec. 0.0996± 0.003 0.0800± 0.003 0.1279± 0.004 0.0646± 0.005 0.0672± 0.011 0.0490± 0.015 0:1991± 0.002 κ 0.1090± 0.005 0.0885± 0.005 0.1289± 0.006 0.0756± 0.004 0.0658± 0.013 0.0587± 0.016 0:2076± 0.007 6.2.3.1 Classification Result Comparison As mentioned before, we conduct two classification tasks (ST-LT and LT-OD). All 102,166 patients are included in the 5-fold cross validation for ST-LT task. Only 3.47% long-term users are opioid dependent and thus the labels are quite imbalanced for LT-OD task. To get robust prediction and features, we randomly generate 14 156 Table 6.7: Opioid-dependent patient prediction (LT-OD) results (mean± 95% confidence interval) on the REP dataset. Baseline Models Deep Models LR SVM RF DNN-1hl DNN-2hl DNN-3hl RNN Setting A Acc. 0.6929± 0.010 0.6805± 0.007 0.7417± 0.007 0.7441± 0.010 0.7550± 0.008 0.7547± 0.009 0:7607± 0.009 AUC 0.7119± 0.010 0.6985± 0.010 0.7773± 0.011 0.7853± 0.012 0.7975± 0.010 0.8044± 0.011 0:8060± 0.010 Prec. 0.5385± 0.017 0.5212± 0.010 0:7049± 0.022 0.6214± 0.021 0.6323± 0.012 0.6328± 0.018 0.6896± 0.020 Rec. 0.5924± 0.022 0.5748± 0.019 0.3986± 0.016 0.6233± 0.028 0.6457± 0.031 0:6471± 0.024 0.5205± 0.021 κ 0.3262± 0.017 0.2966± 0.016 0.3555± 0.015 0.4273± 0.019 0.4520± 0.021 0:4571± 0.017 0.4505± 0.019 Setting B Acc. 0.6763± 0.007 0.6669± 0.009 0.7331± 0.010 0.7376± 0.011 0.7406± 0.009 0:7427± 0.006 0.7417± 0.006 AUC 0.6968± 0.008 0.6898± 0.010 0.7659± 0.013 0.7720± 0.009 0.7821± 0.012 0.7829± 0.008 0:8010± 0.007 Prec. 0.5156± 0.013 0.5029± 0.012 0.6784± 0.022 0.6214± 0.023 0.6146± 0.017 0.6289± 0.017 0:7107± 0.019 Rec. 0.5743± 0.022 0.5600± 0.018 0.3867± 0.024 0.5733± 0.026 0:6162± 0.025 0.5810± 0.039 0.3976± 0.026 κ 0.2951± 0.020 0.2734± 0.020 0.3301± 0.028 0.4046± 0.020 0:4201± 0.019 0.4098± 0.018 0.3787± 0.021 Setting C Acc. 0.6404± 0.007 0.6332± 0.012 0.6994± 0.007 0.6870± 0.009 0.6911± 0.009 0:7065± 0.008 0.6956± 0.008 AUC 0.6512± 0.009 0.6429± 0.010 0.6999± 0.011 0.7130± 0.014 0.7216± 0.014 0:7279± 0.014 0.7144± 0.011 Prec. 0.4639± 0.030 0.4554± 0.017 0.6019± 0.021 0.5491± 0.020 0.5485± 0.023 0:6193± 0.024 0.5975± 0.018 Rec. 0.4605± 0.020 0.4629± 0.017 0.3067± 0.018 0.4590± 0.075 0:5338± 0.065 0.3305± 0.030 0.2895± 0.028 κ 0.1906± 0.023 0.1821± 0.022 0.2342± 0.020 0.2702± 0.029 0:3006± 0.026 0.2542± 0.024 0.2542± 0.032 datasets with class ratio of 1 3 by downsampling the non-opioid-dependent patients. Each generated dataset has records from 2,237 patients. We further introduce three different settings (A, B, C) to test model performances in different simulated situations. In Setting A, we take all the medical records before the date when the patient is marked as long-term user or Mar, 31, 2016, whichever is earlier. In Setting B, what we do is the same as Setting A except that we exclude all the opioid and non-opioid analgesics prescriptions. InSetting C, we take records made before the patient’s first opioid prescription. Setting A is the ideal case and the best prediction results can be achieved in this setting since all possible information is taken into consideration. After we find the analgesics usage can be good indicators for our prediction tasks and may hide other indicators, we design Setting B which may impair the prediction performance but helps us find some hidden but useful 157 features. Setting C is the most practical case among the three and we take it to demonstrate the early prediction capacity of our methods. For all the settings and tasks, classification accuracy (Acc.), area under the receiver operating characteristic curve score (AUC), precision (Prec.), recall (Rec.), and Cohen’s kappa coefficient (κ) are reported. We use DNN-khl to refer to DNN models with k hidden layers and one output layer, where k∈{1, 2, 3}. Results for ST-LT and LT-OD tasks are shown in Table 6.6 and Table 6.7, respectively. First, deep models provide the best performance in terms of most evaluation metrics. The improvements on LT-OD are larger than ST-LT. Second, the RNN models which capture temporal information usually but not always beat standard DNN models. They obtain the best AUC score in 5 out of 6 settings. This implies that even loosely segmented time series contains useful temporal information. However, the superiority of RNN is shown to be less on LT-OD than ST-LT, and one possible reason is the lack of training samples on LT-OD. It is worth noting that, since we are more interested in identifying the patients who are long-term users and opioid-dependent while there is a large class imbalance, the value of classification accuracy itself might not be enough to demonstrate the usefulness of one method. For example, a model which always simply predicts the patient as ST in ST-LT task and LT in LT-OD task achieves an accuracy of 0.7889 and 0.9853, respectively. Though the numbers are high especially in LT-OD task where the other best models get the accuracy around 0.7 to 0.75, the simple model is not useful at all because it has no predictive power on the problem. It also has only 0 precision and recall which are quite bad as well. 158 6.2.3.2 Feature Analysis It is useful to get to know which features are more related to opioid use or play more important roles in the prediction models. We take the DNN-3hl models in Setting A and show the top ten most important features ordered by the absolute value of importance scoreI 2 in Table 6.8. Basically, features with positive/negative score can be interpreted as positively/negatively correlated to the prediction target (long-term use in ST-LT, and opioid dependence in LT-OD). The score should only be compared within the same model and the same task. For both tasks, the opioid analgesics prescription is selected as the most important indicators. Non- opioid analgesics is also an important factor for long-term opioid use but not very useful to distinguish opioid-dependent user from long-term user. Several disorders diagnoses, such as substance-related disorders, anxiety disorders, and other mental health disorders (e.g., interview, evaluation, and consultation), are all highly related to opioid dependence. These findings are consistent with previous studies and most Table 6.8: Most important features for long-term opioid patient (ST-LT, left) and opioid-dependent patient (LT-OD, right) identified from DNN-3hl model for opioid study. ST-LT prediction LT-OD prediction Table Code Feature name I Table Code Feature name I RX C8834 Opioid analgesics 0.2287 RX C8834 Opioid analgesics 0.7784 RX C8890 Amphetamine-like stimulants −0.0843 DX CCS 661 Substance-related disorders 0.6186 RX C8838 Non-opioid analgesics 0.0802 PR CCS 182 Mammography −0.3481 PR CCS 227 Other diagnostic procedures 0.0272 DX CCS 663 Substance abuse/Mental health history 0.3248 DX CCS 258 Other screening −0.0218 DX CCS 258 Other screening −0.2948 RX C4859 Salicylates, antirheumatic −0.0204 PR CCS 228 Prophylactic vaccinations/Inoculations −0.2796 DX CCS 203 Osteoarthritis 0.0185 DX CCS 651 Anxiety disorders 0.2785 DX CCS 205 Spondylosis 0.0179 RX C8864 Anticonvulsants 0.2626 DX CCS 98 Essential hypertension 0.0126 RX C8860 Benzodiazepine derivatives 0.2382 RX C2728 Vaccines/Toxoids, other −0.0120 DX CCS 670 Miscellaneous mental health disorders 0.2324 159 of the top features are also selected by LR and RF baselines. In addition, the scores for top features in LT-OD task are closer than those in ST-LT. This indicates that in Setting A identifying opioid-dependent users is a more challenging task which requires exploitation of more different features. The fact that all models have higher evaluation score on ST-LT than LT-OD in Setting A (Table 6.6 and Table 6.7) also supports the same claim. As we only have completed preliminary investigations, more details and validations can be discovered as future research directions. 160 Chapter 7 Conclusion and Future Work 161 Inthisthesis, wehavestudiedseveraleffectivedeeplearningmodelsfortemporal data in health care and demonstrated their successful applications in real-world, mainly on addressing the challenges from the data heterogeneity, data scarcity, and the model interpretability and usability. We demonstrate the efficacy of the proposed deep learning-based solutions in terms of improved prediction performance, meaningful analysis, and validations from domain experts. 7.1 Contributions and Limitations In this thesis work, we have the contributions listed as follows. • For data heterogeneity, we show that the missingness in the time series can provide additional information and our proposed GRU-D model can improve prediction performance by modeling the decay mechanism based on the missing patterns. We also show that hierarchical deep generative models which captures the latent hierarchical structure in the underlying data generation mechanism can model the multi-rate multivariate time series better than single-rate and simple imputation baselines. • For data scarcity, we propose a semi-supervised learning framework with a generative adversarial network specially designed for EHR data which can boost the prediction performance even with a limited amount of data with labels. We also demonstrate to incorporate prior knowledge and use incremental training to improve and speedup the training procedure of deep learning models for temporal data in health care. 162 • For model interpretability and usability, we propose an interpretable mimic learning framework which provides powerful models with good interpretability. We also collaborate with doctors and showcase our deep learning solutions to opioid usage analysis. These contributions confirm our thesis hypothesis that many unique and critical challenges in temporal data in health care can be effectively solved by deep learning- based methods in practice. However, there are a few limitations to the proposed solutions as listed below. • Each of our proposed methods is mainly designed with the goal of solving the particular problems. However, in real-world applications those challenging problems usually get entangled with each other, and the underlying mechanism may not be as simplified as our assumption in our model design. People may need substantial efforts to combine these models and provide effective solutions from these components in practice. • The investigations and interpretations for deep learning models are still primi- tive, indirect, or sometimes not accurate enough. As we are in the early stage of this research direction, our study mainly resorts clinicians for their validations and explanations about our findings instead of providing brand new findings for them. The latter procedure can be more helpful in practice. Also, the lack of detailed and accurate theoretical analysis on complex deep learning models prevents us from providing more insightful interpretations and explanations to the doctors. • Many extracted features and findings in this thesis are those that have been already known or studied in health care domain, and many results in this 163 thesis are presented to validate existing findings from previous studies with larger datasets and novel methods. Unique findings from deep learning models, such as rare medical features and novel predictive indicators, can bring huge improvements in health care domain. • While majority parts of the thesis work are conducted or have been evaluated on publicly available datasets, and the implementation details and/or code scripts are provided, the reproducibility in general in the interdisciplinary domain of deep learning and health care is still a big issue. Standard benchmarking datasets and tasks are in great demand. 7.2 Future Work In this section, we discuss future research topics along these three directions. Utilizing data heterogeneity A promising direction is to combine temporal data with other types of health care data, such as medical images, electrocardiogram and electroencephalogram records, and social network data, to build a powerful predictive models which can fully utilize information from different data sources. This kind of work, which in general attempts to process and relate information from multiple data sources or modalities, is usually referred to as multi-modal learning [17] in machine learning area. With its wide range of applications from audio-visual speech recognition [86] and multimedia retrieval [10] to language and vision [246], it has the promising ability to fully utilize temporal, signal, textual, and visual data to enhance the quality of personalized health care service. Dealing with ill-structured medical data which come with incompleteness and fragments, such as the medical 164 records for the same patient from multiple specialized hospitals, may provide other solutions for effective analysis and predictions on health care data. These models can fully utilize fragmented data by exploiting the complementary information from each data fragmentation [89], capturing view-specific variables from the common latent space [7], or treating the data as block-wise missing [244]. This line of work requires the study on the non-overlapping and overlapping data fragments with the focus on the complementary information analysis and relevant/irrelevant feature learning. Handling data scarcity One-shot learning [132] and zero-shot learning [243], which are attracting more attentions in computer vision and other domains, might provide effective solutions to handle data and label scarcity issue. Originally, one- short and zero-shot learning methods aim to recognize objects whose instances may only have been seen for limited times or even not have been seen during training, respectively [126, 194], but it has also been applied to other tasks such as imitation learning [61] for reinforcement learning and text classification [125] on sequential data. Most early work in this direction tries to make use of the attributes to infer the label information of an input from one of the rare/unseen classes. Recent advances attempts to learn a mapping from the raw input space to a semantic space [169] or multiple latent spaces [3]. These methods may potentially capture the shared information between different patients so that the knowledge learnt for some common diseases can benefit other rare diseases and special cases. Improving model interpretability and usability To incorporate domain knowledge and borrow experience from domain experts into the model design is useful in many applications [155, 249] and is the same for health care. Prospective study is a powerful tool in health care and could be even more effective when computer 165 scientists could be involved in to adjust their model and experimental settings on the fly. Additionally, simple and efficient deep learning platforms or libraries can benefit the doctors more in real-world health care research and applications. 166 Index ARIMA, autoregressive integrated mov- ing average, 16 AUPRC,areaunderprecision-recallcurve, 137 AUROC, area under the ROC curve, 45 CNN, convolutional neural network, 23 DDH, data-driven healthcare, 2 DMM, deep Markov model, 24 DNN, deep feed-forward neural network, 20 DTW, dynamic time warping, 19 EHR, electronic health record, 2 ehrGAN, a modified GAN model for EHR data, 91 ELBO, evidence lower bound, 68 GAN, generative adversarial network, 87 GBT, gradient boosting trees, 132 GRU, gated recurrent unit, 22 GRU-D, GRU with trainable decays, 38 HMM, hidden Markov model, 61 ICD-9, international classification of dis- ease - 9th revision, 5 ICU, intensive care unit, 4 KF, Kalman filters, 17 KL divergence, Kullback–Leibler diver- gence, or relative entropy, 70 LR, logistic regression, 42 LSTM, long short-term memory, 22 MLP, multilayer perceptron, 66 MR-HDMM, multi-rate hierarchical deep Markov model, 63 MRMTS, multi-rate multivariate time series, 59 MSE, mean squared error, 81 MTS, multivariate time series, 15 ReLU, rectified linear unit, 20 RF, random forest, 42 167 RNN, recurrent neural network, 21 ROC curve, receiver operating character- istic curve, 45 SGD, stochastic gradient descent, 75 SSL, semi-supervised learning, 86 SVM, support vector machine, 42 168 Reference List [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D. G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P. A., Vanhoucke, V., Vasudevan, V., Viégas, F. B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467. [2] Achour, S., Dojat, M., Rieux, C., Bierling, P., and Lepage, E. (2001). Application of information technology: A umls-based knowledge acquisition tool for rule-based clinical decision support system development. JAMIA, 8(4):351–360. [3] Akata, Z., Malinowski, M., Fritz, M., and Schiele, B. (2016). Multi-cue zero-shot learning with strong supervision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 59–68. [4] Al-Aswad, A. M., Brownsell, S., Palmer, R., and Nichol, J. P. (2013). A review paper of the current status of electronic health records adoption worldwide: the gap between developed and developing countries. Journal of Health Informatics in Developing Countries, 7(2). [5] Al-Rfou, R., Alain, G., Almahairi, A., Angermüller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., Belopolsky, A., Bengio, Y., Bergeron, A., Bergstra, J., Bisson, V., Snyder, J. B., Bouchard, N., Boulanger-Lewandowski, N., Bouthillier, X., de Brébisson, A., Breuleux, O., Carrier, P. L., Cho, K., Chorowski, J., Christiano, P. F., Cooijmans, T., Côté, M., Côté, M., Courville, A. C., Dauphin, Y. N., Delalleau, O., Demouth, J., Desjardins, G., Dieleman, S., Dinh, L., Ducoffe, M., Dumoulin, V., Kahou, S. E., Erhan, D., Fan, Z., Firat, O., Germain, M., Glorot, X., Goodfellow, I. J., Graham, M., Gülçehre, Ç., Hamel, P., Harlouchet, I., Heng, J., Hidasi, B., Honari, S., Jain, A., Jean, S., Jia, K., Korobov, M., 169 Kulkarni, V., Lamb, A., Lamblin, P., Larsen, E., Laurent, C., Lee, S., Lefrançois, S., Lemieux, S., Léonard, N., Lin, Z., Livezey, J. A., Lorenz, C., Lowin, J., Ma, Q., Manzagol, P., Mastropietro, O., McGibbon, R., Memisevic, R., van Merriënboer, B., Michalski, V., Mirza, M., Orlandi, A., Pal, C. J., Pascanu, R., Pezeshki, M., Raffel, C., Renshaw, D., Rocklin, M., Romero, A., Roth, M., Sadowski, P., Salvatier, J., Savard, F., Schlüter, J., Schulman, J., Schwartz, G., Serban, I. V., Serdyuk, D., Shabanian, S., Simon, É., Spieckermann, S., Subramanyam, S. R., Sygnowski, J., Tanguay, J., van Tulder, G., Turian, J. P., Urban, S., Vincent, P., Visin, F., de Vries, H., Warde-Farley, D., Webb, D. J., Willson, M., Xu, K., Xue, L., Yao, L., Zhang, S., and Zhang, Y. (2016). Theano: A python framework for fast computation of mathematical expressions. CoRR, abs/1605.02688. [6] Ando, R. K. and Zhang, T. (2006). Learning on graph with laplacian regular- ization. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 25–32. [7] Andrew, G., Arora, R., Bilmes, J. A., and Livescu, K. (2013). Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1247–1255. [8] Armesto, L., Tornero, J., and Vincze, M. (2008). On multi-rate fusion for non- linear sampled-data systems: Application to a 6d tracking system. Robotics and Autonomous Systems, 56(8):706–715. [9] Armstrong, H. (2015). Machines that learn in the wild: Machine learning capabilities, limitations and implications. NESTA, London. [10] Atrey, P. K., Hossain, M. A., El-Saddik, A., and Kankanhalli, M. S. (2010). Multimodal fusion for multimedia analysis: a survey. Multimedia Syst., 16(6):345– 379. [11] Avendi, M. R., Kheradvar, A., and Jafarkhani, H. (2016). A combined deep- learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical Image Analysis, 30:108–119. [12] Azur, M. J., Stuart, E. A., Frangakis, C., and Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 20(1):40–49. [13] Ba, J. and Caruana, R. (2014). Do deep nets really need to be deep? In Advances in Neural Information Processing Systems 27: Annual Conference on 170 Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2654–2662. [14] Bahadori, M. T., Yu, Q. R., and Liu, Y. (2014). Fast multivariate spatio- temporal analysis via low rank tensor learning. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3491–3499. [15] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473. [16] Balan, A. K., Rathod, V., Murphy, K. P., and Welling, M. (2015). Bayesian dark knowledge. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3438–3446. [17] Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. [18] Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463– 482. [19] Batista, G.E.A.P.A.andMonard, M.C.(2002). Astudyofk-nearestneighbour as an imputation method. In Soft Computing Systems - Design, Management and Applications, HIS 2002, December 1-4, 2002, Santiago, Chile, pages 251–260. [20] Baytas, I. M., Xiao, C., Zhang, X., Wang, F., Jain, A. K., and Zhou, J. (2017). Patient subtyping via time-aware LSTM networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 65–74. [21] Beaulieu-Jones, B. K. and Greene, C. S. (2016). Semi-supervised learning of the electronic health record for phenotype stratification. Journal of Biomedical Informatics, 64:168–178. [22] Bengio, Y., Courville, A. C., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798– 1828. [23] Bengio, Y. and Gingras, F. (1995). Recurrent neural networks for missing or asynchronous data. In Advances in Neural Information Processing Systems 8, NIPS, Denver, CO, USA, November 27-30, 1995, pages 395–401. 171 [24] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2006). Greedy layer- wise training of deep networks. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 153–160. [25] Berndt, D. J. and Clifford, J. (1994). Using dynamic time warping to find patterns in time series. In Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop, Seattle, Washington, USA, July 1994. Technical Report WS-94-03, pages 359–370. [26] Bodenreider, O. (2004). The unified medical language system (UMLS): integrat- ing biomedical terminology. Nucleic Acids Research, 32(Database-Issue):267–270. [27] Bonner, G. (2001). Decision making for health care professionals: use of decision trees within the community mental health setting. Journal of Advanced Nursing, 35(3):349–356. [28] Boudreau, D., Von Korff, M., Rutter, C. M., Saunders, K., Ray, G. T., Sulli- van, M. D., Campbell, C. I., Merrill, J. O., Silverberg, M. J., Banta-Green, C., et al. (2009). Trends in long-term opioid therapy for chronic non-cancer pain. Pharmacoepidemiology and drug safety, 18(12):1166–1175. [29] Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 111–118. [30] Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). Time series analysis: forecasting and control. John Wiley & Sons. [31] Braden, J. B., Sullivan, M. D., Ray, G. T., Saunders, K., Merrill, J., Silverberg, M. J., Rutter, C. M., Weisner, C., Banta-Green, C., Campbell, C., et al. (2009). Trends in long-term opioid therapy for noncancer pain among persons with a history of depression. General hospital psychiatry, 31(6):564–570. [32] Breivik, H., Collett, B., Ventafridda, V., Cohen, R., and Gallacher, D. (2006). Survey of chronic pain in europe: prevalence, impact on daily life, and treatment. European journal of pain, 10(4):287–333. [33] Bucila, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pages 535–541. 172 [34] Centers for Disease Control and Prevention (CDC and others) (2011). Vital signs: overdoses of prescription opioid pain relievers—united states, 1999–2008. MMWR. Morbidity and mortality weekly report, 60(43):1487. [35] Chand, M. S., Sharma, S., Singh, R. S., and Reddy, S. (2014). Comparison on difference in manual and electronic recording of vital signs in patients admitted in ctvs-icu and ccu. Nursing and Midwifery Research, 10(4):157. [36] Che, C., Xiao, C., Liang, J., Jin, B., Zho, J., and Wang, F. (2017). An RNN architecture with dynamic temporal matching for personalized predictions of parkinson’s disease. In Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, Texas, USA, April 27-29, 2017., pages 198–206. [37] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274. [38] Cheng, Y., Wang, F., Zhang, P., and Hu, J. (2016). Risk prediction with electronic health records: A deep learning approach. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, Florida, USA, May 5-7, 2016, pages 432–440. [39] Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014a). On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, pages 103–111. [40] Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014b). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724–1734. [41] Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F., and Sun, J. (2016a). Doctor AI: predicting clinical events via recurrent neural networks. In Proceedings of the 1st Machine Learning in Health Care, MLHC 2016, Los Angeles, CA, USA, August 19-20, 2016, pages 301–318. [42] Choi, E., Bahadori, M. T., Searles, E., Coffey, C., Thompson, M., Bost, J., Tejedor-Sojo, J., and Sun, J. (2016b). Multi-layer representation learning for med- ical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1495–1504. 173 [43] Choi, Y., Chiu, C. Y., and Sontag, D. (2016c). Learning low-dimensional representations of medical concepts. In Summit on Clinical Research Informatics, CRI 2016, San Francisco, CA, USA, March 21-24, 2016. [44] Chollet, F. et al. (2015). Keras: Deep learning for humans. https://keras.io. [45] Chou, R., Clark, E., and Helfand, M. (2003). Comparative efficacy and safety of long-acting oral opioids for chronic non-cancer pain: a systematic review. Journal of pain and symptom management, 26(5):1026–1048. [46] Chung, J., Ahn, S., and Bengio, Y. (2016). Hierarchical multiscale recurrent neural networks. CoRR, abs/1609.01704. [47] Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. [48] Claeskens, G., Hjort, N. L., et al. (2008). Model selection and model averaging. Cambridge University Press. [49] Cornet, R. and de Keizer, N. (2008). Forty years of SNOMED: a literature review. BMC Med. Inf. & Decision Making, 8(S-1):S2. [50] Cost, H., Project, U., et al. (2016). Clinical classifications software (ccs) for icd-9-cm. last modified October, 7. [51] De Boor, C., De Boor, C., Mathématicien, E.-U., De Boor, C., and De Boor, C. (1978). A practical guide to splines, volume 27. Springer-Verlag New York. [52] Dear, R. E. (1959). A principal-component missing-data method for multiple regression models. System Development Corporation. [53] Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Li, F. (2009). Imagenet: A large-scalehierarchicalimagedatabase. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. [54] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and de Freitas, N. (2013). Predicting parameters in deep learning. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2148–2156. [55] Department of Justice and Office of Public Affairs (2017). Mckesson agrees to pay record $150 million settlement for failure to report suspicious orders of pharmaceutical drugs. 174 [56] Dernoncourt, F., Lee, J. Y., Uzuner, Ö., and Szolovits, P. (2017). De- identification of patient notes with recurrent neural networks. JAMIA, 24(3):596– 606. [57] Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml. [58] Díaz-Galiano, M. C., Martín-Valdivia, M. T., and López, L. A. U. (2009). Query expansion with a medical ontology to improve a multimodal information retrieval system. Comp. in Bio. and Med., 39(4):396–403. [59] Dowell, D., Haegerich, T. M., and Chou, R. (2016). Cdc guideline for prescribing opioids for chronic pain—united states, 2016. Jama, 315(15):1624–1645. [60] Drolet, L., Michaud, F., and Côté, J. (2000). Adaptable sensor fusion using multiple kalman filters. In IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2000, October 30 - Novemver 5, 2000, Takamatsu, Japan, pages 1434–1439. [61] Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., and Zaremba, W. (2017). One-shot imitation learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1087–1098. [62] Duckworth, D. (2013). pykalman: Kalman filter, smoother, and em algorithm for python. http://pykalman.github.io/. [63] Durbin, J. and Koopman, S. J. (2012). Time series analysis by state space methods, volume 38. Oxford University Press. [64] English, P. (2016). Predictive imputer: Predictive imputation of missing values with sklearn interface. https://github.com/log0ymxm/predictive_imputer. [65] Erhan, D., Bengio, Y., Courville, A., and Vincent, P. (2009). Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1. [66] Erhan, D., Bengio, Y., Courville, A. C., Manzagol, P., Vincent, P., and Bengio, S. (2010). Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625–660. [67] Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118. 175 [68] Fan, C., Chang, P., Lin, J., and Hsieh, J. C. (2011). A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification. Appl. Soft Comput., 11(1):632–644. [69] Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. (2016). Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2199–2207. [70] Frenk, S. M., Porter, K. S., and Paulozzi, L. (2015). Prescription opioid analgesic use among adults: United States, 1999-2012. Number 2015. US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Health Statistics. [71] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232. [72] Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378. [73] Gal, Y. and Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1019–1027. [74] Gan, Z., Li, C., Henao, R., Carlson, D. E., and Carin, L. (2015). Deep temporal sigmoid belief networks for sequence modeling. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2467–2475. [75] Gao, J. and Han, J. (2008). Research challenges for data mining in science and engineering. In Next Generation of Data Mining, pages 24–51. Chapman and Hall/CRC. [76] García-Laencina, P. J., Sancho-Gómez, J., and Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applica- tions, 19(2):263–282. [77] Garla, V., Taylor, C., and Brandt, C. (2013). Semi-supervised clinical text classification with laplacian svms: An application to cancer case management. Journal of Biomedical Informatics, 46(5):869–875. [78] Giorgino, T. et al. (2009). Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software, 31(7):1–24. 176 [79] Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 249–256. [80] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 315–323. [81] Goldberger, A. L., Amaral, L. A., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. (2000). Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220. [82] Goller, C. and Kuchler, A. (1996). Learning task-dependent distributed repre- sentations by backpropagation through structure. Neural Networks, 1:347–352. [83] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680. [84] Graves, A., Mohamed, A., and Hinton, G. E. (2013). Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645–6649. [85] Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines. CoRR, abs/1410.5401. [86] Gurban, M., Thiran, J., Drugman, T., and Dutoit, T. (2008). Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In Proceedings of the 10th International Conference on Multimodal Interfaces, ICMI 2008, Chania, Crete, Greece, October 20-22, 2008, pages 237–240. [87] Hamilton, J. D. (1994). Time series analysis, volume 2. Princeton university press Princeton, NJ. [88] He, J. (2017). Learning from data heterogeneity: Algorithms and applications. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 5126– 5130. 177 [89] Heinrich, M. P., Jenkinson, M., Bhushan, M., Matin, T. N., Gleeson, F., Brady, M., and Schnabel, J. A. (2012). MIND: modality independent neighbour- hood descriptor for multi-modal deformable registration. Medical Image Analysis, 16(7):1423–1435. [90] Hermans, M. and Schrauwen, B. (2013). Training and analysing deep recurrent neural networks. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 190–198. [91] Hihi, S. E. and Bengio, Y. (1995). Hierarchical recurrent neural networks for long-term dependencies. In Advances in Neural Information Processing Systems 8, NIPS, Denver, CO, USA, November 27-30, 1995, pages 493–499. [92] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82–97. [93] Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554. [94] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786):504–507. [95] Hinton, G. E., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. CoRR, abs/1503.02531. [96] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735–1780. [97] Holzinger, A. (2016). Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Informatics, 3(2):119–131. [98] Hooten, W. M., St Sauver, J. L., McGree, M. E., Jacobson, D. J., and Warner, D. O. (2015). Incidence and risk factors for progression from short-term to episodic or long-term opioid prescribing: a population-based study. In Mayo Clinic Proceedings, volume 90, pages 850–856. Elsevier. [99] Hornik, K., Stinchcombe, M. B., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366. [100] Houtekamer, P. L. and Mitchell, H. L. (2001). A sequential ensemble kalman filter for atmospheric data assimilation. Monthly Weather Review, 129(1):123–137. 178 [101] Hyvärinen, A. and Smith, S. M. (2013). Pairwise likelihood ratios for estimation of non-gaussian structural equation models. Journal of Machine Learning Research, 14(1):111–152. [102] Im, D. J., Kim, C. D., Jiang, H., and Memisevic, R. (2016). Generating images with recurrent adversarial networks. CoRR, abs/1602.05110. [103] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456. [104] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pages 675–678. [105] Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., Wang, Y., Dong, Q., Shen, H., and Wang, Y. (2017). Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology, 2(4):230–243. [106] Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035. [107] Johnson, A. E. W., Pollard, T. J., and Mark, R. G. (2017). Reproducibility in critical care: a mortality prediction case study. In Proceedings of the Machine Learning for Health Care, MLHC 2017, Boston, Massachusetts, USA, 18-19 August 2017, pages 361–376. [108] Jones, E., Oliphant, T., Peterson, P., et al. (2001). SciPy: Open source scientific tools for Python. http://www.scipy.org/. [109] Jordan, M. I. (1998). Learning in graphical models, volume 89. Springer Science & Business Media. [110] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45. [111] Kalra, M. and Lal, N. (2016). Data mining of heterogeneous data with research challenges. In Colossal Data Analysis and Networking (CDAN), Symposium on, pages 1–6. IEEE. 179 [112] Kalso, E., Edwards, J. E., Moore, R. A., and McQuay, H. J. (2004). Opioids in chronic non-cancer pain: systematic review of efficacy and safety. Pain, 112(3):372– 380. [113] Kerr, K. F., Bansal, A., and Pepe, M. S. (2012). Further insight into the incremental value of new markers: the interpretation of performance measures and the importance of clinical context. American journal of epidemiology, 176(6):482– 487. [114] Khemani, R. G., Conti, D., Alonzo, T. A., Bart, R. D., and Newth, C. J. (2009). Effect of tidal volume in children with acute hypoxemic respiratory failure. Intensive care medicine, 35(8):1428–1437. [115] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980. [116] Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi- supervised learning with deep generative models. In Advances in Neural Informa- tion Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3581–3589. [117] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. CoRR, abs/1312.6114. [118] Koh, H. C., Tan, G., et al. (2011). Data mining applications in healthcare. Journal of healthcare information management, 19(2):65. [119] Koren, Y., Bell, R. M., and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30–37. [120] Koutník, J., Greff, K., Gomez, F. J., and Schmidhuber, J. (2014). A clockwork RNN. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1863–1871. [121] Kreindler, D. M. and Lumsden, C. J. (2016). The effects of the irregular sample and missing data in time series analysis. In Nonlinear Dynamical Systems Analysis for the Behavioral Sciences Using Real Data, pages 149–172. CRC Press. [122] Krishnan, R. G., Shalit, U., and Sontag, D. (2015). Deep kalman filters. CoRR, abs/1511.05121. [123] Krishnan, R. G., Shalit, U., and Sontag, D. (2017). Structured inference networks for nonlinear state space models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 2101–2109. 180 [124] Kruse, C. S., Goswamy, R., Raval, Y., and Marawi, S. (2016). Challenges and opportunities of big data in health care: a systematic review. JMIR medical informatics, 4(4). [125] Kumar, A., Muddireddy, P. R., Dreyer, M., and Hoffmeister, B. (2017). Zero- shot learning across heterogeneous overlapping domains. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pages 2914–2918. [126] Lampert, C. H., Nickisch, H., and Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell., 36(3):453–465. [127] Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. (2009). Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10:1–40. [128] Lasko, T. A., Denny, J. C., and Levy, M. A. (2013). Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PloS one, 8(6):e66341. [129] LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., and Jackel, L. D. (1989). Handwritten digit recognition with a back- propagation network. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pages 396–404. [130] Leung, M. K. K., Delong, A., Alipanahi, B., and Frey, B. J. (2016). Machine learning in genomic medicine: A review of computational problems and data sets. Proceedings of the IEEE, 104(1):176–197. [131] Li, C. and Wand, M. (2016). Precomputed real-time texture synthesis with markovian generative adversarial networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pages 702–716. [132] Li, F., Fergus, R., and Perona, P. (2006). One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell., 28(4):594–611. [133] Li, J., Zhao, R., Huang, J., and Gong, Y. (2014). Learning small-size DNN with output-distribution-based criteria. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 1910–1914. 181 [134] Lin, H. W., Tegmark, M., and Rolnick, D. (2017). Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247. [135] Lipton, Z. C., Kale, D. C., Elkan, C., and Wetzel, R. C. (2015). Learning to diagnose with LSTM recurrent neural networks. CoRR, abs/1511.03677. [136] Lipton, Z. C., Kale, D. C., and Wetzel, R. C. (2016). Directly modeling missing data in sequences with rnns: Improved classification of clinical time series. In Proceedings of the 1st Machine Learning in Health Care, MLHC 2016, Los Angeles, CA, USA, August 19-20, 2016, pages 253–270. [137] Lisboa, P. (2004). Neural networks in medical journals: current trends and implications for biopattern. In Proc. 1st European Workshop on Assessment of Diagnostic Performance (EWADP), volume 7, pages 99–112. [138] Lisboa, P. J. G. (2002). A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Networks, 15(1):11–39. [139] Liu, M. and Tuzel, O. (2016). Coupled generative adversarial networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 469–477. [140] Luo, Y. and Rumshisky, A. (2016). Interpretable topic features for post-icu mortality prediction. In AMIA 2016, American Medical Informatics Association Annual Symposium, Chicago, IL, USA, November 12-16, 2016. [141] Ma, F., Chitta, R., Zhou, J., You, Q., Sun, T., and Gao, J. (2017). Dipole: Diagnosispredictioninhealthcareviaattention-basedbidirectionalrecurrentneural networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 1903–1911. [142] Madeo, R. C. B., Lima, C. A. M., and Peres, S. M. (2013). Gesture unit seg- mentation using support vector machines: segmenting gestures from rest positions. In Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, Coimbra, Portugal, March 18-22, 2013, pages 46–52. [143] Madnick, S. E. and Lee, Y. W. (2011). Editorial notes: Classification and assessment of large amounts of data: Examples in the healthcare industry and collaborative digital libraries. J. Data and Information Quality, 2(3):12:1–12:2. [144] Madsen, L. B. (2014). Data-driven healthcare: How analytics and BI are transforming the industry. John Wiley & Sons. 182 [145] Marlin, B. M., Kale, D. C., Khemani, R. G., and Wetzel, R. C. (2012). Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In ACM International Health Informatics Symposium, IHI ’12, Miami, FL, USA, January 28-30, 2012, pages 389–398. [146] Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1033–1040. [147] Mathieu, M., Zhao, J. J., Sprechmann, P., Ramesh, A., and LeCun, Y. (2016). Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 5041–5049. [148] Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322. [149] Mehrabi, S., Sohn, S., Li, D., Pankratz, J. J., Therneau, T. M., Sauver, J. L. S., Liu, H., and Palakal, M. J. (2015). Temporal pattern and association discovery of diagnosis codes using deep learning. In 2015 International Conference on Healthcare Informatics, ICHI 2015, Dallas, TX, USA, October 21-23, 2015, pages 408–416. [150] Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048. [151] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3111–3119. [152] Milletari, F., Navab, N., and Ahmadi, S. (2016). V-net: Fully convolutional neuralnetworksforvolumetricmedicalimagesegmentation. InFourth International Conference on 3D Vision, 3DV 2016, Stanford, CA, USA, October 25-28, 2016, pages 565–571. 183 [153] Miotto, R., Li, L., Kidd, B. A., and Dudley, J. T. (2016). Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports, 6:26094. [154] Miotto, R., Wang, F., Wang, S., Jiang, X., and Dudley, J. T. (2017). Deep learning for healthcare: review, opportunities and challenges. Briefings in bioin- formatics. [155] Mirchevska, V., Lustrek, M., and Gams, M. (2014). Combining domain knowledgeandmachinelearningforrobustfalldetection. Expert Systems,31(2):163– 175. [156] Miyato, T., Dai, A. M., and Goodfellow, I. J. (2016). Virtual adversarial training for semi-supervised text classification. CoRR, abs/1605.07725. [157] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. [158] Mondal, D. and Percival, D. B. (2010). Wavelet variance analysis for gappy time series. Annals of the Institute of Statistical Mathematics, 62(5):943–966. [159] Murphy, K. P. (2012). Machine learning - a probabilistic perspective. Adaptive computation and machine learning series. MIT Press. [160] Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 807–814. [161] Nash, J. (1951). Non-cooperative games. Annals of mathematics, pages 286– 295. [162] Negenborn, R. (2003). Robot localization and kalman filters. Utrecht Univ., Utrecht, Netherlands, Master’s thesis INF/SCR-0309. [163] Neil, D., Pfeiffer, M., and Liu, S. (2016). Phased LSTM: accelerating recurrent network training for long or event-based sequences. In Advances in Neural Informa- tion Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3882–3890. [164] Nelson, S. J., Zeng, K., Kilbourne, J., Powell, T., and Moore, R. (2011). Normalized names for clinical drugs: Rxnorm at 6 years. JAMIA, 18(4):441–448. 184 [165] Ng, A. Y. and Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Infor- mation Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pages 841–848. [166] Nie, L., Wang, M., Zhang, L., Yan, S., Zhang, B., and Chua, T. (2015). Disease inference from health-related questions via sparse deep learning. IEEE Trans. Knowl. Data Eng., 27(8):2107–2119. [167] Niu, G., Jitkrittum, W., Dai, B., Hachiya, H., and Sugiyama, M. (2013). Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 10–18. [168] Organization, W. H. (2004). International statistical classification of diseases and related health problems, volume 1. World Health Organization. [169] Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). Zero-shot learning with semantic output codes. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada., pages 1410–1418. [170] Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I. J., and Talwar, K. (2016). Semi-supervised knowledge transfer for deep learning from private training data. CoRR, abs/1610.05755. [171] Parisotto, E., Ba, L. J., and Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer reinforcement learning. CoRR, abs/1511.06342. [172] Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep face recognition. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pages 41.1–41.12. [173] Parveen, S. and Green, P. D. (2001). Speech recognition with missing data using recurrent neural nets. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pages 1189–1195. [174] Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1310–1318. 185 [175] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch. [176] Paulozzi, L. J., Mack, K. A., and Hockenberry, J. M. (2014). Vital signs: variation among states in prescribing of opioid pain relievers and benzodiazepines- united states, 2012. Morbidity and Mortality Weekly Report, 63(26):563–568. [177] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825– 2830. [178] Peleg, M., Tu, S. W., Bury, J., Ciccarese, P., Fox, J., Greenes, R. A., Hall, R. W., Johnson, P. D., Jones, N., Kumar, A., Miksch, S., Quaglini, S., Seyfang, A., Shortliffe, E. H., and Stefanelli, M. (2003). Research paper: Comparing computer- interpretable guideline models: A case-study approach. JAMIA, 10(1):52–68. [179] Pham, T., Tran, T., Phung, D. Q., and Venkatesh, S. (2016). Deepcare: A deep dynamic memory model for predictive medicine. In Advances in Knowl- edge Discovery and Data Mining - 20th Pacific-Asia Conference, PAKDD 2016, Auckland, New Zealand, April 19-22, 2016, Proceedings, Part II, pages 30–41. [180] Pietersma, D., Lacroix, R., Lefebvre, D., and Wade, K. M. (2003). Performance analysis for machine-learning experiments using small data sets. Computers and electronics in agriculture, 38(1):1–17. [181] Pivovarov, R. (2015). Electronic Health Record Summarization over Heteroge- neous and Irregularly Sampled Clinical Data. PhD thesis, Columbia University. [182] Purushotham, S., Meng, C., Che, Z., and Liu, Y. (2018). Benchmarking deep learning models on large healthcare datasets. Journal of Biomedical Informatics, 83:112–134. [183] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1):81– 106. [184] Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. [185] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representa- tion learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434. 186 [186] Raghupathi, W. and Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health information science and systems, 2(1):3. [187] Ranzato, M., Poultney, C. S., Chopra, S., and LeCun, Y. (2006). Efficient learn- ing of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Con- ference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 1137–1144. [188] Ravì, D., Wong, C., Lo, B., and Yang, G. (2017). A deep learning approach to on-node sensor data analytics for mobile or wearable devices. IEEE J. Biomedical and Health Informatics, 21(1):56–64. [189] Razavian, N. and Sontag, D. (2015). Temporal convolutional neural networks for diagnosis from lab tests. CoRR, abs/1511.07938. [190] Rehfeld, K., Marwan, N., Heitzig, J., and Kurths, J. (2011). Comparison of correlation analysis techniques for irregularly sampled time series. Nonlinear Processes in Geophysics, 18(3):389–404. [191] Reinsel, G. C. (2003). Elements of multivariate time series analysis. Springer Science & Business Media. [192] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropa- gation and approximate inference in deep generative models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1278–1286. [193] Robinson, A. and Fallside, F. (1987). The utility driven dynamic error propa- gation network. University of Cambridge Department of Engineering. [194] Rohrbach, M., Stark, M., and Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 1641–1648. [195] Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592. [196] Rubinsteyn, A. and Feldman, S. (2015). fancyimpute: Multivariate imputation and matrix completion algorithms implemented in python. https://github.com/ hammerlab/fancyimpute. [197] Rudd, R. A., Aleshire, N., Zibbell, J. E., and Gladden, R. M. (2016). Increases in drug and opioid overdose deaths—united states, 2000–2014. American Journal of Transplantation, 16(4):1323–1327. 187 [198] Rusu, A. A., Colmenarejo, S. G., Gülçehre, Ç., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. (2015). Policy distillation. CoRR, abs/1511.06295. [199] Safari, S., Shabani, F., and Simon, D. (2014). Multirate multisensor data fusion for linear systems using kalman filters and a neural network. Aerospace Science and Technology, 39:465–471. [200] Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226–2234. [201] Saunders, K. W., Von Korff, M., Campbell, C. I., Banta-Green, C. J., Sullivan, M. D., Merrill, J. O., and Weisner, C. (2012). Concurrent use of alcohol and sedatives among persons prescribed chronic opioid therapy: prevalence and risk factors. The Journal of Pain, 13(3):266–275. [202] Schafer, J. L. and Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2):147. [203] Science and Technology Committee (House of Commons) (2016). report on robotics and artificial intelligence. https://publications.parliament.uk/pa/ cm201617/cmselect/cmsctech/145/14502.htm. [204] Seabold, S. and Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, volume 57, page 61. SciPy society Austin. [205] Shaheed, C. A., Maher, C. G., Williams, K. A., Day, R., and McLachlan, A. J. (2016). Efficacy, tolerability, and dose-dependent effects of opioid analgesics for low back pain: a systematic review and meta-analysis. JAMA internal medicine, 176(7):958–968. [206] Shickel, B., Tighe, P., Bihorac, A., and Rashidi, P. (2018). Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J. Biomedical and Health Informatics, 22(5):1589–1604. [207] Silva, I., Moody, G., Scott, D. J., Celi, L. A., and Mark, R. G. (2012). Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. Computing in cardiology, 39:245. [208] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, 188 S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489. [209] Smith, G. I. (2006). Basic CPT/HCPCS Coding 2007. American Health Information Management Association. [210] Socher, R., Lin, C. C., Ng, A. Y., and Manning, C. D. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 129–136. [211] Springenberg, J. T. (2015). Unsupervised and semi-supervised learning with categorical generative adversarial networks. CoRR, abs/1511.06390. [212] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958. [213] Srivastava, N. and Salakhutdinov, R. (2013). Discriminative transfer learning with tree-based priors. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2094–2102. [214] St Sauver, J., Buntrock, J., Rademacher, D., et al. (2010). Comparison of mayo clinic coding systems. [215] St. Sauver, J. L., Grossardt, B. R., Leibson, C. L., Yawn, B. P., Melton III, L. J., and Rocca, W. A. (2012). Generalizability of epidemiological findings and public health decisions: an illustration from the rochester epidemiology project. In Mayo Clinic Proceedings, volume 87, pages 151–160. Elsevier. [216] St. Sauver, J. L., Grossardt, B. R., Yawn, B. P., Melton III, L. J., and Rocca, W. A. (2011). Use of a medical records linkage system to enumerate a dynamic population over time: the rochester epidemiology project. American journal of epidemiology, 173(9):1059–1068. [217] Stekhoven, D. J. and Bühlmann, P. (2011). Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118. [218] Sun, Y., Chen, Y., Wang, X., and Tang, X. (2014). Deep learning face repre- sentation by joint identification-verification. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1988–1996. 189 [219] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112. [220] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2013). Intriguing properties of neural networks. CoRR, abs/1312.6199. [221] Tavares, J. and Oliveira, T. (2017). Electronic health record portal adoption: a cross country analysis. BMC Med. Inf. & Decision Making, 17(1):97:1–97:17. [222] Taylor, G. W., Hinton, G. E., and Roweis, S. T. (2006). Modeling human motion using binary latent variables. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 1345–1352. [223] Th, M., Sahu, S. K., and Anand, A. (2015). Evaluating distributed word representations for capturing semantics of biomedical concepts. In Proceedings of the Workshop on Biomedical Natural Language Processing, BioNLP@IJCNLP 2015, Beijing, China, July 30, 2015, pages 158–163. [224] Thomas, P. (2009). Semi-supervised learning by olivier chapelle, bernhard schölkopf, and alexander zien (review). IEEE Trans. Neural Networks, 20(3):542. [225] Thorson, D., Biewen, P., Bonte, B., Epstein, H., Haake, B., Hansen, C., Hooten, M., Hora, J., Johnson, C., Keeling, F., et al. (2014). Acute pain assessment and opioid prescribing protocol. Institute for Clinical Systems Improvement. [226] Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31. [227] Torralba, A., Fergus, R., and Freeman, W. T. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1958–1970. [228] Tresp, V. and Briegel, T. (1997). A solution for missing data in recurrent neural networks with an application to blood glucose prediction. In Advances in Neural Information Processing Systems 10, [NIPS Conference, Denver, Colorado, USA, 1997], pages 971–977. [229] Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media. 190 [230] Vellido, A., Martín-Guerrero, J. D., and Lisboa, P. J. G. (2012). Making machine learning models interpretable. In 20th European Symposium on Artificial Neural Networks, ESANN 2012, Bruges, Belgium, April 25-27, 2012. [231] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371– 3408. [232] Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3156–3164. [233] Vodovotz, Y., An, G., and Androulakis, I. P. (2013). A systems engineering per- spective on homeostasis and disease. Frontiers in bioengineering and biotechnology, 1:6. [234] Von Korff, M., Saunders, K., Ray, G. T., Boudreau, D., Campbell, C., Merrill, J., Sullivan, M. D., Rutter, C., Silverberg, M., Banta-Green, C., et al. (2008). Defacto long-term opioid therapy for non-cancer pain. The Clinical journal of pain, 24(6):521. [235] Wang, D. and Liu, Q. (2016). Learning to draw samples: With application to amortized MLE for generative adversarial learning. CoRR, abs/1611.01722. [236] Wang, Y., Wang, L., Rastegar-Mojarad, M., Moon, S., Shen, F., Afzal, N., Liu, S., Zeng, Y., Mehrabi, S., Sohn, S., and Liu, H. (2018). Clinical information extraction applications: A literature review. Journal of Biomedical Informatics, 77:34–49. [237] Weinberger, K. Q., Sha, F., Zhu, Q., and Saul, L. K. (2006). Graph laplacian regularization for large-scale semidefinite programming. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Con- ference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 1489–1496. [238] Weisner, C. M., Campbell, C. I., Ray, G. T., Saunders, K., Merrill, J. O., Banta-Green, C., Sullivan, M. D., Silverberg, M. J., Mertens, J. R., Boudreau, D., et al. (2009). Trends in prescribed opioid therapy for non-cancer pain for individuals with prior substance use disorders. Pain, 145(3):287–293. [239] Wells, B. J., Chagin, K. M., Nowacki, A. S., and Kattan, M. W. (2013). Strategies for handling missing data in electronic health record derived data. eGEMs, 1(3). 191 [240] White, I. R., Royston, P., and Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377–399. [241] Wongchaisuwat, P., Klabjan, D., and Jonnalagadda, S. R. (2016). A semi- supervised learning approach to enhance health care community–based question answering: A case study in alcoholism. JMIR medical informatics, 4(3). [242] Wu, J., Roy, J., and Stewart, W. F. (2010). Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Medical care, pages S106–S113. [243] Xian, Y., Schiele, B., and Akata, Z. (2017). Zero-shot learning - the good, the bad and the ugly. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3077–3086. [244] Xiang, S., Yuan, L., Fan, W., Wang, Y., Thompson, P. M., and Ye, J. (2013). Multi-source learning with block-wise missing data for alzheimer’s disease pre- diction. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013, pages 185–193. [245] Xiao, C., Choi, E., and Sun, J. (2018). Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association. [246] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2048–2057. [247] Xu, Y., Mo, T., Feng, Q., Zhong, P., Lai, M., and Chang, E. I. (2014). Deep learning of feature representation with multiple instance learning for medical image analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, pages 1626–1630. [248] Yao, Z., Liu, P., Lei, L., and Yin, J. (2005). R-c4. 5 decision tree model and its applications to health care dataset. In Services Systems and Services Management, 2005. Proceedings of ICSSSM’05. 2005 International Conference on, volume 2, pages 1099–1103. IEEE. [249] Yu, T., Simoff, S. J., and Jan, T. (2010). VQSVM: A case study for incorpo- rating prior domain knowledge into inductive machine learning. Neurocomputing, 73(13-15):2614–2623. 192 [250] Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolu- tional networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, pages 818–833. [251] Zhai, S., Cheng, Y., Feris, R. S., and Zhang, Z. (2016). Generative adversarial networks as variational training of energy based models. CoRR, abs/1611.01799. [252] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Under- standing deep learning requires rethinking generalization. CoRR, abs/1611.03530. [253] Zhang, G., Ou, S., Huang, Y., and Wang, C. (2015). Semi-supervised learning methods for large scale healthcare data analysis. IJCIH, 2(2):98–110. [254] Zhang, T., Popescul, A., and Dom, B. (2006). Linear prediction models with graph regularization for web-page categorization. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pages 821–826. [255] Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2003). Learning with local and global consistency. In Advances in Neural Informa- tion Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], pages 321–328. [256] Zhou, G., Sohn, K., andLee, H.(2012). Onlineincrementalfeaturelearningwith denoising autoencoders. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, La Palma, Canary Islands, Spain, April 21-23, 2012, pages 1453–1461. [257] Zhou, J., Wang, F., Hu, J., and Ye, J. (2014). From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pages 135–144. [258] Zhou, L. and Hripcsak, G. (2007). Temporal reasoning with medical data—a reviewwithemphasisonmedicalnaturallanguageprocessing. Journal of biomedical informatics, 40(2):183–202. [259] Zhu, J., Chen, N., Perkins, H., and Zhang, B. (2014). Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research, 15(1):1073–1110. 193 [260] Zhu, X. (2007). Semi-supervised learning tutorial. In Machine Learning, Pro- ceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, pages 1–135. 194
Abstract (if available)
Abstract
The worldwide push for electronic health records has resulted in an exponential surge in volume, detail, and availability of digital health data which offers an unprecedented opportunity to solve many difficult and important problems in health care. Clinicians are collaborating with computer scientists by using this opportunity to improve the state of data-driven and personalized health care services. Meanwhile, recent success and development of deep learning is revolutionizing many domains and provides promising solutions to the problems of prediction and feature discovery on health care data, which have made us closest ever towards the goals of improving health quality, reducing cost, and most importantly saving lives. However, the recent rise of this research field with more data and new applications has also introduced several challenges which have not been answered well. ❧ My thesis work focuses on providing deep learning-based solutions to three major challenges in health care tasks on temporal data, ranging from data heterogeneity, data scarcity, and model interpretability and usability for health care applications in practice. ❧ In terms of utilizing data heterogeneity, we develop recurrent neural network models which exploit missingness in time series to improve the prediction performance. We then introduce hierarchical deep generative models for multi-rate multivariate time series which capture multi-scale temporal dependencies and complex underlying generation mechanism of temporal data in health care. ❧ To handle data scarcity, we introduce a semi-supervised learning framework with modified generative adversarial networks to boost prediction performance with a limited amount of labeled data. We also make some attempts to incorporate prior-knowledge and incremental learning to train deep neural networks more efficiently. ❧ To improve the interpretability and usability of deep learning models for clinicians and doctors in practical applications, we propose an interpretable mimic learning framework to get models with both powerful performance and good interpretability, and describe our deep learning solutions to opioid usage and addiction study on a large-scale dataset to demonstrate how deep learning can be applied to important and urgent health care tasks in the real world. ❧ All of the proposed methods are evaluated on real-world health care datasets from different applications. Some important findings are also examined and validated by our collaborators from hospitals.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning to diagnose from electronic health records data
PDF
Learning distributed representations from network data and human navigation
PDF
Scalable multivariate time series analysis
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Exploring complexity reduction in deep learning
PDF
Deep generative models for time series counterfactual inference
PDF
Invariant representation learning for robust and fair predictions
PDF
Deep learning for subsurface characterization and forecasting
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Scalable machine learning algorithms for item recommendation
PDF
Leveraging training information for efficient and robust deep learning
PDF
Neural creative language generation
PDF
3D deep learning for perception and modeling
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Multimodal representation learning of affective behavior
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Deep representations for shapes, structures and motion
Asset Metadata
Creator
Che, Zhengping
(author)
Core Title
Deep learning models for temporal data in health care
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/09/2018
Defense Date
08/31/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,health care,neural network,OAI-PMH Harvest,temporal data,time series analysis
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Knight, Kevin (
committee member
), Wu, Shinyi (
committee member
)
Creator Email
peterche1990@hotmail.com,zche@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-102258
Unique identifier
UC11675910
Identifier
etd-CheZhengpi-6951.pdf (filename),usctheses-c89-102258 (legacy record id)
Legacy Identifier
etd-CheZhengpi-6951.pdf
Dmrecord
102258
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Che, Zhengping
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
neural network
temporal data
time series analysis