Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning from limited and imperfect data for brain image analysis and other biomedical applications
(USC Thesis Other)
Learning from limited and imperfect data for brain image analysis and other biomedical applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning from Limited and Imperfect Data for Brain Image Analysis and Other Biomedical Applications by Haleh Akrami A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOMEDICAL ENGINEERING) May 2024 Copyright 2024 Haleh Akrami Dedication This thesis is dedicated to my father, my first and best teacher. His belief in me provided the motivation and inspiration needed to undertake this journey. ii Acknowledgements I would like to express my heartfelt gratitude to Professor Richard Leahy for his unwavering support, invaluable guidance, and steadfast commitment to my academic journey. His profound expertise and extensive knowledge in the field have been a constant source of inspiration and learning for me. His insightful feedback, encouragement, and academic wisdom have been instrumental in shaping the trajectory of this thesis. I am truly grateful for the privilege of working under his mentorship and benefiting from his deep understanding of the subject matter. I could not have asked for a better advisor on this journey, and his intellectual generosity has left an indelible mark on my educational experience. I would like to express my deepest gratitude to Professor Anand Joshi for his invaluable expertise, continuous guidance, and generous sharing of knowledge throughout this research. His insightful feedback and unwavering support have been instrumental in shaping both the direction and success of this work. I extend my sincere appreciation to Professor Francisco Valero-Cuevas and Professor Paul Thompson, who were invaluable members of my thesis committee. Their expertise, critical insights, and constructive feedback have greatly enriched the quality of my research and broadened my intellectual horizons. I would like to extend my deepest gratitude to Professor Constantine Sideris, whose invaluable support and provision of computational resources were essential to the completion of this thesis. To my dear husband, Arash fayyazi, whose unwavering belief in me has been a constant source of strength and motivation, I am forever grateful. Your love, patience, and understanding have sustained me through the challenges of this journey, and I am deeply appreciative of your unwavering support. iii I would like to express my deep gratitude to my family for their unwavering love, constant support, and unyielding belief in my abilities. Their steadfast encouragement and selfless sacrifices have provided the solid foundation upon which I have forged my academic path. Additionally, I would like to express my gratitude to all of my collaborators and lab mates who have contributed to this work. Their expertise, collaboration, and shared enthusiasm have been invaluable in shaping and refining my research. I am grateful for their camaraderie, stimulating discussions, and the sense of community we have fostered together. In conclusion, I would like to express my deep appreciation to all those who have contributed to the completion of this thesis. Your support, guidance, and encouragement have been essential in making this endeavor a reality. This journey would not have been possible without each and every one of you. Thank you. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Robust Methods for Noisy Data in the Training Set . . . . . . . . . . . . . . . . . . . . . 10 2.1 A Robust Variational Autoencoder Using Beta Divergence . . . . . . . . . . . . . . . . . . 10 2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Motivation via a Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.4 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.4.1 Robust Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.4.2 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.4.3 Robust Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.5.1 Experiment 1: Effect on Latent Representation . . . . . . . . . . . . . . . 29 2.1.5.2 Experiment 2: Reconstruction and Outlier Detection . . . . . . . . . . . 30 2.1.5.3 How to Choose Robustness Parameter β? . . . . . . . . . . . . . . . . . . 33 2.1.5.4 Experiment 3: Detecting Abnormalities in Brain Images using RVAE . . 36 2.1.5.5 Experiment 4: RVAE for Tabular Data . . . . . . . . . . . . . . . . . . . . 37 2.1.5.6 Comparison with Other Methods . . . . . . . . . . . . . . . . . . . . . . 38 2.1.5.7 Inconsistency in Outliers Between Training and Testing . . . . . . . . . 39 2.1.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2 Neuroanatomic Markers of post-Traumatic Epilepsy Based on Magnetic Resonance Imaging and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.2.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2.1.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.2.1.3 Tensor-based Morphometry . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.2.1.4 Lesion-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 v 2.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.2.2.1 TBM-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.2.2.2 Lesion-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.2.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3 Prediction of Post Traumatic Epilepsy using MRI-based Imaging Markers . . . . . . . . . . 57 2.3.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.3.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.3.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.3.2.1 Lesion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.3.2.2 PTE-related Modulations of ALFF . . . . . . . . . . . . . . . . . . . . . . 70 2.3.2.3 Classification of PTE and non-PTE Subjects using Machine Learning . . 73 2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.3.3.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.4 Semi-supervised Learning using Robust Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2.4.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 2.4.2 Robust Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 2.4.3 Training using Unlabeled Data and a Robust Loss Function . . . . . . . . . . . . . 85 2.4.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 2.4.4.1 Brain Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 2.4.4.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 2.4.4.3 Brain Tumour Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 89 2.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 2.5 Sequential Multi-task Learning for Histopathology-based Prediction of Genetic Mutations with Extremely Imbalanced Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 2.5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 2.5.1.1 Single-Task Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 2.5.1.2 Multi-Task Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 2.5.1.3 Sequential Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 2.5.1.4 Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 2.5.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 2.5.3 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Chapter 3: Brain Lesion Detection using Robust Variational Autoencoder and Transfer Learning . 103 3.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.1.1 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.1.2 Robust Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.2 The Model and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.2.1 Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Chapter 4: Deep Quantile Regression for Uncertainty Estimation . . . . . . . . . . . . . . . . . . . 112 4.1 Deep Quantile Regression for Uncertainty Estimation in Unsupervised and Supervised Lesion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.1.1.1 Variance Shrinkage Problem in Variational Autoencoders . . . . . . . . . 117 vi 4.1.1.2 Conditional Quantile Regression . . . . . . . . . . . . . . . . . . . . . . 119 4.1.2 Deep Uncertainty Estimation with Quantile Regression . . . . . . . . . . . . . . . . 120 4.1.2.1 Quantile Regression Varational Autoencoder (QR-VAE) . . . . . . . . . . 120 4.1.2.2 Binary Quantile Regression U-Net (BQR U-Net) . . . . . . . . . . . . . . 121 4.1.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.1.3.1 Simulations for VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.1.3.2 Unsupervised Lesion Detection . . . . . . . . . . . . . . . . . . . . . . . 125 4.1.3.3 Supervised Lesion Detection . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.2 Beta Quantile Regression for Robust Estimation of Uncertainty in the Presence of Outliers∗ 137 4.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.2.1.1 Least Trim Quantile Regression (TQR) . . . . . . . . . . . . . . . . . . . 139 4.2.1.2 Robust regression based on regularization of case-specific parameters (RCP) 140 4.2.1.3 β-quantile regression (β-QR) . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.2.1.4 Quantile regression for diffusion models for regression tasks . . . . . . . 141 4.2.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.2.2.1 Star Cluster CYB OB1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.2.2.2 Toy Example for Uncertainty Estimation . . . . . . . . . . . . . . . . . . 143 4.2.2.3 Quantile Regression for Uncertainty Estimation in Diffusion Models . . . 144 4.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Chapter 5: Using Diffusion Models for In-painting, Segmentation, and Registration of Abnormal Brains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.2.3 Generating of pseudo-healthy individualized atlas using Diffusion model . . . . . . 155 5.2.4 Registration assistant module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2.5 Abnormality Localization and Processing . . . . . . . . . . . . . . . . . . . . . . . 159 5.2.6 The inpainting module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.3 Results and Disscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.3.0.1 Evaluating lesion detection . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.3.0.2 Evaluating the in-painting module . . . . . . . . . . . . . . . . . . . . . . 163 5.3.1 Results of BrainSuite Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 ∗This work is in equal collaboration with Omar Zamzam from the University of Southern California. vii List of Tables 2.1 Comparison of different autoencoders for MNIST+EMNIST and Fashion-MNIST anomaly detection experiment with 10% outliers and Comparison of different autoencoders for Lesion detection experiment for ISLES dataset in terms of AUC. . . . . . . . . . . . . . . . 39 2.2 Average lesion volumes as measured by identifying lesions using a one-class SVM on the VAE lesion maps. Red indicates cases of significant differences in the variance of lesion volume between PTE and non-PTE (F-test). The FDR-corrected p-values are shown at a significance level of α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3 PTE vs non-PTE group comparison of lesion and ALFF measures (p-values obtained using an F-test). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.4 Classification accuracy of PTE vs. non-PTE subjects using different classifiers and features types. Mean and standard deviation of AUC are shown for KSVM, SVM, RF and NN. The last column shows the performance obtained when the models were trained simultaneously on all three feature types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.5 Dice scores relative to ground-truth T1-W+T2-W SGLs for the test dataset . . . . . . . . . 89 2.6 Performance of proposed semi-supervised strategy on CIFAR-10 dataset. Results show test accuracy of the models trained using only 10% of ground truth labels. . . . . . . . . . . . 90 2.7 Comparison of the mean and standard deviation of Dice scores for different methods on test subjects for different tumor classes (WT, TC, ET). . . . . . . . . . . . . . . . . . . . . . 91 2.8 F1 score/AUROC testing WSI result for single-task training. . . . . . . . . . . . . . . . . . 100 2.9 F1 score/AUROC testing WSI-level results for multi-task training. . . . . . . . . . . . . . . 101 2.10 F1 score/AUROC testing WSI-level results comparing sequential learning with re-sampling (sequential-resample) and continual learning with re-sampling (CL-resample). . . . . . . . 102 4.1 Compariaon of the performance of unsupervised lesion detection for VAE and QR-VAE, with and without conformalization. QR-VAE-conf: conformalized QR-AVE; QR-VAE-GS: Gaussian QR-VAE; QR-VAE-GS-conf: Gaussian conformalized QR-VAE. . . . . . . . . . . . 132 viii 4.2 The mean (std dev) of the Dice coefficients between estimated probability regions (P(Y = 1|X) ≥ α, where α is (0.25, 0.5, 0.75, 1); GT: ground truth, DT: deterministic, P = P(Y = 1|X) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.3 Comparison of performance of TQR, RCP and β-QR. Each entry shows the Frobenius norm of the difference between the estimated quantiles and their (outlier-free) ground truth for the star cluster CYB OB1 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.4 Comparing performance of β-QR with outlier free on baseline model. For the prediction error, MSE calculated between ground truth T2 and the median of each model . . . . . . . 146 5.1 Comparison of Diffusion Models on Brats Dataset . . . . . . . . . . . . . . . . . . . . . . . 159 5.2 Comparing inpainting performance of our fine-tuned model with rePaint (SSIM/MSE) . . . 163 ix List of Figures 2.1 Illustration of the robustness of β-divergence to outliers in comparison to KL-divergence: optimizing KL-divergence for parameter estimation of a single Gaussian distribution does not distinguish between inliers and outliers, whereas optimizing β-divergence results in an estimate that is robust to outliers, by de-emphasizing out of distribution data. . . . . . . . 14 2.2 Comparing robustness of VAE and RVAE using the MNIST dataset contaminated with synthetic outliers generated by Gaussian noise: (a) the 2D latent space of VAE for the original MNIST dataset without outliers (colors represent class labels of MNIST); (b) the 2D latent space of VAE for the MNIST dataset with added outliers (marked by a dark red circle); (c) the 2D latent space of RVAE without outliers; (d) the 2D latent space of RVAE with outliers added to the input data. (e) Examples of the reconstructed images using VAE (VAE_outlier: VAE trained on a data set which includes outliers, VAE_Orig: VAE trained on the outlier-free data set) and RVAE (RVAE_outlier: RVAE trained on a data set which includes outliers). Unlike the VAE, the RVAE is minimally affected by the presence of outliers in the training and reconstructs the outliers as a digit from the (outlier-free) training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Examples of reconstructed inlier images (first 4 columns in each Figure) and outliers (last 4 columns in each Figure) using VAE and RVAE with different βs on (a) MNIST (inliers) + EMNIST (outliers) datasets and (b,c) Images from the class of shoes (inliers) and images from the class of other accessories (outliers) in the Fashion MNIST datasets. The optimal value of β is highlighted in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 The performance measure (the ratio between the overall absolute reconstruction error in outlier samples and their counterparts in the normal samples) as a function of the parameter β (x-axis) and the fraction of outliers present in the training data (y-axis) for two datasets used for experiment 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 ROC curves showing the performance of outlier detection using VAE and RVAE with different fractions of outliers present in the training data for the two datasets used in Experiment 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Generating samples using the decoder with different values of β for the Fashion-MNIST experiment. The optimal value of β is 0.01 which generates different types of shoes. This optimal value also matches the maximum value achieved in the heat map in Figure 2.4-b. . 32 x 2.7 Searching for the optimal β by optimizing using different metrics using Brent’s method: (a) MNIST-EMNIST experiment; (b) Fashion-MNIST experiment with 10% outliers. Validation method: the ratio of reconstruction error in inliers vs outliers used as the cost function. Clustering method: the Silhouette Score for k-means cluster (k = 2) was used as the cost function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8 Reconstructions of brain images using VAE and RVAE: a) after randomly dropping rows with a height of 5 pixels for 10% of the Maryland MagNeTs dataset; b) after adding simulated lesions to 10% of the Maryland MagNeTs dataset; c) for the ISLES data with true lesions. . 35 2.9 Performance comparison of VAE and RVAE as function of contamination in training data for datasets: (a) KDDCup99, (b) NSLKDD, and (c) UNSW-NB15 . . . . . . . . . . . . . . . . 38 2.10 Examples of reconstructed inlier images (first 4 columns in each Figure) and outliers (last 4 columns in each Figure) from the test set using original VAE when outliers in the training data are qualitatively different from those in the test data. . . . . . . . . . . . . . . . . . . 40 2.11 Reconstructions of brain images using VAE and RVAE with inconsistency between outliers in training and testing. The VAE did not reconstruct the real lesions in the test (input) images but the reconstructions are sometimes corrupted by artifacts similar to the outliers that were present in the training data (see red squares). . . . . . . . . . . . . . . . . . . . . 41 2.12 The VAE network and an input/output sample pair from the ISLES dataset. X denotes the input data, Z denotes its low-dimensional latent representation. The VAE consists of an encoder network that computes an approximate posterior qϕ(Z|X), and a decoder network that computes pθ(X|Z). The VAE model takes T1, T2, and FLAIR images from individual subjects (left), compresses them to generate a latent representation (Z), and regenerates three images (right). The VAE is trained on a dataset that contains few lesions. After training, when presented with a new lesioned brain, the reconstruction effectively removes the lesion from the image resulting in a normal (lesion-free) version of the brain. 49 2.13 Three orthogonal views through the t statistic map thresholded at p = 0.05 (uncorrected) for TBM analysis using Jacobian determinants. No regions in the map survived multiple comparisons corrections using FDR (q = 0.05) . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.14 Three orthogonal views through the F-statistic map thresholded at q = 0.05 (FDR-corrected) for TBM analysis using Jacobian determinants. . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.15 Reconstruction results obtained by applying the VAE to the ISLES dataset: (a) sample slices from input images; (b) slices reconstructed from the VAE; (c) difference between input and reconstructed images; (d) error maps after applying median filtering to reduce the occurrence of spurious voxels; (e) manually delineated lesion masks used as ground truth to evaluate VAE performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.16 Orthogonal views through the t-statistic map, thresholded at p = 0.05 (uncorrected), comparing lesion maps for the PTE and non-PTE groups. . . . . . . . . . . . . . . . . . . . 54 xi 2.17 Orthogonal views through the F-statistic map, thresholded at q = 0.05 (FDR-corrected), comparing lesion maps for the PTE and non-PTE groups. . . . . . . . . . . . . . . . . . . . 55 2.18 Voxel-based PTE vs. non-PTE group comparison of lesion maps overlaid on the USCBrain atlas. The color code depicts f-values, shown in a region where p-value < 0.05, resulting from the F-test (with permutations). Prominent significant clusters are located in the left temporal lobe, bilateral occipital lobe, cerebellum, and right parietal lobe. . . . . . . . . . . 71 2.19 Differences in ALFF between the PTE and non-PTE groups. The results are color-coded f-statistic thresholded by FDR corrected p-values (p < 0.05) derived using a permutation test. Significant clusters are visible in the left temporal lobe, bilateral occipital lobes, cerebellum, and right parietal lobe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.20 (a) Brain-wide mean lesion volume variability shown for non-PTE (upper row) and PTE subjects (lower row). (b) Feature importance map shown as color-coded ROIs overlaid on the USCBrain atlas. Both corticalsurface and volumetric ROIs are shown. . . . . . . . . . . 75 2.21 Number of samples vs AUC resulting from KSVM (PCA) method. The blue curve shows mean AUC and shaded areas indicate std dev. in the leave one out stratified cross-validation. Conservative (red) and optimistic (blue) extrapolations are shown as dotted curves. . . . . 79 2.22 Framework: We develop a teacher-student semi-supervised framework by using both manually labeled as well as pseudo-labeled data. We propose to first generate pseudo-labels of the unlabeled data using a model trained on manually labeled ground-truth labels. Then we train a second model using these ground-truth labels and generated pseudo-labels simultaneously by applying a robust loss to enhance model robustness to noise in the pseudo-labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 2.23 An illustrative example of using robust loss for classification: (left) a simulated dataset with three distinct classes and a mislabeled subclass data, shown as a smaller cluster on the bottom left; (middle) decision boundary computed using a single layer perceptron and multivariate cross-entropy loss (non-robust loss); (right) the decision boundary calculated using a single layer perceptron, but with multivariate beta cross-entropy (robust loss). It can be seen that the decision boundary computed using the non-robust loss is affected due to the mislabeled outliers, whereas the decision boundaries calculated using a robust loss are minimally impacted by the mislabeled data. . . . . . . . . . . . . . . . . . . . . . . . . 86 2.24 Segmentation results using different methods . . . . . . . . . . . . . . . . . . . . . . . . . 88 2.25 A graphical illustration of segmentation accuracy improvement using the proposed strategy for different fractions of pseudo-labels (p) in the training data. Figures (a), (b) and (c) show the average dice scores of tumor classes WT, TC, and ET, respectively. . . . . . . . . . . . . 91 2.26 Comparison of Brain Tumor Segmentation Results: from left to right showing the segmentation results of the lower bound model, the model with CE loss, the model with GCE loss, and the ground-truth label. Segmentation results indicate label 1 (yellow), label 2 (blue) and label 4 (red), where we have ET (label 1), TC (labels 1 and 4), and WT ( labels 1, 2, and 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 xii 2.27 A) Single-task model. B) Multi-task model. C) Sequential training. Non-trainable parameters are shown with dash lines. D) Resnet50 model where the first 20 layers are freezing for sequential training and transfer learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 2.28 A) Class distribution for the 10 genetic mutations. B) Examples of a WSI with TMB-H (left) and a patch of 512×512 pixels extracted from the WSI (right). . . . . . . . . . . . . . . . . 99 3.1 VAE network and input, output sample for ISLES dataset . . . . . . . . . . . . . . . . . . . 106 3.2 (A) Original and reconstructed test images using different models. (B) Absolute reconstruction error of the test images and associated hand-delineated lesions (GTruth). VAEbr: VAE model re-trained from scratch using the initial data and the BRATS samples, RVAEbr: RVAE model re-trained from scratch using the initial data and BRATS samples, PreVAE: transfer learning of VAE from the pre-trained VAE model using additional BRATS samples, PreRVAE: transfer learning of RVAE from the pre-trained VAE model using additional BRATS samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.3 ROC curves of different models. RVAE outperforms VAE both when trained from scratch using BRATS samples in addition to the initial data (RVAEbr vs VAEbr) and when updated using the pre-trained models (PreRVAE vs PreRVAE). . . . . . . . . . . . . . . . . . . . . . 111 4.1 Pairwise joint distribution of the ground truth and generated distributions. Top: v1 vs. v2 dimensions. Bottom: v2 vs v3 dimensions. From left to right: original distribution and distributions computed using VAE, Comb-VAE and QR-VAE, respectively. We also list the KL divergence between the learned distribution and the original distribution in each case. 125 4.2 Model-free lesion detection for ISLES dataset using QL = Q0.025 and QH = Q0.975. Pixels outside the [QL, QH] interval are marked outliers. Estimated quantiles are the outputs of QR-VAE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.3 Estimating two quantiles in the ISLES dataset using QR-VAE. Using the Gaussian assumption for the posterior, there is 1-1 mapping from these quantiles to mean and standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.4 Pixel-wise quantile image thresholds for a single test image as a function of quantile computed using the QR-VAE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.5 Vertical axis indicates the fraction of pixels in the entire testing set whose intensity is below the corresponding quantile for that pixel as computed using the QR-VAE. Note that as aggregated over the entire test set, the computed pixel-wise quantiles closely match the true distribution assuming anomaly-free data (in practice the fraction of anomalous pixels is a very small fraction of the total, so the presence of lesions in the data should not substantially affect this plot). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 xiii 4.6 Lesion detection for the ISLES dataset. A) VAE with mean and variance estimation B) QR-VAE. First, we normalize the error value using the pixel-wise model’s estimates of mean and variance. The resulting z-score is then converted to an FDR-corrected p-value and the images are thresholded at a significance level of 0.05. The bottom rows represent ground truth based on expert manual segmentation of lesions. . . . . . . . . . . . . . . . . . . . . 131 4.7 Top row: results of U-Net delineation of lesion boundaries. Bottom row: results of deterministic cross-entropy U-Net. (a) The original slice of the lung image; (b) estimated probability regions corresponding to 0.125,0.375,0.625,0.875 quantile levels shown with Red, green, purple and yellow colors respectively; (c) the estimate of thresholded lesion boundary from human raters corresponding to agreement between 1,2,3 and 4 raters. . . . 134 4.8 Violin plots of the Dice coefficients between quantiles (0.125, 0.375, 0.625, 0.875)) and rater agreement maps for the test datasets, GT: ground truth, DT: binary cross-entropy (Deterministic U-Net), QR: quantile regression (BQR U-Net). The fraction of empty quantiles in the ground truth (excluded from Dice coefficient computations) were 0.07, 0.31, 0.45, 0.64 respectively. The width of the violin indicates the fraction of the dataset as a function of the dice coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.9 Robust linear quantile regression using TQR, RCP, β -QR for star cluster CYB OB1 dataset. 143 4.10 Robust non-linear quantile regression using TQR, RCP, β-QR using a simple neural network for a toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.11 Estimating T2 MRI QL(0.05),QH(0.95),QM(0.5) for diffusion models from T1 MRI. Comparing the estimated quantiles using the non-robust and robust (β-QR) model with the outlier free model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.1 Proccessing of lesioned brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.2 Registering the lesioned brain to a normal version of it reconstructed using a diffusion model. Then, thresholding the determinant of the Jacobian of the registration deformation field using a normative validation set of Jacobian determinants to find the anomalous part of the lesioned brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.3 The input of the network is the noisy input which is a combination of noisy localized input and non-masked input and the mask. The network task is to predict the noise. . . . . . . . 152 5.4 In-painting the identified part of the lesioned brain using a masked diffusion model to get a completely normal brain that can be easily processed using any software. And mapping back the generated processing to the space of the original brain using the inverse of the deformation field that resulted in the first step of registering the lesioned brain to the new normal space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.5 Lesion segmentation results for SegReg (ours), Diffusion model with 300-level noise (DF-300), Diffusion model with 500-level noise(DF-500). In the case of diffusions, the left column shows segmentation while the right column shows reconstruction error. GT: ground-truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 xiv 5.6 in-painting results comparing Repaint and our fine-tuned model . . . . . . . . . . . . . . . 157 5.7 BrainSuit Processing of the lesioned brain, Surrogate brain, and Surrogate brain moved back to original space. The arrows show that the right temporal lobe in the original brain has been mislabelled. The medial temporal gyrus and temporal pole have been completely misidentified in the original brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.8 BrainSuit Processing of the lesioned brain, Surrogate brain, and Surrogate brain moved back to original space in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 xv Abstract This research explores the development of novel deep learning (DL) methods specifically tailored for biomedical applications, where data is often limited or imperfect. Despite the success of DL in various domains, its application in medical imaging faces unique challenges. These include the non-conformity of real-world datasets to standard machine learning assumptions, limited generalizability to unseen datasets, and the tendency of DL methods to produce overconfident predictions, particularly with limited data. Such issues are critical in clinical settings, where accurate uncertainty assessment is crucial for disease diagnosis and treatment planning. The primary objective of this study is to create robust, generalizable, and uncertainty-aware DL models that can effectively handle the complexities of biomedical datasets. We aim to: (1) enhance the robustness of DL models for complex medical imaging data, (2) improve the generalizability of DL methods across diverse datasets, thereby increasing the statistical power of medical studies, and (3) refine DL models to better estimate uncertainties, providing a risk assessment alongside diagnostic decisions. A meaningful application of our research is in the detection of lesions and prediction of post-traumatic epilepsy (PTE) following traumatic brain injury (TBI). Given the high prevalence and long-term impact of TBI, and the challenge of identifying biomarkers for PTE, our DL methods have the potential to aid in the early identification of at-risk patients, guiding preventive care. Beyond medical imaging, the developed methods have implications for other fields suffering from poor-quality training data, such as network traffic modeling and speech recognition, illustrating the broad applicability of our research. xvi Chapter 1 Introduction In recent years, deep Learning (DL) methods have attracted researchers across a broad range of biomedical applications to develop automated algorithms for the analysis and interpretation of imaging and other biomedical data. These techniques can replace time-consuming repetitive tasks such as lesion delineation that are currently performed by technicians and physicians. Even more importantly, DL can draw on and learn from larger and richer datasets than a typical clinician may encounter in many years of practice. As a result, DL offers the potential for improvements in identifying diseases, assessing progression, and predicting outcomes. Here we mostly focused on the application of DL in the scenario where data is limited or imperfect. Despite many promising results, DL methods still suffer from the following problems that limit their applications in medical imaging: 1. Real-wold datasets do not always conform to the assumptions of machine learning and DL methods, for example, that they are outlier-free. Violating these assumptions can severely degrade learned representations and performance of these methods, disproportionately affect the training process, and lead to incorrect conclusions about the data. 2. Another barrier to applying DL models in medical imaging datasets and other real-world applications is generalizability. Using a pre-trained model on an unseen dataset is still a challenge. This difficulty 1 exists since new datasets may use different imaging parameters, demographics, and different preprocessing techniques. 3. Additionally, despite impressive state-of-the-art performance on a wide variety of tasks in multiple applications, DL methods tend to produce over-confident predictions, particularly with limited training data. This is particularly dangerous in clinical applications where realistic assessment of uncertainty is essential in determining disease status and appropriate treatments. So, considering all of these challenges, the goal of this research is to develop novel DL methods that address the difficulty of applying conventional approaches and recent DL methods on real-world datasets such as biomedical datasets due to imperfect and limited data. More precisely, the result of this research achieves improvement in three major ongoing research areas: 1. robustness: developed robust DL models to be applied in medical imaging research with complex training data. 2. generalizability: built DL methods to increase the generalizability of DL methods to be able to aggregate datasets from different studies to increase the statistical power of medical studies. 3. uncertainty estimation: improved DL models by estimating the uncertainty to calculate a risk along with the decision. Robustness: My research focused on developing a robust loss function for both supervised and unsupervised learning. In this process, we have derived a novel Variational Autoencoder (VAE) that is robust to outliers in the training data [1, 2]. This RVAE model is based on beta-divergence, rather than the standard Kullback-Leibler (KL) divergence. These developed DL methods have been used for lesion detection and prediction of post-traumatic epilepsy (PTE) after traumatic brain injury [3, 4]. Despite the reported relationship between TBI and PTE, identifying biomarkers of epileptogenesis after TBI is still a 2 fundamental challenge. Developing a DL method for automated identification of markers of PTE can help in the early identification of TBI patients at increased risk for PTE and help to focus both resources and research development in preventive care for these subjects. Furthermore, we utilized this loss function for categorical datasets, mitigating the effect of noisy labels. We presented a semi-supervised training strategy that leverages both manually labeled data and additional unlabeled data [5, 6]. The combination of noisy labels with class imbalance has not been thoroughly investigated in the literature. During my internship at Merck, we developed a sequential multi-task learning approach to train a single model capable of predicting multiple genetic mutations while avoiding overfitting and trivial answers for imbalanced classes with noisy labels. We trained models to predict ten genetic mutations in H&E images [7]. (see Chapter 2) [1] Akrami, H., Joshi, A. A., Li, J., Aydöre, S., Leahy, R. M. (2022). A robust variational autoencoder using beta divergence. Knowledge-based systems, 238, 107886. [2] Akrami, H., Aydore, S., Leahy, R. M., Joshi, A. A. (2020). Robust variational autoencoder for tabular data with beta divergence. ICML Workshop on Uncertainty and Robustness in Deep Learning, ICML 2022. [3] Akrami, H., Leahy, R. M., Irimia, A., Kim, P. E., Heck, C. N., Joshi, A. A. (2022). Neuroanatomic Markers of Posttraumatic Epilepsy Based on MR Imaging and Machine Learning. American Journal of Neuroradiology, 43(3), 347-353. [4] Akrami, H., Cui, W., Kim, P. E., Heck, C. N., Irimia, A., Jebri, K., Nair, D., Leahy, R. M., Joshi, A. A. (2024) Prediction of Post Traumatic Epilepsy using MRI-based Imaging Markers. bioRxiv 2024.01.12.575454. [5] Akrami, H., Cui, W., Joshi, A. A., Leahy, R. M. (2022). Learning from imperfect training data using a robust loss function: application to brain image segmentation. arXiv preprint arXiv:2208.04941. [6] Akrami, H., Cui, W., Joshi, A. A., Leahy, R. M. (2022). Semi-supervised Learning using Robust Loss. Medical Imaging Meets NeurIPS Workshop, NeurIPS 2022. 3 [7] Akrami, H., et al. "Sequential Multi-task Learning for Histopathology-Based Prediction of Genetic Mutations with Extremely Imbalanced Labels." International Workshop on Medical Optical Imaging and Virtual Microscopy Image Analysis. Cham: Springer Nature Switzerland, 2022. Generalizability: Generalizability and bias mitigation are vital considerations when applying machine learning methods. These methods assume that training and test datasets are sampled from the same distribution, which may not hold in real-world settings. To address this, we leveraged the robustness of the RVAE and deployed a transfer-learning approach on a test dataset with different characteristics to detect outliers. Results on MRI datasets demonstrated that this approach improved the accuracy of lesion detection using a variational autoencoder model [1] (see Chapter 3). To increase the generalizability we also describe a new approach to jointly align and synchronize fMRI data in space and time, across a group of subjects. Since brain activity during resting is spontaneous, it is not possible to directly compare rfMRI time courses across subjects [2]. Additionally, We propose a novel knowledge transfer strategy that integrates meta-learning with self-supervised learning to tackle the heterogeneity and scarcity of fMRI data [3]. As part of my internship at Microsoft Research, we developed a semi-supervised multi-task framework to improve the performance of a blind mean opinion score (MOS) estimation model. MOS is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample. Moreover, we presented preliminary results for addressing individual rater bias in the MOS labels [4]. We also tackled the problem of generalizability of preprocessing brain software for lesioned brains. The identification of anatomy in the presence of lesions poses a significant challenge. This issue is especially pronounced in atlas-based methods, which typically rely on the assumption of a one-to-one correspondence between the atlas and the subject’s anatomy. Such an assumption becomes problematic in pathological states, where lesions disrupt the normal anatomical structure, leading to inaccuracies in atlas matching. To address this, we developed a pipeline using diffusion models to preprocess lesioned brain images, enabling 4 them to be processed using conventional methods for registration, segmentation, and atlas labeling. Our pipeline includes a lesion in-painting process, wherein diffusion data is utilized to predict and reconstruct the affected regions, effectively ’filling in’ the areas disrupted by lesions (see Chapter 5). [1] Akrami, H., Joshi, A. A., Li, J., Aydore, S., Leahy, R. M. (2020, April). Brain lesion detection using a robust variational autoencoder and transfer learning. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) (pp. 786-790). IEEE. [2] Akrami, H., Joshi, A. A., Li, J., Leahy, R. M. (2019, March). Group-wise alignment of resting fMRI in space and time. In Medical Imaging 2019: Image Processing (Vol. 10949, pp. 737-744). SPIE [3] Cui, W., Akrami, H., Zhao, G., Joshi, A. A., Leahy, R. M. (2023). Meta Transfer of Self-Supervised Knowledge: Foundation Model in Action for Post-Traumatic Epilepsy Prediction. arXiv preprint arXiv:2312.14204. [4] Akrami, H., Gamper, H. (2023, June). Speech MOS multi-task learning and rater bias correction. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. Uncertainty Estimation: Despite the impressive performance of DL methods in various tasks, deep learning models can produce over-confident predictions, particularly with limited training data. To address this, We propose a novel approach using quantile regression to quantify aleatoric uncertainty in supervised and unsupervised lesion detection problems [1, 2]. The resulting confidence intervals can be used for lesion detection and segmentation. In the unsupervised setting, We combine quantile regression with the VAE. Our approach tackles the problem of quantifying uncertainty in the reconstructed images by developing a Quantile-Regression VAE (QR-VAE) that directly estimates conditional quantiles for the input image. We developed binary quantile regression (BQR) for the supervised lesion segmentation task. BQR captures uncertainty in lesion boundaries by characterizing expert disagreement [2]. We also developed a robust quantile regression loss function for robust uncertainty estimation in diffusion models, enabling the estimation of reliable quantile intervals for image translation [4] (see Chapter 4). 5 [1] Akrami, H., Joshi, A., Aydore, S., Leahy, R. (2021). Quantile regression for uncertainty estimation in vaes with applications to brain lesion detection. In Information Processing in Medical Imaging: 27th International Conference, IPMI 2021, Virtual Event, June 28–June 30, 2021, Proceedings 27 (pp. 689-700). Springer International Publishing. [2] Akrami, H., Joshi, A. A., Aydöre, S., Leahy, R. M. (2022). Deep Quantile Regression for Uncertainty Estimation in Unsupervised and Supervised Lesion Detection. The journal of machine learning for biomedical imaging, 1. [3] Akrami, H., Zamzam, O., Joshi, A., Aydore, S., Leahy, R. (2023). Beta quantile regression for robust estimation of uncertainty in the presence of outliers. ICASSP 2023. Our tools and methods will also be useful in other applications that frequently suffer from poor-quality training data such as network traffic modeling and speech recognition. 6 List of my Publications [1] H Akrami, RM Leahy, A Irimia, PE Kim, CN Heck, and AA Joshi. “Neuroanatomic Markers of Posttraumatic Epilepsy Based on MR Imaging and Machine Learning”. In: American Journal of Neuroradiology 43.3 (2022), pp. 347–353. [2] Haleh Akrami, Sergul Aydore, Richard M Leahy, and Anand A Joshi. “Robust Variational Autoencoder for Tabular Data with Beta Divergence”. In: arXiv preprint arXiv:2006.08204 (2020). [3] Haleh Akrami, Wenhui Cui, Anand A. Joshi, and Richard M. Leahy. “Learning from imperfect training data using a robust loss function: application to brain image segmentation”. In: CoRR abs/2208.04941 (2022). doi: 10.48550/ARXIV.2208.04941. arXiv: 2208.04941. [4] Haleh Akrami, Wenhui Cui, Paul E. Kim, Christianne N. Heck, Andrei Irimia, Karim Jebri, Dileep Nair, Richard M. Leahy, and Anand Joshi. “Prediction of Post Traumatic Epilepsy using MRI-based Imaging Markers”. In: bioRxiv (2024). doi: 10.1101/2024.01.12.575454. eprint: https://www.biorxiv.org/content/early/2024/01/15/2024.01.12.575454.full.pdf. [5] Haleh Akrami and Hannes Gamper. “Speech MOS Multi-Task Learning and Rater Bias Correction”. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. IEEE, 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10096572. [6] Haleh Akrami, Andrei Irimia, Wenhui Cui, Anand A Joshi, and Richard M Leahy. “Prediction of posttraumatic epilepsy using machine learning”. In: Medical Imaging 2021: Biomedical Applications in Molecular, Structural, and Functional Imaging. Vol. 11600. SPIE. 2021, pp. 424–430. [7] Haleh Akrami, Anand Joshi, Sergul Aydore, and Richard Leahy. “Quantile Regression for Uncertainty Estimation in VAEs with Applications to Brain Lesion Detection”. In: International Conference on Information Processing in Medical Imaging. Springer. 2021, pp. 689–700. [8] Haleh Akrami, Anand A Joshi, Sergül Aydöre, and Richard M Leahy. “Deep Quantile Regression for Uncertainty Estimation in Unsupervised and Supervised Lesion Detection”. In: The journal of machine learning for biomedical imaging 1 (2022). [9] Haleh Akrami, Anand A Joshi, Jian Li, Sergul Aydore, and Richard M Leahy. “A robust variational autoencoder using beta divergence”. In: Knowledge-Based Systems 238 (2022), p. 107886. [10] Haleh Akrami, Anand A Joshi, Jian Li, Sergul Aydore, and Richard M Leahy. “Brain Lesion Detection Using A Robust Variational Autoencoder and Transfer Learning”. In: (2020), pp. 786–790. 7 [11] Haleh Akrami, Anand A. Joshi, Jian Li, Sergul Aydore, and Richard M. Leahy. “Brain Lesion Detection Using A Robust Variational Autoencoder and Transfer Learning”. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). Iowa City, IA, USA: IEEE, Apr. 2020, pp. 786–790. isbn: 978-1-5386-9330-8. doi: 10.1109/ISBI45749.2020.9098405. (Visited on 06/02/2020). [12] Haleh Akrami, Anand A. Joshi, Jian Li, and Richard M. Leahy. “Group-wise alignment of resting fMRI in space and time”. In: Medical Imaging 2019: Image Processing, San Diego, California, United States, 16-21 February 2019. Vol. 10949. SPIE Proceedings. SPIE, 2019, 109492W. doi: 10.1117/12.2512564. [13] Haleh Akrami, Tosha Shah, Amir Vajdi, Andrew Brown, Radha Krishnan, Razvan Cristescu, and Antong Chen. “Sequential Multi-task Learning for Histopathology-Based Prediction of Genetic Mutations with Extremely Imbalanced Labels”. In: Medical Optical Imaging and Virtual Microscopy Image Analysis - First International Workshop, MOVI 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings. Vol. 13578. Springer, 2022, pp. 126–135. doi: 10.1007/978-3-031-16961-8\_13. [14] Haleh Akrami, Omar Zamzam, Anand A. Joshi, Sergül Aydöre, and Richard M. Leahy. “Beta quantile regression for robust estimation of uncertainty in the presence of outliers”. In: CoRR abs/2309.07374 (2023). doi: 10.48550/ARXIV.2309.07374. arXiv: 2309.07374. [15] Akrami, H, Joshi, AA, Li, J, and Leahy, RM. “Average template for comparison of resting fMRI based on group synchronization of their time series”. In: 2018. [16] Antong Chen, Tosha Shah, Andrew Brown, Haleh Akrami, Albert Swiston, Amir Vajdi, Radha Krishnan, and Razvan Cristescu. “Prediction of tumor mutation burden from HE whole-slide images: a comparison of training strategies with convolutional neural networks”. In: Medical Imaging 2022: Digital and Computational Pathology. Vol. 12039. SPIE, 2022, p. 120391D. doi: 10.1117/12.2613196. [17] Wenhui Cui, Haleh Akrami, Anand A. Joshi, and Richard M. Leahy. “Semi-supervised Learning using Robust Loss”. In: CoRR abs/2203.01524 (2022). doi: 10.48550/ARXIV.2203.01524. arXiv: 2203.01524. [18] Wenhui Cui, Haleh Akrami, Anand A. Joshi, and Richard M. Leahy. “Toward Improved Generalization: Meta Transfer of Self-supervised Knowledge on Graphs”. In: CoRR abs/2212.08217 (2022). doi: 10.48550/ARXIV.2212.08217. arXiv: 2212.08217. [19] Wenhui Cui, Haleh Akrami, Ganning Zhao, Anand A Joshi, and Richard M Leahy. “Meta Transfer of Self-Supervised Knowledge: Foundation Model in Action for Post-Traumatic Epilepsy Prediction”. In: arXiv preprint arXiv:2312.14204 (2023). [20] Anand A. Joshi, Haleh Akrami, Jian Li, and Richard M. Leahy. “A Matched Filter Decomposition of fMRI into Resting and Task Components”. In: Medical Image Computing and Computer Assisted Intervention - MICCAI 2019 - 22nd International Conference, Shenzhen, China, October 13-17, 2019, Proceedings, Part III. Ed. by Dinggang Shen, Tianming Liu, Terry M. Peters, Lawrence H. Staib, Caroline Essert, Sean Zhou, Pew-Thian Yap, and Ali R. Khan. Vol. 11766. Lecture Notes in Computer Science. Springer, 2019, pp. 673–681. doi: 10.1007/978-3-030-32248-9\_75. 8 [21] Anand A. Joshi, Soyoung Choi, Haleh Akrami, and Richard M. Leahy. “fMRI-Kernel Regression: A Kernel-based Method for Pointwise Statistical Analysis of rs-fMRI for Population Studies”. In: CoRR abs/2012.06972 (2020). arXiv: 2012.06972. url: https://arxiv.org/abs/2012.06972. [22] Anand A. Joshi, Soyoung Choi, Jian Li, Haleh Akrami, and Richard M. Leahy. “A pairwise approach for fMRI group studies using the BrainSync Transform”. In: Medical Imaging 2021: Image Processing. Ed. by Ivana Išgum and Bennett A. Landman. Vol. 11596. International Society for Optics and Photonics. SPIE, 2021, 115960G. doi: 10.1117/12.2580980. [23] Anand A. Joshi, Jian Li, Haleh Akrami, and Richard M. Leahy. “Predicting cognitive scores from resting fMRI data and geometric features of the brain”. In: Medical Imaging 2019: Image Processing, San Diego, California, United States, 16-21 February 2019. Ed. by Elsa D. Angelini and Bennett A. Landman. Vol. 10949. SPIE Proceedings. SPIE, 2019, 109492E. doi: 10.1117/12.2512063. [24] Anand A. Joshi, Jian Li, Minqi Chong, Haleh Akrami, and Richard M. Leahy. “rfDemons: Resting fMRI-Based Cortical Surface Registration Using the BrainSync Transform”. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Vol. 11072. Cham: Springer International Publishing, 2018, pp. 198–205. isbn: 978-3-030-00930-4 978-3-030-00931-1. doi: 10.1007/978-3-030-00931-1_23. [25] Joshi, AA, Li, J, Akrami, H, and Leahy, RM. “Predicting Cognitive Scores from Resting fMRI Data and Geometric Features”. In: Medical Imaging. SanDiego, 2019, In Press. [26] Souvik Kundu, Saurav Prakash, Haleh Akrami, Peter A. Beerel, and Keith M. Chugg. “pSConv: A Pre-defined S parse Kernel Based Convolution for Deep CNNs”. In: 57th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2019, Monticello, IL, USA, September 24-27, 2019. IEEE, 2019, pp. 100–107. doi: 10.1109/ALLERTON.2019.8919683. [27] Shashank N. Sridhara, Haleh Akrami, Vaishnavi Krishnamurthy, and Anand A. Joshi. “Bias field correction in 3D-MRIs using convolutional autoencoders.” In: Medical Imaging 2021: Image Processing. Ed. by Ivana Išgum and Bennett A. Landman. Vol. 11596. International Society for Optics and Photonics. SPIE, 2021, 115962H. doi: 10.1117/12.2582042. [28] Omar Zamzam, Haleh Akrami, and Richard M. Leahy. “Learning From Positive and Unlabeled Data Using Observer-GAN”. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. IEEE, 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10094818. [29] Omar Zamzam, Haleh Akrami, Mahdi Soltanolkotabi, and Richard M. Leahy. “Learning A Disentangling Representation For PU Learning”. In: CoRR abs/2310.03833 (2023). doi: 10.48550/ARXIV.2310.03833. arXiv: 2310.03833. 9 Chapter 2 Robust Methods for Noisy Data in the Training Set Learning models that are based on the maximization of the log-likelihood (e.g., autoencoders) often assume perfect training data [109]. Outliers in training data can have a disproportionate impact on learning because they have large negative log-likelihood values for a correctly trained network [91, 116]. In practice, particularly in large datasets, training data will inevitably include mislabeled data, anomalies or outliers, sometimes taking up as much as 10% of the data [101]. To address this problem, we need robust techniques that explicitly account for outliers. In section 2.1 we focus on robustness to outliers in training data in VAE settings using concepts from robust statistics. We modify variational autoencoders (VAE) loss so it can handle the outliers in the training. Then, in section 2.2, we used VAE for unsupervised lesion detection, without access to the lesion-free data for training, to identify imaging biomarkers that distinguish PTE and non-PTE subjects among TBI survivors based on a magnetic resonance imaging (MRI) dataset. Finally, we propose a simple sequential trimming strategy to handle noisy labels in a multi-label class imbalance setting for histopathology-based prediction of multiple genomics properties in section??. 2.1 A Robust Variational Autoencoder Using Beta Divergence In the case of autoencoders, the inclusion of outliers in the training data can result in the encoding of these outliers. As a result, the trained network may reconstruct these outliers in the testing samples. Conversely, 10 if an encoder is robust to outliers then the outliers will not be reconstructed correctly. Therefore, a robust autoencoder can be used to detect anomalies that are presented at inference time by comparing an original image to its reconstructed version. Here, we focus on variational autoencoders (VAEs) [142]. A VAE is a probabilistic graphical model that is comprised of an encoder and a decoder. The encoder transforms high-dimensional input data with an intractable probability distribution into a low-dimensional ‘code’ with an approximate tractable posterior (variational distribution). The decoder takes the code as an input and returns the parameters of the conditional distribution of the data. VAEs use the concept of variational inference [36] and re-parameterize the variational evidence lower bound (ELBO) so that it can be optimized using standard stochastic gradient descent methods. A VAE can learn latent features that best describe the distribution of the data and allows the generation of new samples using the decoder. VAEs have been successfully used for feature extraction from images, audio and text [148, 206, 111]. As noted above, when VAEs are trained using normal datasets, they can be used to detect anomalies, where the characteristics of the anomalies differ from those of the training data [16, 278]. It has been shown that the variational form is preferable to standard autoencoders in real-world applications such as lesion detection [30, 191, 294]. For example, Chen et al [49] reformulated pixel-wise lesion detection as an image restoration problem using a VAE as a probabilistic model with a network-based prior as the normative distribution resulting in a reduced false positive rate. A recent paper[177] suggested cautious use of VAE likelihood for anomaly detection as they show VAE might assign a high likelihood for the out-of-distribution samples in some settings. Importantly, they base their conclusions on the ability to detect anomalies on differences in likelihood between inliers and outliers. Here we use the pixel-wise reconstruction error measure for anomaly detection. The VAE effectively assigns probabilities (or likelihoods) to the data. Since deep generative models are very flexible, they are able to over-fit to outliers that are present in the training data. This in turn will result 11 in high probabilities for outliers in the inference time [78]. If the goal is to detect outliers as anomalies characterized as not being accurately encoded by the VAE, this propensity for over-fitting will negatively impact performance. For this reason, we develop a Robust VAE (RVAE) framework that is robust to the presence of outliers in the training data. In other words, these outliers are encoded with low probability in the trained network. We note that in contrast to the case treated in here, the presence of outliers during training can be used to improve performance in unsupervised models when the uncorrupted versions of the samples is also known, which is the core idea of the Denoising autoencoder (AE) [260]. It is worth mentioning that definition of robustness is context dependent. Here we are specifically interested in insensitivity to deviations from underlying assumptions in the training data. As a concrete example, in the primarily application presented later we are interested in detecting lesions in magnetic resonance images of the brain. Our underlying assumption is that there are no lesions present in the training images. Outliers are defined as images that do contain lesions. Our goal is to train our RVAE so that even if some of the training data do contain lesions, these outliers are poorly encoded so that we still are able to detect lesions using the trained network by comparing the original and decoded images. Here we show that the presence of outliers in training data can result in degraded performance of VAEs for anomaly detection. We then describe a robust VAE that overcomes this problem. 2.1.1 Related Work In the past few years, denoising autoencoders [260], maximum correntropy autoencoders [209] and robust autoencoders [291] have been proposed to overcome the problem of noise corruption, anomalies, and outliers in the data. The denoising autoencoder [260] is trained to reconstruct ‘noise-free’ inputs by corrupting the input data during training and is robust to the type of corruption it learns. However, these denoising autoencoders require access to clean training data and the modeling of noise can be difficult in real-world problems. An alternative approach is to replace the cost function with noise-resistant correntropy [209]. 12 Although this approach discourages the reconstruction of the outliers in the output, it may not prevent from encoding of outliers in the hidden layer. Recently, Zhou and Paffenroth [291] described a robust deep autoencoder that was inspired by robust principal component analysis. This encoder performs a decomposition of input data X into two components, X = LD + S, where LD is the low-rank component which we want to reconstruct and S represents a sparse component that contains outliers or noise. Despite many successful applications of these models, they do not extend well to generative models and categorical datasets. 2.1.2 Motivation via a Toy Example To make the VAE models robust to outliers in the training data, existing approaches focus on the modification of network architectures, adding constraints or modeling of outlier distributions [286, 45, 78]. In contrast, we adopt an approach based on the β-divergence from robust statistics [27]. To motivate our approach we ran the simulation illustrated in Figure 2.1. Here, the samples are generated from a distribution p which is a mixture of two Gaussian distributions where the tall mode represents inlier samples and the short mode indicates the presence of outliers. Our goal is to learn a single-mode Gaussian distribution pθ by minimizing either the Kullback-Leibler (KL) or β divergences to optimize parameters θ. β-divergence is defined as Dβ [27]: Dβ(p(X)||pθ(X)) = 1 β Z X p(X) β+1dX − β + 1 β Z X p(X)pθ(X) β dX + Z X pθ(X) β+1dX. Figure 2.1 shows the estimated distributions found by using the two different divergences measures. While the β-divergence estimate is robust to the outliers, the estimated Gaussian distribution from the KL divergence attempts to also account for their presence and misplaces the mean and variance of the estimated 13 distribution. We observed similar results over a range of variances for inlier and outlier distributions: VAE estimates were consistently influenced by both distributions while RVAE learned only the (dominant) inlier distribution 200 150 100 50 0 50 100 150 200 x 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 p(x) Normal data Outliers Figure 2.1: Illustration of the robustness of β-divergence to outliers in comparison to KL-divergence: optimizing KL-divergence for parameter estimation of a single Gaussian distribution does not distinguish between inliers and outliers, whereas optimizing β-divergence results in an estimate that is robust to outliers, by de-emphasizing out of distribution data. In contrast to our toy example, in practice p(X) is unknown and is replaced with an empirical distribution. To learn this distribution using the standard KL-based approach we minimize: argminθ DKL(ˆp(X)||pθ(X)) = Const − 1 N PN i=1 log(pθ(x (i) )) where N is the number of samples and pθ is the estimate of the empirical distribution. This is a maximum likelihood estimation (MLE) problem that is sensitive to outliers because it treats all data points equally. Replacing KL with a density power divergence can overcome this problem, 14 with β-divergence a popular choice. Minimizing β-divergence is equivalent to minimizing β-cross entropy defined as [80]: Hβ(p(X), pθ(X)) = − β + 1 β Z pˆ(X)(pθ(X) β − 1)dX + Z pθ(X) β+1dX. (2.1) This simplifies for empirical estimation to: Lβ(θ) = Const− β + 1 β X N i=1 pθ(x (i) ) β + Epθ(X) [pθ(X) β ] Setting the derivative with respect to θ to zero results in: 0 = − X N i=1 pθ(x (i) ) β ∂ ∂θ log(pθ(x (i) ))+ Epθ(X) [pθ(X) β ∂ ∂θ log(pθ(X))] The first term is the likelihood weighted according to the β-power of the probability density for each data point. This equation weights inliers (high probability samples) more than the outliers (low probability samples) since the probability densities are higher for the former. Consequently, Lβ(θ) suppresses the likelihood of outliers and can be interpreted as an M-estimate [27]. Inspired by this formulation we derive a new reconstruction error term for VAE using β divergence. We used the influence function (IF) [88] for robustness analysis of our new VAE as described in section 2.1.4.3. We note that the standard VAE has some inherent robustness and is related to Robust PCA [63]. However, we show in experiments below that VAE’s performance degrades with the increasing presence of outliers in training. Our formulation is a generalization of VAE which converges to it as β → 0 and adds extra robustness. Note that the RVAE formulation works with any VAE setup (even when the constant posterior variance goes to zero and the VAE converges to AE. 15 2.1.3 Our Contributions We propose a novel robust VAE (RVAE) using robust variational inference [88] that uses a β−ELBO-based cost function. The β-ELBO cost replaces the KL-divergence (log-likelihood) term with β-divergence. Our contributions are as follows: • We apply concepts from robust statistics, specifically, robust variational inference to variational autoencoder (VAE) for deriving a robust variational autoencoder (RVAE) model. We also present formulations of RVAE for Gaussian and Bernoulli models as well as categorical and mixed type data. • We show that on datasets from computer vision, network traffic, and real-world brain imaging that our approach is more robust than a standard VAE to outliers. • We also show how the robustness of RVAE can be exploited to perform anomaly detection even in cases where the training data also includes similar anomalies. We performed anomaly detection both for images and tabular data sets with categorical and continuous features. • Finally we suggest an approach for hyperparameter tuning which makes our model completely unsupervised. 2.1.4 Mathematical Formulation Let x (i) ∈ R D be an observed sample of input X where i ∈ {1, · · · , N}, D is the number of features and N is the number of samples; and z (j) be an observed sample for latent variable Z where j ∈ {1, · · · , S}. Given samples x (i) of the random feature vector X representing input data, probabilistic graphical models estimate the posterior distribution pθ(Z|X) as well as the model evidence pθ(X), where Z represents the latent variables and θ the generative model parameters [36]. The goal of variational inference is to approximate the posterior distribution of Z given X by a tractable parametric distribution. In variational methods, the 16 functions used as prior and posterior distributions are restricted to those that lead to tractable solutions. For any choice of a tractable q(Z), the distribution of the latent variable, the following decomposition holds: log pθ(X) = L(q(Z), θ) + DKL(q(Z)||pθ(Z|X)), (2.2) where: L(q(Z, θ)) = Eq(Z) [log(pθ(X|Z))] − DKL(q(Z)||pθ(Z)). and DKL represents the Kullback-Leibler (KL) divergence. Instead of maximizing the log-likelihood pθ(X), with respect to the model parameters θ, the variational inference approach maximizes its variational evidence lower bound ELBO [36]. 2.1.4.1 Robust Variational Inference Here we review the robust variational inference framework [88] and explain its usage for developing variational autoencoders that are robust to outliers. The ELBO function includes a log-likelihood term which is sensitive to outliers in the data because the negative log-likelihood of low probability samples can be arbitrarily high. It can be shown that maximizing log-likelihood given samples x (i) is equivalent to minimizing KL divergence DKL(ˆp(X)||pθ(X|Z)) between the empirical distribution pˆ of the samples and the parametric distribution pθ [285, 88]. Therefore, the ELBO function can be expressed as: L(q, θ) = − N Eq [DKL(ˆp(X)||pθ(X|Z))] − DKL(q(Z)||pθ(Z)) + const., (2.3) 17 where N is the number of samples of X used for computing the empirical distribution pˆ(X) = 1 N PN i=1 δ(X, x (i) ) and δ is the Dirac delta function. For the robust case we replace KL with β divergence, Dβ [27]: Dβ(ˆp(X)||pθ(X|Z)) = 1 β Z X pˆ(X) β+1dX − β + 1 β Z X pˆ(X)pθ(X|Z) β dX + Z X pθ(X|Z) β+1dX. In the limit as β → 0, Dβ converges to DKL. Using β-divergence changes the variational inference optimization problem to maximizing β-ELBO: Lβ(q, θ) = − N Eq[Dβ(ˆp(X)||pθ(X|Z))] − DKL(q(Z)||pθ(Z)) (2.4) Note that for robustness to outliers in the input data, the divergence in the likelihood is replaced, but divergence in the latent space is unchanged [88]. The idea behind β-divergence is based on applying a power transform to variables with heavy tailed distributions [51]. It can be proven that minimizing Dβ(ˆp(X)||pθ(X|Z)) is equivalent to minimizing β-cross entropy [88] and is given by [80]: Hβ(ˆp(X), pθ(X|Z)) = − β + 1 β Z pˆ(X)(pθ(X|Z) β − 1)dX + Z pθ(X|Z) β+1dX. (2.5) Replacing Dβ in equation 2.4 with Hβ results in Lβ(q, θ) = − N Eq[Hβ(ˆp(X)||pθ(X|Z))] − DKL(q(Z)||pθ(Z)). (2.6) 18 2.1.4.2 Variational Autoencoder A variational autoencoder (VAE) is a directed probabilistic graphical model whose posteriors are approximated by a neural network. It has two components: the encoder network that computes qϕ(Z|X), which is a tractable approximation of the intractable posterior pθ(Z|X), and the decoder network that computes pθ(X|Z), which together form an autoencoder-like architecture [268]. The regularizing assumption on the latent variables is that the marginal pθ(Z) is a standard Gaussian N (0, 1). For this model the marginal likelihood of individual data points can be rewritten as follows: log pθ(x (i) ) =DKL(qϕ(Z|x (i) ), pθ(Z|x (i) )) + L(θ, ϕ; x (i) ), (2.7) where L(θ, ϕ; x (i) ) = Eqϕ(Z|x(i)) [log(pθ(x (i) |Z))] − DKL(qϕ(Z|x (i) )||pθ(Z)). (2.8) The first term (log-likelihood) can be interpreted as the reconstruction loss and the second term (KL divergence) as the regularizer. Using empirical estimates of expectation we form the Stochastic Gradient Variational Bayes (SGVB) cost [142]: L(θ, ϕ; x (i) ) ≈ 1 S X S j=1 log(pθ(x (i) |z (j) )) − DKL(qϕ(Z|x (i) )||pθ(Z)), (2.9) 19 where S is the number of samples drawn from qϕ(Z|X). We can assume either a multivariate i.i.d. Gaussian or Bernoulli distribution for pθ(X|Z). That is, given the latent variables, the uncertainty remaining in X is i.i.d. with these distributions. For the Bernoulli case, the log likelihood for sample x (i) simplifies to: Eqϕ (log pθ(x (i) |Z)) ≈ 1 S X S j=1 log pθ(x (i) |z (j) ) = 1 S X S j=1 X D d=1 x (i) log p (j) d + (1 − x (i) d ) log(1 − p (j) d ) , where pθ(x (i) d |z (j) ) = Bernoulli(p (j) d ) and D is the feature dimension. In practice we can choose S = 1 as long as the minibatch size is large enough. For the Gaussian case, this term simplifies to the squared-error when the variance is fixed. 2.1.4.3 Robust Variational Autoencoder We now derive the robust VAE (RVAE) using concepts discussed above. In order to derive the cost function for the RVAE, as in equation 2.6, we propose to use β-cross entropy H (i) β (ˆp(X), pθ(X|Z)) between the empirical distribution of the data pˆ(X) and the probability of the samples for the generative process pθ(X|Z) for each sample x (i) in place of the likelihood term in equation 2.9. Similar to VAE, the regularizing assumption on the latent variables is that the marginal pθ(Z) is normal Gaussian N (0, 1). The β-ELBO for the RVAE is: Lβ(θ, ϕ; x (i) ) = − Eqϕ(Z|x(i)) [(H (i) β (ˆp(X), pθ(X|Z)))] − DKL(qϕ(Z|x (i) )||pθ(Z)). Bernoulli Case: The Bernoulli case is used when the data are binary. For each sample, we need to calculate H (i) β (ˆp(X), pθ(X|Z)) where x (i) ∈ {0, 1}. Using empirical estimates of expectation we form the 20 -4.5 -4.5 4.5 0 4.50 (a) VAE-orig -4.5 4.50 (d) RVAE-outlier -4.5 4.50 (c) RVAE-orig 4.50-4.5 (b) VAE-outlier (e) Org VAE_outlier RVAE_outlier VAE_Orig Figure 2.2: Comparing robustness of VAE and RVAE using the MNIST dataset contaminated with synthetic outliers generated by Gaussian noise: (a) the 2D latent space of VAE for the original MNIST dataset without outliers (colors represent class labels of MNIST); (b) the 2D latent space of VAE for the MNIST dataset with added outliers (marked by a dark red circle); (c) the 2D latent space of RVAE without outliers; (d) the 2D latent space of RVAE with outliers added to the input data. (e) Examples of the reconstructed images using VAE (VAE_outlier: VAE trained on a data set which includes outliers, VAE_Orig: VAE trained on the outlier-free data set) and RVAE (RVAE_outlier: RVAE trained on a data set which includes outliers). Unlike the VAE, the RVAE is minimally affected by the presence of outliers in the training and reconstructs the outliers as a digit from the (outlier-free) training set. SGVB cost and chose S=1. In equation 2.5, we substitute pˆ(X) = δ(X − x (i) ) and pθ(X|z (j) ) is a Bernoulli distribution therefore pθ(X|z (j) ) β = Xp (j) + (1 − X)(1 − p (j) ) β = Xp (j)β + (1 − X) 1 − p (j) β , and H (i) β (ˆp(X), pθ(X|z (j) )) = − β + 1 β X x δ(X − x (i) )(Xp (j) + (1 − X)(1 − p (j) ) β − 1) + p (j)β+1 + (1 − p (j) ) β+1 . We calculate the sum over x (i) ∈ {0, 1}. Therefore, for the multivariate case, the β-ELBO of RVAE becomes: Lβ(θ, ϕ; x (i) ) = β + 1 β Y D d=1 x (i) d p (j)β d + (1 − x (i) d )(1 − p (j) d ) β − 1 ! − Y D d=1 p (j)β+1 d + (1 − p (j)β+1 d ) − DKL(qϕ(Z|x (i) )||pθ(Z)). (2.10) Org VAE 0.005 0.01 0.015 (a) MNIST + EMNIST Org VAE 0.005 0.01 0.02 (b) Fashion-MNIST (Gaussian model) Org VAE 0.009 0.015 0.005 (C) Fashion-MNIST (Bernoulli model) Figure 2.3: Examples of reconstructed inlier images (first 4 columns in each Figure) and outliers (last 4 columns in each Figure) using VAE and RVAE with different βs on (a) MNIST (inliers) + EMNIST (outliers) datasets and (b,c) Images from the class of shoes (inliers) and images from the class of other accessories (outliers) in the Fashion MNIST datasets. The optimal value of β is highlighted in red. Gaussian Case: When the data is continuous and unbounded we can assume the posterior p(X|z (j) ) is Gaussian N (xˆ (j) , σ) where xˆ (j) is the output of the decoder generated from z (j) . Here, we choose σ = 0.5 for our experiments since values were normalized between 0 and 1. We empirically found that we can make VAE robust for any value of σ. The β-cross entropy for the i th sample input x (i) is given by: H (i) β (ˆp(X), pθ(X|z (j) )) = − β + 1 β Z δ(X − x (i) )(N(xˆ (j) , σ) β − 1)dX + Z N(xˆ (j) , σ) β+1dX. (2.11) The second term does not depend on xˆ so the first term is minimized when exp(−β PD d=1 ||xˆ (j) d − x (i) d ||2 ) is maximized. Therefore, the ELBO-cost for the Gaussian case for j th sample is then given by Lβ(θ, ϕ; x (i) ) = β + 1 β 1 (2πσ2) βD/2 exp − β 2σ 2 X D d=1 ||xˆ (j) d − x (i) d ||2 ! − 1 ! − DKL(qϕ(Z|x (i) )||pθ(Z)). (2.12) 22 The cost converges to MSE in the limiting case β → 0 For calculating the limit at β = 0 in equation 2.12, we use l’Hopital’s rule for as follows: lim β→0 Lβ(θ, ϕ; x (i) ) = 1 (2πσ2) βD/2 × exp − β 2σ 2 X D d=1 ||xˆ (j) d − x (i) d ||2 ! − 1+ (β + 1)((−βD/2)(2πσ2 ) −βD/2−1× exp − β 2σ 2 X D d=1 ||xˆ (j) d − x (i) d ||2 ! + 1 (2πσ2) βD/2 (− 1 2σ 2 X D d=1 ||xˆ (j) d − x (i) d ||2 )× exp − β 2σ 2 X D d=1 ||xˆ (j) d − x (i) d ||2 ! (2.13) Setting β = 0 we have: lim β→0 Lβ(θ, ϕ; x (i) ) = − 1 2σ 2 X D d=1 ||xˆ (j) d − x (i) d ||2 (2.14) assuming constant σ , arg maxθ,ϕ Lβ(θ, ϕ; x (i) ) = argminθ,ϕ PD d=1 ||xˆ (j) d − x (i) d ||2 . Which shows the cost converges to MSE in the limiting case. Tabular data with mixed categorical and continuous features: For categorical features we can assume that the generative distribution is a categorical distribution with K categories. Then, the first integral in equation 2.5 becomes: 23 H (i) β (ˆp(X), pθ(X|z (j) )) = − β + 1 β Z δ(X − x (i) )(pθ(X|z (j) ) β − 1)dX = pθ(xi |z (j) ) β − 1 (2.15) The second integral can be written as: Z pθ(X|z (j) ) β+1dX = Z Y K k=1 pθ(X = k | z (j) ) β+1δ(X, k)dX = X K k=1 pθ(X = k | z (j) ) β+1 (2.16) where pθ(X|z (j) ) β = Q p (j) [X = i] β We can use the formulation derived in section 2.3.2 for continuous variables with the assumption of Gaussian distribution. The total loss for the mixed type data then can be computed as a summation of loss from categorical and continuous features. Influence function analysis: The influence function (IF) measures the effect of an abnormal observation on the training of the model. Futami et al. 2017 [88] give general expressions for IFs for both original and beta-variational inferences. Here, we analyze the IFs for Bernoulli and Gaussian cases. By studying the supremum of IFs over perturbation, we can compare the robustness of the models. The IFs for the expressions in equations 2.3 and 2.4 differ in their first terms, so it is sufficient to compare the suprema of these terms only, for the non-robust case: ∂ ∂θ DKL(ˆp(X), pθ(X|Z)) (2.17) 24 and for the robust-case: ∂ ∂θ Dβ(ˆp(X), pθ(X|Z)). (2.18) The IF of β-ELBO can be written as: ∂ ∂θ Dβ(ˆp(X), pθ(X|Z)) ∝ X N i=1 ∂ ∂θ p β θ (xi |Z). (2.19) The IF for the original ELBO is ∂ ∂θ DKL(ˆp(X), pθ(X|Z)) ∝ X N i=1 ∂ ∂θ log pθ(xi |Z). (2.20) Both expressions approach infinity when pθ(xi |Z) goes to 0 (outliers have small probabilities) but IF for β-ELBO is upper bounded by that of ELBO when β > 0 indicating β-ELBO is less affected by outliers. As a result, RVAE will be more robust than VAE to outliers. We further study unbounded input for the Gaussian case. Assuming d = 1, σ = 0.5 and defining xˆ = fθ(x) where f is a neural network parameterized by θ, The IF for our approach can be written as: ∂ ∂θ Dβ(ˆp(X), pθ(X|Z)) ∝ X N i=1 ∂ ∂θ exp − 1 2 ∥fθ(xi) − xi∥ 2 (2.21) which approaches 0 as xi approaches +/- infinity given the output of the network and the gradient are bounded. Similarly, the expression for KL divergence is: ∂ ∂θ DKL(ˆp(X), pθ(X|Z)) ∝ X N i=1 ∂ ∂θ 1 2 ∥xˆi − xi∥ 2 (2.22) 25 which is not bounded when xi approaches +/- infinity. Thus, β-ELBO but not ELBO is bounded for the Gaussian case. For the Bernoulli case, we assume the sigmoid function is used as an activation function at the output layer. We can express the posterior as pθ(xi |Z) = fθ(xi) xi (1 − fθ(xi))1−xi where fθ(xi) = 1 e −gθ (xi ) and gθ(xi) is the input to the sigmoid function[88]. Then the derivative of the logarithm of the posterior in equation 2.20 with respect to the model parameters can be written as ∂ ∂θ log pθ(xi |Z) = −xi(1 − f) ∂g ∂θ + (1 − xi)f ∂g ∂θ . (2.23) Let’s consider the case where xi = 1, then we have : ∂ ∂θ log pθ(xi = 1|Z) = 1 1 + e gθ(xi) ∂g ∂θ (2.24) And for the β-ELBO, using equation 2.19 we have: ∂ ∂θ p β θ (xi |Z) = pθ(xi = 1|Z) β ∂ ∂θ log pθ(xi = 1|Z) = (2.25) 1 (1 + e−gθ(xi)) β 1 1 + e gθ(xi) ∂g ∂θ (2.26) Comparing these terms which are the difference between original and beta-variational inferences, stated in equations 2.19 and 2.20, we can infer that pθ(xi = 1|Z) β acts as a weight on the gradient. This value is higher for normal data samples, since we have 0 < β and and 0 < pθ(xi = 1|Z) < 1, so their gradients are weighted more highly, and have more impact on the model parameters. 26 2.1.5 Experimental Results We present formulations of RVAE for Gaussian and Bernoulli models as well as categorical and mixed type data and designed series of experiments to show the effectiveness RVAE for these three different choices of posterior distribution. In each case, we optimize ELBO and β-ELBO using stochastic gradient descent with reparameterization [142]. We note that all quantitative performance evaluation below was performed on independent hold-out data. Here we evaluate the performance of RVAE using datasets contaminated with outliers and compare it with the traditional VAE. We conducted four experiments using the MNIST [158], the EMNIST [52], the Fashion-MNIST benchmark datasets [272], two real-world Magnetic Resonance (MR) brain imaging datasets: Maryland MagNeTs study of neurotrauma (https://fitbir.nih.gov) and the Ischemic Stroke Lesion Segmentation (ISLES) database [170] (http://www.isles-challenge.org) and three benchmark datasets made available by the cyber security community: KDDCup 99 [20], NSL-KDD [60] and UNSW-NB15 [1]. Both brain imaging datasets consist of three sets of coregistered MR images corresponding to FLAIR and T1 and T2 weighting[259]. The experiments are summarized as follows: • EXPERIMENT 1: in section 3.1, using a simple simulation using MNIST as inliers as Gaussian noise as ouliers to show how the encoding of VAE gets corrupted in the presence of outliers and how we can fix it using RVAE. • EXPERIMENT 2: in section 3.2, we did an oulier detection experiment using benchmark data-sets for different choices of posterior distribution Gaussian (Fashion-MNIST dataset, Shoes: inliers, othercategoreis: outliers) and Bernoulli models (binerized-MNIST: inliers, binerzed-EMNIST: outliers. In section 3.3, we introduced two methods for choosing the robustness parameter β and repeated experiment 2. 27 • EXPERIMENT 3: in section 3.4, we used the Gaussian formulation for a real-word lesion detection task. Two real-world Magnetic Resonance (MR) brain imaging datasets: Maryland MagNeTs study of neurotrauma (https://fitbir.nih.gov) and the Ischemic Stroke Lesion Segmentation (ISLES) database [170] has been used. • EXPERIMENT 4: in section 3.5, we used the formulation for tabular data with mixed categorical and continuous features and applied it for a outlier detection task on three benchmark datasets made available by the cyber security community: KDDCup 99 [20], NSL-KDD [60] and UNSW-NB15 [1]. In section 3.6, we compared RVAE performance with other methods VAE, Denoising VAE (DVAE)[118], robust AE (RAE) [291], and Coupled-VAE (CVAE) [45]. Finally in section 3.7, we investigate the performance of VAE when the outliers in training and test data are qualitatively different we used Fashion-MNIST dataset and brain imaging datasets and Gaussian formulation. The network architectures for VAEs were chosen based on previously established designs and summarized as below: EXPERIMENT 1: We use fully-connected layers with single hidden layers consisting of 400 units both for the encoder and the decoder and a bottleneck with dimension 2. EXPERIMENT 2: We use fully-connected layers with single hidden layers consisting of 400 units both for the encoder and the decoder a bottleneck with dimension 20 described in [150]. EXPERIMENT 3: The VAE architecture consists of three consecutive blocks of convolutional layers, a batch normalization layer, a rectified linear unit (ReLU) activation function and two fully-connected layers in the bottleneck for the encoder. Similarly, the decoder consists of a fully-connected layer and three consecutive blocks of deconvolutional layers, a batch normalization layer and ReLU, with a final deconvolutional layer. EXPERIMENT 4: For the cyber security datasets, we use fully-connected neural networks with three hidden layers in both encoder and decoder with 128 units and tanh and soft-max activation functions for continuous and categorical variables, respectively. 28 We used PyTorch [190] scikit-learn [193], and NumPy [262] for the implementation. We used the Adam optimizer [141] with a learning rate of 0.001 for training. For the first three experiments, bias correction parameters for the Adam optimizer were 0.9 and 0.999 (default parameters) for gradients and squared gradients, respectively. For the tabular data, these values were 0.5 and 0.999, respectively. We provide a public version of our code at https://github.com/HaleAkrami/RVAE. 2.1 3.6 Percentage of outliers 20% 10% 0% Value of 0 0.01 0.02 1.5 5.5 Percentage of outliers 20% 10% 0% Value of 0 0.01 0.02 (a) MNIST + EMNIST (Bernoulli model) (b) Fashion-MNIST (Gaussian model) 2.4 6.4 Percentage of outliers 20% 10% 0% Value of 0 0.01 0.02 (c) Fashion-MNIST (Bernoulli model) Figure 2.4: The performance measure (the ratio between the overall absolute reconstruction error in outlier samples and their counterparts in the normal samples) as a function of the parameter β (x-axis) and the fraction of outliers present in the training data (y-axis) for two datasets used for experiment 2. 1 0.9 0.8 0.7 0.6 0.5 True positive rate 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False positive rate False positive rate False positive rate (a) MNIST + EMNIST (b) Fashion-MNIST(Gaussian model) (c) Fashion-MNIST(Bernoulli model) Figure 2.5: ROC curves showing the performance of outlier detection using VAE and RVAE with different fractions of outliers present in the training data for the two datasets used in Experiment 2. 2.1.5.1 Experiment 1: Effect on Latent Representation First, we used the MNIST dataset comprising 70,000 28x28 grayscale images of handwritten digits [158]. We replaced 10% of the MNIST data with synthetic outlier images generated by white Gaussian noise. We binarized the data by thresholding at 0.5 of the maximum intensity value, and used the Bernoulli 29 model of the β-ELBO (equation 2.10) with β = 0.005. The latent dimension was chosen to be 2 by visual inspection. Figure 2.2 (e) shows examples of the reconstructed images using the VAE (second row) and RVAE (third row) along with original images (first row). The results show that outlier images are encoded when VAE is used. On the other hand, RVAE, as desired, did not accurately encode the outlier noise images but rather encoded them such that they produce images consistent with the MNIST (inlier) training data after decoding. Moreover, we visually inspected the embeddings using both VAE and RVAE (Figure 2.2 (a) - (d)). In the VAE case (Figure 2.2 (a) and (b)), the distributions of the digits were strongly perturbed by the outlier noise images. In contrast, RVAE was not significantly affected by outliers (Figure 2.2 (c) and (d)), illustrating the robustness of RVAE. For a quantitative comparison, we calculated negative BCE (Binary cross entropy) for each sample which is equivalent to the log-likelihood since we assumed a Bernoulli posterior. The average log-likelihood of outliers was much lower for the RVAE (-4036.42) than for the VAE (-545.46), while inliers had similar log-likelihood for VAE (-144.28) and RVAE (-145.97). The much larger difference between average log-likelihood of inliers and outliers for the RVAE than VAE indicates a superior ability to distinguish between the two and hence increased robustness. 2.1.5.2 Experiment 2: Reconstruction and Outlier Detection For this experiment, instead of using Gaussian random noise as outliers, we replaced a fraction of the MNIST data with Extended MNIST (EMNIST) data [52] which contains images that are the same size as MNIST but do not display integers. We again binarized the data by thresholding at 0.5 of the maximum intensity value and used the Bernoulli model of the β-ELBO (equation 2.10) for the RVAE loss function. Similarly, we repeated the above experiment using the Fashion-MNIST dataset [272] that consists of 70,000 28x28 grayscale images of fashion products from 10 categories (7,000 images per category). Here we chose shoes and sneakers as inliers classes and samples from other categories as outliers. Since these images contain a significant range of gray scales, we chose the Gaussian model for the β-ELBO (equation 2.12). 30 An apparently common, yet theoretically unclear practice is to use a Bernoulli model for grayscale data. This is pervasive in VAE tutorials, research literature, and default implementations of VAE in deep learning frameworks where researchers effectively treat the data as probability values rather than samples from the distribution. Using the Bernoulli model for continuous data on [0, 1] is inconsistent with the interpretation of the VAE in terms of probabilistic inference. This issue is discussed in detail in [165]. In particular, they note that even treating the algebraic form of the Bernoulli distribution as representing a continuous variable on [0, 1], the formulation is still incorrect since the distribution is not correctly normalized. Despite these theoretical concerns, to be consistent with common practice we do include results here for the Fashion-MNIST grayscale images using the Bernoulli model of the β-ELBO (equation 2.10) with gray scale values interpreted as probabilities. To investigate the performances of the autoencoders, we start with a fixed fraction of outliers (10%). For the MNIST-EMNIST experiment, we trained both VAE and RVAE with β varying from 0.001 to 0.02. Figure 2.3 (a) shows the reconstructed images from RVAE with β = 0.005, 0.01 and 0.015 in comparison to the regular VAE. Similarly to experiment 1, with an appropriate β (β = 0.01 in this case), RVAE did not reconstruct the outliers (letters). As expected, RVAE with too small β has similar performance to the regular VAE, while RVAE with too large β rejects outliers but also rejects some normal samples. Next, we explored the impact of the parameter β and the fraction of outliers in the data on the performance of the RVAE. Performance was measured as the ratio between the overall absolute reconstruction error in outlier samples (letters) and their counterparts in the inlier samples (digits). The higher this metric, the more robust the model, since a robust model should in this example encode digits well but letters (outliers) poorly. Figure 2.4 (a) shows the performance of this measure in a heatmap as a function of β (x-axis) and the fraction of outliers (y-axis). When only a few outliers are present, a wide range of βs (< 0.01) works almost equally well. On the other hand, when a significant fraction of the data is outliers, 31 β=0.005 β=0 β=0.01 β=0.02 Generated Samples from Gaussian noise Figure 2.6: Generating samples using the decoder with different values of β for the Fashion-MNIST experiment. The optimal value of β is 0.01 which generates different types of shoes. This optimal value also matches the maximum value achieved in the heat map in Figure 2.4-b. the best performance was achieved only when β is close to 0.01. When β > 0.01, the performance degraded regardless of the fraction of the outliers. These results are consistent with the results in Figure 2.3. We further investigate the performance of RVAE as a method for outlier detection as follows. We threshold the mean squared error between the reconstructed images and the original images. Errors exceeding a given threshold identified the image as an outlier. The resulting labels were compared to the ground truth to determine true and false-positive rates. We varied the threshold to compute Receiver Operating Characteristic (ROC) curves. Figure 2.5 shows the ROC curves with RVAE shown as a solid and VAE a dashed line. The results were similar for the Fashion-MNIST dataset (Figures (2.3,2.4,2.5) (b),(c)). The RVAE outperformed the VAE for all settings with the difference increasing with the fraction of outliers. To illustrate robustness of the RVAE as a generative model we input Gaussian noise to the decoder for the Fashion-MNIST trained with 10% outliers for a range of β values. Figure 2.6 shows that for β = 0.01, the network exhibits the best trade-off between generating a range of shoe images consistent with the inlier training data and non generating outliers. In contrast, smaller values (β = 0, 0.005) result in outliers affecting training, while the larger value (β = 0.02) produces almost no variability. Note that the maximum value of the heatmap in Figure 2.4 (b) is achieved at this optimal value of β = 0.01 when the percentage of outliers is 10%. 32 2.1.5.3 How to Choose Robustness Parameter β? In practice, outliers in the training data would not be known in advance, hence, we cannot compute the reconstruction error in outlier samples in the training data as described above. Here, we propose two approaches to tuning β: a semi-supervised validation-based approach and an unsupervised clustering-based approach. We used two gradient-free methods for parameter optimization: Brent’s method [40, 205] and Bayesian optimization [71, 167]. Bayesian optimization is useful when evaluating the objective function is expensive. This approach keeps track of past evaluation results, using them to form a probabilistic model by mapping hyperparameters to the probability of a score on the objective function. This is then used to predict the next value [71, 167]. Brent’s method [40, 205] has lower computation cost than Bayesian optimization. Brent’s method involves iterative optimization using a combination of the bisection method, the secant method, and inverse quadratic interpolation. At each iteration, Brent’s method estimates an optimum value of β and trains the model with that β using the training set, computes the defined metric, and based on this, finds a new estimate of β. Validation-based approach: We choose a small subset of training data as a validation dataset in which we labeled inliers/outliers. Specifically, for the above MNIST-EMNIST and Fashion-MNIST experiment, we chose 1000 samples of which 10% were identified as outliers. We trained the RVAE on the rest of the training data, and then computed the ratio of the reconstruction error for inlier samples (digits), and the outlier samples (letters) on the labeled validation dataset (section 2.1.5.2). The lower this metric, the more robust the model is. In this example, a robust model should reconstruct digits/shoes (inliers) well but letters/other categories (outliers) poorly. We minimize this ratio with respect to β using both Brent’s method and Bayesian optimization. 33 Validation Clustering a b Clustering Validation Figure 2.7: Searching for the optimal β by optimizing using different metrics using Brent’s method: (a) MNIST-EMNIST experiment; (b) Fashion-MNIST experiment with 10% outliers. Validation method: the ratio of reconstruction error in inliers vs outliers used as the cost function. Clustering method: the Silhouette Score for k-means cluster (k = 2) was used as the cost function. We demonstrate Brent’s method for finding an optimal β for the MNIST-EMNIST and Fashion-MNIST experiments (section 2.1.5.2) in Figure 2.7. The red curve shows the convergence of β values. It can be seen that after only a few iterations, we are able to compute the optimal value of β. A similar optimal value (0.01 for both MNIST-EMNIST and Fashion-MNIST experiments) of β was achieved with Bayesian optimization using a maximum of 20 iterations and a log-uniform search space. The validation based approach needs a small validation data with samples labeled as inliers and outliers. As an alternative, for the case where such a validation set is not available, we propose a clustering-based approach as explained below. Clustering based approach: In some datasets, labeled data may not be available to generate a validation set. For such cases, we suggest a clustering-based approach. From the heatmaps in section 2.1.5.2, 34 InputVAERVAE (a) (b)InputVAERVAE (c)InputVAERVAE Figure 2.8: Reconstructions of brain images using VAE and RVAE: a) after randomly dropping rows with a height of 5 pixels for 10% of the Maryland MagNeTs dataset; b) after adding simulated lesions to 10% of the Maryland MagNeTs dataset; c) for the ISLES data with true lesions. it can be inferred that when the β value is too low, the reconstruction error is low both for the outliers and the inliers. Hence, they are not clusterable based on this measurement. Conversely, when β is large, the reconstruction error is large for both groups, so again they cannot be partitioned into separate clusters. For an optimal β value the reconstruction error should most easily allow differentiation and hence clustering of inliers and outliers into two groups. Based on this observation, we maximize the Silhouette score[223], a measure of how similar a data sample is to its cluster (cohesion) compared to other clusters (separation). The Silhouette Score is calculated using the mean intra-cluster distance x and the mean nearest-cluster distance y. The Silhouette Score for a sample is (y − x)/max(y, x). As noted above, suboptimal values of β will produce similarly small (β too small) or large (β to large) reconstruction errors so that inliers and outliers will not cluster into distinct groups using their reconstruction errors and the Silhouette score will be low. But for two clusters where outliers have large and inliers low reconstruction errors, the silhouette score will be large. Simple k-means clustering (k = 2) can be used to evaluate the score for each candidate value of β. The black curves in Figure 2.7 show the convergence of β values using the Silhouette objective function. After only a few iterations we are able to find the optimal value of β using Brent’s method. The optimal value of β using Bayesian optimization was similar (0.01) using a maximum of 20 iteration and a log-uniform search space. 35 2.1.5.4 Experiment 3: Detecting Abnormalities in Brain Images using RVAE Recently machine learning methods have been introduced to accelerate the identification of abnormal structures in medical images of the brain and other organs [30]. Since supervised methods require a large amount of annotated data, unsupervised methods have attracted considerable attention for lesion detection in brain images. A popular approach among these methods leverages VAE for anomaly detection [48] by training the VAE using nominally normal (anomaly free) data. However, if outliers, lesions or dropouts are present in the training data, VAEs cannot distinguish between normal brain images and those with outliers. Here, we tackle this real-world problem by investigating the effectiveness of the RVAE for automated detection of outliers using both simulated and real outliers. We used the VAE architecture proposed in [150] and 20 central axial slices of brain MRI datasets from 119 subjects from the Maryland MagNeTs study of neurotrauma (https://fitbir.nih.gov). We split this dataset into 107 subjects for training and 12 subjects for testing. The experiment using simulated outliers consisting of 10% of two types of outliers: random data dropout (lower intensity lines with a thickness of 5 pixels), and randomly generated simulated lesions (higher intensity Gaussian blobs). For the experiment with real outliers (lesions), we used 142 central axial slices of 24 subjects from the ISLES (The Ischemic Stroke Lesion Segmentation) database [170]. We used 21 subjects for training and 3 subjects for testing. In experiments both with simulated and real outliers, unlike VAE, RVAE is robust to the outliers in the sense that they are not reconstructed in the decoded images and can, therefore, be detected by differencing from the original images (Figure 2.8). Due to the small number of samples and more variability and noise in the data, the quality of the reconstructions of examples from the dataset with real outliers is worse than that for dataset with simulated outliers. For quantitative comparison, we apply a pixel-wise ROC study for the data contaminated with simulated outliers and for the ISLES dataset. In this study each pixel is classified as either ’normal’ or ’lesion’ based on the thresholded error and compared to ground truth delineations of the lesions to compute true and false positive rates. By varying 36 the threshold we were able to generate a set of ROC curves. The area under the ROC curve was 0.20 for VAE and 0.85 for RVAE for the simulated data and was 0.92 for VAE and 0.98 for RVAE for the ISLES dataset which quantitatively demonstrates the success of RVAE compared to the VAE for lesion detection. 2.1.5.5 Experiment 4: RVAE for Tabular Data We compared the performance between regular VAE and RVAE by gradually contaminating the training dataset with more outliers to evaluate robustness. We use three benchmark datasets made available by the cyber security community: KDDCup 99[20], NSL-KDD[60] and UNSW-NB15[1]. The goal is to detect cyber attacks at the network level. All datasets are in tabular format with categorical and continuous columns. We measured the area under the receiver operating characteristic curve (AUC) as an evaluation metric. KDDCup 99: [20] is the dataset used for "The Third Knowledge Discovery and Data Mining Tools" competition. The task was to build an automated network intrusion detector that can distinguish between attacks and normal connections. There are 41 columns of which 8 are categorical. We use the complementary 10 % data for training and the labeled test data for testing. NSL-KDD: [60] is the refined version of KDDCup 99 to resolve some of its inherent problems. Specifically, redundant connection records were removed to prevent detection models becoming biased towards frequent connection records. We used the available full training dataset for training and test dataset for testing. UNSW-NB15: [1] This dataset was introduced by a cyber security research team from the Australian Centre for Cyber Security. We used the available partitioned datasets for training and testing. The data has 43 columns out of which, 9 features are categorical. Thanks to the abundance of labeled data for this application, model selection for β and the early stopping was done based on the best AUC from the hold-out validation dataset (20 % of the training dataset). We 37 (a) (b) (c) Figure 2.9: Performance comparison of VAE and RVAE as function of contamination in training data for datasets: (a) KDDCup99, (b) NSLKDD, and (c) UNSW-NB15 ran each experiment with five different initializations and report the average and standard error of AUCs across these five runs. The results in Figure 2.9 show that the performance of the VAE degrades significantly even with a small amount of contamination (1 %) for all three data sets (Figure 2.9 (a) KDDCup99, (b) NSLKDD, and (c) UNSW-NB15). The RVAE, on the other hand, stays robust to the outliers in the training datasets. 2.1.5.6 Comparison with Other Methods We compared the performance of RVAE for outlier detection with Denoising VAE (DVAE)[118], robust AE (RAE) [291], and Coupled-VAE (CVAE) [45] for the following data sets: (i) MNIST+EMNIST (section 3.2), (ii) a Fashion-MNIST (section 3.2) experiment with 10% percent outliers, and (iii) a lesion detection experiment for the ISLES dataset (section 3.4). The areas under the ROC curve (AUCs) for outlier detection are shown in Table 2.1. The RVAE has the highest ROC value among the methods tested. Although the RAE [291] shows competitive results and is robust to outliers it is not a generative model. Further, RAEs are only applicable for the Gaussian posterior with real-valued input data where the loss is calculated using reconstruction error. RAEs are not generally applicable, for example with categorical or tabular datasets. Furthermore, in earlier medical imaging applications, VAEs were typically shown to outperform AEs [30]. The RAE performs a decomposition of input data X into two components, X = LD + S, where LD 38 Table 2.1: Comparison of different autoencoders for MNIST+EMNIST and Fashion-MNIST anomaly detection experiment with 10% outliers and Comparison of different autoencoders for Lesion detection experiment for ISLES dataset in terms of AUC. Dataset VAE RVAE RAE CVAE DVAE MNIST+EMNIST 0.84 0.90 0.89 0.86 0.84 FashionMNIST 0.79 0.96 0.93 0.82 0.79 ISLES 0.92 0.98 0.98 0.96 0.89 is a low-rank component which we want to reconstruct, and S represents a sparse component assumed to contain outliers or noise. To train the model a two-phase optimization framework is needed so that RAE is computationally more expensive than VAE. Finally, another drawback of this method is that, in contrast to the RVAE, no principled way of choosing the hyperparameter has been described. Using a cross-entropy cost, CVAE [45] models pixel data as Bernoulli even when the data is continuous, which causes pervasive errors [165] as described in section 3.2 . Further, CVAE does not include any general settings for other priors such as Gaussian nor provide a mechanism for tuning hyperparameters. An alternative robust framework was proposed for the VAE using a two component mixture model for each feature in [78], where one component represents the clean data and the other robustifies the model by isolating outliers. However, that work focuses on categorical data rather than images, which is the primary focus of the current paper. 2.1.5.7 Inconsistency in Outliers Between Training and Testing UP to this point we have focused on the case where training is polluted with outliers similar to those we intend to detect. In this section we investigate the performance of VAE when the outliers in training and test data are qualitatively different. Fashion-MNIST experiment: Here we added 10% outliers to both the training and test datasets. Inliers are different types of shoes and sneakers. The outliers in the training set are from EMNIST while 39 the outliers in the test set are other fashion categories from Fashion-MNIST (the test set is similar to experiment 2). Figure 10 shows that the standard VAE does not reconstruct the test outliers properly, as a result reconstruction error can be used to detect outliers. Consequently, in this case there may be no need to use a robust formulation. There was no significant difference in AUC (0.99) for the test set using RVAE or VAE in the outlier detection task using the reconstruction error. Input VAE Figure 2.10: Examples of reconstructed inlier images (first 4 columns in each Figure) and outliers (last 4 columns in each Figure) from the test set using original VAE when outliers in the training data are qualitatively different from those in the test data. Lesion Detection We performed the lesion detection task on 20 slices each from 15 subjects from the ISLES dataset as a test set that included outliers containing real lesions. Twenty lesion-free central axial slices of brain MRI datasets from 119 subjects from the Maryland MagNeTs study of neurotrauma (https://fitbir.nih.gov) were used for training. We separately added either simulated lesions or simulated drop-outs to these data to achieve 10% outliers to generate two different corrupted data sets for training. Note that the simulated lesions are qualitatively different in shape from their real counterparts in the test data. We trained the network using the VAE architecture proposed in [150]. Results are shown in Figure 11. We see that the RVAE continues to perform well with both types of outliers in the training data. Conversely, while the VAE does not reconstruct the true lesions in the test input data, the reconstructions of these test images do, in some instances, contain features similar to those of the outliers used in the training data. For example in the 2nd row, 4th column for training with simulated lesions and 3rd row, 5th column for training with drop-outs. This could clearly lead to errors in detection and localizing lesions since the artifacts introduced could be interpreted as lesions when computing differences from the input images. For this experiment there was no significant difference in AUC (0.95) using VAE or RVAE because errors of the 40 type just described occur relatively infrequently. Nevertheless, the fact that the presence of outliers in the training in the VAE can lead to artifacts in images as shown in the 2nd and 3rd rows of Figure 11 argues in favor of using the RVAE over the VAE even in cases where it is suspected that outliers are different between the training and testing sets. Figure 2.11: Reconstructions of brain images using VAE and RVAE with inconsistency between outliers in training and testing. The VAE did not reconstruct the real lesions in the test (input) images but the reconstructions are sometimes corrupted by artifacts similar to the outliers that were present in the training data (see red squares). 2.1.6 Discussion and Conclusion The presence of outliers in the form of noise, mislabeled data, and anomalies can impact the performance of machine learning models for labeling and anomaly detection tasks. In this work, we developed an effective approach for learning representations, RVAE, to ensure the robustness of learning to outliers. Our approach relies on the notion of β-divergence from robust statistics. We formulated cost functions for 41 Bernoulli, Gaussian and categorical distributions. Furthermore, we provided an unsupervised approach to selecting the robustness hyper-parameter β in RVAE using an optimization method. We demonstrated the effectiveness of our approach using benchmark datasets from computer vision, real-world brain imaging and tabular cyber security datasets. Our experimental results indicate that the RVAE is robust to outliers in representation learning and can also be useful for outlier detection. Our approach can be used for automated anomaly detection applications in medical images and cyber security datasets. Our results show that RVAE tends to decrease the resolution of reconstructed images relative to VAE. In our approach there is a tradeoff between robustness and quality of reconstructed images similar to the efficiency-robustness tradeoff in well-known robust models [28]. The β divergence is an M-estimator [88] that tries to reduce the influence of outliers by applying a non-linear function on the loss; the associated efficiency-robustness tradeoff is reported in M-estimates [28]. In future work this could be addressed using an enhancement framework to increase the quality of samples generated using RVAE [64]. We note that our formulation can also be extended Generative Adversarial Networks (GANs) [95] by optimizing a divergence robust to outliers. This may also lead to improved image quality. 2.2 Neuroanatomic Markers of post-Traumatic Epilepsy Based on Magnetic Resonance Imaging and Machine Learning The onset of post-traumatic epilepsy (PTE) after traumatic brain injury (TBI) is relatively common [194]. Epidemiological studies have found that PTE accounts for 10-20% percent of all symptomatic epilepsies in the general population and ∼5% of all epilepsies [107, 199]. Significant risk factors for seizure onset over one week after TBI include the occurrence of seizures within the first week, acute intra-cerebral (especially subdural) hematoma, brain contusion, greater injury severity, and age over 65 at the time of injury [59]. As 42 many as 86 percents of patients with one seizure early after TBI experience a second one within the next two years [151]. Despite the reported relationship between TBI and PTE, identifying biomarkers of epileptogenesis after TBI is still a fundamental challenge. Preliminary studies in adult male Sprague-Dawley rats indicate the potential involvement of the perilesional cortex, hippocampus, and thalamus in PTE and demonstrated the potential of leveraging MRI analysis to find PTE biomarkers [119, 203]. Previous MRI studies have shown correlations between PTE incidence and (a) the presence of lesions in T2-weighted scans, (b) injury severity and (c) injury type [18, 72, 79, 201]. Studies of PTE reported correlations between PTE and the existence of frontal, parietal, and temporal lesions [210, 2, 224, 82, 254]. Nevertheless, the association between PTE and lesion size or location remains poorly understood. Additionally, the heterogeneous nature of TBI injury types, pathology, and lesions present additional challenges to biomarker discovery. Because the locations, spatial extent, and content of lesions vary considerably between patients with vs. without PTE, there is no complete spatial overlap of injury profiles across the two groups. This heterogeneity needs to be accounted for in statistical analyses due to its potentially confounding effect. The prediction of post-traumatic seizure onset and frequency based on neurological and radiological examinations has been only moderately successful and more research is needed to understand the relationship between TBI and PTE [211, 207, 90, 76, 121]. Thus, the identification of imaging biomarkers can help in developing better PTE prediction strategies. This study uses multimodal MRI to identify location- and contrast-related biomarkers of PTE. We perform two analyses aimed at characterizing changes in brain structure using two distinct strategies: • Morphometric Analysis: We performed a population analysis of morphometric changes in the brain associated with TBI. In contrast to the lesion analysis described below, this analysis focuses on identifying changes in brain shape rather than alterations in tissue composition. 43 • Lesion analysis: We use a machine learning (ML) method for identifying abnormal contrasts in multimodal MR images, which are indicative of lesions and tissue abnormalities such as edema, hematoma, and hemorrhage. 2.2.1 Materials and Methods 2.2.1.1 Data We used three datasets in this study: 1) the Maryland TBI MagNeTs dataset [99], including 74 subjects used for statistical comparison and the remaining 41 subjects (total 114) were used for training the neural network; 2) the TRACK-TBI Pilot [283] dataset including 97 subjects were used for training the neural network; and 3) the ISLES (The Ischemic Stroke Lesion Segmentation) dataset with manually delineated lesions [170], were used as validation data for measuring the accuracy of our lesion delineation method. Information about these three datasets is provided below. Maryland MagNeTs data The total number of subjects available from this dataset was 115. The dataset was collected as a part of a prospective study that includes longitudinal imaging and behavioral data from TBI patients with a Glasgow Coma Scores (GCS) in the range of 3-15 (mild to severe TBI). Injury mechanisms included falls, bicycle or sports accidents, motor vehicle collisions, and assaults. The individual or group-wise GCS, injury mechanisms, and clinical information is not shared. The imaging data are available from FITBIR (https://fitbir.nih.gov), with FLAIR, T1, T2, diffusion, and other modalities available for download. In this study, we used imaging data acquired within 10 days after injury, and seizure information was recorded using follow-up appointment questionnaires. Exclusion criteria included a history of white matter disease or neurodegenerative disorders including multiple sclerosis, Huntington’s disease, Alzheimer’s disease, Pick’s disease, and a history of stroke or brain tumors. Imaging was performed on a 3T Siemens TIM Trio scanner (Siemens Medical Solutions, Erlangen, Germany) using a 12-channel receiver-only head coil. For statistical analysis, we used 37 subjects with 44 epilepsy (26M/11F) from this dataset and 37 randomly selected subjects without epilepsy (distinct from the set used to train the lesion detection algorithm; 27M/10F) from the same dataset [99]. The age range for the epilepsy group was 19-65 years (yrs) and 18-70 yrs for the non-epilepsy group. The analysis of population differences was performed using the T1-weighted, T2-weighted and FLAIR MRIs in the Maryland TBI MagNeTs dataset [99]. The remaining 41 subjects with TBI but without PTE used for training the machine learning algorithm. TRACK-TBI Pilot dataset This is a multi-site study with data across the injury spectrum, along with CT/MRI imaging, blood biospecimens, and detailed clinical outcomes [283]. Patients were scanned at 3T and their information was collected according to the 26 core Common Data Elements (CDEs) standard developed by the neuroimaging working group, including 93 CDEs. The 3T research MRI scanners were manufactured by General Electric, Phillips, and Siemens. The MRI protocols complemented the CDE tiers used in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) with TR/TE = 2200/2.96 ms, an effective TI of 880 ms, an echo spacing time of 7.1 ms, a bandwidth 240Hz/pixel, a total scan time of 4 minutes and 56 seconds. The T1-weighted sequence adapted for non-ADNI studies was adopted because the MP-RAGE (magnetization-prepared rapid gradient-echo) sequence with 180-degree radio-frequency pulses was a key component of the ADNI MRI protocol and increased the probability of obtaining at least one high-quality morphometric scan at each examination. To train the machine learning algorithm, we used 2D slices of brain MRIs from a combined group of 41 subjects (33M/8F, age range 18-82 yrs) with TBI but without PTE from the Maryland TBI MagNeTs study (excluding the 74 subjects used for statistical testing) [99] and 97 subjects with TBI but without epilepsy from the TRACK-TBI Pilot study [283] (70M/27F, age range 11-73 yrs), available for download from https://fitbir.nih.gov. ISLES dataset For validation of the ability of the machine learning algorithm to identify lesions, we used 15 subjects from the ISLES (The Ischemic Stroke Lesion Segmentation 2015) database. The data is 45 available for download at http://www.isles-challenge.org/ISLES2015. This dataset was used because it provides manually delineated lesions as ground truth [170]. The data consists of T1, T2, and FLAIR images as well as manual delineations of the lesions. Imaging was performed on 3T scanners manufactured by Phillips systems with routine clinical acquisition parameters. 2.2.1.2 Pre-processing The pre-processing of all three datasets was performed using the BrainSuite software (https://brainsuite. org). The three modalities (T1, T2, FLAIR) were co-registered to each other by registering T2 and FLAIR to T1, and the result was co-registered to the MNI atlas by registering T1 images to the MNI atlas using a rigid (translation, scaling and rotation) transformation model. As a result, we have all three modality images in MNI space at 1 mm3 resolution. Skull and other non-brain tissue were removed using BrainSuite. Brain extraction was performed by stripping away the skull, scalp, and any non-brain tissue from the image. This was followed by tissue classification and generation of the inner and pial cortex surfaces. Subsequently, for training and validation, all images were reshaped into 128 × 128 pixel images and histogram-equalized to a lesion free subject. The extracted cortical surface representations and brain image volumes for each subject were jointly registered to the BCI-DNI Brain Atlas (http://brainsuite.org/svreg_atlas_description/) using BrainSuite’s Surface-Volume Registration (SVReg18a) module [131, 134]. SVReg uses anatomical information from both the surface and volume of the brain for accurate automated co-registration, which allows consistent surface and volume mapping to a labeled atlas. This co-registration establishes a one-to-one correspondence between individual subjects’ T1 MRIs and the BCI-DNI brain atlas. The deformation map between the subject and the atlas encodes the deformation field that transforms images between the subject and the atlas. 46 2.2.1.3 Tensor-based Morphometry To perform a morphometric analysis that compares the brain shapes of PTE patients to those of non-PTE participants, we used tensor-based morphometry (TBM) [68, 114]. TBM is an established neuroimaging method that identifies regional differences in brain structure in groups or individuals relative to a control group using the determinant of the Jacobian matrix computed from the deformation field; the latter defines a nonlinear mapping that warps the brain into a common (atlas) space [23, 161]. Regions of the brain that differ most from the reference atlas brain will be characterized by significantly smaller (e.g. atrophy/tissue loss) or larger (e.g. enlarged ventricles) Jacobian determinants relative to healthy controls (HCs). We used Brainsuite’s TBM pipeline to map structural brain changes resulting from TBI to identify regions that are more strongly associated with the onset of PTE [113, 68]. The Jacobians are computed from the deformation fields associated with the cortically constrained volumetric subject-to-atlas registration described above. We applied 3-mm standard deviation (7 mm full width at half-maximum, or FWHM) isotropic smoothing to the Jacobian determinant maps to account for residual misregistration and to increase statistical power. We analyzed the Jacobian determinants at each voxel in two ways: (1) a t-test to determine if there are differences in the mean local shape across the PTE and non-PTE groups, (2) an F-test to determine if there are group differences in the variances of this measure. The null hypothesis for (1) was that the mean of the Jacobian determinants is the same in the PTE and non-PTE groups. For (2), the null hypothesis was that the variance of Jacobian determinants in the PTE and non-PTE groups is the same. The t-test would reveal if there are consistent TBI-related brain shape differences between the PTE and non-PTE groups. Since trauma affects different areas of the brain in different subjects across groups, it seems unlikely that consistent localized differences between the two groups should be observed. For this reason, we also included the F-test, which would allow us to observe larger variances in localized shape 47 differences in the PTE group in regions at higher risk for developing PTE foci. Since there may be more than one such region, only a subset of PTE subjects would have TBI-related variance difference in any particular area, leading to a larger variance across the PTE group in these areas relative to the non-PTE group. 2.2.1.4 Lesion-based Analysis To complement the TBM analysis −which captures morphometric brain changes− we also performed a lesion-based analysis to analyze changes in the underlying tissue microstructure, edema, and other TBI-related factors revealed by MRI contrast changes. For lesion mapping, we use multimodal MR images (T1, T2, FLAIR) and machine learning (ML) to automatically identify and delineate abnormal tissues. Lesions can be identified by visual inspection after extensive training, but this time-consuming process makes ML an attractive alternative. Approaches based on supervised ML have already achieved noticeable success, reaching high accuracy for lesion detection [160, 136, 195]. However, many manual lesion delineations are required to train supervised machines. Unsupervised approaches, on the other hand, do not require labeled data but are typically less accurate. A popular unsupervised ML approach to lesion identification leverages a form of deep learning (DL) neural network known as a variational auto-encoder (VAE) [48, 142]. By training the VAE using nominally normal imaging data, the network learns to encode only normal images. As a result, the associated image ‘decoder’ can reconstruct normal images. When presented with images containing lesions or anomalies, the VAE encodes and reconstructs the image as if it contained only normal structures, as illustrated in Figure 2.12. Lesions can then be identified from the differences between original and VAE-decoded images. In practice, VAEs exhibit some degree of robustness to outliers (in our case, lesions) in the training data. To identify lesion-based PTE biomarkers, we trained the VAE using the T1-weighted, T2-weighted, and FLAIR MRIs in the Maryland TBI MagNeTs dataset [99]. The lesions were delineated based on VAE 48 qφ (Z|X) Z p (X|Z) T1 T2 FLAIR Figure 2.12: The VAE network and an input/output sample pair from the ISLES dataset. X denotes the input data, Z denotes its low-dimensional latent representation. The VAE consists of an encoder network that computes an approximate posterior qϕ(Z|X), and a decoder network that computes pθ(X|Z). The VAE model takes T1, T2, and FLAIR images from individual subjects (left), compresses them to generate a latent representation (Z), and regenerates three images (right). The VAE is trained on a dataset that contains few lesions. After training, when presented with a new lesioned brain, the reconstruction effectively removes the lesion from the image resulting in a normal (lesion-free) version of the brain. reconstruction error in the FLAIR images. We used the VAE for unsupervised delineated of lesions in these two groups. A detailed description and validation of our method are available elsewhere [10]. The VAE is a directed probabilistic graphical model whose posteriors are approximated by a neural network. Since a VAE learns a general pattern in the training data, its encoder does not encode the lesion and its decoder reconstructs a lesion-free version of the image. The difference between the input and output (reconstructed) images can then reveal the locations of lesions as well as other pathology which may include hematoma, edema, and hemorrhage. We used a combination of the Maryland TBI MagNeTs data and TRACK-TBI Pilot data to train our VAE. These datasets are not lesion-free but a VAE can handle occasional lesions in the training set since it has some robustness. To ensure this, we compared its performance to our proposed robust VAE [10, 14] and confirmed that there was no significant difference between their results. In the following, we refer to this procedure of computing differences of input and reconstructed brains as ‘lesion mapping’. We then investigate the relationship between the sizes and locations of lesions and PTE onset. 49 For the VAE’s architecture, we used the convolutional neural network (CNN) proposed in [150] that consists of three consecutive blocks of convolutional layers, a batch normalization layer, a rectified linear unit (ReLU) activation function and two fully-connected layers in the bottleneck for the encoder. For the decoder, we used a fully-connected layer and three consecutive blocks of deconvolutional layers, a batch normalization layer and a ReLU with a final deconvolutional layer (Figure 2.12). The size of the input layer is 3 × 128 × 128. The VAE error indicates deviation from normal tissue and therefore we refer to it as VAE lesion map, was then warped to the BCI-DNI atlas space by applying the deformation field computed during the registration process to the atlas during preprocessing. Similar to the TBM analysis, we analyzed the VAE lesion maps using two methods: (1) a t-test of whether there are differences in the means of VAE lesion maps between the PTE and non-PTE groups, (2) an F-test to determine whether there are statistically significant differences in the variances of lesion maps between the PTE and non-PTE TBI groups. Additionally, we performed a regional analysis by quantifying lesion volume from binarized lesion maps in each lobe using our USCLobes brain atlas (http://brainsuite.org/usclobes-description). This atlas consists of lobar delineations (left and right frontal, parietal, temporal, and occipital lobes, as well as the bilateral insulae and cerebellum). To identify the lesions as binary masks in each lobe, a one-class support vector machine (SVM) [288] applied to the VAE lesion maps at each voxel and across subjects to identify subjects with abnormally large errors at that voxel. One-class SVM is a commonly used unsupervised learning algorithm for outlier detection [288, 75]. We used the outliers marked by the one-class SVM as lesion delineations and computed lesion volumes per lobe by counting the number of outlier voxels in each lobe for each subject. Since lesion locations vary across subjects, some subjects in either group have healthy-appearing tissue at a given location in the brain whereas some have lesions. However, if lesions in a brain region increase the chance of PTE, then in that region, we would expect to see a greater heterogeneity of image intensities 50 across the PTE than in the non-PTE group, leading to an increase in variance. To test this hypothesis at each voxel, we performed an F-test on the lesion maps. The resulting p-values were corrected for multiple comparisons using Benjamini and Hochberg’s false discovery rate (FDR) procedure [34]. Validation of VAE Lesion Detection After training the VAE, we evaluated its performance using the ISLES dataset. We calculated the pixel-wise absolute reconstruction error and applied median filtering to the resulting image to remove isolated pixels. Ground truth was defined using hand-traced delineation of lesions on FLAIR images [170]. We then generated ROC (receiver operating characteristics) curves and computed the AUC (area under the curve) based on concordance between pixels in the labeled lesions and those pixels in which the absolute error image exceeded a given threshold. The ROC curves were generated by varying the lesion threshold intensity in the error image to control the true and false-positive rates. We also generated reconstruction error maps for the Maryland MagNeTs test set and used these for the lesion mapping analysis described above. 2.2.2 Results 2.2.2.1 TBM-based Analysis The results of TBM analysis using a t-test of Jacobian determinant maps are shown in Figure 2.13 while the results for the F-test are shown in Figure 2.14. As anticipated, in the case of the t statistic map, TBM analysis results did not survive multiple comparison corrections for the false discovery rate (FDR) using the Benjamini-Hochberg procedure [34]. This may be because of the heterogeneity of lesion locations and sizes across both groups. In contrast, the F-test is sensitive to significant differences in variance between the two groups and does show regions where the Jacobian determinant is significantly different, even after FDR correction (q = 0.05). The voxels close to the pial surface associated with significant differences may 51 L R y=78 x=28 L R z=114 0 0.012 0.025 0.037 0.05 Figure 2.13: Three orthogonal views through the t statistic map thresholded at p = 0.05 (uncorrected) for TBM analysis using Jacobian determinants. No regions in the map survived multiple comparisons corrections using FDR (q = 0.05) indicate group differences in the locations of edema, hematoma, or hemorrhage and may, therefore, be associated with an an increased risk for PTE. 2.2.2.2 Lesion-based Analysis The VAE model was trained using the independent set of 138 non-PTE training subjects, and its performance was measured using ROC analysis on the ISLEs dataset (Figure 2.15). Due to the infrequent occurrence of lesions in the training data, the VAE was able to prevent reconstruction of lesions so that they appeared in the error map as shown in Figure 2.15, where we illustrate performance for cases where lesions are present. Note that the reconstructed images in (b) are ’de-lesioned’ approximations of the input images in (a). The normal tissues are reconstructed, whereas the anomalies and lesions are not. The error maps in (c) are indicative of anomalies in the brain. The error maps after median filtering in (d) show reasonable correspondence with the ground truth (e). The AUC for the lesion detection ROC study on the ISLEs data was 0.81. The trained model was then applied to the study population of 37 PTE and 37 non-PTE subjects. Mapping the results of t-test on VAE-computed absolute reconstruction errors shows differences in the left temporal and right frontal lobes (Figure 2.16). However, these results do not survive correction for 52 0.05 0 Figure 2.14: Three orthogonal views through the F-statistic map thresholded at q = 0.05 (FDR-corrected) for TBM analysis using Jacobian determinants. 53 (b) VAE Output (c) VAE Error (d) VAE Filtered Error (e) Ground truth (a) Input Figure 2.15: Reconstruction results obtained by applying the VAE to the ISLES dataset: (a) sample slices from input images; (b) slices reconstructed from the VAE; (c) difference between input and reconstructed images; (d) error maps after applying median filtering to reduce the occurrence of spurious voxels; (e) manually delineated lesion masks used as ground truth to evaluate VAE performance. L R y=78 x=28 L R z=114 0 0.012 0.025 0.037 0.049 Figure 2.16: Orthogonal views through the t-statistic map, thresholded at p = 0.05 (uncorrected), comparing lesion maps for the PTE and non-PTE groups. 54 0.05 0 Figure 2.17: Orthogonal views through the F-statistic map, thresholded at q = 0.05 (FDR-corrected), comparing lesion maps for the PTE and non-PTE groups. multiple comparisons (q = 0.05). As explained earlier, this is likely because of the heterogeneity of lesions across both groups reduces the power of the t-test that compares the means of the two populations. On the other hand, the results of the F-test showed significant differences between groups in both left and right temporal lobes. We also saw significant differences in the right occipital lobe and the cerebellum (Figure 2.18). The results of the lobar analysis (Table 2.2) were consistent with voxel-wise analysis, showing an increased variance in the PTE population relative to non-PTE subjects in the left and right temporal lobes, right occipital lobe, and cerebellum. 55 Lobe Percentages of lobe volumes in PTE subjects median (std) Percentages of lobe volumes in non-PTE subjects median (std) P-value (f-test) Right Temporal 4.746 (1.496) 4.803 (0.819) 0.002648 Left Temporal 4.993 (1.223) 4.871 (0.7661) 0.01704 Right Occipital 4.131 (1.749) 4.724 (1.225) 0.04928 Left Occipital 4.336 (1.358) 4.384 (1.001) 0.07867 Right Frontal 4.794 (1.416) 4.897 (1.868) 0.9497 Left Frontal 5.251 (1.646) 5.016 (1.399) 0.3054 Right Parietal 4.794 (1.416) 4.897 (1.868) 0.6491 Left Parietal 4.610 (1.183) 4.859 (1.320) 0.817 Right Insula 5.319 (1.259) 5.147 (1.383) 0.6491 Left Insula 5.082 (1.067) 4.612 (1.045) 0.817 Cerebellum 4.758 (1.269) 5.046 (0.8524) 0.03532 Table 2.2: Average lesion volumes as measured by identifying lesions using a one-class SVM on the VAE lesion maps. Red indicates cases of significant differences in the variance of lesion volume between PTE and non-PTE (F-test). The FDR-corrected p-values are shown at a significance level of α = 0.05. 2.2.3 Discussion and Conclusion Our results are consistent with earlier TBI studies that show a relationship between lesion location and the probability of PTE onset. In particular, the F-test in our lesion study indicates a correlation between PTE presence and the frequency of lesion occurrence in temporal lobes, consistent with previous studies [210, 2, 224, 82, 254]. Interestingly, the TBM F-test shows areas of significant differences between groups that are, in large part, clustered on or just below the pial surface as well as in the cerebellum. Whereas the near-surface clusters could be false positives and need further investigation, this result may indicate the increased occurrence of edema or hematoma in acute TBI patients, which are known to alter cortex shape [122] and which may be associated with an increased chance for developing PTE. TBM and its extensions [23, 161] have been used for whole-brain analysis of structural abnormalities in TLE patients. Significant volume reductions were found in brain regions including the hippocampus, cingulate gyrus, precentral gyrus, right temporal lobe, and cerebellum [161]. Cross-sectional studies of children with chronic localization-related 56 epilepsy (LRE) using traditional volumetric and voxel-based morphometry have revealed abnormalities in the cerebellum, frontal and temporal lobes, hippocampus, amygdala, and thalamus [55, 66, 98, 153, 154, 152, 252]. These studies also emphasize the role of the temporal lobe in PTE, while also providing further evidence for the role of temporal lobe lesions in PTE. The findings from our study further support this evidence for the involvement of the temporal lobe. Furthermore, studies like ours may assist or complement efforts to study post-traumatic metabolic crisis [269] or to localize post-traumatic epileptic foci for surgical resection via electroencephalography (EEG) [123, 124]. The novel use of a VAE here to automatically delineate lesions may prove useful for future studies over the large datasets, or collections of datasets like FITBIR, where manual segmentation is very time-consuming and/or subject to large inter-rater variability. 2.3 Prediction of Post Traumatic Epilepsy using MRI-based Imaging Markers Traumatic Brain Injury (TBI) survivors often carry a tremendous burden of disability as a result of their injuries [187]. Such injuries can have wide-ranging physical and psychological effects with some signs or symptoms that appear immediately after the traumatic event, while others appear days or weeks later. TBI is one of the major causes of epilepsy [4] yet the link between TBI and epilepsy is not fully understood. Post-traumatic epilepsy (PTE) refers to recurrent and unprovoked post-traumatic seizures occurring after 1 week [258]. Patients with PTE perform worse across several clinical and performance metrics such as independence and cognitive scores and have a significantly reduced quality of life [43]. They are prone to higher rates of mental illness such as depression and addiction [44]. Significant risk factors for the development of seizures > 1 week after TBI include seizures within the first week, acute intracerebral hematoma (especially subdural hematoma), brain contusion, increased injury severity, and age > 65 years at the time of injury [62]. The incidence of PTE ranges from 4% to 53%, with the risk approaching 50% 57 in cases with direct injury to the brain parenchyma. PTE risk also varies with the time after injury, age range under study, as well as the spectrum of severity of the inciting injuries [62, 86, 197]. Epidemiological studies have found that PTE accounts for 10%–20% of symptomatic epilepsies and 5% of all epilepsies [106, 200]. It is estimated that in the USA and the European Union (EU), with a total population of about 800 million, at least 0.5 million surviving individuals live with PTE [258]. Data on the economic burden of PTE are unavailable, but some idea is provided by the lifetime cost of TBI on average, which in the USA is around $200,000 per case scaled to 2004 prices [35, 117]. Thus, in addition to the personal burden, the economic burden caused by PTE is also substantial. Therefore, prediction and if possible, prevention of PTE remains an important challenge. Biomarkers for post-traumatic epilepsy can vary from imaging and electrophysiologic measurements to changes in gene expression and metabolites in blood or tissues. MRI imaging offers a non-invasive, radiation-free powerful modality for marker development. While a wealth of biomarkers exist when the epilepsy condition is already established, these markers can only reveal mechanisms that exist after the epileptogenic process, which can allow partial or full pharmacoresistance to establish before treatment starts [196, 197, 204]. Prognostic markers and approaches to identify the risk of post-traumatic epilepsy would eliminate the need to wait for spontaneous epileptic seizures to occur before starting treatment. The ability to identify high-risk subjects can enable the mitigation of risks to subjects whose seizures could result in serious injury or death. The pathogenesis of spontaneous recurrent seizures are certainly multifactorial. Once established, seizure threshold, which is a measure of the balance between excitatory (glutaminergic) and inhibitory (GABA-ergic) forces in the brain [186, 226], is thought to vary over time depending on several factors such as periodicities in seizure occurrence [29]. Current anti-seizure medications raise the seizure threshold and thus reduce the propensity for seizures to occur. An individualized prognostic marker for the development of PTE could be used in clinical trials to study compounds which may have true anti-epileptic potential. 58 Current medications used for epilepsy only treat the symptoms not the underlying pathophysiology which leads to epilepsy. The value of individualized prognostic markers is five-fold [81]: (i) Prediction of the development of an epilepsy condition: prognostic markers that eliminate the need to wait for spontaneous epileptic seizures to occur would reduce the time and cost required for TBI patients to start participation in clinical trials, and also the risks to subjects whose seizures could result in serious injury or death, (ii) Identification of the presence and severity of tissue capable of generating spontaneous seizures: an imaging-based marker can identify anomalous brain regions, which could help in surgical as well as noninvasive treatment planning even before the condition is established. (iii) Measuring progression after the condition is established: MR imaging-based markers can help in quantifying the progression of epilepsy and understanding pharmacoresistance. Identification of localized biomarkers of epileptogenic brain abnormalities would allow longitudinal tracking of seizure threshold at later time points and would presumably reveal the time points when the epileptogenic process reaches a critical point so that clinical seizures would likely occur. (iv) Creating animal models of PTE: The identified markers can be used to create animal models, and prediction algorithms can be used for more cost-effective screening of animal models for treatment with potential anti-epileptogenic and antiseizure drugs and devices. (v) Cost reduction in clinical trials by screening patients: PTE risk prediction can be used for recruitment into clinical trials for potential anti-epileptogenic interventions by enriching the trial population with identified patients that are at higher risk for developing epilepsy. Prediction and if possible, prevention of the development of PTE is a major unmet challenge. Animal studies in adult male Sprague-Dawley rats have shown the potential of using MRI-based image analysis for finding biomarkers for PTE [120, 202]. These studies point towards the involvement of the perilesional cortex, hippocampus, and temporal lobe in PTE [200]. Despite important progress, brain imaging is still underexploited in the context of PTE biomarker research. Numerous human neuroimaging studies have 59 provided important insights into TBI [67, 83, 140] and epilepsy [161, 175, 239], but imaging-based PTE prediction work is scarce. Group analyses of TBI patients compared to controls revealed volume reductions in brain tissue across multiple cortical areas including the corpus callosum, corona radiata; anterior, mid, and posterior cingulate gyrus, precuneus, and parahippocampal gyrus [83, 89, 140, 236, 280], as well as subcortical regions, including the hippocampus, amygdala, putamen, globus pallidus, caudate, and midbrain [67, 83, 89, 236, 249, 267]. Clinical and research studies in epilepsy often include both anatomical (MRI, CT) and functional (PET, EEG, MEG, ECoG, depth electrodes, fMRI) mapping. While epileptogenic zones can be found in almost any location in the brain, the temporal lobe and the hippocampus are the most common sites causing focal epileptic seizures [239]. Multimodal MRI and PET imaging has been used to predict the laterality of temporal lobe epilepsy [208, 239]. Extensive changes in brain networks due to epilepsy were reported using PET, fMRI, and diffusion imaging [161, 202, 208, 7, 239]. Research using resting-state fMRI (rs-fMRI) over the last two decades has uncovered important properties of brain dynamics and network organization by revealing the existence of patterns of spontaneous neural activity that occur in the absence of a specific task or stimulus. These patterns, known as resting-state networks, have been found to be consistent across individuals and are thought to reflect the underlying functional organization of the brain [22]. The presence of lesions in TBI patients is expected to alter these resting-state brain dynamics either locally, through changes in the lesioned area, or at the level of networks affected by the lesion [184]. In addition, previous work has shown that cases of focal epilepsy are often associated with changes in network activity extending beyond the seizure onset zone [276, 192]. Interestingly, [290] trained an SVM classifier on brain imaging data for the diagnosis of mesial temporal lobe epilepsy, and found that combining fMRI and structural MRI features provided better classification than either modality alone. Taken together, these studies motivate the exploration of local and network-level brain abnormalities as potential predictive markers of PTE in patients who have suffered from a TBI. 60 A few recent studies have employed machine learning to identify functional brain changes that may serve as PTE biomarkers [216, 7, 58]. Common features used in such brain-based prediction approaches are the pair-wise correlation patterns observed between rs-fMRI signals. However, because of the high dimensionality of such connectivity metrics, using them as features in a machine learning framework can quickly lead to model overfitting. A standard approach to dealing with this issue is to use a dimensionality reduction method such as principal components analysis (PCA) to reduce the dimension of the feature space to a subset of principal components that represents nevertheless most of the information in the data. Moreover, regularized models [103, 251] with a ridge or lasso penalty can also be used to select a subset of features and prevent overfitting by penalizing the weights. In the specific case of exploiting brain connectivity features in a data set with limited number of subjects, the presence of groups of highly correlated features leads to an ill-conditioned feature space. As a result, methods that use simple penalties that discard most of the correlated features can become unstable [251]. The goal of the present study was to probe the utility of multiple structural and functional features extracted from MR imaging in characterizing PTE, as well as predicting its occurrence. We explored both classical group-level statistics and cross-validated machine learning methods. Our results extend previous reports by uncovering key MR-based differences between TBI patients who develop PTE and those who do not. In addition, we were able to leverage machine learning analyses to assess the relative contribution of different types of structural and functional features to the out-of-sample prediction of PTE. 2.3.1 Materials and Methods 2.3.1.1 Data We extracted functional and structural features from two datasets: 1) The Maryland TBI MagNeTs dataset [99]: Of the 113 individual data sets, 72 (36 PTE and 36 non-PTE groups) were used for group-level difference comparisons as well as for supervised training and testing of a 61 machine learning algorithm (i.e. constructing a model from training samples to predict the presence of PTE in previously unseen data). The remaining 41 non-PTE subjects (total 113) were used to train an artificial neural network for automatic lesion delineation, using a method recently developed by our group [13]. 2) The TRACK-TBI Pilot dataset [283]: 97 subjects from this data set, in addition to the 41 non-PTE subjects from MagNeTS, were used to train the artificial neural network for automated delineation of lesions as described briefly below in Section 2.3.1.2. A more detailed description can be found in [13]. Maryland MagNeTs data This is our main dataset for group comparisons as well as for PTE prediction. The dataset was collected as a part of a prospective study that includes longitudinal imaging and behavioral data from TBI patients with Glasgow Coma Scores (GCS) in the range of 3-15 (mild to severe TBI). Injury mechanisms included falls, bicycle or sports accidents, motor vehicle collisions, and assaults. The individual or group-wise GCS, injury mechanisms, and clinical information is not shared. The imaging data are available from FITBIR (https://fitbir.nih.gov), with FLAIR, T1, T2, fMRI, diffusion, and other modalities available for download. In this study, we used imaging data acquired within 10 days after injury, and seizure information was recorded using follow-up appointment questionnaires. Exclusion criteria included a history of white matter disease or neurodegenerative disorders including multiple sclerosis, Huntington’s disease, Alzheimer’s disease, Pick’s disease, and a history of stroke or brain tumors. The imaging was performed on a 3T Siemens TIM Trio scanner (Siemens Medical Solutions, Erlangen, Germany) using a 12-channel receiver-only head coil. For statistical analysis, we used 36 fMRI subjects with PTE (25M/11F) from this dataset and 36 randomly selected fMRI subjects without PTE (28M/8F)[99, 292]. The age range for the epilepsy group was 19-65 years (yrs) and 18-70 yrs for the non-epilepsy group. Our analysis of population differences was performed using T1-weighted, T2-weighted, and FLAIR MRI as well as resting fMRI [99]. The remaining 41 subjects with TBI but without PTE were used for training the automatic lesion detection algorithm. Standard gradient-echo echo-planar resting-state functional MR imaging (repetition time msec/echo time msec, 2000/30; flip angle, 75°; field of view, 220 × 220 mm; 62 matrix, 128 × 128; 153 volumes) was performed in the axial plane, parallel to a line through the anterior and posterior commissures (section thickness, 5 mm; section gap, 1 mm) and positioned to cover the entire cerebrum (spatial resolution, 1.72 × 1.72 × 6.00 mm) with an acquisition time of 5 minutes 6 seconds. The individuals were instructed to close their eyes for better relaxation but to stay awake during the imaging protocol. TRACK-TBI Pilot dataset This is a multi-site study with data across the injury spectrum, along with CT/MRI imaging, blood biospecimens, and detailed clinical outcomes [283]. Here we use 3T MRI data in addition to information collected according to the 26 core Common Data Elements (CDEs) standard developed by the TrackTBI Neuroimaging Working Group. The 3T MRI protocols (implemented on systems from General Electric, Phillips, and Siemens) complemented those used in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) with TR/TE = 2200/2.96 ms, an effective TI of 880 ms, an echo spacing time of 7.1 ms, a bandwidth 240Hz/pixel, and a total scan time of 4 minutes and 56 seconds. The data is available for download from https://fitbir.nih.gov To train the unsupervised deep learning model, a variational autoencoder for lesion delineation as described in Section 2.3.1.2, we used 2D slices of brain MRIs from a combined group of 41 TBI subjects (33M/8F, age range 18-82 yrs) from the Maryland TBI MagNeTs study [99] and 97 TBI subjects (70M/27F, age range 11-73 yrs) from the TRACK-TBI Pilot study [283]. These TBI data were taken from patients without PTE, and are strictly distinct from the set of 72 subjects which we subsequently used for statistical testing and PTE prediction. 2.3.1.2 Methods Preprocessing Pre-processing of the MR datasets was performed using the BrainSuite software (https: //brainsuite.org). The three modalities (T1, T2, FLAIR) were coregistered with each other by registering T2 and FLAIR to T1, and the result was co-registered to the MNI atlas (Colin 27 Average Brain) [53] by 63 registering T1 images to the MNI atlas using a rigid (translation, scaling and rotation) transformation model. As a result, all three image modalities were registered to a common MNI space at 1 mm3 resolution. Skull and other non-brain tissue were removed using BrainSuite [133], and brain extraction was performed by stripping away the skull, scalp, and any non-brain tissue from the image. This was followed by tissue classification and generation of the inner and pial cortex surfaces. Subsequently, for training and validation of the lesion detection model, all images were reshaped into 128×128 pixel images and histogram-equalized to a lesion-free subject. The extracted cortical surface representations and brain image volumes for each subject were jointly registered to the BCI-DNI Brain Atlas (http://brainsuite.org/svreg_atlas_description/) [129] using BrainSuite’s Surface-Volume Registration (SVReg18a) module [132, 130]. The BCI-DNI brain atlas is from a single subject with parcellation defined by anatomical landmarks. SVReg uses anatomical information from both the surface and volume of the brain for accurate automated co-registration, which allows consistent surface and volume mapping to a labeled atlas. This co-registration establishes a one-to-one correspondence between individual subjects’ T1 MRIs and the BCI-DNI brain atlas. The deformation map between the subject and the atlas encodes the deformation field that transforms images between the subject and the atlas. We used the BrainSuite fMRI Pipeline (BFP) to process the rs-fMRI subject data and generated grayordinate representations of the preprocessed rs-fMRI signals [94]. BFP is a software workflow that processes fMRI and T1-weighted MR data using a combination of software that includes BrainSuite, AFNI, FSL, and MATLAB scripts to produce processed fMRI data represented in a common grayordinate system that contains both cortical surface vertices and subcortical volume voxels. Starting from raw T1 and fMRI images, BFP produces processed fMRI data co-registered with BrainSuite’s BCI-DNI atlas and includes both volumetric and grayordinate representations of the data. 64 MRI Based Measures: Lesion Detection To extract lesions from the anatomical MRI data we used an unsupervised framework which we recently developed to automatically detect lesions in MR data. The method, which has been validated on other data sets, is based on a variational auto-encoder (VAE), a class of auto encoders where the latent representation can be used as a generative model [48, 142, 15]. By training the VAE using nominally healthy (lesion-free) imaging data, the network learns to encode normal brain images. As a result, applying such a model to an image that contains a lesion will yield a VAE-decoded image that does not contain anomalies: the lesions can then be identified from the differences between original and VAE-decoded images. One complication here is that we did not have access to normal imaging data with matching characteristics of the PTE dataset. Instead we trained VAE using the T1-weighted, T2-weighted, and FLAIR MRIs in the Maryland TBI MagNeTs dataset [99] leveraging VAEs robustness to outliers [12]. While lesions are present in most of the volumetric TBI images, they are typically confined to a limited region in each brain, so that in any particular anatomical region (at the scale of the major gyri delineated in the BCI-DNI atlas) the fraction of images with lesions is relatively low. In this study the lesions were delineated based on VAE reconstruction error in the FLAIR images. We have previously evaluated performance of this lesion detection algorithm using an independent validation set with delineated lesions [13, 15]. Statistical analyses: Once we determined the lesions using the methods described above, we analyzed the VAE lesion maps using a 1-sided nonparameteric (permutation-based) F-test to determine whether there were any statistically significant differences in the variances of lesion maps across the PTE and non-PTE TBI groups [5]. The null hypothesis is that the variances of lesion maps across subjects in the PTE group is less than or equal to that of the non-PTE group. Our decision to use the F-test was guided by the fact that traumatic brain injuries affect different areas in different subjects across the groups, so that consistently localized differences between the 2 groups were expected to be very unlikely. However, a higher frequency of lesions in a particular regions should result in a higher sample variance in the lesion 65 maps. This rationale was in fact supported by our observation that assessing differences in the group means using a standard t-test did not show any significant effects. In order to apply F-test with permutations, we computed point-wise group variances and computed their ratio to obtain an unpermuted F-statistic. We then permuted the group labels to recompute the F-statistic for nperm = 1000 permutations. The p-value was computed pointwise by comparing the unpermuted F-statistic to the permuted F-statistics (nperm = 1000). Finally, the resulting pointwise map of p-values was corrected for multiple comparisons using Benjamini-Hochberg FDR correction [32]. Additionally, we performed a regional analysis by quantifying lesion volume from binarized lesion maps in each ROI using the USCLobes brain atlas [129] (http://brainsuite.org/usclobes-description). The USCLobes atlas segments the brain into larger regions (lobes) than those provided by the default BrainSuite atlases. It has 15 ROIs delineated on the volumetric labels of the atlas: (L/R) Frontal Lobe, (L/R) Parietal Lobe, (L/R) Temporal Lobe, (L/R) Occipital Lobe, (L/R) Insula, (L/R) Cingulate, Brainstem, Cerebellum, and Corpus Callosum. To identify lesions as binary masks, a one-class SVM [288] was applied to the VAE lesion maps at each voxel and across subjects to identify subjects with abnormally large errors (i.e discrepancies between the original input image and VAE-decoded image) at that voxel [288, 74]. We defined the outliers marked by the one-class SVM as lesions and computed lesion volumes per ROI by counting the number of outlier voxels in each ROI for each subject. In the following we consider this measure to be a proxy for an ROI-wise lesion volume which we use for PTE prediction. fMRI Based Measubsures: Connectivity We compute ROI-wise connectivity using rs-fMRI data and the USCLobes ROIs. The BFP fMRI pipeline produces a standardized fMRI signal in the grayordinate system that is in point-wise correspondence across subjects. Using the USCLobes parcellation with respect to the grayordinates, we computed the ROI-wise signal by averaging over each ROI. A 15x15 matrix was then computed from the Pearson correlations of averaged fMRI signals between each of the 15 ROIs in the 66 USCLobes atlas. We used the elements of the upper triangle of the correlation matrix as a feature vector and applied the Fisher Z transform to normalize the feature distribution. This feature vector was subsequently used for classification. fMRI Based Measures: ALFF The amplitude of low-frequency fluctuation (ALFF) [298, 263] is an rsfMRI-based metric that measures the magnitude of spontaneous fluctuations in BOLD-fMRI signal intensity for a given region. We calculated the ALFF metric as the signal power in a frequency band defined by a low- and high-frequency cutoff, which we set to 0.01 Hz and 0.1 Hz, respectively. The ALFF measure was first computed in the native fMRI space and then mapped to the the mid-cortical surface of the USCBrain atlas [129] using the BrainSuite registration method described above. Statistical analyses: Similar to the statistical assessment of lesion differences, we used a 1-sided F-test (null hypothesis var in PTE <= var in nonPTE populations). We tested for significance using a permutation method (nperm = 1000), for voxel-wise group-level comparison of variance across the PTE and non-PTE groups. This resulting p-values were corrected for multiple comparisons using the false discovery rate (Benjamini Hochberg procedure) [32]. As a result, we end up with a voxelwise p-value map in the USCBrain atlas space, indicating local differences in slow-frequency fluctuations of BOLD between PTE and non-PTE groups. PTE prediction Using Machine Learning In addition to the statistical analyses described above, we examined the same brain features using a supervised machine learning framework. This was motivated by several factors. The machine learning framework allows us to implement an out-of-sample analysis that assess the ability of the features extracted from the data to classify individual subjects. This is particularly important when searching for potential biomarkers. Second, tools such as multi-feature classification and feature importance quantification readily provide useful insights into the significance, complementarity or redundancy across the set of explored features. This is also key when seeking to identify the most efficient prognostic biomarkers. Finally, the diversity of available machine learning algorithms opens novel 67 opportunities to tease apart variable distributions that may be harder to separate using standard statistical tools. In the machine learning pipeline implemented in the present study, we used the anatomical and functional features described above (i.e. lesion information, connectivity and ALFF) that were extracted from MR imaging data collected during the early (acute) phase prior to onset of PTE. The goal of the data-driven classifier approach is to build a model that learns to distinguish between PTE and non-PTE subjects using labeled training data. We use a leave-one out stratified cross-validation scheme to reduce risks of selection bias and overfitting. To fine-tune model hyperparameters while adhering to a strict separation of training and test data, we used a standard nested cross-validation procedure. To assess the feasibility of building models that can predict PTE from structural and functional MR data, we implemented a multi-feature binary classification framework using three distinct types of algorithms (details below). We concatenated the extracted features into an input vector, to which we applied PCA to reduce the dimensionality of the feature space [77, 214]. We use the area under the receiver operating characteristic curve (AUC) as the primary performance evaluation metric. We applied the following three machine learning algorithms: Random Forests [39]: We used a random forest classifier which is an ensemble learning method that works by training multiple decision trees on random subsets of the data and then averaging the predictions of each tree to make a final prediction. This technique reduces overfitting and improves the overall accuracy of the model compared to using a single decision tree. At each iteration, a random subsample of the data is taken, and a new decision tree is fit. This process is repeated multiple times, and the final output is the majority vote of all the decision trees. Support Vector Machines (SVMs) and Kernel-Support Vector Machines (KSVMs): The basic idea underlying the SVM is to find the hyperplane in a high-dimensional space that maximally separates the different classes [56]. The data points that are closest to the hyperplane are called support vectors and have the greatest 68 impact on the position of the hyperplane. Once the hyperplane is found, new data can be easily classified by determining on which side of the hyperplane they fall. By contrast, a Kernel Support Vector Machine (KSVM), is an extension of the basic SVM algorithm that uses a kernel operator to map the input data into a higher-dimensional space, in which they can be more easily separated. The use of a kernelallows the SVM to handle non-linearly separable data, by finding a higher-dimensional space in which they are linearly separable. One of the most popular kernels for this purpose, and that used here, is the radial basis function (RBF). Multi-Layer Perceptron: We also used a multilayer perceptron (MLP) which is a feedforward artificial neural network where the input is passed through multiple layers of artificial neurons. Each layer applies a non-linear transformation to the input before passingto the next layer [162]. MLPs are trained using back-propagation and stochastic gradient descent. The MLP model we used here consisted of 3 hidden layers (32, 16, and 16 neurons, respectively). While more complex and deeper neural network architectures are available, we chose to use a simple MLP given the limited size of the data at hand (N=72). We expected the RF and SVM algorithms to be more suitable for our classification task, but included the MLP method for the sake of comparison. 2.3.2 Results 2.3.2.1 Lesion Analysis To compare lesion patterns across the PTE and non-PTE patients, we assessed group differences in lesion scores, defined as the difference between the grayscale values in the original anatomical MR images and the VAE-decoded versions thereof (see methods). Statistical assessment using the F-test revealed statistically significant differences (p<0.05, corrected) between the two groups in multiple brain areas (Figure 2.18). These differences were prominent in the left and right temporal lobes, the right occipital lobe, and the cerebellum, reflecting higher variability of lesion scores in these areas in PTE patients. In addition, our lobe-wise 69 Table 2.3: PTE vs non-PTE group comparison of lesion and ALFF measures (p-values obtained using an F-test). analysis (Table 2.4) yielded results consistent with voxel-wise analysis, confirming an increased variance in the PTE population relative to non-PTE subjects in the same regions identified in the graordinate-wise analysis. 2.3.2.2 PTE-related Modulations of ALFF The results of the lobar analysis of ALFF (Table 2.4) are consistent with the grayordinate-wise analysis (Figure 2.18), both showing an increased variance in the PTE population relative to non-PTE subjects in the right temporal lobe, both occipital lobes, right parietal lobe. The asymmetry in the lobe-based analysis is possibly due to the limited sample size. 70 Figure 2.18: Voxel-based PTE vs. non-PTE group comparison of lesion maps overlaid on the USCBrain atlas. The color code depicts f-values, shown in a region where p-value < 0.05, resulting from the F-test (with permutations). Prominent significant clusters are located in the left temporal lobe, bilateral occipital lobe, cerebellum, and right parietal lobe. 71 Figure 2.19: Differences in ALFF between the PTE and non-PTE groups. The results are color-coded f-statistic thresholded by FDR corrected p-values (p < 0.05) derived using a permutation test. Significant clusters are visible in the left temporal lobe, bilateral occipital lobes, cerebellum, and right parietal lobe. 72 Table 2.4: Classification accuracy of PTE vs. non-PTE subjects using different classifiers and features types. Mean and standard deviation of AUC are shown for KSVM, SVM, RF and NN. The last column shows the performance obtained when the models were trained simultaneously on all three feature types 2.3.2.3 Classification of PTE and non-PTE Subjects using Machine Learning To test the feasibility of training an ML model to distinguish PTE and non-PTE data we performed leaveone-pair-out nested stratified cross-validation over 100 iterations. Nesting was used for parameter tuning (the number of PCA components and model hyperparameter). Single-feature classification using either the lesion, connectivity, or ALFF metrics, was followed by a multi-feature classification approach combining all three features. The input features were computed for distinct ROIs, based on the USCLobes atlas, and were normalized to unit variance, zero mean. From Table 2.3, we can see that combining all three feature types yields the best model performance in terms of the AUC scores. This is likely a reflection of the complementarity of the information about PTE captured across the lesion, connectivity and ALFF data. Among the four ML methods we used, KSVM achieved the best performance. This is probably due to high variability in the feature space, and improved feature separation through mapping to a higher dimensional space. The neural network performed relatvely poorly on this classification, which given the moderate training sample size, is not surprising. In addition to determining the feature space based on the USCLobes atlas, we also computed ROI-based features using atlases with a large number of parcels. To this end we used the USCBrain atlas, BCI-DNI atlas [129], and AAL3 atlas [219]. Running the classification pipeline based on these atlases did not lead to any significant improvements in PTE prediction. 73 In order to gain further insights and improve the interpretability of these ML results, we sought to assess the distinct spatial contributions of the lesion features to the overall prediction score. To this end, we computed feature importance maps derived from the positive SVM model coefficients. Figure 2.18 shows the lesion variability maps (in non-PTE and PTE subjects), followed by the SVM feature importances across the USCBrain atlas. Comparison between the lesion and feature importance maps points towards a reasonable spatial overlap between the two. But, most importantly, this analysis suggests that the lesion volume data that contribute most to distinguishing PTE and non-PTE subjects are located in right temporal and left prefrontal cortices. 2.3.3 Discussion The complex pathophysiology as well as the variability in the degree of severity of TBI poses a significant challenge to any research in this field. TBI with the resultant coup, focal cortical injury, and contrecoup injury, secondary contusion opposite to the coup injury, results in great variety of cortical lesions sites which then gives rise to a diversity of neurocognitive impairments [180]. The cortical injury inflicted by TBI is well known to have some predilection to the polar regions of the brain, particularly the temporal and frontal lobes [84]. This is related to adjacency to bony structures of the skull. However, the presence of cortical lesions in specific regions have not been consistently linked to the development of PTE. In one study a left parietal lobe lesion and presence of hemosiderin staining were linked to the development of PTE [210]. Yet in other studies, rather than a specific cortical region, the degree of leakage by the blood brain barrier around cortical sites after TBI was a prognostic marker for the development of PTE [202]. Due to the extreme heterogeneity of PTE, the need for a reliable biomarker to enhance prediction of PTE is highly desirable. There have been a number of promising and novel therapeutic interventions targeting the complex pathophysiology of TBI shown to have promise in preclinical phase I/II trial, yet have gone on to fail in phase III clinical trials [180]. Some reasons for failure could be the large sample required to 74 Figure 2.20: (a) Brain-wide mean lesion volume variability shown for non-PTE (upper row) and PTE subjects (lower row). (b) Feature importance map shown as color-coded ROIs overlaid on the USCBrain atlas. Both corticalsurface and volumetric ROIs are shown. 75 determine benefit if the effect size of the intervention is small. Use of biomarkers could enrich the target population so as to maximize the likelihood of discovery of potential therapeutic intervention. In this study, we explored the feasibility of predicting PTE using functional and structural imaging features which consisted of fMRI connectivity, lesion volumes, and the amplitude of low-frequency fluctuation in FMRI data. In particular, we assessed the performance of widely employed machine learning algorithms, in predicting post-traumatic epilepsy using these features. The aim was two-fold: (1) to assess the feasibility of building predictive ML algorithms for PTE based on functional and structural brain features, (2) leverage this data-driven framework to pinpoint discriminant brain features that may provide useful mechanistic insights into the clinical underpinnings of PTE. Among the machine learning models examined here, KSVM combined with a standard PCA approach for feature dimensionality reduction led to the highest prediction score. Additionally, our results suggest that combining all three feature types (connectivity, lesion and ALFF data) leads to better prediction than only using one of the three types of features. The feasibility of successfully training a model to discriminate PTE from non-PTE subjects demonstrates that complementary brain features extracted from multi-spectral MRI imaging can collectively capture anatomo-functional alterations that underpin PTE. The ability of the ML approach to generalize to data from individuals that were not used in training the classifier (i.e. using cross-validation) indicates the feasibility of using the identified features to predict with a reasonable degree of accuracy whether a patient who suffers a traumatic brain injury is likely to go on to develop PTE or not. Our results show a maximum AUC of 0.78 using KSVM. Improvements in this value may result from the use of larger training sets as we discuss below. Although we employed several methods to compare the PTE and non-PTE group, we observed reasonable consistency across the results; The results form the group difference analyses based on F-tests and the ML classifier results, as well as the feature importance maps (based on the SVM coefficients), provide converging evidence for alterations in temporal, and occipital cortices, and to some extent in the cerebellum, 76 in both functional and structural features. We note that the discrepancy between some of the structural and functional patterns observed in the right parietal regions may be due to the functional connectivity between parietal areas and epilepsy-related networks. Furthermore, it is noteworthy that while the findings from lesion-based analysis were largely left-right symmetric, the ALFF-based analysis results showed a certain degree of asymmetry. To probe this further, we compared lesion volumes in left-hemisphere ROIs and in the corresponding right-hemisphere ROIs. This revealed that the left-right lesion volume differences were not significant for any of the ROIs. In the ALFF-based analysis, the region-wise results were largely symmetric, but in the left temporal, and left parietal lobes, the p-values approached significance. The p-value in the ALFF analysis for the cerebellum also approached significance. The AUC metric is a commonly used performance criterion in binary classification. While performance requirements vary across tasks and application domains, an AUC score of 0.7 is frequently used to indicate a minimally acceptable discrimination [229, 243, 246]. However, a higher AUC is often necessary to stratify a significant portion of the population into high-risk or low-risk subjects. [229] suggested that to classify the majority of the population into a clinically distinct risk group (high or low risk), an AUC of 0.85 was needed. In our analysis, the highest performance (AUC=0.77) was obtained with the KSVM algorithm in a total sample size of 72 individuals (36 PTE and 36 non-PTE subjects). To understand the impact of the number of subjects on the AUC, we repeated the ML analysis for different subsets of subjects (maintaining balanced sample sizes for PTE and non-PTE subjects). Figure 2.21 depicts the AUC’s mean and variance based on stratified cross-validation as a function of the number of training subjects. Our results suggest that the AUC starts to increase monotonically after 26 subjects. A quadratically fit curve that extrapolates this trend is shown as a dotted red line. We fit a curve to the last 6 data points to obtain an over-optimistic estimate of 0.98 AUC for 50 PTE subjects. By fitting the last 10 points we obtain a more conservative AUC value of 0.85 for 50 PTE subjects. Assuming a 15% prevalence of PTE in TBI subjects, this analysis shows 77 that N=300 would yield the desired AUC with the existing algorithm. This is, of course, an attempt to extrapolate our results to outline a pragmatic and clinically relevant approach to using our method for the prediction of PTE. Of course, there are a number of unknown variables and factors that can influence this analysis. These include the heterogeneity of TBI and a wide range of properties of TBI patients (incl. demographics, types, the severity of injuries, treatment at the acute stage, etc.). In particular, the properties of the Maryland data used for this analysis can be quite different from those of other data sets. 2.3.3.1 Limitations Despite the encouraging and clinically relevant observations reported here, the PTE prediction method proposed is a first step that paves the way for more efficient and elaborate prediction approaches. Among relevant future steps, we believe that increasing the size of the training data and the addition of more features such as diffusion imaging data may lead to an improved AUC. Similarly, prediction is likely to benefit from incorporating non-imaging clinical data such as scores on the Glasgow Coma Scale (GCS), which classifies Traumatic Brain Injuries, alongside demographics, and injury mechanisms. As noted earlier, this information was unfortunately not available in the public domain for the cohort we used in this study. Furthermore, including information on the type of epilepsy, may turn out to be useful for training the ML models. This and other clinical information can also be used for further sub-grouping of the clinical population. 2.3.4 Conclusion In this paper, we investigated the efficiency of functional and structural brain features as PTE biomarkers. Leveraging a machine-learning framework, we compared PTE prediction performance across an array of standard classifiers and a variety of brain features. The best results were obtained with KSVM, which is possibly partly due to the heterogeneity of the alterations in the PTE group around the mean feature. 78 Figure 2.21: Number of samples vs AUC resulting from KSVM (PCA) method. The blue curve shows mean AUC and shaded areas indicate std dev. in the leave one out stratified cross-validation. Conservative (red) and optimistic (blue) extrapolations are shown as dotted curves. 79 Our results using kernel-based methods show promising results. In both lesion and ALFF comparison studies, bilateral temporal lobes and cerebellum show significance. Moreover, there is potential involvement of the parietal and occipital lobes. Cross-sectional studies of children with chronic localization-related epilepsy (LRE) using traditional volumetric and voxel-based morphometry have revealed abnormalities in the cerebellum, frontal and temporal lobes, hippocampus, amygdala, and thalamus [54, 65, 97, 156, 157, 155, 253]. The temporal lobe findings from our study further support this evidence. One of the limitations of our study is its relatively small population size (N=74). A larger study using TrackTBI dataset might be possible in the future. The leave-one-out cross-validation shows the relatively stable performance of the prediction. However, the AUC performance can be improved further with a larger dataset. 2.4 Semi-supervised Learning using Robust Loss The amount of manually labeled data is limited in medical applications, so semi-supervised learning and automatic labeling strategies can be an asset for training deep neural networks. However, the quality of the automatically generated labels can be uneven and inferior to manual labels. In this paper, we suggest a semi-supervised training strategy for leveraging both manually labeled data and extra unlabeled data. In contrast to the existing approaches, we apply robust loss for the automated labeled data to automatically compensate for the uneven data quality using a teacher-student framework. First, we generate pseudo-labels for unlabeled data using a teacher model pre-trained on labeled data. These pseudo-labels are noisy, and using them along with labeled data for training a deep neural network can severely degrade learned feature representations and the generalization of the network. Here we mitigate the effect of these pseudo-labels by using robust loss functions. Specifically, we use three robust loss functions, namely beta cross-entropy, symmetric cross-entropy, and generalized cross-entropy. We show that our proposed strategy improves 80 the model performance by compensating for the uneven quality of labels in image classification as well as segmentation applications. Deep neural networks usually require a large amount of labeled training data to achieve good performance. However, manual annotations, especially for medical images, are very time-consuming and costly to acquire. So it is desirable to incorporate extra knowledge from unlabeled data into the training process and assist supervised training. The dominant methods that leverage unlabeled data for classification, specifically semantic segmentation, include (1) consistency training [85, 138] that ensures consistency of prediction in the presence of various perturbations. In these approaches, a standard supervised loss term (e.g., cross-entropy loss) is combined with an unsupervised consistency loss term that enforces consistent predictions in response to perturbations applied to unsupervised samples; and (2) pseudo-labeling [293, 296, 163, 297, 50] of the unlabeled images obtained from the model trained on the labeled images by generating pseudo-labels using another or even the same neural network. Generating pseudo-labels is a straightforward and effective way to enrich supervised information [50]. However, the pseudo-labeled datasets inevitably include mislabeled data that introduce noise. These weakly labeled data can have a disproportionate impact on the learning process, and the model may overfit to the outliers [287]. A major challenge in using auto-labeled data for training is accounting for noise in the pseudo-labels [213, 293, 296, 163, 297]. Several approaches deal with noise in pseudo-labeled data by focusing on heuristically controlling their use by (1) lowering the ratio of pseudo-labels in each mini-batch [296]; (2) selecting pseudo-labels with high confidence [297]; or (3) setting lower weights in computing the loss for pseudo-labels [213]. Alternative approaches that mitigate the effect of noisy labels can be categorized into three classes: (1) label correction methods that improve the quality of raw labels by modeling characteristics of the noise and correcting incorrect labels [273]; (2) methods with robust loss that are inherently robust to labeling errors [266]; and (3) refined adaptive training strategies that are more robust to noisy labels [282]. 81 Here we focus on robust loss functions that offer a theoretically-based approach to the noisy label problem [266]. Previous studies have shown some loss functions such as Mean Absolute Error (MAE) that were originally designed for regression problems can also be used in classification settings [93]. However, training with MAE has been found to be very challenging because of the gradient saturation issue [289] . The Generalized Cross-Entropy (GCE) [289] loss applies a Box-Cox transformation to probabilities that has been shown to be a generalized mixture of MAE and Cross-Entropy (CE). Using a similar idea of applying a power-law function, beta-cross entropy loss has been developed to mitigate the effect of noise in the training data [6, 12]. Minimizing beta-cross entropy (BCE) is equivalent to minimizing beta-divergence[27], which is the robust counterpart of KL-divergence. BCE has an extra normalization term compared to the GCE loss. Another study suggested Symmetric Cross-Entropy (SCE) loss by combining Reverse Cross-Entropy (RCE) together with the CE loss [266]. Here we develop a semi-supervised learning strategy to utilize labeled data and weakly labeled data. A teacher-student training framework [274, 50] is adopted. We propose to first generate pseudo-labels of the unlabeled data using a teacher model trained on ground-truth labels. Then we train a student model using a combination of ground-truth and pseudo-labels. We apply a robust loss to enhance model robustness to noise in pseudo-labels, so that both supervised and weakly supervised knowledge are combined in an optimized learning strategy. We demonstrate the effectiveness of the proposed strategy on a simple classification task and a brain tumor segmentation task. In this work, our contribution is three-fold: • We improved model performance when only limited labeled data is available, which is especially meaningful to medical images. • We proposed a simple yet effective semi-supervised learning strategy by introducing a plug-and-play module: the robust loss function. 82 • The proposed strategy is agnostic to specific model architecture and can be applied to various segmentation or classification or even regression tasks. 2.4.1 Materials and Methods First, we introduce the robust loss functions used for handling noise in pseudo-labels in Sec.2.4.2 and illustrate the utility of the robust loss functions using a simulation. We then describe our semi-supervised strategy in Sec.2.4.3, which employs a teacher-student framework [274]. In a multi-class classification setting, the most commonly used loss function is multivariate cross-entropy given by: LCE = X K k=1 −q(k|x)log(p(k|x)) (2.27) where x is the input variable, p(k|x) is the probability output of a deep neural network (DNN) classifier, K is the number of classes, and q(k|x) is the one-hot encoding of the label. We use this as a baseline in our experiments below, however this loss is function is susceptible to label noise, which we address with the robust approaches described below. 2.4.2 Robust Loss Functions We evaluated three different robust loss functions to reduce the effect of pseudo-label noise. The Generalized Cross-Entropy (GCE) loss is defined as follows: LGCE = (1 − p(y|x))q q (2.28) 83 Teacher Train With Cross-Entropy Input Expert Label Unlabeled Input Inference Pseudo label Student Labeled Input Pseudo labeled Input Train With Robust Loss Figure 2.22: Framework: We develop a teacher-student semi-supervised framework by using both manually labeled as well as pseudo-labeled data. We propose to first generate pseudo-labels of the unlabeled data using a model trained on manually labeled ground-truth labels. Then we train a second model using these ground-truth labels and generated pseudo-labels simultaneously by applying a robust loss to enhance model robustness to noise in the pseudo-labels. GCE applies a Box-Cox transformation to probabilities (power-law function). Using L’Hôpital’s rule it can be shown that GCE is equivalent to CE for lim q → 0 and to MAE when lim q → 1 so this loss is a generalization of CE and MAE [289]. The Beta Cross-Entropy (BCE) loss can be expressed as: LBCE = β + 1 β (1 − p(y|x))β + X K k=1 p(k|x) β+1 (2.29) BCE minimizes β-divergence [27] between the posterior and empirical distributions when the posterior is a categorical distribution [13, 12]. Using L’Hôpital’s rule, it can be shown that BCE is equivalent to CE for lim β → 0 where β-divergence also converges to KL-divergence. β-divergence is the robust counterpart of KL-divergence using a power-law function. BCE has an extra regularization term compared to GCE. This loss has not previously been applied to classification tasks [6, 12]. 84 The Reverse Cross-Entropy (RCE) loss is defined as [266]: LRCE = X K k=1 −p(k|x)log(q(k|x)) (2.30) = − p(y|x)log1 − X k̸=y p(k|x)A = − A X k̸=y p(k|x)A = −A(1 − p(y|x) A is the smoothed/clipped replacement of log{0}. MAE is a special case of RCE at A = −2. RCE has also been proved to be robust to label noise, and can be combined with CE to obtain the Symmetric Cross-Entropy (SCE): LSCE = αCE + γRCE [266], where α and γ are hyper-parameters. As an illustrative example (Figure 2.23), we created a simulated dataset with three classes using Gaussian mixtures plus outliers. The figure shows color-coded data according to their classes. A single layer perceptron was trained with the multivariate cross-entropy loss function (Eq. 2.36), which is a non-robust loss. The decision boundary determined by the network can be seen to be impacted by the presence of outliers in the data. The procedure was repeated by training the perceptron with multivariate β-cross entropy loss (Eq. 2.31), which is a robust loss. It can be seen that the decision boundaries, in this case, are minimally impacted by the presence of outliers. 2.4.3 Training using Unlabeled Data and a Robust Loss Function We introduce a teacher-student framework [274] to perform semi-supervised learning for classification and its special case semantic segmentation. We assume we have a small number of labeled training samples and a large set of extra unlabeled samples. We first train a teacher model with standard cross-entropy loss using the labeled set and then generate pseudo-labels for the unlabeled set. The quality of pseudo-labels depends on the performance and generalizability of the teacher model. When the number of labeled samples is smaller, the generated pseudo-labels will become noisier. We then combine the true and pseudo-labels 85 Data samples Decision Boundary using non-robust loss Decision Boundary using robust loss Figure 2.23: An illustrative example of using robust loss for classification: (left) a simulated dataset with three distinct classes and a mislabeled subclass data, shown as a smaller cluster on the bottom left; (middle) decision boundary computed using a single layer perceptron and multivariate cross-entropy loss (non-robust loss); (right) the decision boundary calculated using a single layer perceptron, but with multivariate beta cross-entropy (robust loss). It can be seen that the decision boundary computed using the non-robust loss is affected due to the mislabeled outliers, whereas the decision boundaries calculated using a robust loss are minimally impacted by the mislabeled data. to re-train a student model. We used CE for human-annotated labels and the robust loss functions for pseudo-labels. An illustration of our framework is shown in Fig. 2.22. 2.4.4 Experiments and Results First, we investigated the application of our developed loss function in the noisy segmentation task of the brain. To explore the effectiveness of our semi-supervised strategy, we performed image classification and segmentation tasks. We started with a simple classification task using the CIFAR-10 [147] dataset. We used this dataset as robust loss functions were previously applied on this dataset [168] using simulated noisy labels, and we wanted to investigate if these loss functions are useful for the semi-supervised learning using the same dataset. Then we performed a brain tumor segmentation task on the BraTS 2018 dataset [174, 25, 26] to further explore the benefits of our proposed strategy for medical imaging applications. To create a semi-supervised experiment setting, we first divide the training set into two groups, assuming a specific percentage (p) of subjects being the labeled group and the rest of subjects being the unlabeled group. Denote the total number of training subjects as N. We define the performance of the teacher model trained on N ∗ p subjects as the lower bound and the model trained on N subjects (the entire training set) as the upper 86 bound. We generated the pseudo-labels of N ∗ (1 − p) subjects using the teacher model trained on the other N ∗ p subjects. Then a student model is trained on both ground-truth labels and pseudo-labels. 2.4.4.1 Brain Segmentation Here our goal is MRI head segmentation when only a T1-weighted image is available. Head segmentation (including gray and white matter, CSF, skull, scalp, and other extra-cranial tissue) is commonly used for measuring and visualizing the brain’s anatomical structures and is also a necessary step for other applications, for example in generating head-models for use in current-source reconstruction in electroencephalography and magnetoencephalography (EEG/MEG) [115]. T1-weighted (T1-W) images are commonly available and employ the most cost-effective and fastest pulse sequences used to create accurate head segmentations. However, adding T2-weighted (T2-W) images can help better segmentation of the skull due to superior contrast, particularly between CSF and skull [182]. Our goal was to develop a fast head segmentation using only the T1-W modality. We investigated three scenarios: (i) First, we generated head segmentations ("software-generated labels", (SGLs) using SimNIBS (https://simnibs.github.io) where the input was T1-W and T2-W images. We then trained a CNN using only T1-W as input to predict these labels. (ii) We obtained SGLs using only T1-W data and trained a CNN using T1-W as input to predict these labels. (iii) We repeated (ii) using a robust loss function to compensate for the poor quality of training labels for some classes due to the absence of T2-W images. We used the CamCan dataset (https://www.cam-can.org) that consists of T1-W and T2-W MRI images. We used SimNIBS to segment the head into 9 classes: 0=background (outside head), 1=WM, 2=GM, 3=CSF, 4= bone, 5= skin, 6= cavities, 7 = eyes, 8=ventricles. We split the data 60%/20%/20% for training/validation/test sets. We trained a TransUNet [46] using 2D slices as input and resized to 256*256 for 10 epochs. (i) In the first experiment, we generated SGLs using T1-W and T2-W images as input to SimNIBS to segment the head. We then used T1-W as input and T1-W+T2-W SGLs as the ground-truth labels for training the 87 TransUNet (T1-W+T2-W TransUNet). (ii) We obtained T1-W SGLs by using T1-W images only as input to SimNIBS; then, we used T1-W as input and T1-W SGLs as noisy labels for training the TransUNet with a cross-entropy loss (T1-W TransUNet) (iii) finally we repeated experiment (ii) but with the robust BCE loss (T1-W TransUNet Robust) . The BCE loss is expressed as: LBCE = β + 1 β (1 − p(y|x))β + X K k=1 p(k|x) β+1 (2.31) where x is the input variable, y is the response variable, and β is a hyper-parameter; p(k|x) is the probability output of a deep neural network (DNN) classifier, and K is the number of classes. BCE minimizes βdivergence between the posterior and empirical distributions when the posterior is a categorical distribution [12]. Using L’Hôpital’s rule, it can be shown that BCE is equivalent to CE for lim β → 0 where β-divergence also converges to KL-divergence. β-divergence is the robust counterpart of KL-divergence using a powerlaw function. We tuned parameter β = 0.0001 using the validation data. We warmed up the robust model for two epochs using the cross-entropy loss. Figure 2.24: Segmentation results using different methods To measure the performance of our method, we calculated the Dice score between each segmentation and T1-W+T2-W SGL labels (table ??). The result shows our T1-W+T2-W TransUNet model can be used to predict head segmentation when only a T1-W image is available during inference. The dice score of bone/skull for this model is improved compared to T1-W SGL segmentation. Further, as we expected we find an improvement in the dice score for the robust model trained with T1-W only SGL labels compared to the non-robust model for these labels. The dice score between T1-W+T2-W SGL and T1-W SGL represents 88 the level of inaccuracy in each label. The degradation of dice coefficients for the eye labels when using the robust loss is probably due to the low occurrence of these voxels in the training data leading to their interpretation as outliers. A hybrid CE/BCE loss would address this issue. Table 2.5: Dice scores relative to ground-truth T1-W+T2-W SGLs for the test dataset Model background WM GM CSF bones skin cavities eyes ventricles T1-W SGL 0.99 0.99 0.98 0.82 0.84 0.94 0.73 0.75 0.99 T1-W+T2-W TransUnet 0.99 0.92 0.90 0.81 0.89 0.94 0.71 0.78 0.94 T1-W TransUnet 0.99 0.93 0.90 0.74 0.83 0.93 0.68 0.74 0.94 T1-W TransUnet Robust 0.99 0.90 0.87 0.78 0.85 0.94 0.72 0.68 0.94 2.4.4.2 CIFAR-10 First, we performed experiments on the CIFAR-10 dataset [147] (with 50000 images for training and 10000 for testing). We set p = 10% and trained our teacher network with ground truth labels. Then we generated pseudo-labels for the rest of 90% of the data and combined this set with the 10% ground truth labels to train the student model with CE, GCE, SCE, and BCE. The hyper-parameters for GCE, SCE and BCE losses were chosen based on a validation set (10% of the training data (β = 5), (q = 0.9), (α = 0.1, γ = 0.01)). For both teacher and student networks we used ResNet-18 [104] and trained the networks for 150 epochs using stochastic gradient descent [215] (momentum=0.9, weight decay=1e-4). The initial learning rate was 0.1 and reduced to 0.01 after 100 epochs. The results are summarized in Table 2.6. All robust losses improved the model performance. 2.4.4.3 Brain Tumour Segmentation Backbone Model: We adopted a 3-dimensional CNN called TransBTS [265] as our backbone model, which combines U-Net [221] and Transformer [256] networks. TransBTS is based on an encoder-decoder structure and takes advantage of Transformer to learn not only local context information but also global semantic correlations [265]. TransBTS achieved superior performance compared to previous state-of-the-art models 89 Table 2.6: Performance of proposed semi-supervised strategy on CIFAR-10 dataset. Results show test accuracy of the models trained using only 10% of ground truth labels. Data-set Lower bound CE BCE GCE SCE Upper bound CIFAR10 (Accuracy) 66.17% 69.62% 79.53% 77.68% 78.74% 89.31% for the brain tumor segmentation task [265]. We trained the model for 1000 epochs from scratch with an Adam optimizer. The initial learning rate is 0.0002 and the batch size is set to 4. Dataset and Evaluation: We performed experiments on a publicly available dataset provided by the Brain Tumor Segmentation (BraTS) 2018 challenge [174, 25, 26]. The BraTS 2018 Training dataset includes 285 subjects with ground-truth labels. The Magnetic Resonance Images (MRIs) have been registered into a common space and the image dimension is 240 × 240 × 155. The ground-truth labels contain three tumor tissue classes (necrotic and non-enhancing tumor: label 1, peritumoral edema: label 2, and GD-enhancing tumor: label 4) and background: label 0. We split the 285 subjects into a training set (200 subjects), a validation set (28 subjects), and a test set (57 subjects). We used the validation set to select the values of hyper-parameters, which are β = 0.001 for BCE loss; q = 0.7 for GCE loss; α = 0.01, γ = 1.0 for SCE loss. We used Dice score to quantitatively evaluate the segmentation accuracy for enhancing tumor region (ET, label 1), regions of the tumor core (TC, labels 1 and 4), and the whole tumor region (WT, labels 1, 2, and 4). To explore the benefits of our strategy when different numbers of ground-truth labels are available, we evaluated the segmentation accuracy when p = 30%, 50%, 70% of data is considered as labeled, respectively. We ran the experiments three times to compute mean and standard deviation for the Dice score, each time we select different subjects as labeled data. Table 2.7 compares the dice scores of segmentation results when different loss functions (CE, BCE, GCE, SCE) are applied during training. From Table 2.7 and Fig. 2.25 we can observe that applying robust loss improves the segmentation accuracy compared to the CE loss and the lower bound. The improvement is more significant when only a small number of labeled data is available. This is because a relatively large proportion of noisy labels has a more negative effect on the model performance, and applying robust loss 90 Table 2.7: Comparison of the mean and standard deviation of Dice scores for different methods on test subjects for different tumor classes (WT, TC, ET). Dice Score Methods WT TC ET 30% 50% 70% 30% 50% 70% 30% 50% 70% Lower bound 0.834(0.015) 0.845(0.005) 0.864(0.004) 0.722(0.011) 0.746(0.011) 0.772(0.009) 0.613(0.004) 0.634(0.004) 0.648(0.012) CE 0.842(0.014) 0.858(0.003) 0.870(0.004) 0.747(0.011) 0.768(0.006) 0.781(0.003) 0.634(0.005) 0.648(0.007) 0.655(0.004) BCE 0.851(0.013) 0.865(0.006) 0.866(0.003) 0.753(0.009) 0.776(0.010) 0.782(0.005) 0.634(0.008) 0.648(0.008) 0.658(0.002) GCE 0.848(0.013) 0.865(0.005) 0.865(0.002) 0.756(0.001) 0.779(0.007) 0.786(0.004) 0.638(0.004) 0.653(0.001) 0.662(0.003) SCE 0.850(0.016) 0.858(0.006) 0.871(0.002) 0.758(0.015) 0.777(0.006) 0.774(0.009) 0.641(0.010) 0.649(0.005) 0.658(0.007) Upper bound 0.873 0.788 0.668 can make the model less perturbed by noise. The proposed strategy at p = 30%, p = 50% achieved even better performance than the lower bound model at p = 50%, p = 70%, and comparable performance compared to the model with CE loss at p = 50%, p = 70%. When p = 70%, some of the robust loss results showed slightly worse dice scores compared to CE loss for the WT and TC classes. This is probably because the teacher model generated higher quality pseudo-labels when more ground truth labels are available, the noise level is negligible and these two classes are relatively easier to segment, so adding robust loss will not further boost model performance. To qualitatively evaluate the segmentation results, we selected two representative test subjects and showed the segmentation results produced by different approaches as well as ground-truth labels in Fig. 2.26. Evidently, segmentation results generated by the model with robust loss are more accurate, which verified the benefits of our semi-supervised strategy. (a) WT ! (b) TC (c) ET Figure 2.25: A graphical illustration of segmentation accuracy improvement using the proposed strategy for different fractions of pseudo-labels (p) in the training data. Figures (a), (b) and (c) show the average dice scores of tumor classes WT, TC, and ET, re Figure 2.26: Comparison of Brain Tumor Segmentation Results: from left to right showing the segmentation results of the lower bound model, the model with CE loss, the model with GCE loss, and the ground-truth label. Segmentation results indicate label 1 (yellow), label 2 (blue) and label 4 (red), where we have ET (label 1), TC (labels 1 and 4), and WT ( labels 1, 2, and 4). 2.4.5 Conclusion We developed a semi-supervised learning strategy that uses ground-truth labels and generated pseudo-labels during training and applies robust loss functions to mitigate the negative effect on the model from noises existing in pseudo-labels. The proposed semi-supervised learning strategy is simple to deploy because of the plug-and-play robust loss module and has open possibilities for various applications as it is agnostic to specific model architecture. The experimental results on classification and segmentation tasks show that the proposed semi-supervised learning strategy improved model performance, especially in scenarios where only a small amount of ground truth labels are available. 92 Figure 2.27: A) Single-task model. B) Multi-task model. C) Sequential training. Non-trainable parameters are shown with dash lines. D) Resnet50 model where the first 20 layers are freezing for sequential training and transfer learning. 2.5 Sequential Multi-task Learning for Histopathology-based Prediction of Genetic Mutations with Extremely Imbalanced Labels Demands for predicting genetic mutations for assisting targeted or biomarker-based therapies [87, 57, 137] are growing. Deep learning techniques for predicting genetic mutations based on diagnostic histopathology images have reached success in recent years. This is generally regarded as a weakly supervised learning problem, as patches derived from hematoxylin and eosin (H&E)-stained whole-slide images (WSI) are fed into deep convolutional neural networks (CNN) carrying the corresponding WSI-level labels in the training phase. In the testing phase, WSI-level prediction results are aggregated from patch-level results using max or mean pooling or more sophisticated techniques e.g. weighted sum [wulczyn2020deep], or self-attention [li2021multi]. Recent findings show that learned computational histopathological features 93 are associated with a wide range of recurrent genetic aberrations across cancer types [87, 137]. Using a single model that can predict many genetic properties may be preferable for its lower computational overhead and better generalizability. Here our goal is to train a single CNN model capable of predicting multiple molecular biomarkers using H&E WSI. In order to generate such a model, efforts have been made to assemble many digital pathology datasets in supervised or self-supervised multi-task learning frameworks [graham2022one, ciga2022self, 137]. However, we might encounter optimization challenges in multi-task learning emerging from varying learning speeds of different tasks and plateaus in the optimization landscape, or even gradients conflict in the learning procedure [sener2018multi]. Moreover, histopathological datasets are extremely imbalanced for some gene mutation labels. It is nearly impossible to use augmentation/re-sampling techniques to reach meaningful ratios between positive/negative samples for all labels simultaneously in a multi-task learning framework. Most of the time, results converge to trivial answers for extremely imbalanced labels. Here we investigate the multi-task learning of genetic mutations in an extremely imbalanced setting. To address the patch-level label quality issues in the weakly supervised learning framework, we formulated the problem as a noisy label problem and applied a trimming strategy to tackle the label quality issues in an imbalanced setting. It is known that when training a model with noisy labels, the model fits into clean labels first and then starts memorizing noisy labels [arpit2017closer, kim2022large]. So we trimmed the data for each class separately based on the value of their loss to fit the imbalanced nature of the data [arpit2017closer]. To avoid generating trivial answers for predicting the extremely imbalanced class labels, our approach makes key contributions in the following areas: (1) We propose a sequential training strategy for a multi-task problem to handle the extremely imbalanced data. (2) We compare our strategy with two regular methods for handling class imbalance: (i) using a self-supervised pre-trained model Bootstrap Your Own Latent 94 (BYOL) [96] for feature extraction followed by a linear model, and (ii) using weighted loss based on the portion of each class. To assess our approach, both AUROC and F1 scores are reported. However, our key observation is that AUROC could be misleading, and the F1 score is more reliable in comparing the performance of the models for datasets with extremely imbalanced class labels [ciga2022self]. Our proposed trimming strategy combined with sequential learning could improve the predictions on all of the genetic mutations. In addition, we investigated applying continual learning (CL) to mitigate the effect of the task order and improve the performance of the model for the tasks at the beginning of the sequence. 2.5.1 Methods A total of 670 WSIs were downloaded from TCGA-LUAD ∗ dataset. We trained the network to predict ten binary genetic mutation labels. The number of positive (1) and negative (0) class samples are shown in Fig.2.28 (A) for each label, which are STK11 (92/490), EGFR (87/495), SETPB1 (71/511), TP53 (307/275), FAT1 (74/508), KRAS (175/407), KEAP1 (111/471), LRP1B (211/371), NF1 (88/494), and tumor mutation burden (TMB) (258/320). A threshold of 175 was applied to classify the WSIs into TMB-H (1) and TMB-L (0). Less than 5 WSIs were used for each of the 370 unique subjects. Patches of 512×512 pixels were extracted at the 20× magnification ratio, and only patches with more than 85% content as tissue were used. Example WSIs and patches are shown in Fig. 2.28 (B). The dataset is split into a 70/10/20 ratio for training/validation/testing at the subject level. First, we compared the performance of single-task learning with different strategies to handle weak and imbalanced labels in terms of AUROC and F1 score. These methods under the single-task training framework could be considered as a set of strong baseline solutions. Then we compared the performance of various methods under the multi-task training framework. ∗ https://portal.gdc.cancer.gov/projects/TCGA-LUAD 95 2.5.1.1 Single-Task Training Here we started with training a separate neural network for each of the 10 binary classification tasks, as shown in Fig.2.27(A). We used a ResNet50 model and compared the baseline model, denoted "regular", with four different techniques to handle weak and imbalanced labels: (i) Training with re-sampling to have a similar number of samples for both classes. This method is denoted "reSample". (ii) Start from a pre-trained model BYOL [96] trained for 80 epochs with a batch size of 256, followed by re-sampling. BYOL does not explicitly use negative samples and Instead consists of two parallel networks and predicts the projection or encoding of an augmented view of the input passed from a target and an averaging network. According to [96], BYOL’s performance is comparable to its counterparts requiring negative samples but needs a smaller batch size compared to other contrastive learning models. After training the BYOL model, we freeze the first 20 layers of the pre-trained Resnet50 and train each task separately with re-sampling. Since it is out of the scope of this work, we did not particularly optimize the number of trainable parameters for this model. This method is denoted "reSample+BYOL". (iii) Starting from the previous task and freezing the first 20 layers of the Resnet50 for the training and applying re-sampling. This method is denoted "transfer". (iv) Trimming with under-sampling the dominant class. In each batch, we trimmed the data for each class separately based on the value of their loss. Different percentages (0%, 10%, 20%, 50%) of samples with the highest error were removed from the training in each batch. The trimming percentage is chosen based on the F1 score performance on the validation set. This method is denoted "trim" in subsequent sections. The F1 score is calculated following the equations below: P recision = T rue P ositive/(T rue P ositive + F alse P ositive) (2.32) Recall = T rue P ositive/(T rue P ositive + F alse Negative) (2.33) 96 F1 = 2 ∗ P recision ∗ recall/(precision + recall) (2.34) The Micro F1 score is calculated as the average of F1 scores for all tasks. 2.5.1.2 Multi-Task Training Following the framework in [57], we trained a ResNet50 with 10 parallel binary classification heads, namely a multi-head layer, as it is shown in Fig.2.27(B). The loss was the sum of binary cross-entropy loss for each classification task. This baseline method is denoted "mutli-task". To mitigate the effect of class imbalance, we tried two strategies under the multi-task framework: (i) We used a weighted loss where weights were the ratio of positive labels in the training set. This method is denoted "multi-task-weighted". (ii) We used pre-trained weights trained with ImageNet and only continued to train the multi-head layer for one epoch. This method is denoted pre-trained-ImageNet". (iii) We trained the network using the self-supervised learning method BYOL for 80 epochs and then only fine-tuned the multi-head layer using the labels for one epoch. This method is denoted pre-trained-BYOL" in subsequent sections. 2.5.1.3 Sequential Training In sequential training, we started with a pre-trained BYOL model and froze the first 20 layers of the Resnet50 for the rest of the training to have more generalizable features. Then we trained the model sequentially for tasks 1 to 10 based on the sequence of genetic mutation labels introduced at the beginning of the section, i.e. trained the model for task 1 predicting STK11 for 10 epochs and kept the best weights based on the F1 score of validation data, then continued to train the model for task 2 predicting EGFR, and followed the same procedure for the remaining tasks until the model was fully trained for all subsequent tasks. We evaluated the performance of the model on the testing dataset for all 10 labels after the sequential training finished, which is different from the single-task training in section 2.1 where models were only trained/tuned for the 97 current task, and the best testing results on the current mutation label were reported. To mitigate the effect of imbalanced classes on each task, we resampled the data for each batch in a way to have a similar number of samples for both classes, which is denoted "sequential-resample". Additionally, we tried a trimming strategy to handle the noisy labels at the tile level. For each task, we had 0%, 10%, 20%, and 50% trimming based on the validation of WSI’s F1 score. The trimming was applied for each batch by keeping p% of samples with less error after warming up the model. We also did under-sampling from the majority class to take care of class imbalance. The method is denoted "sequential-trimming" in subsequent sections. 2.5.1.4 Continual Learning The goal of applying CL is to prevent catastrophic forgetting. Proposed methods to address this problem have focused on (i) regularizing intrinsic levels of plasticity to protect acquired knowledge [135, 164, 73], (ii) allocating new neurons or network layers to accommodate novel knowledge [225, 189], and (iii) using complementary learning networks with experience replay for memory consolidation [188, 240, 235, 135]. We used Averaged Gradient Episodic Memory (AGEM) [Lopez2017gradient], which is a combination of regularization and replay methods. AGEM modifies the gradients for updates and minimizes catastrophic forgetting by storing a subset of the observed examples from previous tasks and constraining the gradient based on those samples. When training for a new task, AGEM ensures that the loss of every episodic memory is non-increase [Lopez2017gradient]. The loss function for AGEM is calculated as: minimize( 1 2 ||g − g¯||2 2 ) s.t g¯ T gref ≥ 0 (2.35) 98 This constrained optimization problem can now be solved very quickly; when the gradient g violates the constraint, it is projected via: g¯ = g − g T gref g T ref gref gref (2.36) where gref is a gradient computed using a batch randomly sampled from the episodic memory of all the past tasks. We applied re-sampling to get the samples from the previous task. As the samples are similar for all tasks, we only need memory to store the labels in the sequential learning scenario. 2.5.2 Experiments and Results We first investigated the effectiveness of re-sampling and trimming in the single-task framework. Then we compared the performance of models in the multi-task framework. We trained the ResNet50 using the WSI-level labels for the tiles and obtained the final WSI-level prediction by simply calculating the mean probability over the tiles for each WSI. TP53 NF1 STK11 SETPB1 KRAS KEAP1 LRP1B TMB FAT1 Number of Samples EGFR 100 200 300 400 0 1 100 200 300 400 0 1 100 200 300 400 0 1 100 200 300 400 0 1 100 200 300 400 0 1 Number of Samples 100 200 300 0 1 100 200 300 400 0 1 100 200 300 0 1 100 200 300 400 0 1 100 200 300 0 1 (A) (B) Figure 2.28: A) Class distribution for the 10 genetic mutations. B) Examples of a WSI with TMB-H (left) and a patch of 512×512 pixels extracted from the WSI (right). 99 Table 2.8: F1 score/AUROC testing WSI result for single-task training. Task regular reSample reSample+BYOL transfer trim coin-toss F1 STK11 0.54/0.16 0.53/0.20 0.60/0.15 - 0.58/0.30 0.24 EGFR 0.50/0.12 0.61/0.41 0.73/0.34 0.75/0.50 0.73/0.34 0.23 SETBP1 0.54/0 0.44/0.16 0.50/0 0.50/0.08 0.48/0.31 0.21 TP53 0.70/0.62 0.68/0.59 0.60/0.59 0.65/0.62 0.69/0.64 0.47 FAT1 0.58/0 0.57/14 0.55/0 0.60/0 0.71/0.24 0.21 KRAS 0.41/0.13 0.46/0.40 0.53/0.37 0.55/0.40 0.48/0.47 0.38 KEAP1 0.51/0 0.41/0.15 0.55/0.21 0.57/0.34 0.52/0.31 0.28 LRP1B 0.52/0.20 0.46/0.32 0.57/0.37 0.54/0.52 0.62/0.58 0.42 NF1 0.54/0 0.50/0 0.41/0.08 0.46/0.09 0.72/0.26 0.23 TMB 0.58/0.31 0.52/0.50 0.50/0.52 0.55/0.55 0.66/0.65 0.47 Micro F1 0.15 0.29 0.26 0.34 0.41 0.31 Single-task training was conducted following the methods introduced in section 2.1. Evaluations were performed on the testing dataset of each task when the training of the model for the current task was finished. The results from various methods are shown in Table 2.8. For comparing the F1 score to a random solution as a baseline, we calculated the F1 score for a random coin-toss scenario in which we assumed the model randomly assigned positive labels to half of the samples and negative labels to the rest in each task. Our results demonstrate that AUROC may not be an appropriate measurement for comparing different models in this setting as it does not reflect the models that have converged to trivial solutions with 0 F1 scores (e.g., refer to Table 2.8, regular model FAT1 with AUROC of 0.58 and F1 score of 0, which shows convergence to a trivial solution). The re-sampling technique could increase the Micro F1 score from 0.15 to 0.29 but is still less than random coin performance. Transferring from the pre-trained self-supervised BYOL model led to Micro F1 score of 0.26, which did not improve the performance compared with re-sampling alone. However, transferring from the previous tasks increased the Micro F1 score to 0.34. Moreover, since the samples were weakly labeled, the best Micro F1 score at 0.41 was obtained using the trimming technique. 100 Multi-task and sequential training experiments were conducted under the multi-task framework, i.e. the models trained with various methods were evaluated on the testing datasets associated with all the 10 tasks. Results are shown in Table 2.9. As expected, the multi-task baseline model converged to a non-informative trivial solution and had F1 score of 0 for imbalanced genetic mutation labels. Using a weighting strategy improved the performance in general but still failed to improve the Micro F1 score more than the results of the random coin-toss model. The sequential learning increased the Micro F1 score from 0.09 to 0.38, and trimming increased it even further to 0.41, which was the best performance in terms of the Micro F1 score amongst all methods under the multi-task framework. Results of the AGEM-based CL with re-sampling are shown in Table 2.10 in comparison with the results of sequential learning with re-sampling. It is observed that CL can not improve its performance in terms of the Micro F1 score. Table 2.9: F1 score/AUROC testing WSI-level results for multi-task training. Task multi-task multi-task-weighted pre-trained-BYOL pre-trained-ImageNet sequential-resample sequential-trimming STK11 0/0.50 0.08/0.45 0/0.43 0/0.64 0.30/0.60 0.25/0.51 EGFR 0/0.63 0.25/0.56 0/0.66 0/0.62 0.46/0.68 0.38/0.53 SETBP1 0/0.54 0.14/0.50 0/0.42 0/0.49 0.24/40 0.25/0.37 TP53 0.51/0.66 0.59/0.66 0.10/0.60 0.51/0.63 0.50/0.56 0.59/0.67 FAT1 0/0.59 0.22/0.57 0/0.60 0/0.73 0.16/0.52 0.50/0.56 KRAS 0/0.40 0.33/0.40 0.04/0.45 0/0.43 0.46/0.57 0.25/0.64 KEAP1 0/0.50 0.13/0.43 0.08/0.49 0/0.54 0.37/0.57 0.37/0.52 LRP1B 0.26/0.57 0.45/0.60 0/0.60 0/0.55 0.56/0.52 0.59/0.69 NF1 0/0.51 0.13/0.53 0.06/0.42 0/0.63 0.22/0.54 0.23/0.54 TMB 0.13/0.55 0.49/0.52 0.34/0.65 0.13/0.66 0.55/0.55 0.66/0.67 Micro F1 0.09 0.28 0.06 0.06 0.38 0.41 2.5.3 Conclusion and Discussion We targeted the class imbalance problem in the prediction of genetic mutations using H&E WSI. Experiments are conducted on the TCGA-LUAD dataset for the prediction of 10 genetic mutation labels. Our results demonstrate that AUROC may not be a suitable measurement for prediction accuracy as it does not reflect the models converging to trivial solutions with a 0 F1 score, which is more probable to occur when predicting multiple labels with imbalanced classes. The standard multi-task model obtained a very low F1-score due 101 Table 2.10: F1 score/AUROC testing WSI-level results comparing sequential learning with re-sampling (sequential-resample) and continual learning with re-sampling (CL-resample). Task sequential-resample CL-resample STK11 0.30/0.60 0.05/0.38 EGFR 0.46/0.68 0.29/0.57 SETBP1 0.24/0.40 0/0.43 TP53 0.50/0.56 0.16/0.28 FAT1 0.16/0.52 0.15/0.61 KRAS 0.46/0.57 0.26/0.43 KEAP1 0.37/0.57 0.21/0.0.57 LRP1B 0.56/0.52 0.42/0.68 NF1 0.22/0.54 0.13/0.48 TMB 0.55/0.55 0.49/0.68 Micro F1 0.38 0.22 to the extreme class imbalance in multiple tasks. Starting from a Resnet50 model pre-trained with the self-supervised BYOL approach, combining trimming with sequential multi-task learning was effective in handling the problem, especially for the tasks at the end of the sequence. The Micro F1 scores of our proposed sequential multi-task learning approach were on par with the strong baseline results obtained under the single-task framework, and significantly outperformed the established baseline multi-task solution [57]. To mitigate catastrophic forgetting, we made a limited investigation of continual learning. However, we were not able to obtain improvement in terms of the Micro F1 score using a popular CL method AGEM. It was reported that AGEM might fail in the case of long task sequences [169] due to the difference between new and previous tasks, which could have made a negative impact in our setting with multiple labels. Overall, results show that the prediction of multiple genetic mutations remains a challenging problem. For future research, efforts need to be made on studying the effect of changing the order of sequential learning tasks. More sophisticated CL methods will also need to be investigated. 102 Chapter 3 Brain Lesion Detection using Robust Variational Autoencoder and Transfer Learning Accurate detection of lesions in the human brain is crucial for early diagnosis and treatment. Medical imaging techniques, such as MRI are now standard clinical tools for detecting and quantifying lesions. Humans excel in identifying lesions by visual inspection after extensive training, but the subjective and expensive nature of human detection and delineation makes the machine learning methods an attractive alternative or complement. Furthermore, machine learning might be able to achieve better-than-human performance for this specific task by leveraging multispectral MRI. Research based on supervised machine learning has already achieved significant success [160, 136, 195] with human-level or better performance. However, large numbers of manual lesion delineations are required for training supervised methods. Unsupervised approaches, on the other hand, do not require labeled data but generally are less accurate. Unsupervised approaches such as the autoencoder and variational autoencoder (VAE) [142] and their variants [171] have shown that we can approximate the underlying distributions of high-dimensional data. A common application of unsupervised approaches is outlier detection [3], where the goal is to identify data samples whose representation deviates from the normal samples. For a population of brain images, assuming that lesions and other abnormalities occur rarely and in different locations across subjects, we conjecture that it is possible to learn the distribution that reflects a healthy brain structure using a VAE. 103 Once this distribution is learned, we can measure the reconstruction error between a given image and the reconstructed image to identify and localize abnormalities in that image. A VAE is a probabilistic autoencoder that uses the variational lower bound of the marginal likelihood of data as the objective function. It has been shown that VAEs achieve higher accuracy in lesion detection tasks than standard autoencoder [48, 30, 191]. VAEs are based on the assumption that the training dataset and the test dataset are sampled from the same distribution. However, this assumption may not hold in real-world settings such as medical imaging applications since different datasets can use different acquisition and pre-processing techniques. Ideally, we should still be able to leverage a pre-trained VAE model to develop a new model that adapts to our dataset. The topic of transfer learning focuses on addressing this problem [185]. With the aid of transfer learning, it is possible to store the knowledge gained while solving one problem and apply it to a different problem. The VAE’s objective function contains the KL-divergence term which does not cope well with outliers and is therefore not robust. This may lead to unintended effects in applying transfer learning for adapting pre-trained VAE models when the characteristics of the new dataset differ significantly from that of the initial training dataset. To this end, we propose the use of robust VAE based on the notion of β-divergence from robust statistics [88] for applying transfer learning from pre-trained unsupervised lesion detection models. By varying the robustness hyperparameter β, we can control how much influence is granted to samples with low probability. We demonstrate the effectiveness of our approach on brain MRI datasets. Our results show that the combination of robust VAE and transfer learning allows us to use training data that has different imaging parameters and demographics than that of the test dataset. We demonstrate this using a quantitative comparison to VAE models. 104 3.1 Mathematical Formulation In this section, we first present a summary of VAEs and robust variational inference. Then we formulate a robust VAE that can be trained on a mixture of normal and lesion images based on the assumption that the lesion-free images are drawn from a Gaussian distribution. 3.1.1 Variational Autoencoder The VAE is a directed probabilistic graphical model whose posteriors are approximated by a neural network. Let X denote the input data, x (i) denote the samples of X, and Z denote its low-dimensional latent representation. The VAE consists of an encoder network that computes an approximate posterior qϕ(Z|X), and a decoder network that computes pθ(X|Z) [142] and pθ(Z) denotes the prior distribution which z is generated from. The model parameters ϕ and θ are found by maximizing the evidence lower bound (ELBO) function [142]: L(θ, ϕ; x (i) ) =Eqϕ(Z|x(i)) [log(pθ(x (i) |Z))] − DKL(qϕ(Z|x (i) )||pθ(Z)). (3.1) The first term (log-likelihood) can be interpreted as the reconstruction loss and the second term (KL divergence) as the regularizer. Using empirical estimates of expectation, we form the Stochastic Gradient Variational Bayes cost [142]: L(θ, ϕ; x (i) ) ≈ 1 S X S j=1 log(pθ(x (i) |z (j) )) − DKL(qϕ(Z|x (i) )||pθ(Z)), (3.2) 105 where S is the number of samples drawn from qϕ(Z|X). In practice, we can choose S = 1 as long as the minibatch size is large enough. Assuming pθ(X|Z) is a Gaussian distribution and the output of the network is the mean of this distribution, the log-likelihood term simplifies to the mean-squared-error. qφ (Z|X) Z p (X|Z) T1 T2 FLAIR Figure 3.1: VAE network and input, output sample for ISLES dataset 3.1.2 Robust Variational Autoencoder Robust variational inference is based on the β−ELBO based loss function and replaces the log-likelihood term with β-divergence which is equivalent to minimizing β-cross entropy [88, 51]. The β − ELBO function is given by: Lβ(q, θ) = − NEqϕ(Z|x(i)) [(Hβ(ˆp(X)||pθ(X|Z)))] − DKL(q(Z)||pθ(Z)). (3.3) where pθ(Z|X) is posterior distribution, the empirical distribution is pˆ(X) = 1 N PN i=1 δ(X, x (i) ) where δ is the Dirac delta function and Z represents the latent variable, N is the number of samples, and θ contains the generative model’s parameters. The β-cross entropy is given by [51]: 106 Hβ(ˆp(X)||pθ(X|Z)) = − β + 1 β Z pˆ(X)(pθ(X|Z) β − 1)dX + Z pθ(X|Z) β+1dX. (3.4) By replacing log-likelihood with β-cross entropy in the VAE formulation, we obtain a new cost function which is robust to outliers [14]. For a Gaussian distribution, the β−ELBO-cost of RVAE for the j th sample simplifies to [14]: Lβ(θ, ϕ; x (i) ) = β + 1 β 1 (2πσ2) βD/2 exp − β 2σ 2 X D d=1 ||xˆ (j) d − x (i) d ||2 ! − 1 ! − DKL(qϕ(Z|x (i) )||pθ(Z)). (3.5) Similar to the VAE, we use stochastic gradient variational bayes cost minimization using sampling to optimize β-ELBO to train the robust VAE. Next, we describe the use of VAE and robust VAE in combination with transfer learning for lesion delineation tasks. 3.2 The Model and Experiments We used the VAE architecture proposed in [150] that consists of three consecutive blocks of convolutional layer, a batch normalization layer, a rectified linear unit (ReLU) activation function and two fully-connected layers in the bottleneck for the encoder and a fully-connected layer and three consecutive blocks of deconvolutional layers, a batch normalization layer and ReLU, and a final deconvolutional layers for the decoder. The size of the input layer is 3 × 64 × 64. 107 3.2.1 Data and Preprocessing For the initial training, we used 20 central axial slices of brain MRI datasets from a combination of 119 subjects from the Maryland MagNeTS study [99] of neurotrauma and 112 subjects of TrackTBI-Pilot [283] dataset, both available for download from https://fitbir.nih.gov. We used 2D slices rather than 3D images to make sure we had a large enough dataset for training the VAE. These datasets contain T1, T2 and FLAIR images for each subject, and have sparse lesions. The three imaging modalities (T1, T2, FLAIR) were rigidly coregistered within subject and to the MNI atlas reference, and re-sampled to 1mm isotropic resolution. Skull and other non-brain tissue were removed using BrainSuite (https://brainsuite.org). Subsequently, we reshaped each sample into 64 × 64 dimensional images and performed histogram equalization to a lesion free subject that intensity-normalized by the value of the 99th percentile voxel. We used 191 subjects for training and 40 subjects for validation randomly sampled from MagNeTS and TrackTBI-Pilot datasets. Experiments for pre-trained model: In this experiment, we evaluate the performance of a pre-trained model on a dataset that was pre-processed similarly to the training set. We used 20 central axial slices of 15 subjects from the ISLES (The Ischemic Stroke Lesion Segmentation) database [170] as a test set and performed similar pre-processing as for the training set. Experiments for re-training models (VAEbr, RVAEbr): In this experiment, we re-train VAE and RVAE models from scratch using a combination of the initial dataset and an additional 20 independent subjects from the BRATS dataset (https://www.smir.ch/BRATS/Start2015). We used 20 central axial slices from the rest of the 20 subjects of BRATS 2015 as test data. Experiments for transfer learning (PreVAE, PreRVAE): In this final experiment, we assume that we only have access to the pre-trained models but the training datasets used for pre-trained models are not available. We updated the pre-trained models using 20 subjects from the BRATS 2015 dataset. Similar to the 108 experiments for re-training the models, we tested the updated models on 20 central axial slices from 20 subjects of the BRATS 2015 dataset. 3.2.2 Results The absolute error maps between reconstructed and original images were computed for segmentation of the lesions. A median filter of size 7x7 was applied to remove isolated pixels. The filtered lesion error maps were used to plot ROC (Receiver Operating Characteristic) curves from which we computed the AUC (Area Under The Curve) Hand-traced lesions were used to define ground truth. Only the pixels inside the brain mask were used for AUC computation. A example input image from the ISLES test dataset and its reconstruction using the pre-trained VAE model is shown in Figure 3.1. The AUC for this experiment was 0.93. Experimental results of re-training the models and using transfer learning are illustrated in Figure 3.2 with the ROC curves and AUC values shown in Figure 3.3. Figure 3.2A shows that RVAE did not reconstruct the lesions while the lesions are more apparent in the reconstructed images from the VAE model. As a result, it can be concluded that the RVAE can capture the locations of the lesions more accurately by computing the error between original and reconstructed images. The AUC of the pre-trained VAE was 0.75. When the VAE is re-trained from scratch using the BRATS dataset (VAEbr), the AUC has increased to 0.9. However, the value of AUC decreased to 0.82 when transfer learning is applied to the pre-trained VAE model (PreVAE). The AUC of the RVAE model that was re-trained using the initial and the BRATS datasets (RVAEbr) was 0.92. The AUC increased to 0.93 when transfer learning was applied to the RVAE model (PreVAE). The values of beta for these experiments were chosen using the validation dataset. We chose a beta value that prevents RVAE from reconstructing lesions in validation dataset. 109 A B Original VAE VAEbr RVAEbr PreVAE PreRVAE VAE VAEbr RVAEbr PreVAE PreRVAE GTruth Figure 3.2: (A) Original and reconstructed test images using different models. (B) Absolute reconstruction error of the test images and associated hand-delineated lesions (GTruth). VAEbr: VAE model re-trained from scratch using the initial data and the BRATS samples, RVAEbr: RVAE model re-trained from scratch using the initial data and BRATS samples, PreVAE: transfer learning of VAE from the pre-trained VAE model using additional BRATS samples, PreRVAE: transfer learning of RVAE from the pre-trained VAE model using additional BRATS samples. 3.3 Discussion and Conclusion After training the VAE using nominally normal (anomaly free) data, we can use it for anomaly detection and specifically for identification of abnormal structures in medical images. We focused on delineating lesions from MRI scans which might have differing characteristics and pre-processing. This causes degradation in the performance of VAE. Utilizing the robustness of RVAE, we described a framework that enables us to fine-tune the model for new test sets with differing specific attributes. We used a pre-trained model and re-trained it with the additional subjects from the new dataset for model refinement. The robustness of RVAE forces the model to only learn common features between these data samples instead of their anomalous features (lesions). We have shown quantitatively and qualitatively that RVAE outperforms VAE both before and after model refinement. A previous study on the BRATS 2015 dataset [48] reported AUC of 0.9 using VAE. We achieved a similar level of performance by using only a subset of this dataset and a pre-trained model from a different dataset. 110 False Positive Rate True Positive Rate Figure 3.3: ROC curves of different models. RVAE outperforms VAE both when trained from scratch using BRATS samples in addition to the initial data (RVAEbr vs VAEbr) and when updated using the pre-trained models (PreRVAE vs PreRVAE). 111 Chapter 4 Deep Quantile Regression for Uncertainty Estimation 4.1 Deep Quantile Regression for Uncertainty Estimation in Unsupervised and Supervised Lesion Detection Inference based on deep learning methods that do not take uncertainty into account can lead to overconfident predictions, particularly with limited training data [212]. Quantifying uncertainty is particularly important in critical applications such as clinical diagnosis, where a realistic assessment of uncertainty is important in determining disease status and appropriate treatment. For example, in the lesion detection task, knowing the uncertainty in detected boundaries may help in defining tumor margins. In the literature, predictive uncertainty is categorized into two types based on the source of uncertainty: (i) aleatoric uncertainty [149] that is the result of uncertainty inherent in the data, and (ii) epistemic uncertainty, which is often referred to as model uncertainty, as it is due to model limitations. Access to unlimited training data does not reduce the former uncertainty in contrast to the latter. Here we focus on aleatoric uncertainty and its estimation using quantile regression (QR) [143]. A recent novel approach proposed using conditional QR to estimate aleatoric uncertainty in neural networks [220, 248]. QR can be used to estimate the conditional median (0.5 quantiles) or other quantiles of the response variable, conditioned on the input feature variable [281]. QR is most commonly applied 112 in cases where parametric likelihood cannot be specified [281], here we apply develop QR methods for Gaussian (supervised) and binary (unsupervised) applications. Lesion detection is an important application of deep learning in medical image processing. Here we address the important problem of learning uncertainty in order to perform statistically-informed inference for this application. Lesion detection can be applied either in a supervised framework when labels are available or with an unsupervised framework using a generative model such as the VAE. Here we describe how aleatoric uncertainty can be quantified in both of these settings using quantile regression to define confidence intervals, which are then used to identify lesions. In both supervised and unsupervised frameworks, we applied quantile regression by changing the loss function of the network. For quantile regression in the unsupervised setting, we use the formulation developed by [105]. Our goal is to learn the characteristics of the input distribution to separate inliers from outliers. The quantiles help us to define confidence intervals from which we can identify outliers. For the supervised setting we use binary quantile regression [144] in order to capture uncertainty in binary classification problems. In both scenarios, our goal is to estimate aleatoric uncertainty in segmentation and calculate a confidence interval associated with each segmentation. Unsupervised Lesion Detection: Generative models, including autoencoders can be used for unsupervised lesion detection. Once the distribution of anomaly-free samples is learned during the training, during inference we can compute the reconstruction error between a given image and its reconstruction to identify abnormalities [3, 9, 13, 14]. Decisions on the presence of outliers are often performed based on empirically chosen thresholds. Here we use quantile regression to define a principled-approach to thresholding. Variational autoencoder (VAE) [142] and its variants can approximate the underlying distribution of high-dimensional data. VAEs are trained using the variational lower bound of the marginal likelihood of data as the objective function. They can then be used to generate samples from the data distribution, where probabilities at the output are modeled as parametric distributions such as Gaussian or Bernoulli that are 113 conditionally independent across output dimensions [142]. By using VAE, An and Cho [16] proposed to use reconstruction probability rather than the reconstruction error to detect outliers. This allows a more principled approach to anomaly detection since inference is based on quantitative statistical measures and can include corrections for multiple comparisons. For determining the reconstruction probability, we need to predict both conditional mean and variance using VAEs for each of the output dimensions. The estimated variance represents an aleatoric uncertainty associated with the conditional variance of the estimates given the data [212]. Estimating the variance is more challenging than estimating the mean in generative networks due to the unbounded likelihood [237]. In the case of VAEs, if the conditional mean network prediction is nearly perfect (zero reconstruction error), then maximizing the log-likelihood pushes the estimated variance towards zero in order to maximize the likelihood. This also makes VAEs susceptible to overfitting the training data giving a near-perfect reconstruction on the training data and very small uncertainty. This near-zero variance does not reflect the true performance of the VAE on the test data. For this reason, near-zero variance estimates, with the log-likelihood approaching an infinite supremum, do not lead to a good generative model. It has been shown that there is a strong link between this likelihood blow-up and the mode-collapse phenomenon [173, 212]. In fact, in this case, the VAE behaves much like a deterministic autoencoder [37]. While the classical formulation of VAEs allows both mean and variance estimates [142], because of the variance shrinkage problem, most if not all implementations of VAE, including the standard implementations in PyTorch and Tensorflow libraries, estimate only the mean with a fixed value of variance [237]. Here we describe an approach that overcomes the variance shrinkage problem in VAEs using quantile regression (QR) in place of variance estimation. We then demonstrate the application of this new QR-VAE by computing reconstruction probabilities for a brain lesion detection task. Supervised Lesion Detection: Labelled training data are preferabled, if available, as the lead to better performance compared to unsupervised models [279]. One approach to estimating uncertainty 114 is to use the soft-max probability of the cross-entropy loss [70]. Softmax probabilities are known to be poorly calibrated, and imperceptible perturbations to the input image can change the deep network’s softmax output significantly [100]. Softmax confidence also conflates two different sources of uncertainty (aleatoric and epistemic). Bayesian neural networks [178] can be used for estimating aleatoric uncertainty by measuring conditional entropy; however, these models are not able to capture multimodal uncertainty profiles [248]. An alternative method to capture aleatoric uncertainty is using quantile regression [247]. Here we used binary quantile regression [146, 172] to capture the quantiles of labels which can be used to define multiple nested segmentation masks with increasing uncertainty. Our goal is to capture the source of uncertainty within the data distribution where there is more than one plausible answer to the segmentation problem due to disagreement between the specialists who labeled the data. Binary quantile regression is also robust to label noise [183]. Finally, Kordas et al. [146] showed that binary quantile regression can be useful for unbalanced data and leads to a more comprehensive view on how the predictor variables influence the response. Related Work: A few recent papers have targeted the variance shrinkage problem. Among these, Detlefsen et al. 2019 [69] describe reliable estimation of the variance using Comb-VAE, a locally aware mini-batching framework that includes a scheme for unbiased weight updates for the variance network. In an alternative approach Stirn and Knowles 2020, [244] suggest treating variance variationally, assuming a Student’s t-likelihood for the posterior to prevent optimization instabilities and assume a Gamma prior for the precision parameter of this distribution. The resulting Kullback–Leibler (KL) divergence induces gradients that prevent the variance from approaching zero [244]. In the supervised framework, several papers estimate uncertainty for segmentation; however, only a few separately consider aleatoric uncertainty and focus on multi-rated labels [61, 112, 127, 145]. Czolbe et al. [61] compared these methods to investigate whether they are helpful in an assessment of segmentation quality and active learning. Recently, Monteiro et al. [176] used a stochastic segmentation network for 115 modeling spatially correlated uncertainty in image segmentation. They applied a multivariate Normal distribution over the softmax logits and used low-rank approximation to estimate the full covariance matrix across all the pixels in the image [176]. Our Contribution: In the unsupervised setting, to obtain a probabilistic threshold and address the variance shrinkage problem, we suggest an alternative and attractively simple solution. Assuming the output of the VAE has Gaussian distribution, we quantify uncertainty in VAE estimates using conditional quantile regression (QR-VAE). The aim of conditional quantile regression [143] is to estimate a quantile of interest. Here we use these quantiles to compute variance, thus sidestepping the shrinkage problem. It has been shown that quantile regression is able to capture aleatoric uncertainty [248]. We demonstrate the effectiveness of our method quantitatively and qualitatively on simulated and brain MRI datasets. Our approach is computationally efficient and does not add any complications to training or sampling procedures. In contrast to the VAE loss function, the QR-VAE loss function does not have an interaction term between quantiles, and therefore, shrinkage does not happen. Since quantile regression does not satisfy finite-sample coverage guarantees, we applied conformalized quantile regression [220] which combines conformal prediction with classical quantile regression, to have the theoretical guarantee of valid coverage. We also use binary quantile regression in a supervised framework in order to capture the uncertainty of lesion annotations. We demonstrate estimation of multiple quantiles in imaging data in which each lesion is delineated by four human observers and compare to human-rater groun truth and a binary cross-entropy formulation. A preliminary version of these results was presented in [9]. The novel extensions presented in the current work include: (1) application of conformalized quantile regression to unsupervised learning (section 4.1.2.1); (2) extension of unsupervised approach to the supervised approach using binary quantile regression (section 4.1.2.2), (3) application of binary quantile regression for lesion detection and uncertainty estimation, (section 116 4.1.3.3), and (4) additional results and validation that extend those in the earlier paper. We provide a public version of our code at https://github.com/ajoshiusc/QRSegment and https://github.com/ajoshiusc/QRVAE 4.1.1 Background 4.1.1.1 Variance Shrinkage Problem in Variational Autoencoders Let xi ∈ R D be an observed sample of random variable X where i ∈ {1, · · · , N}, D is the number of features and N is the number of samples; and let zi be an observed sample for latent variable Z. Given a sample xi representing the input data, the VAE is a probabilistic graphical model that estimates the posterior distribution pθ(Z|X) as well as the model evidence pθ(X), where θ are the generative model parameters [142]. The VAE approximates the posterior distribution of Z given X by a tractable parametric distribution and minimizes the ELBO (evidence lower bound) loss [16]. It consists of the encoder network that computes qϕ(Z|X), and the decoder network that computes pθ(X|Z) [268], where ϕ and θ are model parameters. Since we use the neural network for learning the distributions qϕ(Z|X) and pθ(X|Z), the parameters θ and ϕ are modeled by the weights of the encoder and decoder networks and will be learnt from the data during training. The marginal likelihood of an individual data point can be rewritten as follows: log pθ(xi) = DKL(qϕ(Z|xi), pθ(Z|xi)) + L(θ, ϕ; xi), (4.1) where L(θ, ϕ; xi) = Eqϕ(Z|xi) [log(pθ(xi |Z))] − DKL(qϕ(Z|xi)||pθ(Z)). (4.2) 117 The first term (log-likelihood) in equation 4.2 can be interpreted as the reconstruction loss and the second term (KL divergence) as the regularizer. The total loss over all samples can be written as: L(θ, ϕ, X) = LREC + LKL (4.3) where LREC := Eqϕ(Z|X) [log(pθ(X|Z))] and LKL := DKL(qϕ(Z|X)||pθ(Z)). Assuming the posterior distribution is Gaussian and using a 1-sample approximation [237], the likelihood term simplifies to: LREC = X i −1 2 log(σ 2 θ (zi)) − (xi − µθ(zi))2 2σ 2 θ (zi) (4.4) where Z ∼ p(Z) = N(0, I) (I is identity matrix), X|Z ∼ pθ(X|Z) = N(X|µθ(Z), σθ(Z)), and Z|X ∼ qϕ(Z|X) = N(Z|µϕ(X), σϕ(X)). µθ(Z), σθ(Z) are posterior mean and variance; µϕ(X), and σϕ(X) are encoder mean and variance. Optimizing VAEs over mean and variance with a Gaussian posterior is challenging [237]. If the model has sufficient capacity that there exists (ϕ, θ) for which µθ(z) provides a sufficiently good reconstruction, then the second term pushes the variance to zero before the term −1 2 log(σ 2 θ (xi))) catches up [37, 237]. One practical example of this behavior is in speech processing applications [37]. The input is a spectral envelope which is a relatively smooth 1D curve. Representing this as a 2D image produces highly structured and simple training images. As a result, the model quickly learns how to accurately reconstruct the input. Consequently, reconstruction errors are small and the estimated variance becomes vanishingly small. Another example is 2D reconstruction of MRI images where the images from neighbouring 2D slices are highly correlated leading again to variance shrinkage [261]. To overcome this problem, variance estimation networks can be avoided using a Bernoulli distribution or by simply setting variance to a constant value [237]. 118 4.1.1.2 Conditional Quantile Regression In contrast to classical parameter estimation where the goal is to estimate the conditional mean of the response variable given the feature variable, the goal of quantile regression is to estimate conditional quantiles based on the data [281]. The most common application of quantile regression models is in cases in which a parametric likelihood cannot be specified [281]. Another motivation for quantile regression is that quantiles are robust to outliers [128]. Quantile regression can be used to estimate the conditional median (0.5 quantile) or other quantiles of the response variable conditioned on the input data. The α-th conditional quantile function is defined as qα(x) := inf{y ∈ R : F(y|X = x) ≥ α} where F = P(Y ≤ y) is a strictly monotonic cumulative distribution function. Similar to classical regression analysis which estimates the conditional mean, the α-th quantile regression (0 < α < 1) seeks a solution to the following minimization problem for input x and output y [143, 281]: argmin θ X i ρα(yi − fθ(xi)) (4.5) where xi are the inputs, yi are the responses, ρα is the check function or pinball loss [143] and f is the model paramaterized by θ. The goal is to estimate the parameter θ of the model f. The pinball loss is defined as: ρα(y, fθ(xi)) := α(y − fθ(xi)) if (y − fθ(xi)) > 0 (1 − α)(y − fθ(xi)) otherwise. (4.6) Due to its simplicity and generality, quantile regression is widely applicable in classical regression and machine learning to obtain a conditional prediction interval [217]. It can be shown that minimization of the 119 loss function in equation 4.9 is equivalent to maximization of the likelihood function formed by combining independently distributed asymmetric Laplace densities [281]: arg max θ L(θ) = α(1 − α) σ exp − P i ρα(yi − fθ(xi)) σ where σ is the scale parameter. Individual quantiles can be shown to be maximum likelihood estimates of Laplacian density. In this paper we estimate two quantiles jointly and therefore our loss function can be seen as a summation of two Laplacian likelihoods. 4.1.2 Deep Uncertainty Estimation with Quantile Regression 4.1.2.1 Quantile Regression Varational Autoencoder (QR-VAE) Instead of estimating the conditional mean and conditional variance directly at each pixel (or feature), the outputs of our QR-VAE are multiple quantiles of the output distributions at each pixel. This is achieved by replacing the Gaussian likelihood term in the VAE loss function with the quantile loss (check or pinball loss). For the QR-VAE, if we assume a Gaussian output, then only two quantiles are needed to fully characterize the Gaussian distribution. Specifically, we estimate the median and 0.15-th quantile, which corresponds to approximately one standard deviation (more precisely 1.036 std dev.) from the mean under the Gaussian model. Our QR-VAE ouputs, QL (low quantile) and QH (high quantile), are then used to calculate the mean and the variance. To find these conditional quantiles, fitting is achieved by minimization of the pinball loss for each quantile. The resulting reconstruction loss for the proposed model can be calculated as: LREC = X i ρL(xi − fθL (xi)) +XρH(xi − fθH (xi)) where θL and θH are the parameters of the models corresponding to the quantiles QL and QH, respectively. The minimization of the loss results in the desired quantile estimates for each output pixel. 120 We reduce the chance of quantile crossing (consistency in the quantiles defined by: Qτ1 ⊂ Qτ2 when τ1 > τ2) by limiting the flexibility of independent quantile regression. This is done by simultaneous estimation of both quantiles with one neural network, rather than training separate networks for each quantile [217]. Note that the estimated quantiles share network parameters except for the last layer. While quantile regression guarantees coverage of data (based on the quantiles chosen) in the training set, performance on a held-out validation data is not guaranteed. In order to have a coverage guarantee for finite samples on unseen data, we deployed conformalized quantile regression using a calibration set as explained in [220]. Conformal predictions provide a non-asymptotic, distribution-free coverage guarantee [231]. The main idea of conformal prediction is to fit a model on the training data, then use the residuals on held-out calibration data to quantify the uncertainty in future predictions. This offers finite sample distribution-free performance guarantees. The conformalized quantile regression approach combines conformal prediction with quantile regression [230]. For this, we use the approach presented in [220] that combines conformal prediction with quantile regression. This approach inherits both the finite sample, distribution-free validity of conformal prediction and the statistical efficiency of quantile regression. 4.1.2.2 Binary Quantile Regression U-Net (BQR U-Net) For calculating quantiles of a binary response consider the following model: Y ∗ = h(x) + ϵ, Y = I{Y ∗ ≥ 0} 121 where Y ∗ is a hidden variable, h(x) is the true model, and ϵ is the noise. No distribution ϵ is assumed for the model [146]. Since the indicator function, is monotone increasing, and since quantiles are invariant under monotone transform, we have: Qτ (Y |X) = I(Qτ (Y ∗ |X)). Qτ (Y |X) is τ th conditional quantile of Y given X. By modeling Qτ (Y ∗ |X) = fτ (X, β) with β as the parameter, the β parameter can be estimated by: argmin β X i ρτ (yi − I(fτ (xi , β) ≥ 0)) which can be shown to be equivalent to a maximization problem [146]: arg max β X i [yi − (1 − τ )]I(fτ (xi , β) ≥ 0) However, the function is not differentiable, because of the use of the indicator function. To apply gradient based optimization methods for training the neural network, we use the smoothed approximation [146]: arg max β X i [yi − (1 − τ )]K(fτ (xi , β)) (4.7) where K(t) is smoothed version of the indicator function, with the following properties: K(t) ≥ 0, ∀t ∈ R, lim t→+∞ K(t) = 1, lim t→−∞ K(t) = 0. 122 Specifically, to train the neural networks, we choose K(t) = 1 1+e−t , the sigmoid function, since the sigmoid function has the desired properties. In this paper we us the BQR loss to solve the lesion detection and segmentation task. We use a U-Net architecture with multiple heads (output branches), where each head estimates a specific quantile for the labels at the pixel level. We observed that joint estimation of multiple quantiles is computationally faster than solving for each separately, and also avoids the quantile crossing problem. A standard U-Net would use a cross-entropy loss for this segmentation task. Here we replace this with the BQR loss. To estimate the n-th quantile, the BQR loss is given by: Loss = X n X i [yi − (1 − τn)]K(fτn (xi , βn)) (4.8) where each f correspond to a head of U-Net, τ1...τn shows different quantiles and β1...βn are estimated parameter for each quantile respectively. A single network is used to estimate all quantiles. We chose to train a single network with output branches for each quantile since there is a common network across quantiles except for the last layer. This makes training of the quantiles consistent avoiding crossing of the estimated quantiles. We choose K(t) to be the sigmoid function. 4.1.3 Experiments and Results We evaluate our proposed approaches for supervised and unsupervised deep quantile regression on (i) A simulated dataset for density estimation, (ii) Unsupervised lesion detection in a brain imaging dataset, and (iii) Supervised lesion detection in a lung cancer dataset. For the simulated data, we compare our results qualitatively and quantitatively, using KL divergence between the learned distribution and the original distribution, with Comb-VAE [237] and VAE as baselines. For unsupervised lesion detection, we compare our lesion detection results with the VAE, which estimates both mean and variance. The area under the receiver operating characteristic curve (AUC) and dice coefficient is used as a performance metric. We also 123 performed the unsupervised lesion detection task nonparametrically, estimating upper and lower quantiles of the images and then assigning voxel lesion labels if their intensities are outside those quantiles. Using the BQR U-Net, we estimated the thresholded probability of the labels for a dataset with multiple (4) annotators per image. We compared the dice coefficient of these thresholded areas obtained using the BQR U-Net with their corresponding counterparts calculated both using the softmax probability of a deterministic U-Net and the ground truth (as determined by the four human raters). 4.1.3.1 Simulations for VAE Following [237], we first evaluate variance estimation using VAE, Comb-VAE, and QR-VAE on a simulated dataset. The two moon dataset inspires the data generation process for this simulation∗ . First, we generate 500 points in R 2 in a two-moon manner to generate a known two-dimensional latent space. These are then mapped to four dimensions (v1, v2, v3, v4) using the following equations: v1(z1, z2) = z1 − z2 + ϵ p 0.03 + 0.05(3 + z1) v2(z1, z2) = z 2 1 − 1 2 z2 + ϵ p 0.03 + 0.03||z1||2 v3(z1, z2) = z1z2 − z1 + ϵ p 0.03 + 0.05||z1||2 v4(z1, z2) = z1 + z2 + ϵ s 0.03 + 0.03 0.02 + ||z1||2 where ϵ is sampled from a normal distribution. For more details about the simulation, please refer to [237] † . After training the models, we first sample from z using a Gaussian prior, and input that sample to the decoder to generate parameters of the posterior pθ(x|z), and then sample again from the posteriors using the estimated means and variances from the decoder. The distribution of these generated samples represents the learned distribution in the generative model. ∗ https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons † https://github.com/SkafteNicki/john/blob/master/toy_vae.py 124 Ground truth VAE KL=(0.64) Comb-VAE (KL=0.30) QR-VAE (KL=0.25) Figure 4.1: Pairwise joint distribution of the ground truth and generated distributions. Top: v1 vs. v2 dimensions. Bottom: v2 vs v3 dimensions. From left to right: original distribution and distributions computed using VAE, Comb-VAE and QR-VAE, respectively. We also list the KL divergence between the learned distribution and the original distribution in each case. In Figure 4.1, we plot the pairwise joint distribution for the input data as well as the generated samples using various models. We used Gaussian kernel density estimation to model the distributions from 1000 samples in each case. We observe that the standard VAE underestimates the variance resulting in insufficient learning of the data distribution. The samples from our QR-VAE model capture a data distribution more similar to the ground truth than either standard VAE or Comb-VAE. Our model also outperforms VAE and Comb-VAE in terms of KL divergence between input samples and generated samples as can be seen in Figure 1. 4.1. The KL-divergence is calculated using universal-divergence, which estimates the KL divergence based on k-nearest-neighbor (k-NN) distance [264] ‡ . 4.1.3.2 Unsupervised Lesion Detection Network Architecture Next, we investigate utility of the proposed QR-VAE for the medical imaging application of detecting brain lesions. Multiple automatic lesion detection approaches have been developed to assist clinicians in identifying and delineating lesions caused by congenital malformations, tumors, stroke or brain injury. The VAE is a popular framework among the class of unsupervised methods [48, ‡ https://pypi.org/project/universal-divergence 125 30, 191]. After training a VAE on a lesion free dataset, presentation of a lesioned brain to the VAE will typically result in reconstruction of a lesion-free equivalent. The error between input and output images can therefore be used to detect and localize lesions. However, selecting an appropriate threshold that differentiates lesion from noise is a difficult task. Furthermore, using a single global threshold across the entire image will inevitably lead to a poor trade-off between true and false positive rates. Using the QR-VAE, we can compute the conditional mean and variance of each output pixel. This allows a more reliable and statistically principled approach for detecting anomalies by thresholding based on computed p-values. Further, this approach also allows us to correct for multiple comparisons. The network architectures of the VAE and QR-VAE are chosen based on previously established architectures [150]. Both the VAE and QR-VAE consist of three consecutive blocks of convolutional layer, a batch normalization layer, a rectified linear unit (ReLU) activation function and a fully-connected layer in the bottleneck for the encoder. The decoder includes three consecutive blocks of deconvolutional layers, a batch normalization layer and ReLU followed by the output layer that has 2 separate deconvolution layers (for each output) with sigmoid activations. For the VAE, the outputs represent mean and variance while for QR-VAE the outputs represent two quantiles from which the conditional mean and variance are computed at each voxel. The size of the input layer is 3 × 64 × 64 where the first dimension represents three different MRI contrasts: T1-weighted, T2-weighted, and FLAIR for each image. Training, Validation, and Testing Data For training we use 20 central axial slices of brain MRI datasets from a combination of 119 subjects from the Maryland MagNeTS study [99] of neurotrauma and 112 subjects from the TrackTBI-Pilot [283] dataset, both available for download from https://fitbir.nih.gov. We use 2D slices rather than 3D images to make sure we had a large enough dataset for training the VAE. These datasets contain T1, T2, and FLAIR images for each subject, and have sparse lesions. We have found that in practice both VAEs have some robustness to lesions in these training data so that they are sufficient for the network to learn to reconstruct lesion-free images as required for our anomaly detection task. 126 Input QL QH input<QL or input> QH Ground-truth Figure 4.2: Model-free lesion detection for ISLES dataset using QL = Q0.025 and QH = Q0.975. Pixels outside the [QL, QH] interval are marked outliers. Estimated quantiles are the outputs of QR-VAE. The three imaging modalities (T1, T2, FLAIR) were rigidly co-registered within subject and to the MNI brain atlas reference and re-sampled to 1mm isotropic resolution. Skull and other non-brain tissue were removed using BrainSuite (https://brainsuite.org). Subsequently, we reshaped each sample into 64 × 64 dimensional images and performed histogram equalization to a lesion free reference. We separated 40 subject as the validation/calibration set. We evaluated the performance of our model on a test set consisting of 20 central axial slices of 28 subjects from the ISLES (The Ischemic Stroke Lesion Segmentation) database [170] for which ground truth, in the form of manually-segmented lesions, is also provided. We performed similar pre-processing as for the training set. Model-free Anomaly Detection For simplicity, we first performed the lesion detection task using the QR-VAE without the Gaussian assumption as shown in Figure 4.2. We trained the QR-VAE to estimate the Q0.025 and Q0.975 quantiles. We then used these quantiles directly to threshold the input images for anomalies. This leads to a nominal 5% (per pixel) false positive rate. This method is simple and avoids the need for validation data to determine an appropriate threshold. However, without access to p-values we are unable to determine a threshold that can be used to correct for multiple comparisons by controlling 127 q φ (Z|X) Z p (X|Z) Input μ φ σ φ QL QH Gaussian Mapping std. Mean Probabilistic Thresholding Lesion Figure 4.3: Estimating two quantiles in the ISLES dataset using QR-VAE. Using the Gaussian assumption for the posterior, there is 1-1 mapping from these quantiles to mean and standard deviation. 0.001 0.005 0.025 0.1 0.15 0.2 0.3 0.4 0.999 0.995 0.975 0.90 0.85 0.8 0.7 0.6 input 0.5 Figure 4.4: Pixel-wise quantile image thresholds for a single test image as a function of quantile computed using the QR-VAE. 128 Figure 4.5: Vertical axis indicates the fraction of pixels in the entire testing set whose intensity is below the corresponding quantile for that pixel as computed using the QR-VAE. Note that as aggregated over the entire test set, the computed pixel-wise quantiles closely match the true distribution assuming anomaly-free data (in practice the fraction of anomalous pixels is a very small fraction of the total, so the presence of lesions in the data should not substantially affect this plot). the false-discovery or family-wise error rate. A model needs to be assumed to obtain p-value as is done in 4.1.3.2. The Dice coefficient for this model was 0.37 and 0.32 with and without conformalization respectively (see Table 4.3). For validating the accuracy of the computed quantiles we calculated the the percentage of pixels that lie below the estimated quantiles. Even in the extreme quantiles the percentage of pixels with lower intensity than the threshold predicted by each quantile was very close to each of the estimated quantile values (Figures 4.4 and 4.5). Gaussian model for anomaly detection In a second experiment, we trained a VAE with a Gaussian posterior and the QR-VAE as illustrated in Figure 4.3, in both cases estimating conditional mean and variance. Specifically, we estimated the Q0.15 and Q0.5 quantiles for the QR-VAE and used these values to compute pixel-wise mean and variance assuming a Gaussian model. By comparing the pixel intensity to the Gaussian model values, we can compute p-values for each pixel. We then used these means and variances to convert image intensities to p-values. In order to identify or segment lesions, we threshold the pixel-wise p−values. 129 Naively applying a threshold separately at each pixel will result in a large number of false positives because of the multiple comparisons problem [232]. For example, if all pixels are independent, and follow null distribution, then thresholding at an α = 0.05 significance level value would lead to 5% of all pixels being identified as lesion, even if none were present. While in practice this number is much lower because of spatial correlation in the image, it is still important to account for multiple comparisons. The best known such adjustment is the Bonferroni correction [38]. In medical imaging applications, this correction tends to be too conservative since pixels are correlated. Other methods for multiple comparison correction are designed to control the Family-Wise Error Rate (FWER, probability of making one or more false discoveries) [255] or the False Discovery Rate (FDR) [33]. The FDR is the expected ratio of the number of false positives to the total number of positives (rejections of the null). In other words, in an FDR-corrected thresholding at an α = 0.05 significance level, we would expect 5% of the detected lesion pixels to be false positives. Here we use the Benjamini-Hochberg procedure [33] with α = 0.05. As shown in Figure 4.6, the VAE underestimates the variance, so that most of the brain shows significant p-values, even with FDR correction. On the other hand, the QR-VAE’s thresholded results detect anomalies that reasonably match the ground truth. To produce a quantitative measure of performance, we also computed the area under the ROC curve (AUC) for VAE and QR-VAE. To do this we first computed z-score images by subtracting the mean and normalizing by standard deviation. We then applied a median filtering with a 7 × 7 window. By varying the threshold on the resulting images and comparing it to ground truth, we obtained AUC values of 0.52 for the VAE and 0.92 for the QR-VAE. We also obtained Dice coefficient values of 0.006 for the VAE and 0.37 for the QR-VAE. All quantitative measurements were computed on a voxel-wise basis. Also note that with the conformalized formulation, the Dice coefficient further increased to 0.41 (Table 4.3). Results using the conformalized formulation improved performance of the QR-models by calibrating the quantiles, resulting in an increase in the Dice coefficients in both Gaussian and model-free cases. 130 input mean 3 times std |input-median| z-score Significant p-value Ground truth input median 3 times std |input-median| z-score Significant p-value Ground truth Q.15 A) B) Figure 4.6: Lesion detection for the ISLES dataset. A) VAE with mean and variance estimation B) QRVAE. First, we normalize the error value using the pixel-wise model’s estimates of mean and variance. The resulting z-score is then converted to an FDR-corrected p-value and the images are thresholded at a significance level of 0.05. The bottom rows represent ground truth based on expert manual segmentation of lesions. 131 Table 4.1: Compariaon of the performance of unsupervised lesion detection for VAE and QR-VAE, with and without conformalization. QR-VAE-conf: conformalized QR-AVE; QR-VAE-GS: Gaussian QR-VAE; QR-VAE-GS-conf: Gaussian conformalized QR-VAE. VAE QR-VAE QR-VAE-conf QR-VAE-GS QR-VAE-GS-conf AUC 0.52 N/A N/A 0.92 0.92 Dice coefficient 0.006 0.32 0.37 0.37 0.41 4.1.3.3 Supervised Lesion Detection We evaluated our BQR supervised approach on the LIDC-IDRI dataset [21]. This data consists of 1018 3D thorax CT scans annotated by four radiologists tasked with finding multiple lung nodules in each scan. The data is ideal for capturing inherent uncertainty in the data labels that comes from disagreement between experts. The data was preprocessed as described by [145]. They extracted 2D slices centred around the annotated nodules and generated 180 x 180 images when at least one expert has segmented a nodule. This process resulted in a dataset of 8882 images in the training set, 1996 images in the validation set, and 1992 images in the test set. We reshaped the data into 128 x 128 images for input to the neural network. We used a U-Net architecture [222] with 2D convolutional layers. The output layer was modified to generate four quantiles with four output branches using softmax activation. We compared the performance of BQR U-Net with the deterministic U-Net [222]. As ground truth for the comparison, first we estimated agreement maps for each test image by combining the lesion annotations of the four raters. This generates, for each image, an annotation with P(Y = 1|X) values greater than or equal to 0, 0.25, 0.5, 0.75,1. X here represents the input image. Given an input image, the BQR U-Net generates output regions where probabilities of the label Y = 1 are at or above the given quantile thresholds. The BQR U-Net was trained with the loss function in eq. 4.8. For comparison, the deterministic U-Net was trained using the binary cross-entropy loss. The BQR U-Net was trained to output 0.125, 0.375, 0.625, 0.875 quantiles. These quantile values represent the centroids of the intervals between the test data quantiles (0, 0.25, 0.5, 0.75,1) representing 0-4 rater agreements respectively. We used these thresholds rather than the same quantiles as the test data to avoid 132 operating at the boundary points between operators resulting from combining data from four raters only. To generate the corresponding estimated quantiles for the deterministic U-Net we thresholded the softmax probability at 0.125, 0.375, 0.625, 0.875. Both deep BQR and deterministic U-Net diverged to the trivial solution of predicting all labels as zero due to the extreme class imbalance when initialized with a cold start. We therefore warmed up both models using a weighted cross-entropy loss for one epoch weighting samples by 1 ÷ 166, the ratio of the zero to one labels in the training set. We trained both models for 5 epochs. Our results show no significant improvement for deep BQR compared to the cross-entropy loss in terms of Dice coefficients for different agreement areas. The Dice coefficients between estimated probability areas for deterministic U-Net and BQR U-Net are plotted in Figure 4.8 and summarized in Table 2. In the figure we show the distribution of the Dice coefficients between the BQR and ground truth quantiles, DT (deterministic U-Net) and ground truth, and also between DT and BQR. In some cases, particularly for higher quantiles, there was low agreement between human raters in the ground-truth test data. For example, 64 percent of the test data showed zero pixels in common between all four human raters, leaving the 0.875 quantile empty for those images. We therefore computed the Dice coefficients only over those regions in which the test data had a non-empty data set for that quantile. The results show reasonable agreement for the 0.125, 0.375 and 0.625 quantiles, but relatively poor results for the 0.875 quantile. Surprisingly, results for BRQ and the determinsic U-Net (DT U-Net) are very similar, even though the actual degree of overlap between the two is no better than between each of them and the ground truth labels. The fact that both methods perform poorly for the 0.875 quantile reflects both that the data are of relatively poor quality in this region and also that performance is likely limited by the imbalance in the training data between non-lesional areas and lesions confidently identified by all four raters. Here we used the deterministic U-Net as a backbone since it is arguably the most commonly used network for medical image segmentation task. Other networks could also be used as a backbone. To investigate whether we would expect further improvements using a probabilistic backbone, 133 Figure 4.7: Top row: results of U-Net delineation of lesion boundaries. Bottom row: results of deterministic cross-entropy U-Net. (a) The original slice of the lung image; (b) estimated probability regions corresponding to 0.125,0.375,0.625,0.875 quantile levels shown with Red, green, purple and yellow colors respectively; (c) the estimate of thresholded lesion boundary from human raters corresponding to agreement between 1,2,3 and 4 raters. we also implemented the Probabilistic U-Net (Prob. U-Net) [145] to capture rater uncertainty. Our results show that for the quantile estimation task, the performance of Prob. U-Net and U-Net are comparable. As a result, it appears unlikely that replacing the U-Net with its proabilistic form in the backbone would lead to significant improvements. Table 4.2: The mean (std dev) of the Dice coefficients between estimated probability regions (P(Y = 1|X) ≥ α, where α is (0.25, 0.5, 0.75, 1); GT: ground truth, DT: deterministic, P = P(Y = 1|X) P ≥ 0.25 P ≥ 0.5 P ≥ 0.75 P = 1 BQR U-Net vs GT 0.68(0.27) 0.60(0.34) 0.50(0.40) 0.27(0.35) DT U-Net vs GT 0.67(0.27) 0.59(0.34) 0.50(0.38) 0.32(0.37) DT U-Net vs BQR U-Net 0.81(0.27) 0.63(0.40) 0.40(0.43) 0.16(0.33) Prob. U-Net vs GT 0.60(0.26) 0.60(0.39) 0.51(0.43) 0.31(0.36) 134 Figure 4.8: Violin plots of the Dice coefficients between quantiles (0.125, 0.375, 0.625, 0.875)) and rater agreement maps for the test datasets, GT: ground truth, DT: binary cross-entropy (Deterministic U-Net), QR: quantile regression (BQR U-Net). The fraction of empty quantiles in the ground truth (excluded from Dice coefficient computations) were 0.07, 0.31, 0.45, 0.64 respectively. The width of the violin indicates the fraction of the dataset as a function of the dice coefficient. 4.1.4 Conclusion Quantile regression is a simple yet powerful method for estimating uncertainty both in supervised and unsupervised lesion detection. We proposed novel cost functions to apply quantile regression and capture confidence intervals for lesion segmentation. In the unsupervised framework we used the VAE, a popular model for unsupervised lesion detection [48, 30, 191]. VAEs can be used to estimate reconstruction probability instead of reconstruction error for anomaly detection tasks. For calculating reconstruction probability, VAE models the output as a conditionally independent Gaussian characterized by means and variances for each output dimension. Simultaneous estimation of the mean and the variance in VAE underestimates the true variance leading to instabilities in optimization [237]. For this reason, classical VAE formulations that include both mean and variance estimates are rarely used in practice. Typically, only the mean is estimated with variance assumed constant [237]. To address this problem in variance estimation, we proposed an alternative quantile-regression model (QR-VAE) for improving the quality of variance estimation. We used quantile regression and leveraged the Guassian assumption to obtain the mean and variance by estimating two quantiles. We showed that our approach outperforms VAE as well as a Comb-VAE which is an alternative 135 approach for addressing the same issue, in a synthetic as well as real-world dataset. Our approach also has a more straightforward implementation compared to Comb-VAE. As a demonstrative application, we used our QR-VAE model to obtain a probabilistic heterogeneous threshold for a brain lesion detection task. This threshold results in a completely unsupervised lesion (or anomaly) detection method that avoids the need for a labeled validation dataset for the principled selection of an appropriate threshold to control the false discovery rate. Beyond the current application we note that Quantile regression is applicable to deep learning models for medical imaging applications beyond the VAE and anomaly detection as the pinball loss function is easy to implement and optimize and can be applied to a broader range of network architectures and cost functions. For supervised lesion detection, we present deep binary quantile regression to estimate the label uncertainty. Specifically, we use this technique to estimate quantiles of the labels that represent uncertainty. The lesion segmentations generated for each quantile reflect this uncertainty. Using LIDC data with four annotations, we aimed to estimate the disagreement between the annotators. Although it has been reported that deep binary QR has a better performance in imbalanced datasets in lesion segmentation tasks with extreme imbalance toward class zero (normal), warming up the model was still needed in order to prevent it from converging to the trivial solution. Our results show no significant improvement in terms of dice coefficient between ground truth and estimated areas of agreement for deep binary QR compared to a deterministic U-Net. We found a relatively small agreement for the 0.875 quantile region for these two estimations (row three, Table 4.4), demonstrating that although we obtain similar performance, these two estimators are not annotating the same region regions. This finding indicates the potential for further improvements in the performance of both methods. Based on the current results, while numerical results are similar, the fact that with fewer training samples QR is less likely to diverge than the deterministic U-Net indicates that the QR approach may be more robust and stable. 136 We investigated the advantages of using quantile regression in both supervised and unsupervised settings. In the unsupervised framework, the estimated confidence interval is used to capture uncertainty from which can identify outliers that represent our detected lesions. We demonstrated the advantage of using this quantile regression approach in the VAE setting. In the supervised framework, we used BQR to estimate the uncertainty of raters for the case where multi-rater data is available for training (and testing). 4.2 Beta Quantile Regression for Robust Estimation of Uncertainty in the Presence of Outliers§ Quantile regression offers an alternative to mean regression in various applications where accurate predictions and their associated reliability are crucial. For instance, in clinical diagnosis, a realistic assessment of prediction uncertainty is essential for determining disease status and planning appropriate treatment. In the context of deep learning, two types of uncertainties are encountered: aleatoric and epistemic. Aleatoric uncertainty arises from the inherent stochasticity of the data, while epistemic uncertainty — often referred to as model uncertainty — is due to limitations in the model itself. It’s worth noting that an infinite amount of training data would not reduce aleatoric uncertainty, although it could mitigate epistemic uncertainty. A multitude of methods exist for estimating these uncertainties, including Gaussian process regression, uncertainty-aware neural networks, Bayesian neural networks, and ensemble methods[237, 220, 92]. Recent studies have proposed to use conditional quantile regression to estimate aleatoric uncertainty in neural networks [220, 248, 9, 8, 19] and showed that it can compute well-calibrated intervals. The most common application of quantile regression models is in cases where parametric likelihood cannot be §This work is in equal collaboration with Omar Zamzam from the University of Southern California. 137 specified [281]. Similar to the classical regression analysis which estimates the conditional mean, the α-th quantile regression (0 < α < 1) seeks a solution to the following minimization problem [281]: argmin θ X i ρα(yi − fθ(xi)), (4.9) where xi are the inputs, yi are the responses, f is the model paramaterized by θ, and ρα is the check function or pinball loss [281] defined as: ρα(yi − fθ(xi)) = (yi − fθ(xi))α, if yi ≥ fθ(xi) (fθ(xi) − yi)(1 − α), if yi < fθ(xi) It has been shown that minimization of the loss function in (4.9) is equivalent to maximizing the likelihood function formed by combining independently distributed asymmetric Laplace densities [281], arg max θ L(θ) = α(1 − α) σ exp − P i ρα(yi − fθ(xi)) σ . where α is the quantile and σ is the scale parameter. Recently, quantile regression has been employed for uncertainty estimation in regression tasks such as image translation [11] and anomaly detection [19] in medical imaging. In these domains, obtaining a reliable uncertainty estimate is of critical importance. Compared to alternative methods for uncertainty estimation, such as sampling using generative models [19] or Bayesian uncertainty estimation, quantile regression offers computational efficiency and speed, and does not require sampling. Statistical machine learning models that involve maximizing likelihood are particularly sensitive to outliers [116]. Although quantile regression is quite robust to outlying response observations, it can be sensitive to outlier covariate observations (features). It has been shown that perturbing a single (xi , yi ) data 138 point in an arbitrary manner can force all quantile regression hyperplanes to intersect at the perturbed point [179]. Despite that, it’s important to highlight that only a limited number of papers have explored the robustness of quantile regression in the context of covariate observations, particularly within deep learning frameworks. We outline our contributions in this paper as follows: (i) We propose a robust quantile regression approach that leverages concepts from robust divergence. (ii) We compare the performance of our proposed method, particularly in the presence of outliers, to existing techniques such as Least Trimmed Quantile Regression [179], which serves as the only available baseline, and robust regression methods that rely on the regularization of case-specific parameters, in both a simple dataset and a simulated dataset. (iii) Finally, to illustrate the practical utility of the proposed method, we apply it to a medical imaging translation task, employing state-of-the-art diffusion models. 4.2.1 Method We start by briefly explaining the formulation of least-trimmed quantile regression [179] and robust regression based on the regularization of case-specific parameters. 4.2.1.1 Least Trim Quantile Regression (TQR) The objective function for TQR is defined as: argmin θ X IC ρα(yi − fθ(xi)) (4.10) Where IC is the subset of samples with C samples from the training dataset that generates the smallest error. The optimization is similar to quantile regression with an additional iterative process. After initializing with C random samples, at each iteration, the samples with the smallest error are chosen for training in the next iteration and the process is repeated until there are not any significant changes in loss value comparing to 139 that of the previous iteration. We utilized TQR within a gradient descent optimization framework, where we used only the subset of the batch with the lowest error for backpropagation. 4.2.1.2 Robust regression based on regularization of case-specific parameters (RCP) She and Owen [234] proposed a robust regression method using the case-specific indicators in a mean shift model with the regularization method. By generalizing their method to quantile regression, the final loss can be simplified to: argmin θ X i ρα(yi − fθ(xi) − γi) + λ X i |γi | (4.11) This optimization can be solved using an alternative approach and soft margin thresholding. RCP can be used for any likelihood based model. 4.2.1.3 β-quantile regression (β-QR) For parameter estimation, maximizing the likelihood is equivalent to minimizing the KL-divergence between the empirical distribution of the input and statistical model q(ϕ). Similarly a robust β-loss (Lβ) can be derived by replacing the KL-divergence with the β-divergence Dβ [27, 12, 6]. Dβ(f(x)||g(x)) = 1 β Z (f(x) β g(x) β )f(x)dx − 1 β + 1 Z (f(x) β+1g(x) β+1)dx Lβ = 1 N X exp(βl(xi , q(ϕ))) − 1 β + 1 β + 1 Z q(ϕ) β+1 140 where l(xi , q(ϕ)) denotes the log-likelihood of observation xi . This loss assigns a weight to each observation based on the likelihood’s magnitude, mitigating the influence of outliers on model training [27]. In the case of quantile regression, the loss can be simplified to: Lβα = 1 N X exp(−βρα((yi − fθ(xi))/σ) − 1 β (4.12) The hyperparameter σ can be assumed to be 1 for simplicity. This loss can be interpreted as an M-estimate. The hyperparameter β specifies the degree of robustness. 4.2.1.4 Quantile regression for diffusion models for regression tasks Diffusion probabilistic models [110] are primarily composed of two essential processes: a forward process that gradually adds Gaussian noise to a data sample, and a reverse process that transforms Gaussian noise to an empirical data distribution through a gradual denoising process. Conditional diffusion models [228] incorporate input samples to condition the denoising process. Image translation problems can be modeled as conditional diffusion models, represented as: p(y|x), where y is the target image and x is the input conditioning image. In this paper, we deal with image translation problems where the input images x are T1-weighted brain MRI images and the targets y are the corresponding T2-weighted images. The diffusion model fθ(x) is trained to recover T2-weighted images y from Gaussian noise ϵ ∼ N (0, I) conditioned on the input T1-weighted images x. For the details of the diffusion and conditional diffusion models, we refer to multiple works that provide a complete treatment of the mathematical formulations [181, 238, 228, 110, 227]. Instead of minimizing the mean squared error loss between the targets y and the estimates fθ(x) that yields a mean regression problem, the minimization problem in (4.9) is adopted to predict the α quantiles of the target images. We show that in the presence of outliers in the training set, replacing the loss function in (4.9) with the proposed loss function in (4.12) yields a model that is minimally affected by the outliers, 141 making it closer to a model trained only using inlier samples. The details of the conducted experiments are presented in the following section. 4.2.2 Experiments and Results In this section, we evaluate our proposed method on a simple real dataset, a simulation-based dataset, and a medical image translation problem. 4.2.2.1 Star Cluster CYB OB1 First, we start with a simple dataset on the star cluster CYB OB1 which was analyzed in [179]. This dataset consists of 47 observations from which four points with high leverage do not follow the trend of the rest of the data. It has one explanatory variable which is the logarithm of the effective temperature at the surface of stars. The independent variable is the logarithm of its light intensity. The authors have shown the efficacy of least trimmed quantile regression compared to quantile regression using linear programming optimization to find the model’s parameters. However, our goal is to investigate robustness in neural networks where the solution will be calculated using stochastic gradient descent (SGD). We estimate 0.25, 0.5, and 0.75 quantiles with a neural network. We implement the linear quantile regression problem with a one-layer neural network with linear activation. Then we applied the three suggested robust methods TQR, RCP, and β-QR. We used GD with the ADAM optimizer to train the network. We chose the hyperparameters for each model (trimming percentage, L and β) using a grid search. We used a batch size of 47 and performed 5000 iterations. The results are shown in Fig. 4.9. For a quantitative comparison of the models We calculated the Frobenius norm between each estimated quantile and the solution, which was learned only using the inliers (Table 4.3 and Fig. 4.9). The β-QR method shows the best performance among the methods. For optimizing RCP cost, we used the Alternating Direction Method of Multipliers (ADMM) in which we split the objective 142 Quantile regression TQR Q0.75 Q0.5 Q0.25 Quantile regression on clean data B-QR RCP Figure 4.9: Robust linear quantile regression using TQR, RCP, β -QR for star cluster CYB OB1 dataset. into P i ρα(yi − fθ(xi) − γi) and λ P i |γi |. We optimized the former using GD with the ADAM optimizer, and for the latter, we used a proximal method for the L1 objective: proxyλ,l1 (xi) := xi − λ if xi > λ xi + λ if xi < λ 0 otherwise. (4.13) We iterated between optimization of the two components of the cost function until convergence. 4.2.2.2 Toy Example for Uncertainty Estimation Here we used the simple synthetic dataset introduced in [248] to which we added 1% outliers. Tagasovska and Lopez-Paz [248] applied simultaneous Quantile Regression (SQR) to estimate aleatoric uncertainty and suggested estimating all the quantile levels simultaneously. We modeled the data using a three-layer neural network with ReLU activation function. We then applied the three robust methods TQR, RCP, and β-QR. 143 We used SGD with the ADAM optimizer for training. We trained TQR and β-QR with a batch size of 128 and ran each for 500 epochs. RCP was trained for ten epochs and 500 steps of iterative optimization. We estimated the performance of the robust model for 0.25, 0.5 and 0.75 quantiles. Our results shown that β-QR estimates robust quantiles and comparable results to TQR (Fig. 4.10). TQR 10% trimm β-QR β=3 RCP L=0.1 Ground truth Corrupted Model Figure 4.10: Robust non-linear quantile regression using TQR, RCP, β-QR using a simple neural network for a toy example. 4.2.2.3 Quantile Regression for Uncertainty Estimation in Diffusion Models In this section, we present an experiment aimed at showcasing the effectiveness of our proposed robust quantile regression approach in a medical imaging task. Specifically, we focus on addressing the outlier Outlier-free model Outlier model Robust model Figure 4.11: Estimating T2 MRI QL(0.05),QH(0.95),QM(0.5) for diffusion models from T1 MRI. Comparing the estimated quantiles using the non-robust and robust (β-QR) model with the outlier free model. 144 Table 4.3: Comparison of performance of TQR, RCP and β-QR. Each entry shows the Frobenius norm of the difference between the estimated quantiles and their (outlier-free) ground truth for the star cluster CYB OB1 dataset. Method CYB-Q1 CYB-Q2 CYB-Q3 TQR 1.04 1.12 1.12 RCP 3.43 4.82 3.74 β-QR 0.93 0.77 0.85 problem in an image translation task, where we employ a diffusion model to predict various quantiles of T2-weighted brain MRI images based on input T1-weighted images. Our training dataset consists of two distinct groups of subjects: (i) Lesion-free subjects that come from the Cam-CAN dataset (in-liers) [250], representing individuals without any brain lesions, and (ii) Lesion subjects that are sourced from the BRATS dataset [174] who do have brain lesions, thereby introducing outliers into the dataset. Training the diffusion model solely on the Cam-CAN data and using the loss function in (4.9) yields a reliable model that successfully captures the relationship between T1 and the quantiles of T2 images. However, introducing the “outlier” lesioned brain images from the BRATS dataset into the training set and using the same loss function significantly perturbs the training process, resulting in a notably less reliable model and corrupted quantiles. To mitigate the adverse effects of the outlier samples and restore model reliability, we integrate the proposed robust loss function presented in (4.12) into the training process. This loss function is designed to down-weight the influence of outliers during training, effectively bringing the model’s performance closer to that of the model trained solely on clean data from the Cam-CAN dataset. The robust loss function was employed to train the model with the combined Cam-CAN and BRATS data sets. The results of this experiment are illustrated in Fig. 4.11, providing a qualitative comparison of the trained models. Table 4.4 shows our quantitative results. These results unequivocally demonstrate that the inclusion of the robust loss function during model training significantly enhances the model’s robustness to outliers, resulting in a reliable model that closely approximates the performance of the model trained exclusively on clean data. We estimated 0.05,0.5 and 0.95 for this dataset. For comparison of the robust and non-robust models, we calculated: (1) the MSE between the estimated quantiles and the 145 Table 4.4: Comparing performance of β-QR with outlier free on baseline model. For the prediction error, MSE calculated between ground truth T2 and the median of each model Method Prediction error Quantile error Outlier free 0.0086 - Baseline 0.0132 0.0097 β-QR 0.0074 0.0013 TQR 0.0107 0.0015 outlier-free model predicted quantiles; and (2) the MSE between the predicted median and the ground-truth T2 image. We tuned the β parameter using a validation set. 4.2.3 Conclusion In this paper we introduced a robust quantile regression approach designed to enhance the reliability of deep learning models in the presence of outliers. Our method leverages concepts from robust divergences to down-weight outlier influence during training. We demonstrated the effectiveness of our approach on a simple yet real dataset, showcasing its ability to improve quantile regression accuracy compared to existing robust quantile regression methods. Extending the application to medical imaging, and demonstrating its practical utility, the proposed approach proved effective in mitigating outlier effects on training a diffusion model to translate MRI brain images from a T1-weighted to T2-weighted modality, bringing the performance closer to that of the model trained solely on clean data. The presented findings highlight the practical value of the proposed method, particularly in training scenarios compromised by outliers. 146 Chapter 5 Using Diffusion Models for In-painting, Segmentation, and Registration of Abnormal Brains Image registration and segmentation play fundamental roles in preprocessing and analyzing brain MRI images. However, traditional deformable image registration methods assume a diffeomorphic spatial correspondence between images, which may not hold in medical imaging scenarios involving significant tissue changes due to disease progression or interventions. This non-correspondence challenge has led to the development of innovative approaches to address the issue, including cost function masking, noncorrespondence detection via intensity criteria, and conversion of pathological images to normal appearance. We focus on converting pathological images to normal appearance using diffusion models, which have shown promising results in various medical imaging applications. We propose a novel model that combines a registration module with a diffusion model to segment and inpaint abnormalities in T1 brain MRI images. We introduce a fine-tuning strategy to enhance the in-painting capability of diffusion models. Our approach aims to generate a pseudo-healthy version of the lesioned brain, preserving the intact regions while replacing abnormal areas with synthetic, healthy-looking tissue. This pseudo-healthy brain can potentially improve the accuracy and reliability of subsequent preprocessing steps such as registration and segmentation. Our innovative approach offers a promising solution to address the non-correspondence challenge in medical image registration and segmentation. We compare our model’s lesion delineation 147 performance with 2D, 3D, and latent diffusion models. We also show the segmentation of the lesioned brain using BrainSuit software. 5.1 Introduction Image registration is one of the first necessary steps of preprocessing and analysis for brain MRI images. Image-to-atlas registration involves creating a mapping that aligns an individual patient’s image with a reference atlas. This process enables the transformation of each patient’s image into a shared coordinate system, facilitating population-based analyses. Additionally, this mapping can be employed to transfer prior atlas information to an individual subject, allowing for tasks such as segmentation to be performed on the individual’s data. However, standard methods in deformable image registration often operate under the assumption of the existence of a diffeomorphism between the images being aligned [295, 270, 241, 108]. That is, they presume the existence of a one-to-one smoothly invertible map between the two images [257]. In many medical imaging scenarios, such as pathology image to atlas registration or longitudinal studies comparing preoperative and post-operative images, this assumption does not hold [284]. The fundamental and common assumptions of diffeomorphisms and topology preservations in image registration are often violated when pathology, such as lesions, is present in the brain images due to the infiltration of the normal tissue by pathological tissue, tissue death as well as the emergence of tumor tissue. Additionally, large and nonsmooth distortions can be caused by a growing tumor, which violates the usual assumption of smoothness of the deformation fields [284, 242]. These cases often involve significant changes in the morphology of the anatomical regions of interest in the brain due to disease progression, surgical interventions, or other factors. Nonetheless, registering images containing pathologies serves crucial purposes such as facilitating disease diagnosis, tracking, and treatment planning through atlas-based tissue segmentation [125, 42]. 148 Figure 5.1: Proccessing of lesioned brain Direct registration of pathology images, without considering these unique factors can lead to inaccuracies. Specifically, the impact of focal tissue changes (like those seen in pathological alterations) can be overlooked. The challenge of non-correspondence in medical image registration has led to the development of several innovative approaches, each addressing the issue in a unique way. These methods can be categorized into three main classes: 1. Cost Function Masking: In this approach, the regions in the images that do not correspond (noncorresponding regions) are first identified through segmentation. These segmented areas are then ’masked’ or excluded from the image similarity measure during the optimization process of image registration. This method ensures that non-corresponding regions do not adversely affect the registration accuracy [41, 139]. 2. Non-correspondence Detection via Intensity Criteria: This category combines the process of segmentation with registration. This approach aims at detecting non-corresponding regions based on intensity criteria during the registration process itself [47, 17, 159]. 149 3. Conversion of Pathological Images to Normal Appearance: These methods focus on altering an image’s pathological or abnormal areas to make them appear as normal tissue. The idea is to guide the registration process by reconstructing or replacing the focal abnormal areas with what is considered normal tissue. In this context, techniques such as low-rank and sparse image decomposition, as well as the use of generative models, are employed [102, 277]. Cost function masking requires ground truth or accurate labels during registration and may decrease alignment accuracy, particularly when the focal area is large. Non-correspondence methods, which typically rely on a sophisticatedly designed loss function, are very sensitive to the dataset, making it difficult to find a set of unified parameters. Therefore, our interest lies in converting pathological images to normal appearance, using diffusion models that generate state-of-the-art results in image generation across various applications, including the medical imaging domain. Various modifications of diffusion models have been widely used for unsupervised lesion segmentation by generating a pseudo-healthy version of the brain[271, 31, 275, 198]. However, with a few exceptions, these have been applied to 2D slices of brain MRI and T2 modality, using only one dataset with less than 500 subjects for training, thereby reducing the approach’s generalizability on a large scale. In the case of generating a pseudo-healthy version of the brain, the model can benefit from learning from the structure of the brain to better in-paint the missing information. Moreover, the 2D backbone-driven approaches would easily result in volumetric inconsistency. In T1 modality imaging, identifying and delineating lesions can be more difficult than in T2 imaging. These lesions often present themselves as more than just areas of hyperintensity. This chapter aims to employ a diffusion model capable of selectively inpainting abnormal tissue with healthy equivalents while preserving the intact, healthy regions of the original brain. To achieve this, we utilize the T1 modality, owing to its superior performance in most preprocessing software and the greater availability of training data. 150 Figure 5.2: Registering the lesioned brain to a normal version of it reconstructed using a diffusion model. Then, thresholding the determinant of the Jacobian of the registration deformation field using a normative validation set of Jacobian determinants to find the anomalous part of the lesioned brain. The model first identifies and differentiates between healthy and abnormal tissues. To identify the brain’s abnormal tissue, we proposed an innovative approach(RegPaint) in which we combine a diffusion model with a registration module. Once these regions are defined, the diffusion model is then applied for in-painting, replacing the abnormal areas with synthetic, healthy-looking tissue, resulting in a pseudohealthy brain that represents a good surrogate for processing. The resulting pseudo-healthy brain can then be subjected to further preprocessing steps like registration and segmentation, potentially offering more accurate and reliable results when dealing with lesioned brains. Our contribution can be summarized as follows: 1. We developed a novel model for generating a surrogate brain that combines a registration module with a diffusion model for segmenting and in-painting the abnormalities in the T1 brain. 2. We suggested a fine-tuning strategy to improve the in-painting capability of diffusion models of 3D brains. 3. We compared the lesion delineation capability of our model with 2D, 3D, and latent diffusion models. 151 5.2 Method Our aim is to generate a pseudo-healthy version of a lesioned brain and segment the abnormality. First, we generate an individualized atlas for a specific subject by applying a forward-backward step using a diffusion model where noise is added to the brain image, and the image is denoised back to a normal brain that is similar to the original image. Then we register the lesioned brain to its generated individualized atlas and calculate the Jacobian determinant of the deformation field, then threshold the Jacobian determinant using a validation set of normative Jacobian determinants generated from deforming healthy brains to their similarly generated individualized atlases. The Jacobian determinant of a deformation provides insights into the local properties of the transformation applied to the image. A value greater than 1 indicates local expansion, equal to 1 means no local volume change and less than 1 indicates local contraction. After that, Figure 5.3: The input of the network is the noisy input which is a combination of noisy localized input and non-masked input and the mask. The network task is to predict the noise. we segmented the lesion and in-painted it on the moved image to get the surrogate brain, which will be processed by the preprocessing software (We used BrainSuite). The segmentation labels can be mapped back to the original image by inverting the deformation DDF −1 and applying it to the generated labels. A detailed explanation of each module is provided below. For the diffusion model and registration model, we used a U-Net with three main levels. An attention mechanism is integrated into the third stage of the model. 152 5.2.1 Datasets We integrated T1w images from three publicly available datasets: Human Connectome Project (HCP) ∗ , Cambridge Centre for Ageing and Neuroscience (Cam-CAN) † , and the UK Biobank (UKB)[245], along with the IXI ‡ dataset. UK Biobank (UKB): The UKB is a large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants. The study aims to improve the prevention, diagnosis, and treatment of a wide range of serious and life-threatening illnesses. We utilized a subset of 5,000 random participants with T1w images out of 45,564 subjects, all of whom were healthy individuals aged between 44 and 82 years. Human Connectome Project (HCP): The HCP aims to construct a detailed map of the human brain’s structural and functional connections. Our study incorporated 1,113 subjects from the HCP dataset, aged 22-35 years, all of whom underwent minimal preprocessing. This preprocessing includes spatial artifact/distortion removal, surface generation, cross-modal registration, and alignment to standard space, ensuring a high standard of data quality for our analyses. Cambridge Centre for Ageing and Neuroscience (Cam-CAN): The Cam-CAN dataset is a resource for studying cognitive and brain aging over the adult lifespan. We included T1-weighted weighted MRI images from 653 adults aged 18-88. This subset, known as the "CC700", provides a wide age range. IXI Dataset: The IXI dataset consists of 560 pairs of T1 and T2-weighted brain MRI scans acquired in three different London hospitals. From this dataset, we used 158 samples for testing and partitioned the remaining into 358 training samples and 44 validation samples, following Behrendt et al., 2023 /citebehrendt2023patched. This dataset enhances the diversity of our data, adding more depth to our analysis across different populations and imaging settings. ∗ https://humanconnectome.org † https://www.cam-can.org ‡ https://brain-development.org/ixi-dataset 153 Each dataset contributes a unique demographic and technical perspective to our study, providing a robust and comprehensive dataset for training our models. The integration of these diverse datasets is expected to enhance the generalizability and reliability of our findings. To assess the performance and adaptability of our in-painting modules, we employed Two distinct datasets, each reflecting a unique population type: the test subset of the IXI dataset, the Neurofeedback Skull-stripped (NFBS) dataset. IXI dataset served as a benchmark for evaluating the in-painting model’s performance on data similar to the training distribution, providing a baseline for expected performance under ideal conditions. The Figure 5.4: In-painting the identified part of the lesioned brain using a masked diffusion model to get a completely normal brain that can be easily processed using any software. And mapping back the generated processing to the space of the original brain using the inverse of the deformation field that resulted in the first step of registering the lesioned brain to the new normal space. Neurofeedback Skull-stripped (NFBS) dataset § , available from the Preprocessed Connectomes Project repository, includes 125 manually skull-stripped T1-weighted anatomical MRI scans of individuals aged 21 to 45 years. Each scan in this dataset has been meticulously checked to ensure no brain abnormalities, as confirmed by a board-certified neuroradiologist. This dataset was selected to test the model’s performance on out-of-distribution healthy brain images. It provides insight into the model’s generalizability and its ability to maintain performance when confronted with healthy but structurally different brain images from those seen during training. § http://preprocessed-connectomes-project.org 154 For testing lesion detection and segmentation of generated pseudo-healthy, we used 100 subjects of BraTS21. The Tumor Segmentation Challenge 2021 (BraTS21) dataset [24] was employed as a representative of the unhealthy out-of-distribution population. It comprises 1251 brain MRI scans of four different weightings, including T1, T1-CE, T2, and FLAIR, with 100 random samples used in this study. Each scan in this dataset is accompanied by expert annotations in the form of pixel-wise segmentation maps, providing a detailed ground truth for evaluating the model’s performance. 5.2.2 Preprocessing To align all MRI scans, we register the brain scans to the SRI24-Atlas ([218]) by affine transformations. Next, we apply skull stripping with HD-BET [126]. Note that these steps are already applied to the BraTS21 data set by default. Subsequently, we remove black borders, leading to a fixed resolution of [192 × 192 × 160] voxels. Lastly, we perform a bias field correction. To save computational resources, we reduce the volume resolution by a factor of two, resulting in [96×96×80]. For normalizing the intensity we divided each image by the highest pick in its histogram. 5.2.3 Generating of pseudo-healthy individualized atlas using Diffusion model Diffusion models [110] are a class of generative models that learn to generate data by reversing a diffusion process. The diffusion process gradually adds noise to the data until it turns into a Gaussian distribution. Then, the generative model learns to reverse this process to construct data from noise. In the forward diffusion process, data is gradually corrupted by adding Gaussian noise over a sequence of time steps. This can be represented by the following equation: xt = √ αtxt−1 + √ 1 − αtϵ, (5.1) where: 155 Figure 5.5: Lesion segmentation results for SegReg (ours), Diffusion model with 300-level noise (DF-300), Diffusion model with 500-level noise(DF-500). In the case of diffusions, the left column shows segmentation while the right column shows reconstruction error. GT: ground-truth • xt is the data at time step t. • αt represents the noise scale at each step. • ϵ is the random noise vector sampled from a standard Gaussian distribution N (0, I). As t increases, the data xt becomes increasingly similar to pure noise. The reverse process aims to reconstruct the original data from the noise. It iteratively estimates the noise that was added at each step and removes it. The update rule is: xt−1 = 1 √ αt xt − 1 − αt √ 1 − α¯t ϵθ(xt , t) , (5.2) where α¯t is the product of all previous α values up to time t, and ϵθ(xt , t) is the predicted noise by the neural network with parameters θ. 156 Optimization Objective The model is trained to minimize the difference between the true noise and the predicted noise using a loss function : L(θ) = Et,x0,ϵ ∥ϵ − ϵθ(xt , t)∥ 2 , (5.3) where L(θ) is the loss function for the neural network parameters θ, and the expectation is taken over various t, the original data x0, and the noise ϵ. As the baseline pseudo-healthy, which is also used as an atlas for registration, we train the diffusion model on the healthy population and then we add 999 level noise to the lesioned image. Finally, we sample through the reverse process and call this image the individualized atlas. Figure 5.6: in-painting results comparing Repaint and our fine-tuned model 157 5.2.4 Registration assistant module The Registration assistant module is central to our methodology, leveraging a network regϕ(·) trained on normal brain subjects. This network is trained at deforming one (moving) brain image x to match a target image y, with the objective of minimizing the mean squared error between the deformed and target images. A critical aspect of this training involves the regularization of the network using the bending energy of the deformation field. This step ensures a balance between the accuracy of the registration process and the tractability of the deformation fields generated to achieve registration. The loss function used to train the model is: Lreg(ϕ) = X N i=1 ∥yi − regϕ(xi)∥ 2 2 + λ · BendingEnergy(reg) Where N is the number of training samples and λ is a parameter used to weight the regularization term. In this training, λ where chosen to be 1. The network was trained with moving and target images randomly sampled from a pool of healthy T1 brain images. Upon the completion of training, the model is employed on a validation set of 44 IXI normal subjects, where each brain is registered to an individualized atlas corresponding to it and generated using the same method explained in 5.2.3. The pool of Jacobian determinants of each of the registration processes represents a validation set of normative Jacobian determinants against which the registration of subjects at test time is compared for thresholding and identifying abnormalities. The Jacobian determinant is utilized because it is an accurate measure of any shrinkage or expansion of the original volume during the deformation process. In the context of lesioned brain imaging, shrinkage and expansion are the primary alterations the registration network is expected to "cause" in abnormal regions of the brain. Therefore, analyzing the Jacobian determinant is a reasonable and effective measure for detecting these changes. This collection of Jacobian determinant maps, derived from normal subjects, establishes a normative framework against 158 Table 5.1: Comparison of Diffusion Models on Brats Dataset Metric/Dataset 2D D-500 3D D-500 3D LD-500 RegPaint (ours) 3D D-300 3D D-700 Dice 0.095 0.115 0.078 0.30 0.132 0.100 SSIM (Non-lesion) 0.74 0.78 0.72 0.98 0.87 0.72 which abnormalities can be detected and assessed. The precision afforded by this approach ensures a robust basis for subsequent analyses in identifying and characterizing brain abnormalities. 5.2.5 Abnormality Localization and Processing The essence of abnormality localization involves the application of our trained registration network to the task of aligning lesioned brains with their pseudo-healthy counterparts. The inherent non-deformability of abnormal regions in the brain leads to the generation of a unique deformation field (DDF) by the registration network, characterized by extremities in these specific regions. By examining the Jacobian determinant of this field and contrasting it with the normative baseline of 44 Jacobian determinants from normal subjects, we identify anomalies. A threshold, set at the 95th percentile of Jacobian determinants from the normal cohort, allows for creating a binary map that precisely indicates the location of abnormalities within the brain. Subsequent to this localization, we applied the in-painting module, and the result is a brain image that remains identical to the original brain in all but the abnormal regions, where pathological tissues are replaced with normal ones. 5.2.6 The inpainting module We developed a guided inpatient module by fine-tuning the diffusion model for the in-painting task. Inspired by Stable-Diffusion-in-painting ¶ , we applied prompt tuning by adding additional input channels (for the encoding mask) to the U-Net whose weights were zero-initialized after restoring the non-in-painting ¶ https://huggingface.co/runwayml/stable-diffusion-in-painting 159 checkpoint. For more coherent in-painting and using the non-masked information, we generated the noisy input as a combination of noisy localized input and non-masked input: xt = √ αtxt−1 + √ 1 − αtϵ, x ′ t = xt ⊙ m + x0 ⊙ (1 − m) (5.4) Where m is the inpainting mask. x ′ t and mask m are the input of the network. Following equation 5.3 we train a network ϵθ to predict the noise ϵ from the noisy xt : L(θ) = Et,x0,ϵ h ∥ϵ − ϵθ(x ′ t , t, m)∥ 2 i , (5.5) We combine the in-painting and the non-masked information in the reverse process. x unknown t−1 = 1 √ αt xt − 1 − αt √ 1 − α¯t ϵθ(xt , t) , xt−1 = x0 ⊙ m + x unknown t−1 ⊙ 1 − m (5.6) To generalize irregular masks, we developed a module for 3D mask generation. The 3D mask is created by first forming a basic sphere-like mesh with vertices. This mesh is then refined by adding more vertices, increasing its complexity. To give the sphere an irregular and natural appearance, Perlin noise is applied to each vertex, slightly altering their positions. This creates a random, uneven surface. Finally, to make sure the mask is solid and not hollow, any empty spaces inside the 3D structure are filled in. We compared this module with a "Repaint" method [166] built on a pre-trained unconditional diffusion model and proposed to resample in each reverse step. Finally, we are able to process this reconstructed brain using standard software normally, then apply the inverse of the initial deformation field (DDF −1 ) to map the processed information back to the original 160 Figure 5.7: BrainSuit Processing of the lesioned brain, Surrogate brain, and Surrogate brain moved back to original space. The arrows show that the right temporal lobe in the original brain has been mislabelled. The medial temporal gyrus and temporal pole have been completely misidentified in the original brain. brain’s spatial configuration. The end product is a comprehensively processed image of the lesioned brain, with the abnormal regions distinctly marked, leveraging the binary map generated earlier. This methodological approach not only enhances the precision of brain imaging analysis but also significantly contributes to the understanding and treatment of brain abnormalities. 5.3 Results and Disscussion 5.3.0.1 Evaluating lesion detection First, we evaluate our lesion detection performance on a synthetic dataset. We also evaluated our lesion detection method on BraTS21 T1W images. Since our final goal is to (1) localize and (2) inpaint the lesion, 161 Figure 5.8: BrainSuit Processing of the lesioned brain, Surrogate brain, and Surrogate brain moved back to original space in 3D we calculated the dice score for the lesion and also the Structural Similarity Index (SSIM) of reconstruction for the non-lesion areas. We compared our model with the 2D diffusion model, the 3D diffusion model, and the latent diffusion model using the reconstruction error. For all three baseline models, we segmented the lesion by first adding noise with 500 steps noise and denoise it in the reverse process[271]. The thresholding was based on the healthy population validation set reconstruction error for the 2D diffusion model, 3D diffusion model, and latent diffusion model. We chose the threshold which is higher than the 0.95 percentile of the healthy validation set error. The performance of 3D diffusion is sensitive to the level of noise for the forward-backward process(table 5.1 and figure 5.5). Since we used the deformation field for segmenting the lesion, our framework is less sensitive to the level of the noise as long as the lesion has been removed completely from the individualized atlas (The reconstruction of the diffusion model). Our results in table 5.1 show that we significantly improved the lesion detection performance in terms of the dice coefficient. The SSIM also improved significantly. For the diffusion models, this SSIM is calculated on the non-lesion part of the brain and between the reconstruction and the original brain. In this case, the reconstruction can serve as an alternative to use as the surrogate 162 brain. Changing the non-lesion tissue will result in inaccurate segmentation. For RegPaint, we calculated the SSIM between the input image and the final inpainted images back in the subject space. 5.3.0.2 Evaluating the in-painting module We used two different datasets to evaluate the inpainting property. Our inpainting model was assessed using an in-distribution healthy dataset, IXI, and an out-of-distribution healthy dataset, NFBS. We masked the datasets using random masks extracted from the Brats21 dataset. To compare different methods for the masked Region of Interest (ROI), we calculated the Mean Squared Error (MSE) and Structural Similarity Index Measure (SSIM). We compared our fine-tuned model with the Repaint method. Our results, shown in table 5.2 and figure 5.6, indicate that our method slightly improves inpainting performance for both in-distribution and out-of-distribution healthy data in terms of SSIM and Normalized Mean Squared Error (NMSE), although both methods generate a very high SSIM. Table 5.2: Comparing inpainting performance of our fine-tuned model with rePaint (SSIM/MSE) dataset 3D diffusion-repaint ours SSIM NMSE SSIM NMSE IXI 0.9934 0.0034 0.9942 0.0030 NFBS 0.9918 0.0629 0.9933 0.0429 5.3.1 Results of BrainSuite Process We processed the input original Brats21 brain MRIs as well as inpainted MRIs using BrainSuite [233]. The anatomical pipeline in BrainSuite uses a sequence of processing steps that start with input MRI, skull stripping, bias field correction, tissue classification, inner cortical surface generation, and expansion to generate pial cortical surface representation. This is then followed by surface-constrained volumetric registration to generate surface and volume labels by co-registration to an atlas. As noted earlier, the inpainted MRI is often deformed with respect to the original MRI, and therefore, the cortical extraction and labels produced by BrainSuite are not in alignment with the original MRI. We apply the deformation 163 map computed during the registration step to the labels of the inpainted image to map them to the original MRI space. In figures 5.7 and 5.8, the arrows show that the right temporal lobe in the original brain has been mislabelled. The medial temporal gyrus and temporal pole have been completely misidentified in the original brain. Additionally, the cortical surface at the temporal lobe has extraction errors and holes due to the lesion. These issues are fixed in the right image showing the labeled cortical surface for the image that is inpainted and moved to the original MRI coordinates. 5.4 Conclusion We have introduced a novel model that combines a registration module with a state-of-the-art diffusion model. This approach is specifically designed to segment and inpaint lesions in T1 brain MRI images. The comparative analysis of our model against various diffusion models, including 2D, 3D, and latent diffusion models, demonstrates its superior performance in lesion delineation. In addition, we demonstrated that these inpainted images enable preprocessing software to be applied effectively on lesioned brains, resulting in reasonable atlas-based segmentation. 164 Algorithm 1 lesioned brain Processing Pipeline 1: Input: lesioned brain image Iabnormal 2: Output: Processed brain image in abnormal space Iprocessed−abnormal 3: Initialize diffusion model Mdif fuse 4: Initialize registration network Mregister 5: Initialize deformation field D 6: Initialize Jacobian determinant pool Jpool 7: Initialize threshold τ at 95th percentile of Jpool 8: procedure DenoiseBrain(Iabnormal) 9: Inoisy ← Mdif fuse.addNoise(Iabnormal, levels = 999) 10: Inormal ← Mdif fuse.denoise(Inoisy) 11: return Inormal 12: end procedure 13: procedure RegisterBrains(Iabnormal, Inormal) 14: D, Iregistered ← Mregister.register(Iabnormal, Inormal) 15: J ← det(Jac(D)) 16: return J, Iregistered 17: end procedure 18: procedure Threshold(J) 19: M ask ← J > τ 20: return M ask 21: end procedure 22: procedure InpaintAbnormality(Iregistered, M ask) 23: Iin−painted ← Mdif fuse.maskAndInpaint(Iregistered, M ask) 24: return Iin−painted 25: end procedure 26: procedure ProcessImage(Iin−painted) 27: Iprocessed ← BrainSoftware.process(Iin−painted) 28: return Iprocessed 29: end procedure 30: procedure ApplyInverseDeformation(Iprocessed, D) 31: Iprocessed−abnormal ← D−1 (Iprocessed) 32: return Iprocessed−abnormal 33: end procedure 34: Inormal ← DenoiseBrain(Iabnormal) 35: J, Iregistered ← RegisterBrains(Iabnormal, Inormal) 36: M ask ← Threshold(J) 37: Iin−painted ← InpaintAbnormality(Iregistered, M ask) 38: Iprocessed ← ProcessImage(Iin−painted) 39: Iprocessed−abnormal ← ApplyInverseDeformation(Iprocessed, D) 165 Chapter 6 Conclusion This proposal aims to provide novel deep learning (DL) methods that address the difficulty of applying conventional approaches and recent DL methods on real-world datasets such as biomedical datasets due to imperfect and limited data. More precisely, the result of this research expect to achieve improvement in three major ongoing research areas: 1. robustness: develop robust DL models to be applied in medical imaging research with complex training data. This research developed robust loss both in supervised and unsupervised setting to handle the outliers in the training set. 2. generalizability: build DL methods to increase generalizability of DL methods to be able to aggregate datasets from different studies to increase the statistical power of medical studies. The developed robust loss used in a transfer learning frame-work for unsupervised lesion detection. 3. uncertainty estimation: improve DL models with estimating the uncertainty to calculate a risk along with the decision. Deep quantile regression used for capturing aleatoric uncertainty both in unsupervised and supervised setting. Our tools and methods will also be useful in other applications that frequently suffer from poor quality training data such as network traffic modeling and speech recognition. 166 Bibliography [1] ACCS. UNSW-NB15. https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/. [Online; accessed May 2020]. [2] Adelola Adeloye and E Latunde Odeku. “The radiology of missile head wounds”. In: Clinical radiology 22.3 (1971), pp. 312–320. [3] Charu C Aggarwal. “Outlier analysis”. In: Data mining. Springer. 2015, pp. 237–263. [4] Amit Agrawal, Jake Timothy, Lekha Pandit, and Murali Manju. “Post-traumatic epilepsy: an overview”. In: Clinical neurology and neurosurgery 108.5 (2006), pp. 433–439. [5] H Akrami, RM Leahy, A Irimia, PE Kim, CN Heck, and AA Joshi. “Neuroanatomic Markers of Posttraumatic Epilepsy Based on MR Imaging and Machine Learning”. In: American Journal of Neuroradiology 43.3 (2022), pp. 347–353. [6] Haleh Akrami, Sergul Aydore, Richard M Leahy, and Anand A Joshi. “Robust Variational Autoencoder for Tabular Data with Beta Divergence”. In: arXiv preprint arXiv:2006.08204 (2020). [7] Haleh Akrami, Andrei Irimia, Wenhui Cui, Anand A Joshi, and Richard M Leahy. “Prediction of posttraumatic epilepsy using machine learning”. In: Medical Imaging 2021: Biomedical Applications in Molecular, Structural, and Functional Imaging. Vol. 11600. SPIE. 2021, pp. 424–430. [8] Haleh Akrami, Anand Joshi, Sergul Aydore, and Richard Leahy. “Deep Quantile Regression for Uncertainty Estimation in Unsupervised and Supervised Lesion Detection”. In: arXiv preprint arXiv:2109.09374 (2021). [9] Haleh Akrami, Anand Joshi, Sergul Aydore, and Richard Leahy. “Quantile Regression for Uncertainty Estimation in VAEs with Applications to Brain Lesion Detection”. In: International Conference on Information Processing in Medical Imaging. Springer. 2021, pp. 689–700. [10] Haleh Akrami, Anand Joshi, Jian Li, Sergul Aydore, and Richard Leahy. “Brain Lesion Detection Using a Robust Variational Autoencoder and Transfer Learning”. In: Proc. ISBI, 2020. 2020. 167 [11] Haleh Akrami, Anand A Joshi, Sergül Aydöre, and Richard M Leahy. “Deep Quantile Regression for Uncertainty Estimation in Unsupervised and Supervised Lesion Detection”. In: The journal of machine learning for biomedical imaging 1 (2022). [12] Haleh Akrami, Anand A Joshi, Jian Li, Sergul Aydore, and Richard M Leahy. “A robust variational autoencoder using beta divergence”. In: Knowledge-Based Systems 238 (2022), p. 107886. [13] Haleh Akrami, Anand A Joshi, Jian Li, Sergul Aydore, and Richard M Leahy. “Brain Lesion Detection Using A Robust Variational Autoencoder and Transfer Learning”. In: (2020), pp. 786–790. [14] Haleh Akrami, Anand A Joshi, Jian Li, Sergul Aydore, and Richard M Leahy. “Robust variational autoencoder”. In: arXiv preprint arXiv:1905.09961 (2019). [15] Haleh Akrami, Richard M Leahy, Andrei Irimia, Paul E Kim, Christianne Heck, and Anand Joshi. “Neuroanatomic markers of post-traumatic epilepsy based on magnetic resonance imaging and machine learning”. In: medRxiv (2020). [16] Jinwon An and Sungzoon Cho. “Variational autoencoder based anomaly detection using reconstruction probability”. In: Special Lecture on IE 2 (2015), pp. 1–18. [17] Julia Andresen, Timo Kepp, Jan Ehrhardt, Claus von der Burchard, Johann Roider, and Heinz Handels. “Deep learning-based simultaneous registration and unsupervised non-correspondence segmentation of medical images with pathologies”. In: International Journal of Computer Assisted Radiology and Surgery 17.4 (2022), pp. 699–710. [18] F Angeleri, J Majkowski, G Cacchio, A Sobieszek, S D’acunto, R Gesuita, A Bachleda, G Polonara, L Krolicki, M Signorino, et al. “Posttraumatic epilepsy risk factors: one-year prospective study after head injury”. In: Epilepsia 40.9 (1999), pp. 1222–1230. [19] Anastasios N Angelopoulos, Amit Pal Kohli, Stephen Bates, Michael Jordan, Jitendra Malik, Thayer Alshaabi, Srigokul Upadhyayula, and Yaniv Romano. “Image-to-image regression with distribution-free uncertainty quantification and applications in imaging”. In: International Conference on Machine Learning. PMLR. 2022, pp. 717–730. [20] The UCI KDD Archive. KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [Online; accessed May 2020]. 1999. [21] Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. “The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans”. In: Medical physics 38.2 (2011), pp. 915–931. [22] Salim Arslan, Sofia Ira Ktena, Antonios Makropoulos, Emma C Robinson, Daniel Rueckert, and Sarah Parisot. “Human brain mapping: A systematic comparison of parcellation methods for the human cerebral cortex”. In: Neuroimage 170 (2018), pp. 5–30. [23] John Ashburner. “A fast diffeomorphic image registration algorithm”. eng. In: NeuroImage 38.1 (Oct. 2007), pp. 95–113. issn: 1053-8119. doi: 10.1016/j.neuroimage.2007.07.007. 168 [24] Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C Kitamura, Sarthak Pati, et al. “The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification”. In: arXiv preprint arXiv:2107.02314 (2021). [25] Spyridon Bakas, Hamed Akbari, Aristeidis Sotiras, Michel Bilello, Martin Rozycki, Justin S Kirby, John B Freymann, Keyvan Farahani, and Christos Davatzikos. “Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features”. In: Scientific Data 4.1 (2017), p. 170117. issn: 2052-4463. doi: 10.1038/sdata.2017.117. [26] Spyridon Bakas, Mauricio Reyes, Andras Jakab, Stefan Bauer, Markus Rempfler, Alessandro Crimi, Russell Takeshi Shinohara, Christoph Berger, Sung Min Ha, Martin Rozycki, et al. “Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge”. In: arXiv preprint arXiv:1811.02629 (2018). [27] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. “Robust and efficient estimation by minimising a density power divergence”. In: Biometrika 85.3 (1998), pp. 549–559. [28] Ayanendranath Basu and Sahadeb Sarkar. “The trade-off between robustness and efficiency and the effect of model smoothing in minimum disparity inference”. In: Journal of statistical computation and simulation 50.3-4 (1994), pp. 173–185. [29] Maxime O Baud, Jonathan K Kleen, Emily A Mirro, Jason C Andrechak, David King-Stephens, Edward F Chang, and Vikram R Rao. “Multi-day rhythms modulate seizure risk in epilepsy”. In: Nature communications 9.1 (2018), pp. 1–10. [30] Christoph Baur, Benedikt Wiestler, Shadi Albarqouni, and Nassir Navab. “Deep autoencoding models for unsupervised anomaly segmentation in brain mr images”. In: International MICCAI Brainlesion Workshop. Springer. 2018, pp. 161–169. [31] Finn Behrendt, Debayan Bhattacharya, Julia Krüger, Roland Opfer, and Alexander Schlaefer. “Patched diffusion models for unsupervised anomaly detection in brain mri”. In: arXiv preprint arXiv:2303.03758 (2023). [32] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. In: Journal of the Royal Statistical Society: Series B (Methodological) 57.1 (Jan. 1995), pp. 289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. [33] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. In: Journal of the Royal statistical society: series B (Methodological) 57.1 (1995), pp. 289–300. [34] Yoav Benjamini and Daniel Yekutieli. “The control of the false discovery rate in multiple testing under dependency”. In: Annals of Statistics 29 (2001), pp. 1165–1188. url: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.8492. [35] J Berg, Fernanda Tagliaferri, and Franco Servadei. “Cost of trauma in Europe”. In: Eur J Neurol 12.Suppl 1 (2005), pp. 85–90. 169 [36] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006. [37] Merlijn Blaauw and Jordi Bonada. “Modeling and transforming speech using variational autoencoders”. In: Morgan N, editor. Interspeech 2016; 2016 Sep 8-12; San Francisco, CA.[place unknown]: ISCA; 2016. p. 1770-4. (2016). [38] J Martin Bland and Douglas G Altman. “Multiple significance tests: the Bonferroni method”. In: Bmj 310.6973 (1995), p. 170. [39] Leo Breiman. “Random forests”. In: Machine learning 45.1 (2001), pp. 5–32. [40] Richard P Brent. Algorithms for minimization without derivatives. Courier Corporation, 2013. [41] Matthew Brett, Alexander P Leff, Chris Rorden, and John Ashburner. “Spatial normalization of brain images with focal lesions using cost function masking”. In: Neuroimage 14.2 (2001), pp. 486–500. [42] Kristy K Brock, Sasa Mutic, Todd R McNutt, Hua Li, and Marc L Kessler. “Use of image registration and fusion algorithms and techniques in radiotherapy: Report of the AAPM Radiation Therapy Committee Task Group No. 132”. In: Medical physics 44.7 (2017), e43–e76. [43] John Burke, James Gugger, Kan Ding, Jennifer A Kim, Brandon Foreman, John K Yue, Ava M Puccio, Esther L Yuh, Xiaoying Sun, Miri Rabinowitz, et al. “Association of Posttraumatic Epilepsy With 1-Year Outcomes After Traumatic Brain Injury”. In: JAMA Network Open 4.12 (2021), e2140191–e2140191. [44] Tracey Bushnik, Jocelynn L Cook, A Albert Yuzpe, Suzanne Tough, and John Collins. “Estimating the prevalence of infertility in Canada”. In: Human reproduction 27.3 (2012), pp. 738–746. [45] Shichen Cao, Jingjing Li, Kenric P Nelson, and Mark A Kon. “Coupled VAE: Improved Accuracy and Robustness of a Variational Autoencoder”. In: arXiv preprint arXiv:1906.00536 (2019). [46] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. “Transunet: Transformers make strong encoders for medical image segmentation”. In: arXiv preprint arXiv:2102.04306 (2021). [47] Kanglin Chen, Alexander Derksen, Stefan Heldmann, Marc Hallmann, and Benjamin Berkels. “Deformable image registration with automatic non-correspondence detection”. In: Scale Space and Variational Methods in Computer Vision: 5th International Conference, SSVM 2015, Lège-Cap Ferret, France, May 31-June 4, 2015, Proceedings 5. Springer. 2015, pp. 360–371. [48] Xiaoran Chen and Ender Konukoglu. “Unsupervised Detection of Lesions in Brain MRI using constrained adversarial auto-encoders”. In: arXiv preprint arXiv:1806.04972 (2018). [49] Xiaoran Chen, Suhang You, Kerem Can Tezcan, and Ender Konukoglu. “Unsupervised lesion detection via image restoration with a normative prior”. In: Medical Image Analysis (2020), p. 101713. 170 [50] Zhenghao Chen, Rui Zhang, Gang Zhang, Zhenhuan Ma, and Tao Lei. “Digging Into Pseudo Label: A Low-Budget Approach for Semi-Supervised Semantic Segmentation”. In: IEEE Access 8 (2020), pp. 41830–41837. doi: 10.1109/ACCESS.2020.2975022. [51] Andrzej Cichocki and Shun-ichi Amari. “Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities”. In: Entropy 12.6 (2010), pp. 1532–1568. [52] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. “EMNIST: an extension of MNIST to handwritten letters”. In: arXiv preprint arXiv:1702.05373 (2017). [53] D Louis Collins, Colin J Holmes, Terrence M Peters, and Alan C Evans. “Automatic 3-D model-based neuroanatomical segmentation”. In: Human brain mapping 3.3 (1995), pp. 190–208. [54] Francesca Cormack, David G Gadian, Faraneh Vargha-Khadem, J Helen Cross, Alan Connelly, and Torsten Baldeweg. “Extra-hippocampal grey matter density abnormalities in paediatric mesial temporal sclerosis”. In: Neuroimage 27.3 (2005), pp. 635–643. [55] Francesca Cormack, David G. Gadian, Faraneh Vargha-Khadem, J. Helen Cross, Alan Connelly, and Torsten Baldeweg. “Extra-hippocampal grey matter density abnormalities in paediatric mesial temporal sclerosis”. eng. In: NeuroImage 27.3 (Sept. 2005), pp. 635–643. issn: 1053-8119. doi: 10.1016/j.neuroimage.2005.05.023. [56] Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. In: Machine learning 20.3 (1995), pp. 273–297. [57] Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Fenyö, Andre L Moreira, Narges Razavian, and Aristotelis Tsirigos. “Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning”. In: Nature medicine 24.10 (2018), pp. 1559–1567. [58] Wenhui Cui, Haleh Akrami, Ganning Zhao, Anand A Joshi, and Richard M Leahy. “Meta Transfer of Self-Supervised Knowledge: Foundation Model in Action for Post-Traumatic Epilepsy Prediction”. In: arXiv preprint arXiv:2312.14204 (2023). [59] G. Curia, M. Levitt, J. S. Fender, J. W. Miller, J. Ojemann, and R. D’Ambrosio. “Impact of Injury Location and Severity on Posttraumatic Epilepsy in the Rat: Role of Frontal Neocortex”. en. In: Cerebral Cortex 21.7 (July 2011), pp. 1574–1592. issn: 1047-3211, 1460-2199. doi: 10.1093/cercor/bhq218. (Visited on 08/30/2018). [60] Canadian Institute for Cybersecurity. NSL-KDD dataset. https://www.unb.ca/cic/datasets/nsl.html. [Online; accessed May 2020]. [61] Steffen Czolbe, Kasra Arnavaz, Oswin Krause, and Aasa Feragen. “Is segmentation uncertainty useful?” In: International Conference on Information Processing in Medical Imaging. Springer. 2021, pp. 715–726. [62] Raimondo D’Ambrosio and Emilio Perucca. “Epilepsy after head injury”. In: Current opinion in neurology 17.6 (2004), p. 731. 171 [63] Bin Dai, Yu Wang, John Aston, Gang Hua, and David Wipf. “Connections with robust PCA and the role of emergent sparsity in variational autoencoder models”. In: The Journal of Machine Learning Research 19.1 (2018), pp. 1573–1614. [64] Bin Dai and David Wipf. “Diagnosing and enhancing VAE models”. In: arXiv preprint arXiv:1903.05789 (2019). [65] Melita Daley, Prabha Siddarth, Jennifer Levitt, Suresh Gurbani, W Donald Shields, Raman Sankar, Arthur Toga, and Rochelle Caplan. “Amygdala volume and psychopathology in childhood complex partial seizures”. In: Epilepsy & Behavior 13.1 (2008), pp. 212–217. [66] Melita Daley, Prabha Siddarth, Jennifer Levitt, Suresh Gurbani, W. Donald Shields, Raman Sankar, Arthur Toga, and Rochelle Caplan. “Amygdala volume and psychopathology in childhood complex partial seizures”. eng. In: Epilepsy & Behavior: E&B 13.1 (July 2008), pp. 212–217. issn: 1525-5069. doi: 10.1016/j.yebeh.2007.12.021. [67] Emily L Dennis, Xue Hua, Julio Villalon-Reina, Lisa M Moran, Claudia Kernan, Talin Babikian, Richard Mink, Christopher Babbitt, Jeffrey Johnson, Christopher C Giza, et al. “Tensor-based morphometry reveals volumetric deficits in moderate/severe pediatric traumatic brain injury”. In: Journal of neurotrauma 33.9 (2016), pp. 840–852. [68] Emily L. Dennis, Xue Hua, Julio Villalon-Reina, Lisa M. Moran, Claudia Kernan, Talin Babikian, Richard Mink, Christopher Babbitt, Jeffrey Johnson, Christopher C. Giza, Paul M. Thompson, and Robert F. Asarnow. “Tensor-Based Morphometry Reveals Volumetric Deficits in Moderate=Severe Pediatric Traumatic Brain Injury”. eng. In: Journal of Neurotrauma 33.9 (May 2016), pp. 840–852. issn: 1557-9042. doi: 10.1089/neu.2015.4012. [69] Nicki S Detlefsen, Martin Jørgensen, and Søren Hauberg. “Reliable training and estimation of variance networks”. In: arXiv preprint arXiv:1906.03260 (2019). [70] Terrance DeVries and Graham W Taylor. “Leveraging uncertainty estimates for predicting segmentation quality”. In: arXiv preprint arXiv:1807.00502 (2018). [71] Ian Dewancker, Michael McCourt, and Scott Clark. “Bayesian optimization for machine learning: A practical guidebook”. In: arXiv preprint arXiv:1612.04858 (2016). [72] Ramon Diaz-Arrastia, Mark A Agostini, Christopher J Madden, and Paul C Van Ness. “Posttraumatic epilepsy: the endophenotypes of a human model of epileptogenesis”. In: Epilepsia 50 (2009), pp. 14–20. [73] Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu Cord. “PLOP: Learning without Forgetting for Continual Semantic Segmentation”. In: arXiv preprint arXiv:2011.11390 (2020). [74] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012. [75] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern classification. John Wiley & Sons, 2012. 172 [76] Dominique Duncan, Giuseppe Barisano, Ryan Cabeen, Farshid Sepehrband, Rachael Garner, Adebayo Braimah, Paul Vespa, Asla Pitkänen, Meng Law, and Arthur W. Toga. “Analytic Tools for Post-traumatic Epileptogenesis Biomarker Search in Multimodal Dataset of an Animal Model and Human Patients”. In: Frontiers in Neuroinformatics 12 (Dec. 2018), p. 86. issn: 1662-5196. doi: 10.3389/fninf.2018.00086. (Visited on 05/24/2020). [77] George H Dunteman. Principal components analysis. 69. Sage, 1989. [78] Simao Eduardo et al. “Robust Variational Autoencoders for Outlier Detection in Mixed-Type Data”. In: arXiv preprint arXiv:1907.06671 (2019). [79] Behzad Eftekhar, Mohammad Ali Sahraian, Banafsheh Nouralishahi, Ali Khaji, Zahra Vahabi, Mohammad Ghodsi, Hassan Araghizadeh, Mohammad Reza Soroush, Sima Karbalaei Esmaeili, and Mehdi Masoumi. “Prognostic factors in the persistence of posttraumatic epilepsy after penetrating head injuries sustained in war”. In: Journal of neurosurgery 110.2 (2009), pp. 319–326. [80] Shinto Eguchi and Shogo Kato. “Entropy and divergence associated with power function and the statistical application”. In: Entropy 12.2 (2010), pp. 262–274. [81] Jerome Engel Jr, Asla Pitkänen, Jeffrey A Loeb, F Edward Dudek, Edward H Bertram III, Andrew J Cole, Solomon L Moshé, Samuel Wiebe, Frances E Jensen, Istvan Mody, et al. “Epilepsy biomarkers”. In: Epilepsia 54 (2013), pp. 61–69. [82] Jeffrey Englander, Tamara Bushnik, Thao T Duong, David X Cifu, Ross Zafonte, Jerry Wright, Richard Hughes, and William Bergman. “Analyzing risk factors for late posttraumatic seizures: a prospective, multicenter investigation”. In: Archives of physical medicine and rehabilitation 84.3 (2003), pp. 365–373. [83] Kimberly DM Farbota, Aparna Sodhi, Barbara B Bendlin, Donald G McLaren, Guofan Xu, Howard A Rowley, and Sterling C Johnson. “Longitudinal volumetric changes following traumatic brain injury: a tensor-based morphometry study”. In: Journal of the International Neuropsychological Society 18.6 (2012), pp. 1006–1018. [84] Surina Fordington and Mark Manford. “A review of seizures and epilepsy following traumatic brain injury”. In: Journal of neurology 267 (2020), pp. 3105–3111. [85] Geoff French, Timo Aila, Samuli Laine, Michal Mackiewicz, and Graham Finlayson. Semi-supervised semantic segmentation needs strong, high-dimensional perturbations. 2020. url: https://openreview.net/forum?id=B1eBoJStwr. [86] Lauren C Frey. “Epidemiology of posttraumatic epilepsy: a critical review”. In: Epilepsia 44 (2003), pp. 11–17. [87] Yu Fu, Alexander W Jung, Ramon Viñas Torne, Santiago Gonzalez, Harald Vöhringer, Artem Shmatko, Lucy R Yates, Mercedes Jimenez-Linan, Luiza Moore, and Moritz Gerstung. “Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis”. In: Nature Cancer 1.8 (2020), pp. 800–810. 173 [88] Futoshi Futami, Issei Sato, and Masashi Sugiyama. “Variational inference based on robust divergences”. In: arXiv preprint arXiv:1710.06595 (2017). [89] Shawn D Gale, L Baxter, N Roundy, and SC Johnson. “Traumatic brain injury and grey matter concentration: a preliminary voxel based morphometry study”. In: Journal of Neurology, Neurosurgery & Psychiatry 76.7 (2005), pp. 984–988. [90] Rachael Garner, Marianna La Rocca, Giuseppe Barisano, Arthur W. Toga, Dominique Duncan, and Paul Vespa. “A Machine Learning Model to Predict Seizure Susceptibility from Resting-State fMRI Connectivity”. In: 2019 Spring Simulation Conference (SpringSim). Tucson, AZ, USA: IEEE, Apr. 2019, pp. 1–11. isbn: 978-1-5108-8388-8. doi: 10.23919/SpringSim.2019.8732859. (Visited on 03/20/2020). [91] Ursula Gather and Balvant K Kale. “Maximum likelihood estimation in the presence of outiliers”. In: Communications in Statistics-Theory and Methods 17.11 (1988), pp. 3767–3784. [92] Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. “A survey of uncertainty in deep neural networks”. In: Artificial Intelligence Review (2023), pp. 1–77. [93] Aritra Ghosh, Himanshu Kumar, and PS Sastry. “Robust loss functions under label noise for deep neural networks”. In: Proceedings of the AAAI conference on artificial intelligence 31.1 (2017). [94] Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Andersson, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R Polimeni, et al. “The minimal preprocessing pipelines for the Human Connectome Project”. In: Neuroimage 80 (2013), pp. 105–124. [95] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial nets”. In: Advances in neural information processing systems. 2014, pp. 2672–2680. [96] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. “Bootstrap your own latent-a new approach to self-supervised learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 21271–21284. [97] Catarina A Guimarães, Leonardo Bonilha, Renata C Franzon, Li M Li, Fernando Cendes, and Marilisa M Guerreiro. “Distribution of regional gray matter abnormalities in a pediatric population with temporal lobe epilepsy and correlation with neuropsychological performance”. In: Epilepsy & Behavior 11.4 (2007), pp. 558–566. [98] Catarina A. Guimarães, Leonardo Bonilha, Renata C. Franzon, Li M. Li, Fernando Cendes, and Marilisa M. Guerreiro. “Distribution of regional gray matter abnormalities in a pediatric population with temporal lobe epilepsy and correlation with neuropsychological performance”. eng. In: Epilepsy & Behavior: E&B 11.4 (Dec. 2007), pp. 558–566. issn: 1525-5050. doi: 10.1016/j.yebeh.2007.07.005. [99] Rao P Gullapalli. Investigation of Prognostic Ability of Novel Imaging Markers for Traumatic Brain Injury (TBI). Tech. rep. BALTIMORE UNIV MD, 2011. 174 [100] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. “On calibration of modern neural networks”. In: International Conference on Machine Learning. PMLR. 2017, pp. 1321–1330. [101] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics. Wiley Online Library, 1986. [102] Xu Han, Zhengyang Shen, Zhenlin Xu, Spyridon Bakas, Hamed Akbari, Michel Bilello, Christos Davatzikos, and Marc Niethammer. “A deep network for joint registration and reconstruction of images with pathologies”. In: Machine Learning in Medical Imaging: 11th International Workshop, MLMI 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Proceedings 11. Springer. 2020, pp. 342–352. [103] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “The elements of statistical learning. Springer series in statistics”. In: New York, NY, USA (2001). [104] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV]. [105] Xuming He. “Quantile curves without crossing”. In: The American Statistician 51.2 (1997), pp. 186–192. [106] Susan T Herman. “Epilepsy after brain insult: targeting epileptogenesis”. In: Neurology 59.9 suppl 5 (2002), S21–S26. [107] Susan T. Herman. “Epilepsy after brain insult: targeting epileptogenesis”. eng. In: Neurology 59.9 Suppl 5 (Nov. 2002), S21–26. issn: 0028-3878. [108] Derek LG Hill, Philipp G Batchelor, Mark Holden, and David J Hawkes. “Medical image registration”. In: Physics in medicine & biology 46.3 (2001), R1. [109] Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the dimensionality of data with neural networks”. In: science 313.5786 (2006), pp. 504–507. [110] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In: Advances in neural information processing systems 33 (2020), pp. 6840–6851. [111] Wei-Ning Hsu, Yu Zhang, and James Glass. “Learning latent representations for speech generation and transformation”. In: arXiv preprint arXiv:1704.04222 (2017). [112] Shi Hu, Daniel Worrall, Stefan Knegt, Bas Veeling, Henkjan Huisman, and Max Welling. “Supervised uncertainty quantification for segmentation with multiple annotations”. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. 2019, pp. 137–145. 175 [113] Xue Hua, Suh Lee, Igor Yanovsky, Alex D. Leow, Yi Yu Chou, April J. Ho, Boris Gutman, Arthur W. Toga, Clifford R. Jack, Matt A. Bernstein, Eric M. Reiman, Danielle J. Harvey, John Kornak, Norbert Schuff, Gene E. Alexander, Michael W. Weiner, and Paul M. Thompson. “Optimizing power to track brain degeneration in Alzheimer’s disease and mild cognitive impairment with tensor-based morphometry: An ADNI study of 515 subjects”. In: NeuroImage 48.4 (2009), pp. 668–681. issn: 10538119. doi: 10.1016/j.neuroimage.2009.07.011. [114] Xue Hua, Alex D Leow, Neelroop Parikshak, Suh Lee, Ming-Chang Chiang, Arthur W Toga, Clifford R Jack Jr, Michael W Weiner, Paul M Thompson, Alzheimer’s Disease Neuroimaging Initiative, et al. “Tensor-based morphometry as a neuroimaging biomarker for Alzheimer’s disease: an MRI study of 676 AD, MCI, and normal subjects”. In: Neuroimage 43.3 (2008), pp. 458–469. [115] Yu Huang and Lucas C Parra. “Fully automated whole-head segmentation with improved smoothness and continuity, with theory reviewed”. In: PloS one 10.5 (2015), e0125477. [116] Peter J Huber. Robust statistics. Springer, 2011. [117] Ioan Humphreys, Rodger L Wood, Ceri J Phillips, and Steven Macey. “The costs of traumatic brain injury: a literature review”. In: ClinicoEconomics and outcomes research: CEOR 5 (2013), p. 281. [118] Daniel Im Im, Sungjin Ahn, Roland Memisevic, and Yoshua Bengio. “Denoising criterion for variational auto-encoding framework”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. 1. 2017. [119] Riikka Immonen, Irina Kharatishvili, Olli Gröhn, and Asla Pitkänen. “MRI Biomarkers for Post-Traumatic Epileptogenesis”. In: Journal of Neurotrauma 30.14 (July 2013), pp. 1305–1309. issn: 0897-7151. doi: 10.1089/neu.2012.2815. (Visited on 09/18/2017). [120] Riikka Immonen, Irina Kharatishvili, Olli Gröhn, and Asla Pitkänen. “MRI biomarkers for post-traumatic epileptogenesis”. In: Journal of neurotrauma 30.14 (2013), pp. 1305–1309. [121] A. Irimia. “Neuroimaging of structural pathology and connectomics in traumatic brain injury: Toward personalized outcome prediction”. en. In: NeuroImage: Clinical 1 (2012), pp. 1–17. doi: 10.1016/j.nicl.2012.08.002. [122] A. Irimia, S.Y. Goh, C.M. Torgerson, P. Vespa, and J.D. Van Horn. “Structural and connectomic neuroimaging for the personalized study of longitudinal alterations in cortical shape, thickness and connectivity after traumatic brain injury”. en. In: J Neurosurg Sci 58 (2014), pp. 129–144. [123] Andrei Irimia, SY Matthew Goh, Micah C Chambers, Carinna M Torgerson, Nathan R Stein, Paul M Vespa, and John D Van Horn. “Electroencephalographic inverse localization of brain activity in acute traumatic brain injury as a guide to surgery, monitoring and treatment”. In: Clinical Neurology and Neurosurgery 115 (2013), pp. 2159–2165. [124] Andrei Irimia and John D Van Horn. “Epileptogenic focus localization in pharmacologically resistant post-traumatic epilepsy”. In: Journal of Clinical Neuroscience 11 (2015), pp. 627–631. 176 [125] Andrei Irimia, Bo Wang, Stephen R Aylward, Marcel W Prastawa, Danielle F Pace, Guido Gerig, David A Hovda, Ron Kikinis, Paul M Vespa, and John D Van Horn. “Neuroimaging of structural pathology and connectomics in traumatic brain injury: Toward personalized outcome prediction”. In: NeuroImage: Clinical 1.1 (2012), pp. 1–17. [126] Fabian Isensee, Marianne Schell, Irada Pflueger, Gianluca Brugnara, David Bonekamp, Ulf Neuberger, Antje Wick, Heinz-Peter Schlemmer, Sabine Heiland, Wolfgang Wick, et al. “Automated brain extraction of multisequence MRI using artificial neural networks”. In: Human brain mapping 40.17 (2019), pp. 4952–4964. [127] Mobarakol Islam and Ben Glocker. “Spatially Varying Label Smoothing: Capturing Uncertainty from Expert Annotations”. In: International Conference on Information Processing in Medical Imaging. Springer. 2021, pp. 677–688. [128] Onyedikachi O John. “Robustness of quantile regression to outliers”. In: American Journal of Applied Mathematics and Statistics 3.2 (2015), pp. 86–88. [129] Anand A Joshi, Soyoung Choi, Yijun Liu, Minqi Chong, Gaurav Sonkar, Jorge Gonzalez-Martinez, Dileep Nair, Jessica L Wisnowski, Justin P Haldar, David W Shattuck, et al. “A hybrid high-resolution anatomical MRI atlas with sub-parcellation of cortical gyri using resting fMRI”. In: Journal of Neuroscience Methods 374 (2022), p. 109566. [130] Anand A Joshi, David W Shattuck, Paul M Thompson, and Richard M Leahy. “A framework for registration, statistical characterization and classification of cortically constrained functional imaging data”. In: Biennial international conference on information processing in medical imaging. Springer. 2005, pp. 186–196. [131] Anand A Joshi, David W Shattuck, Paul M Thompson, and Richard M Leahy. “Surface-Constrained Volumetric Brain Registration Using Harmonic Mappings”. In: IEEE Trans. Med. Imaging 26.12 (2007), pp. 1657–1669. [132] Anand A Joshi, David W Shattuck, Paul M Thompson, and Richard M Leahy. “Surface-constrained volumetric brain registration using harmonic mappings”. In: IEEE transactions on medical imaging 26.12 (2007), pp. 1657–1669. [133] Anand A. Joshi, Dakarai McCoy, Minqi Chong, Jiang Li, Soyoung Choi, David W. Shattuck, and Richard M. Leahy. BFP: BrainSuite fMRI Pipeline. Singapore, 17th - 21th Jun 2018. [134] Anand A. Joshi, David W. Shattuck, Paul M. Thompson, and Richard M. Leahy. “A framework for registration, statistical characterization and classification of cortically constrained functional imaging data”. In: Inf Process Med Imaging. Vol. 19. July 2005, pp. 186–196. isbn: 1011-2499 (Print)$\backslash$r1011-2499 (Linking). url: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4512650&tool=pmcentrez&rendertype= abstract%5Cnhttp://www.ncbi.nlm.nih.gov/pubmed/17354695. [135] Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. “Less-forgetting learning in deep neural networks”. In: arXiv preprint arXiv:1607.00122 (2016). 177 [136] Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane, David K Menon, Daniel Rueckert, and Ben Glocker. “Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation”. In: Medical image analysis 36 (2017), pp. 61–78. [137] Jakob Nikolas Kather, Lara R Heij, Heike I Grabsch, Chiara Loeffler, Amelie Echle, Hannah Sophie Muti, Jeremias Krause, Jan M Niehues, Kai AJ Sommer, Peter Bankhead, et al. “Pan-cancer image-based detection of clinically actionable genetic alterations”. In: Nature Cancer 1.8 (2020), pp. 789–799. [138] Zhanghan Ke, Daoye Wang, Qiong Yan, Jimmy Ren, and Rynson WH Lau. “Dual student: Breaking the limits of the teacher in semi-supervised learning”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 6728–6736. [139] Junghoon Kim, Brian Avants, Sunil Patel, and John Whyte. “Spatial normalization of injured brains for neuroimaging research: An illustrative introduction of available options”. In: NCRRN Methodology Papers (2007). [140] Junghoon Kim, Brian Avants, Sunil Patel, John Whyte, Branch H Coslett, John Pluta, John A Detre, and James C Gee. “Structural consequences of diffuse traumatic brain injury: a large deformation tensor-based morphometry study”. In: Neuroimage 39.3 (2008), pp. 1014–1026. [141] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [142] Diederik P Kingma and Max Welling. “Auto-encoding variational Bayes”. In: arXiv preprint arXiv:1312.6114 (2013). [143] Roger Koenker and Gilbert Bassett Jr. “Regression quantiles”. In: Econometrica: journal of the Econometric Society (1978), pp. 33–50. [144] Roger Koenker and Kevin F Hallock. “Quantile regression”. In: Journal of economic perspectives 15.4 (2001), pp. 143–156. [145] Simon AA Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus H Maier-Hein, SM Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. “A probabilistic u-net for segmentation of ambiguous images”. In: arXiv preprint arXiv:1806.05034 (2018). [146] Gregory Kordas. “Smoothed binary regression quantiles”. In: Journal of Applied Econometrics 21.3 (2006), pp. 387–407. [147] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [148] Matt J Kusner et al. “Grammar variational autoencoder”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 1945–1954. [149] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. “Simple and scalable predictive uncertainty estimation using deep ensembles”. In: arXiv preprint arXiv:1612.01474 (2016). 178 [150] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. “Autoencoding beyond pixels using a learned similarity metric”. In: arXiv preprint arXiv:1512.09300 (2015). [151] Daniel Laskowitz and Gerald Grant. Translational research in traumatic brain injury. Vol. 57. CRC Press, 2016. [152] J. A. Lawson, M. J. Cook, A. F. Bleasel, V. Nayanar, K. F. Morris, and A. M. Bye. “Quantitative MRI in outpatient childhood epilepsy”. eng. In: Epilepsia 38.12 (Dec. 1997), pp. 1289–1293. issn: 0013-9580. [153] J. A. Lawson, W. Nguyen, A. F. Bleasel, J. K. Pereira, S. Vogrin, M. J. Cook, and A. M. Bye. “ILAE-defined epilepsy syndromes in children: correlation with quantitative MRI”. eng. In: Epilepsia 39.12 (Dec. 1998), pp. 1345–1349. issn: 0013-9580. [154] J. A. Lawson, S. Vogrin, A. F. Bleasel, M. J. Cook, L. Burns, L. McAnally, J. Pereira, and A. M. Bye. “Predictors of hippocampal, cerebral, and cerebellar volume reduction in childhood epilepsy”. eng. In: Epilepsia 41.12 (Dec. 2000), pp. 1540–1545. issn: 0013-9580. [155] John A Lawson, Mark J Cook, Andrew F Bleasel, Vimala Nayanar, Kevin F Morris, and Ann ME Bye. “Quantitative MRI in outpatient childhood epilepsy”. In: Epilepsia 38.12 (1997), pp. 1289–1293. [156] John A Lawson, William Nguyen, Andrew F Bleasel, John K Pereira, Simon Vogrin, Mark J Cook, and Ann ME Bye. “ILAE-defined epilepsy syndromes in children: correlation with quantitative MRI”. In: Epilepsia 39.12 (1998), pp. 1345–1349. [157] John A Lawson, Simon Vogrin, Andrew F Bleasel, Mark J Cook, Lisa Burns, Laraine McAnally, John Pereira, and Ann ME Bye. “Predictors of hippocampal, cerebral, and cerebellar volume reduction in childhood epilepsy”. In: Epilepsia 41.12 (2000), pp. 1540–1545. [158] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. [159] Christian Ledig, Rolf A Heckemann, Alexander Hammers, Juan Carlos Lopez, Virginia FJ Newcombe, Antonios Makropoulos, Jyrki Lötjönen, David K Menon, and Daniel Rueckert. “Robust whole-brain segmentation: application to traumatic brain injury”. In: Medical image analysis 21.1 (2015), pp. 40–58. [160] Hongwei Li, Gongfa Jiang, Jianguo Zhang, Ruixuan Wang, Zhaolei Wang, Wei-Shi Zheng, and Bjoern Menze. “Fully convolutional network ensembles for white matter hyperintensities segmentation in mr images”. In: NeuroImage 183 (2018), pp. 650–665. [161] Wenjing Li, Huiguang He, Jingjing Lu, Bin Lv, Meng Li, and Zhengyu Jin. “Detection of whole-brain abnormalities in temporal lobe epilepsy using tensor-based morphometry with DARTEL”. In: MIPPR 2009: Medical Imaging, Parallel Processing of Images, and Optimization Techniques. Vol. 7497. International Society for Optics and Photonics. 2009, p. 749723. [162] Xiaoxiao Li, Nicha C. Dvornek, Yuan Zhou, Juntang Zhuang, Pamela Ventola, and James S. Duncan. Graph Neural Network for Interpreting Task-fMRI Biomarkers. 2019. arXiv: 1907.01661 [cs.LG]. 179 [163] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. “Bidirectional learning for domain adaptation of semantic segmentation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 6936–6945. [164] Zhizhong Li and Derek Hoiem. “Learning without forgetting”. In: IEEE transactions on pattern analysis and machine intelligence 40.12 (2017), pp. 2935–2947. [165] Gabriel Loaiza-Ganem and John P Cunningham. “The continuous Bernoulli: fixing a pervasive error in variational autoencoders”. In: Advances in Neural Information Processing Systems. 2019, pp. 13266–13276. [166] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. “Repaint: Inpainting using denoising diffusion probabilistic models”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 11461–11471. [167] Xingchen Ma, Amal Rannen Triki, Maxim Berman, Christos Sagonas, Jacques Cali, and Matthew B Blaschko. “A Bayesian Optimization Framework for Neural Network Compression”. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 10274–10283. [168] Xingjun Ma, Hanxun Huang, Yisen Wang, Simone Romano, Sarah Erfani, and James Bailey. “Normalized loss functions for deep learning with noisy labels”. In: International Conference on Machine Learning. PMLR. 2020, pp. 6543–6553. [169] Zheda Mai, Ruiwen Li, Hyunwoo Kim, and Scott Sanner. “Supervised Contrastive Replay: Revisiting the Nearest Class Mean Classifier in Online Class-Incremental Continual Learning”. In: arXiv preprint arXiv:2103.13885 (2021). [170] Oskar Maier et al. “ISLES 2015-A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI”. In: Medical image analysis 35 (2017), pp. 250–269. [171] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. “Adversarial autoencoders”. In: arXiv preprint arXiv:1511.05644 (2015). [172] Charles F Manski. “Semiparametric analysis of discrete response: Asymptotic properties of the maximum score estimator”. In: Journal of econometrics 27.3 (1985), pp. 313–333. [173] Pierre-Alexandre Mattei and Jes Frellsen. “Leveraging the exact likelihood of deep latent variable models”. In: Advances in Neural Information Processing Systems. 2018, pp. 3855–3866. [174] Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. “The multimodal brain tumor image segmentation benchmark (BRATS)”. In: IEEE transactions on medical imaging 34.10 (2014), pp. 1993–2024. [175] Jiajie Mo, Zhenyu Liu, Kai Sun, Yanshan Ma, Wenhan Hu, Chao Zhang, Yao Wang, Xiu Wang, Chang Liu, Baotian Zhao, et al. “Automated detection of hippocampal sclerosis using clinically empirical and radiomics features”. In: Epilepsia 60.12 (2019), pp. 2519–2529. 180 [176] Miguel Monteiro, Loıc Le Folgoc, Daniel Coelho de Castro, Nick Pawlowski, Bernardo Marques, Konstantinos Kamnitsas, Mark van der Wilk, and Ben Glocker. “Stochastic segmentation networks: Modelling spatially correlated aleatoric uncertainty”. In: arXiv preprint arXiv:2006.06015 (2020). [177] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. “Do deep generative models know what they don’t know?” In: arXiv preprint arXiv:1810.09136 (2018). [178] Radford M Neal. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media, 2012. [179] NM Neykov, Pavel Čížek, Peter Filzmoser, and PN Neytchev. “The least trimmed quantile regression”. In: Computational Statistics & Data Analysis 56.6 (2012), pp. 1757–1770. [180] Si Yun Ng and Alan Yiu Wah Lee. “Traumatic brain injuries: pathophysiology and potential therapeutic targets”. In: Frontiers in cellular neuroscience 13 (2019), p. 528. [181] Alexander Quinn Nichol and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models”. In: International Conference on Machine Learning. PMLR. 2021, pp. 8162–8171. [182] Jesper D Nielsen, Kristoffer H Madsen, Oula Puonti, Hartwig R Siebner, Christian Bauer, Camilla Gøbel Madsen, Guilherme B Saturnino, and Axel Thielscher. “Automatic skull segmentation from MR images for realistic volume conductor models of the head: Assessment of the state-of-the-art”. In: Neuroimage 174 (2018), pp. 587–598. [183] Man-Suk Oh, Eun Sug Park, and Beong-Soo So. “Bayesian variable selection in binary quantile regression”. In: Statistics & Probability Letters 118 (2016), pp. 177–181. [184] Eva M Palacios, Roser Sala-Llonch, Carme Junque, Teresa Roig, Jose M Tormos, Nuria Bargallo, and Pere Vendrell. “Resting-state functional magnetic resonance imaging activity and connectivity and cognitive outcome in traumatic brain injury”. In: JAMA neurology 70.7 (2013), pp. 845–851. [185] Sinno Jialin Pan and Qiang Yang. “A Survey on Transfer Learning”. In: IEEE Transactions on Knowledge and Data Engineering 22 (2010), pp. 1345–1359. [186] Chrysostomos P Panayiotopoulos. The epilepsies: seizures, syndromes and management. Bladon Medical Publishing, Oxfordshire (UK), 2005. isbn: 1904218342. [187] Samir Parikh, Marcella Koch, and Raj K Narayan. “Traumatic brain injury”. In: International anesthesiology clinics 45.3 (2007), pp. 119–135. [188] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. “Continual lifelong learning with neural networks: A review”. In: Neural Networks 113 (2019), pp. 54–71. [189] German I Parisi, Jun Tani, Cornelius Weber, and Stefan Wermter. “Lifelong learning of human actions with deep neural network self-organization”. In: Neural Networks 96 (2017), pp. 137–149. [190] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. “Automatic differentiation in pytorch”. In: NIPS-W. 2017. 181 [191] Nick Pawlowski, Matthew CH Lee, Martin Rajchl, Steven McDonagh, Enzo Ferrante, Konstantinos Kamnitsas, Sam Cooke, Susan Stevenson, Aneesh Khetani, Tom Newman, et al. “Unsupervised lesion detection in brain CT using bayesian convolutional autoencoders”. In: MIDL, abstract track, non-archival. 2018. [192] Mangor Pedersen, Amir H Omidvarnia, Jennifer M Walz, and Graeme D Jackson. “Increased segregation of brain networks in focal epilepsy: an fMRI graph theory finding”. In: NeuroImage: Clinical 8 (2015), pp. 536–542. [193] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. “Scikit-learn: Machine learning in Python”. In: Journal of machine learning research 12.Oct (2011), pp. 2825–2830. [194] W. Penfield. “Symposium on posttraumatic epilepsy : introduction”. In: Epilepsia 2 (1961), pp. 109–110. url: http://ci.nii.ac.jp/naid/10011920743/ (visited on 09/03/2017). [195] Sérgio Pereira, Adriano Pinto, Victor Alves, and Carlos A Silva. “Brain tumor segmentation using convolutional neural networks in MRI images”. In: IEEE transactions on medical imaging 35.5 (2016), pp. 1240–1251. [196] Emilio Perucca. “Pharmacoresistance in epilepsy”. In: CNS drugs 10.3 (1998), pp. 171–179. [197] Loretta Piccenna, Graeme Shears, and Terence J O’Brien. “Management of post-traumatic epilepsy: An evidence review over the last 5 years and future directions”. In: Epilepsia open 2.2 (2017), pp. 123–144. [198] Walter HL Pinaya, Mark S Graham, Robert Gray, Pedro F Da Costa, Petru-Daniel Tudosiu, Paul Wright, Yee H Mah, Andrew D MacKinnon, James T Teo, Rolf Jager, et al. “Fast unsupervised brain anomaly detection and segmentation with diffusion models”. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. 2022, pp. 705–714. [199] Asla Pitkänen and Tamuna Bolkvadze. “Head Trauma and Epilepsy”. eng. In: Jasper’s Basic Mechanisms of the Epilepsies. Ed. by Jeffrey L. Noebels, Massimo Avoli, Michael A. Rogawski, Richard W. Olsen, and Antonio V. Delgado-Escueta. 4th. Bethesda (MD): National Center for Biotechnology Information (US), 2012. url: http://www.ncbi.nlm.nih.gov/books/NBK98197/ (visited on 09/18/2017). [200] Asla Pitkänen and Tamuna Bolkvadze. “Head trauma and epilepsy”. In: Jasper’s Basic Mechanisms of the Epilepsies [Internet]. 4th edition (2012). [201] Asla Pitkänen, Tamuna Bolkvadze, and Riikka Immonen. “Anti-epileptogenesis in rodent post-traumatic epilepsy models”. In: Neuroscience letters 497.3 (2011), pp. 163–171. [202] Asla Pitkänen, Wolfgang Löscher, Annamaria Vezzani, Albert J Becker, Michele Simonato, Katarzyna Lukasiuk, Olli Gröhn, Jens P Bankstahl, Alon Friedman, Eleonora Aronica, et al. “Advances in the development of biomarkers for epilepsy”. In: The Lancet Neurology 15.8 (2016), pp. 843–856. 182 [203] Asla Pitkänen, Wolfgang Löscher, Annamaria Vezzani, Albert J Becker, Michele Simonato, Katarzyna Lukasiuk, Olli Gröhn, Jens P Bankstahl, Alon Friedman, Eleonora Aronica, Jan A Gorter, Teresa Ravizza, Sanjay M Sisodiya, Merab Kokaia, and Heinz Beck. “Advances in the development of biomarkers for epilepsy”. In: The Lancet Neurology 15.8 (July 2016), pp. 843–856. issn: 1474-4422. doi: 10.1016/S1474-4422(16)00112-5. (Visited on 09/18/2017). [204] Heidrun Potschka and Martin J Brodie. “Pharmacoresistance”. In: Handbook of clinical neurology 108 (2012), pp. 741–757. [205] William H Press, Brian P Flannery, Saul A Teukolsky, William T Vetterling, et al. Numerical recipes. Vol. 3. Cambridge University Press Cambridge, 1989. [206] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. “Variational autoencoder for deep learning of images, labels and captions”. In: Advances in neural information processing systems. 2016, pp. 2352–2360. [207] Dorian Pustina, Brian Avants, Michael Sperling, Richard Gorniak, Xiaosong He, Gaelle Doucet, Paul Barnett, Scott Mintzer, Ashwini Sharan, and Joseph Tracy. “Predicting the laterality of temporal lobe epilepsy from PET, MRI, and DTI: A multimodal study”. en. In: NeuroImage: Clinical 9 (2015), pp. 20–31. issn: 22131582. doi: 10.1016/j.nicl.2015.07.010. (Visited on 06/13/2019). [208] Dorian Pustina, Brian Avants, Michael Sperling, Richard Gorniak, Xiaosong He, Gaelle Doucet, Paul Barnett, Scott Mintzer, Ashwini Sharan, and Joseph Tracy. “Predicting the laterality of temporal lobe epilepsy from PET, MRI, and DTI: a multimodal study”. In: NeuroImage: clinical 9 (2015), pp. 20–31. [209] Yu Qi, Yueming Wang, Xiaoxiang Zheng, and Zhaohui Wu. “Robust feature learning by stacked autoencoder with maximum correntropy criterion”. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2014, pp. 6716–6720. [210] V Raymont, Andres M Salazar, R Lipsky, David Goldman, G Tasick, and Jordan Grafman. “Correlates of posttraumatic epilepsy 35 years following combat brain injury”. In: Neurology 75.3 (2010), pp. 224–229. [211] V. Raymont, A. M. Salazar, R. Lipsky, D. Goldman, G. Tasick, and J. Grafman. “Correlates of posttraumatic epilepsy 35 years following combat brain injury”. en. In: Neurology 75.3 (July 2010), pp. 224–229. issn: 0028-3878, 1526-632X. doi: 10.1212/WNL.0b013e3181e8e6d0. (Visited on 09/27/2017). [212] Jacob C Reinhold, Yufan He, Shizhong Han, Yunqiang Chen, Dashan Gao, Junghoon Lee, Jerry L Prince, and Aaron Carass. “Validating uncertainty in medical image translation”. In: arXiv preprint arXiv:2002.04639 (2020). [213] Zhongzheng Ren, Raymond Yeh, and Alexander Schwing. “Not all unlabeled data are equal: Learning to weight data in semi-supervised learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 21786–21797. [214] Markus Ringnér. “What is principal component analysis?” In: Nature biotechnology 26.3 (2008), pp. 303–304. 183 [215] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407. [216] Marianna La Rocca, Rachael Garner, Kay Jann, Hosung Kim, Paul Vespa, Arthur W Toga, and Dominique Duncan. Machine learning of multimodal MRI to predict the development of epileptic seizures after traumatic brain injury. 2019. url: https://openreview.net/forum?id=Bye0tkLNcV. [217] Filipe Rodrigues and Francisco C Pereira. “Beyond expectation: deep joint mean and quantile regression for spatiotemporal problems”. In: IEEE Transactions on Neural Networks and Learning Systems (2020). [218] Torsten Rohlfing, Natalie M Zahr, Edith V Sullivan, and Adolf Pfefferbaum. “The SRI24 multichannel atlas of normal adult human brain structure”. In: Human brain mapping 31.5 (2010), pp. 798–819. [219] Edmund T Rolls, Chu-Chung Huang, Ching-Po Lin, Jianfeng Feng, and Marc Joliot. “Automated anatomical labelling atlas 3”. In: Neuroimage 206 (2020), p. 116189. [220] Yaniv Romano, Evan Patterson, and Emmanuel Candes. “Conformalized quantile regression”. In: Advances in Neural Information Processing Systems 32 (2019), pp. 3543–3553. [221] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. 2015. arXiv: 1505.04597 [cs.CV]. [222] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241. [223] Peter J Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In: Journal of computational and applied mathematics 20 (1987), pp. 53–65. [224] W Ritchie Russell. “Disability caused by brain wounds: a review of 1,166 cases”. In: Journal of neurology, neurosurgery, and psychiatry 14.1 (1951), p. 35. [225] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. “Progressive neural networks”. In: arXiv preprint arXiv:1606.04671 (2016). [226] Harold A Sackeim, Paolo Decina, Stephanie Portnoy, Priscilla Neeley, and Sidney Malitz. “Studies of dosage, seizure threshold, and seizure duration in ECT”. In: Biological psychiatry 22.3 (1987), pp. 249–268. [227] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. “Palette: Image-to-image diffusion models”. In: ACM SIGGRAPH 2022 Conference Proceedings. 2022, pp. 1–10. [228] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. “Image super-resolution via iterative refinement”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 45.4 (2022), pp. 4713–4726. 184 [229] Laura Schummers, Katherine P Himes, Lisa M Bodnar, and Jennifer A Hutcheon. “Predictor characteristics necessary for building a clinically useful risk prediction model: a simulation study”. In: BMC medical research methodology 16.1 (2016), pp. 1–10. [230] Matteo Sesia and Emmanuel J Candès. “A comparison of some conformal quantile regression methods”. In: Stat 9.1 (2020), e261. [231] Glenn Shafer and Vladimir Vovk. “A Tutorial on Conformal Prediction.” In: Journal of Machine Learning Research 9.3 (2008). [232] Juliet Popper Shaffer. “Multiple hypothesis testing”. In: Annual review of psychology 46.1 (1995), pp. 561–584. [233] David W Shattuck and Richard M Leahy. “BrainSuite: an automated cortical surface identification tool”. In: Medical image analysis 6.2 (2002), pp. 129–142. [234] Yiyuan She and Art B Owen. “Outlier detection using nonconvex penalized regression”. In: Journal of the American Statistical Association 106.494 (2011), pp. 626–639. [235] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. “Continual learning with deep generative replay”. In: arXiv preprint arXiv:1705.08690 (2017). [236] Annette Sidaros, Arnold Skimminge, Matthew G Liptrot, Karam Sidaros, Aase W Engberg, Margrethe Herning, Olaf B Paulson, Terry L Jernigan, and Egill Rostrup. “Long-term global and regional brain volume changes following severe traumatic brain injury: a longitudinal study with clinical correlates”. In: Neuroimage 44.1 (2009), pp. 1–8. [237] Nicki Skafte, Martin Jørgensen, and Søren Hauberg. “Reliable training and estimation of variance networks”. In: Advances in Neural Information Processing Systems. 2019, pp. 6323–6333. [238] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. “Deep unsupervised learning using nonequilibrium thermodynamics”. In: International conference on machine learning. PMLR. 2015, pp. 2256–2265. [239] John Sollee, Lei Tang, Aime Bienfait Igiraneza, Bo Xiao, Harrison X Bai, and Li Yang. “Artificial intelligence for medical image analysis in epilepsy”. In: Epilepsy Research (2022), p. 106861. [240] Andrea Soltoggio. “Short-term plasticity as cause–effect hypothesis testing in distal reward learning”. In: Biological cybernetics 109.1 (2015), pp. 75–94. [241] Aristeidis Sotiras, Christos Davatzikos, and Nikos Paragios. “Deformable medical image registration: A survey”. In: IEEE transactions on medical imaging 32.7 (2013), pp. 1153–1190. [242] Radu Stefanescu, Olivier Commowick, Grégoire Malandain, Pierre-Yves Bondiau, Nicholas Ayache, and Xavier Pennec. “Non-rigid atlas to subject registration with pathologies for conformal brain radiotherapy”. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. 2004, pp. 704–711. 185 [243] Ewout W Steyerberg. “Validation of prediction models”. In: Clinical prediction models. Springer, 2019, pp. 329–344. [244] Andrew Stirn and David A Knowles. “Variational Variance: Simple and Reliable Predictive Variance Parameterization”. In: arXiv preprint arXiv:2006.04910 (2020). [245] Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. “UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age”. In: PLoS medicine 12.3 (2015), e1001779. [246] John A Swets. “Measuring the accuracy of diagnostic systems”. In: Science 240.4857 (1988), pp. 1285–1293. [247] Natasa Tagasovska and David Lopez-Paz. “Single-model uncertainties for deep learning”. In: arXiv preprint arXiv:1811.00908 (2018). [248] Natasa Tagasovska and David Lopez-Paz. “Single-model uncertainties for deep learning”. In: Advances in Neural Information Processing Systems. 2019, pp. 6417–6428. [249] Robert C Tasker, Claire H Salmond, Amber Gunn Westland, Alonso Pena, Jonathan H Gillard, Barbara J Sahakian, and John D Pickard. “Head circumference and brain and hippocampal volume after severe traumatic brain injury in childhood”. In: Pediatric research 58.2 (2005), pp. 302–308. [250] Jason R Taylor, Nitin Williams, Rhodri Cusack, Tibor Auer, Meredith A Shafto, Marie Dixon, Lorraine K Tyler, Richard N Henson, et al. “The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) data repository: Structural and functional MRI, MEG, and cognitive data from a cross-sectional adult lifespan sample”. In: neuroimage 144 (2017), pp. 262–269. [251] Laura Toloşi and Thomas Lengauer. “Classification with correlated features: unreliability of feature ranking and solutions”. In: Bioinformatics 27.14 (2011), pp. 1986–1994. [252] Duygu Tosun, Kevin Dabbs, Rochelle Caplan, Prabha Siddarth, Arthur Toga, Michael Seidenberg, and Bruce Hermann. “Deformation-based morphometry of prospective neurodevelopmental changes in new onset paediatric epilepsy”. In: Brain 134.4 (Apr. 2011), pp. 1003–1014. issn: 0006-8950. doi: 10.1093/brain/awr027. (Visited on 09/19/2017). [253] Duygu Tosun, Kevin Dabbs, Rochelle Caplan, Prabha Siddarth, Arthur Toga, Michael Seidenberg, and Bruce Hermann. “Deformation-based morphometry of prospective neurodevelopmental changes in new onset paediatric epilepsy”. In: Brain 134.4 (2011), pp. 1003–1014. [254] Meral A Tubi, Evan Lutkenhoff, Manuel Buitrago Blanco, David McArthur, Pablo Villablanca, Benjamin Ellingson, Ramon Diaz-Arrastia, Paul Van Ness, Courtney Real, Vikesh Shrestha, et al. “Early seizures and temporal lobe trauma predict post-traumatic epilepsy: A longitudinal study”. In: Neurobiology of disease 123 (2019), pp. 115–121. [255] John Wilder Tukey. “The problem of multiple comparisons”. In: Multiple comparisons (1953). 186 [256] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. 2017. arXiv: 1706.03762 [cs.CL]. [257] Tom Vercauteren, Xavier Pennec, Aymeric Perchant, and Nicholas Ayache. “Diffeomorphic demons: Efficient non-parametric image registration”. In: NeuroImage 45.1 (2009), S61–S72. [258] Rebecca M Verellen and Jose E Cavazos. “Post-traumatic epilepsy: an overview”. In: Therapy 7.5 (2010), p. 527. [259] Javier E Villanueva-Meyer, Marc C Mabray, and Soonmee Cha. “Current clinical brain tumor imaging”. In: Neurosurgery 81.3 (2017), pp. 397–415. [260] Pascal Vincent et al. “Extracting and composing robust features with denoising autoencoders”. In: Proceedings of the 25th international conference on Machine learning. ACM. 2008, pp. 1096–1103. [261] Anna Volokitin, Ertunc Erdil, Neerav Karani, Kerem Can Tezcan, Xiaoran Chen, Luc Van Gool, and Ender Konukoglu. “Modelling the Distribution of 3D Brain MRI using a 2D Slice VAE”. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. 2020, pp. 657–666. [262] Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. “The NumPy array: a structure for efficient numerical computation”. In: Computing in Science & Engineering 13.2 (2011), pp. 22–30. [263] Pengshuo Wang, Jian Yang, Zhiyang Yin, Jia Duan, Ran Zhang, Jiaze Sun, Yixiao Xu, Luyu Liu, Xuemei Chen, Huizi Li, et al. “Amplitude of low-frequency fluctuation (ALFF) may be associated with cognitive impairment in schizophrenia: a correlation study”. In: BMC psychiatry 19.1 (2019), pp. 1–10. [264] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. “Divergence estimation for multidimensional densities via k-nearest-neighbor distances”. In: IEEE Transactions on Information Theory 55.5 (2009), pp. 2392–2405. [265] Wenxuan Wang, Chen Chen, Meng Ding, Jiangyun Li, Hong Yu, and Sen Zha. TransBTS: Multimodal Brain Tumor Segmentation Using Transformer. 2021. arXiv: 2103.04430 [cs.CV]. [266] Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. “Symmetric cross entropy for robust learning with noisy labels”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 322–330. [267] Elisabeth A Wilde, Erin D Bigler, Jill V Hunter, Michael A Fearing, Randall S Scheibel, Mary R Newsome, Jamie L Johnson, Jocelyne Bachevalier, Xiaoqi Li, and Harvey S Levin. “Hippocampus, amygdala, and basal ganglia morphometrics in children after moderate-to-severe traumatic brain injury”. In: Developmental Medicine & Child Neurology 49.4 (2007), pp. 294–299. [268] David Wingate and Theophane Weber. “Automated variational inference in probabilistic programming”. In: arXiv preprint arXiv:1301.1299 (2013). 187 [269] Matthew J Wright, David R McArthur, Jeffry R Alger, John D Van Horn, Andrei Irimia, Maria Filippou, Thomas C Glenn, David A Hovda, and Paul M Vespa. “Early metabolic crisis-related brain atrophy and cognition in traumatic brain injury”. In: Brain Imaging and Behavior 7 (2013), pp. 307–315. [270] Yifan Wu, Tom Z Jiahao, Jiancong Wang, Paul A Yushkevich, M Ani Hsieh, and James C Gee. “Nodeo: A neural ordinary differential equation based optimization framework for deformable image registration”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 20804–20813. [271] Julian Wyatt, Adam Leach, Sebastian M Schmon, and Chris G Willcocks. “Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 650–656. [272] Han Xiao, Kashif Rasul, and Roland Vollgraf. “Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms”. In: arXiv preprint arXiv:1708.07747 (2017). [273] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. “Learning from massive noisy labeled data for image classification”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 2691–2699. [274] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. “Self-training with noisy student improves imagenet classification”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 10687–10698. [275] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. “Smartbrush: Text and shape guided object inpainting with diffusion model”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 22428–22437. [276] Tomohiro Yamazoe, Nicolás von Ellenrieder, Hui Ming Khoo, Yao-Hsien Huang, Natalja Zazubovits, François Dubeau, and Jean Gotman. “Widespread interictal epileptic discharge more likely than focal discharges to unveil the seizure onset zone in EEG-fMRI”. In: Clinical Neurophysiology 130.4 (2019), pp. 429–438. [277] Xiao Yang, Xu Han, Eunbyung Park, Stephen Aylward, Roland Kwitt, and Marc Niethammer. “Registration of pathological images”. In: Simulation and Synthesis in Medical Imaging: First International Workshop, SASHIMI 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 21, 2016, Proceedings 1. Springer. 2016, pp. 97–107. [278] Suhang You, Kerem C Tezcan, Xiaoran Chen, and Ender Konukoglu. “Unsupervised Lesion Detection via Image Restoration with a Normative Prior”. In: International Conference on Medical Imaging with Deep Learning. 2019, pp. 540–556. [279] Suhang You, Kerem C Tezcan, Xiaoran Chen, and Ender Konukoglu. “Unsupervised lesion detection via image restoration with a normative prior”. In: International Conference on Medical Imaging with Deep Learning. PMLR. 2019, pp. 540–556. 188 [280] Ryan Yount, Kimberly A Raschke, Mekdes Biru, David F Tate, Michael J Miller, Tracy Abildskov, Partha Gandhi, David Ryser, Ramona O Hopkins, and Erin D Bigler. “Traumatic brain injury and atrophy of the cingulate gyrus”. In: The Journal of Neuropsychiatry and Clinical Neurosciences 14.4 (2002), pp. 416–423. [281] Keming Yu and Rana A Moyeed. “Bayesian quantile regression”. In: Statistics & Probability Letters 54.4 (2001), pp. 437–447. [282] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. “How does disagreement help generalization against label corruption?” In: International Conference on Machine Learning. PMLR. 2019, pp. 7164–7173. [283] John K Yue, Mary J Vassar, Hester F Lingsma, Shelly R Cooper, David O Okonkwo, Alex B Valadka, Wayne A Gordon, Andrew IR Maas, Pratik Mukherjee, Esther L Yuh, et al. “Transforming research and clinical knowledge in traumatic brain injury pilot: multicenter implementation of the common data elements for traumatic brain injury”. In: Journal of neurotrauma 30.22 (2013), pp. 1831–1844. [284] Evangelia I Zacharaki, Cosmina S Hogea, Dinggang Shen, George Biros, and Christos Davatzikos. “Non-diffeomorphic registration of brain tumor images by simulating tissue loss and tumor growth”. In: Neuroimage 46.3 (2009), pp. 762–774. [285] Arnold Zellner. “Optimal information processing and Bayes’s theorem”. In: The American Statistician 42.4 (1988), pp. 278–280. [286] Ying Zhai, Bo Chen, Hao Zhang, and Zhengjue Wang. “Robust variational auto-encoder for radar HRRP target recognition”. In: International Conference on Intelligent Science and Big Data Engineering. Springer. 2017, pp. 356–367. [287] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding deep learning (still) requires rethinking generalization”. In: Communications of the ACM 64.3 (2021), pp. 107–115. [288] Jianguo Zhang, Kai-Kuang Ma, Meng-Hwa Er, and Vincent Chong. “Tumor segmentation from magnetic resonance imaging by learning via one-class support vector machine”. In: 2004. [289] Zhilu Zhang and Mert Sabuncu. “Generalized cross entropy loss for training deep neural networks with noisy labels”. In: Advances in neural information processing systems 31 (2018). [290] Baiwan Zhou, Dongmei An, Fenglai Xiao, Running Niu, Wenbin Li, Wei Li, Xin Tong, Graham J Kemp, Dong Zhou, Qiyong Gong, et al. “Machine learning for detecting mesial temporal lobe epilepsy by structural and functional neuroimaging”. In: Frontiers of Medicine 14.5 (2020), pp. 630–641. [291] Chong Zhou and Randy C Paffenroth. “Anomaly detection with robust deep autoencoders”. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. 2017, pp. 665–674. 189 [292] Yongxia Zhou, Michael P Milham, Yvonne W Lui, Laura Miles, Joseph Reaume, Daniel K Sodickson, Robert I Grossman, and Yulin Ge. “Default-mode network disruption in mild traumatic brain injury”. In: Radiology 265.3 (2012), p. 882. [293] Yi Zhu, Zhongyue Zhang, Chongruo Wu, Zhi Zhang, Tong He, Hang Zhang, R Manmatha, Mu Li, and Alexander Smola. “Improving semantic segmentation via self-training”. In: arXiv preprint arXiv:2004.14960 (2020). [294] David Zimmerer, Simon AA Kohl, Jens Petersen, Fabian Isensee, and Klaus H Maier-Hein. “Context-encoding Variational Autoencoder for Unsupervised Anomaly Detection”. In: arXiv preprint arXiv:1812.05941 (2018). [295] Jing Zou, Bingchen Gao, Youyi Song, and Jing Qin. “A review of deep learning-based deformable medical image registration”. In: Frontiers in Oncology 12 (2022), p. 1047215. [296] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training”. In: Proceedings of the European conference on computer vision (ECCV). 2018, pp. 289–305. [297] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. “Confidence regularized self-training”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 5982–5991. [298] Xi-Nian Zuo, Adriana Di Martino, Clare Kelly, Zarrar E Shehzad, Dylan G Gee, Donald F Klein, F Xavier Castellanos, Bharat B Biswal, and Michael P Milham. “The oscillating brain: complex and reliable”. In: Neuroimage 49.2 (2010), pp. 1432–1445. 190
Abstract (if available)
Abstract
This research explores the development of novel deep learning (DL) methods specifically tailored for biomedical applications, where data is often limited or imperfect. Despite the success of DL in various domains, its application in medical imaging faces unique challenges. These include the non-conformity of real-world datasets to standard machine learning assumptions, limited generalizability to unseen datasets, and the tendency of DL methods to produce overconfident predictions, particularly with limited data. Such issues are critical in clinical settings, where accurate uncertainty assessment is crucial for disease diagnosis and treatment planning.
The primary objective of this study is to create robust, generalizable, and uncertainty-aware DL models that can effectively handle the complexities of biomedical datasets. We aim to: (1) enhance the robustness of DL models for complex medical imaging data, (2) improve the generalizability of DL methods across diverse datasets, thereby increasing the statistical power of medical studies, and (3) refine DL models to better estimate uncertainties, providing a risk assessment alongside diagnostic decisions.
A meaningful application of our research is in the detection of lesions and prediction of post-traumatic epilepsy (PTE) following traumatic brain injury (TBI). Given the high prevalence and long-term impact of TBI, and the challenge of identifying biomarkers for PTE, our DL methods have the potential to aid in the early identification of at-risk patients, guiding preventive care. Beyond medical imaging, the developed methods have implications for other fields suffering from poor-quality training data, such as network traffic modeling and speech recognition, illustrating the broad applicability of our research.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Invariant representation learning for robust and fair predictions
PDF
Leveraging training information for efficient and robust deep learning
PDF
Integrated wireless piezoelectric ultrasonic transducer system for biomedical applications
PDF
Improving arterial spin labeling in clinical application with deep learning
PDF
Robust causal inference with machine learning on observational data
PDF
Detection and decoding of cognitive states from neural activity to enable a performance-improving brain-computer interface
PDF
Human motion data analysis and compression using graph based techniques
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Detecting semantic manipulations in natural and biomedical images
PDF
Efficient stochastic simulations of hydrogeological systems: from model complexity to data assimilation
PDF
Deep learning for characterization and prediction of complex fluid flow systems
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Optimization strategies for robustness and fairness
PDF
Inference of computational models of tendon networks via sparse experimentation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Deep learning for subsurface characterization and forecasting
PDF
Learning logical abstractions from sequential data
PDF
Model-based approaches to objective inference during steady-state and adaptive locomotor control
PDF
New theory and methods for accelerated MRI reconstruction
Asset Metadata
Creator
Akrami, Haleh
(author)
Core Title
Learning from limited and imperfect data for brain image analysis and other biomedical applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Degree Conferral Date
2024-05
Publication Date
07/25/2024
Defense Date
01/23/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biomedical dataset,deep learning,generalizability,OAI-PMH Harvest,robustness,uncertainty estimation
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Leahy, Richard (
committee chair
), Thompson, Paul (
committee member
), Valero-Cuevas, Francisco (
committee member
)
Creator Email
akrami@usc.edu,hale.akrami@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113814099
Unique identifier
UC113814099
Identifier
etd-AkramiHale-12634.pdf (filename)
Legacy Identifier
etd-AkramiHale-12634
Document Type
Dissertation
Format
theses (aat)
Rights
Akrami, Haleh
Internet Media Type
application/pdf
Type
texts
Source
20240130-usctheses-batch-1123
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
biomedical dataset
deep learning
generalizability
robustness
uncertainty estimation